The memcached representation cache

Basic idea

After constructing a JSON representation, we can store it in memory and use it later in preference to constructing it again. I've spent a lot of time seeing how this would work and running timing tests. We implemented much of a solution but ultimately decided the performance improvement was

The promise

I used the [[https://dev.launchpad.net/Foundations/Webservice?action=AttachFile&do=view&target=performance_test.py||launchpadlib test script]] throughout these trials. For simplicity I'll only report the final number from that test, the mean end-to-end time across 30 fetches for a bug or for a collection of bugs.

Local tests

First I hacked lazr.restful to serve representations directly from memcached, without any consideration for permission control. These results gives a theoretical lower bound on how fast we can can possibly make lazr.restful by using memcached.

Then I implemented a real representation caching system within lazr.restful, except using a Python dict instead of memcached. Here's how it works: if it turns out that a client can see the full representation of an entry (there are no redacted fields), the entry's string representation is stored in the cache. Upon subsequent requests, the string will be pulled from the cache and either served whole or incorporated into a JSON representation of a collection. If the client cannot see the full representation of a cached entry, we pull the cached string, simplejson.loads() it, redact the appropriate fields, and dump it again. A collection is built from (possibly cached) entry representations, using string operations rather than simplejson.dumps().

I then changed my implementation so that it used memcached instead of an in-memory Python dict, and ran the tests again. The results:

No representation cache: 0.09s (entry test), 0.74s (collection test)
In-memory dict: 0.09s (entry test), 0.11s (collection test)
memcached: 0.10s (entry test), 0.15s (collection test)
Theoretical memcached best case: 0.09s (entry test), 0.18s (collection test)

In other words, with an in-memory dict it's about as fast to get one page of a collection as it is to get a single bug. Since memcached is slower than an in-memory dict, it's about 50% faster to get one bug than one page of a collection--but still much faster than not having a representation cache. The current memcached implementation is within the margin of error of the theoretical best case.

The main remaining problem is invalidation. My implementation invalidates an object's representation cache on an ObjectModifiedEvent (though this also needs a little work), but Launchpad isn't good about sending out ObjectModifiedEvents. We might get good results by doing cleanup on the Launchpad side and then tagging the corresponding objects as being okay to cache. Do this for bugs+bugtasks+people+branches and we'll have covered the most common access scenarios.

Production test

OK, but these are local tests. What about a test in production? To find out, I temporarily enabled my representation cache on staging. I ran different two tests. Each test ran some bit of code ten times.

In the first test, the bit of code run ten times was an iteration over the first 100 projects in launchpad.projects. The total number of HTTP requests made was 20. Without the cache enabled, all ten runs took approximately the same time:

 First run took 5.76 sec
 First five runs took 24.29 sec (mean: 4.86 sec)
 All 10 runs took 56.92 sec (mean: 5.69 sec)

With the cache enabled, the first run took about the same time as when the cache was disabled, but the subsequent runs were much faster.

 First run took 6.11 sec
 First five runs took 10.92 sec (mean: 2.18 sec)
 All 10 runs took 15.15 sec (mean: 1.51 sec)

The second run took only 27% of the time of the first.

In the second test, the bit of code run ten times was an iteration over the first 100 projects, the first 100 users, and then the first 100 bugs. The total number of HTTP requests made was 60, magnifying the effects of latency versus the first test.

Again, with the cache disabled, all ten runs took approximately the same time:

 First run took 23.12 sec
 First five runs took 108.58 sec (mean: 21.72 sec)
 All 10 runs took 192.76 sec (mean: 19.28 sec)

Again, with the cache enabled, the first run took about the same time, but subsequent runs were much faster.

 First run took 13.18 sec
 First five runs took 57.03 sec (mean: 11.41 sec)
 All 10 runs took 108.08 sec (mean: 10.81 sec)

The second run took about 56% the time of the first. The difference between this 56% and the first test's much better 27% is caused by extra latency.

Little benefit in practice

An analysis of actual web service usage shows that it's pretty rare that real applications fetch collections. Launchpad commonly serves collections when handling the searchTasks() named operation, and some applications repeat the same searches over and over again over a period of time. It's also not uncommon for an application to fetch the list of a distribution's distro_series objects (usually Ubuntu's).

But our analysis shows that the majority of GET requests are requests for individual objects (usually distro_series or user objects). Implementing the memcached cache would speed up these requests, but not noticeably.

This could change in one of two ways. First, the web service could become more popular in absolute terms. Even if searchTasks() invocations only make up 2% of web service requests, once that's thousands of requests a day, it becomes worth the effort to speed up each request by one second.

Second, the use cases in which a representation cache is useful could become more popular. Someone could write a hot new application that drives up searchTasks() invocations to 10% of web service requests. Or mutatis mutandis for any other request, or combination of requests, that would benefit from a representation cache.

Lessons learned

We waited too long before analyzing real-world web service traffic. As soon as we knew that the representation cache only improved requests for collections, we should have checked to see how often we served collections. Instead we waited until we'd overcome all the non-difficult implementation details.

Unsolved problems

Even though the performance gain is marginal right now, the representation cache would be worth putting into place were it nor for several tricky unsolved problems. These problems make it not cost-effective to finish the implementation right now.

Invalidating composite objects

Problem: popular web service objects like IBug and IUser have representations that include data from other Storm objects. For instance, Bug.tags is a list of strings derived from IBugTag objects. As the IBugTag objects are created and destroyed, the corresponding IBug needs to be removed from the cache. Similarly, IUser has a number of fields derived from an IUserLocation object. When the user's location is changed, the storm_flush for the IUserLocation is triggered, but that for the IUser (which is the one we want) is not.

Solution: Make it possible to annotate an interface with information about how invalidations should cascade. For instance, IUserLocation could be tagged with something like this:

@cascade_invalidation('user')

or

@cascade_invalidation(attrgetter('user'))

And then within storm_flush for the IUserLocation, we'd get all the @cascade_invalidation property names, fetch each one, and invalidate that object as well. In this case, the UserLocation object location would be invalidated, and so would the result of attrgetter('user')(location).

Defining cache keys for objects with no Storm info

Problem: Many objects whose representations we want to cache aren't Storm objects. Bug cache keys are generated using Storm info, so we can't cache these objects.

Solution: Define a IRepresentationCacheable interface that includes key_for(). Create a default implementation that uses canonical_url. Create another implementation for Storm subclasses that uses Storm info.

Objects with no Storm info tend to be composite objects, so solve that problem first.

Invalidating on changes that don't trigger __storm_flush__

Problem: storm_flush is only triggered when a Storm object is saved to the database. There are plenty of situations (such as direct SQL statements or mass updates) where this doesn't happen.

Solution: A database trigger that generates bug cache keys for rows that are changed, and sends them to a script that removes those keys from memcached.

It either needs to generate its own list of _all_ the keys to be removed (one for every version of the web service * one for every media type being cached * one for every Launchpad instance being managed by this database), or the script needs to be written to generate the appropriate keys itself before removing them from memcached. Since the script needs to run in a Launchpad context anyway (see below), I think the second solution makes more sense.

The script that removes keys from memcached needs to handle invalidation cascades. If it gets (user_location,240) it needs to run an algorithm like this:

1. Remove (user_location, 240) from the cache in all its permutations.

2. Look up this information somehow: the user_location table corresponds to the IUserLocation interface, and that interface defines an invalidation cascade into .user.

3. Get the IUserLocation implementation corresponding to (user_location,240).

4. getUtility(IWebServiceRepresentationCache).delete(location.user)

Alternative solution: Our database may already define the relationship between IUserLocation and IUser, so that we can cascade deletes. If so, we may be able to have the database trigger look at that relationship and generate both (user_location, 240) and the corresponding (user, 1022) cache invalidation key. This would let our script be faster and a lot dumber.

This will not obviate the need for solution #1, because when a change is made through the web service itself, we want to cascade invalidations immediately, during the same request. If you modify a user's .latitude through the web service, it should return a 209 response code and a representation of the user with the .latitude changed. If the change to UserLocation doesn't cascade immediately to User, the representation of the user returned will be the cached, unchanged one.

Race conditions

Because it takes a while for the database trigger to remove cached representations from the caches, there's a short period of time during which the caches will serve out-of-date representations. We've decided this is acceptable.

But here's another race condition.

1. A client makes a change to an entry object. Like all non-GET requests, this change goes through the master database.

2. The cache invalidation is propagated to memcached. Note that the master database has not yet been replicated to the slave databases.

3. Not content with the updated representation returned with the 209 response code, the client refreshes their entry object. This change goes through one of the slave databases. The slave database creates a representation of the entry based on the out-of-date data, and stores it in memcached. The end-user sees the change they made suddenly vanish.

4. The master database is replicated to the slave databases. The slave database now has up-to-date data, but the representation cache has a cached representation based on bad data. Since that out-of-date representation was cached _after_ the cache invalidation, it won't be going away anytime soon.

We have a preliminary solution based on the use of the Last-Modified header. With this solution the server can tell that a client has a more recent representation than the server's own database, and avoid creating the out-of-date representation.

Future promise

Have we undercounted the benefit?

Use to store HTML representations for use in template snippets and the Ajax application?

Use to store JSON representations embedded in the website's web pages?

Foundations/Webservice/RepresentationCache (last edited 2010-06-17 20:11:23 by leonardr)