Foundations/Webservice/DraftProposal

Not logged in - Log In / Register

Statement of the problem

The existing web service and launchpadlib implementations are very easy to write code for, but difficult to write _efficient_ code for, and difficult to understand. Our users' performance problems have two main causes:

Too much data, too few ways to slice it

The only way to filter a collection is to scope it to some entry, or to invoke a named operation. These methods don't cover all, or even most, of the ways clients want to restrict our various datasets. So clients end up getting huge datasets and iterating over the whole thing, filtering them on the client level.

The named operations we do have are not standardized in any way: they're nearly-raw glimpses into our internal Python API. This makes it difficult to learn the web service and even to find a specific thing you want. For instance, this is from Edwin Grubbs's report on the OpenStack Design Summit (warthogs list, 2010-11-15):

"I also answered some questions about searching for bugs via the API. The fact that the method is named project.searchTasks() may have caused it to be ignored when reading the API docs."

Our batching system was designed to stop clients who don't need an entire dataset from having to get the whole thing. But almost all clients do need the entirety of whatever dataset they're operating on. Unlike website users, they generally aren't satisfied with the first 75 items.

The real problem is that because of our poor filtering support, the dataset a client can actually get is much larger than the dataset they need to operate on. In real usage, the batching system actually results in more HTTP requests and lower efficiency. (Though it does reduce timeouts.)

Example of problematic client code

   1 for mp in launchpad.projects['launchpad'].getMergeProposals(status='Merged'):
   2    if mp.date_created > some_date:
   3        process(mp)

We get a huge number of merge proposals and immediately discard almost all of them. If there was some way of applying the date_created filter on the server side, we could avoid this.

Too many requests

Retrieving an entry or collection associated with some other entry (such as a bug's owner or a team's members) requires a new HTTP request. Entries are cached, but we don't send Cache-Control directives, so even when the entry is cached we end up making a (conditional) HTTP request. It's the high-latency request, not the cost of processing it on the server side, that's painful.

Client code that crosses the network boundary (bug.owner) looks exactly like client code that doesn't (bug.id). We need to stop hiding the network boundary from the user, or at least pull back from hiding it so relentlessly. It should be obvious when you write inefficient code, and/or more difficult to write it.

Sending Cache-Control directives when we serve entries will mitigate this problem somewhat. Thus far we haven't done so, out of concerns about data freshness.

Example of problematic client code

   1 for bug in source_package.getBugTasks():
   2     if bug.owner is not None:
   3         if bug.owner.is_team:
   4             for member in bug.owner.members:
   5                 ...

This code looks innocuous but it has big problems. We make a separate HTTP request for every 'bug.owner' -- that's three subsequent requests for the same data. The second two requests are conditional, but that doesn't help much.

A simple Cache-Control header saying it's okay to cache an entry for five seconds would alleviate this problem. But a Cache-Control header can't stop the need to make at least _one_ request for '.owner' and '.owner.members'. This means two HTTP requests for every bug in the 'bugs' list. Let's say there are one hundred bugs and an HTTP request takes one second. This code will run in 6:40 without Cache-Control headers, and in 3:20 if we add Cache-Control headers.

It would be nice to get the running time closer to 0:10.

Predecessor documents

The "expand" operation

The "expand" operation lets you GET an entry or collection, *plus* some of the entries or collections that it links to. The client code will make one big HTTP request and populate an entire object graph, rather than just one object. This will make it possible to access 'bug.owner' and iterate over 'bug.owner.members' as many times as you want, without causing additional HTTP requests.

Possible client-side syntax

This code acquires a bug's owner, and the owner's members, in a single request. If the owner turns out not to be a team, the collection of members will be empty.

   1 print bug.owner               # Raises ValueError: bug.owner is not available 
   2                               # on this side of the network boundary.
   3 bug = expand(bug, bug.owner, bug.owner.members)
   4 expanded_bug = GET(bug)       # Makes an HTTP request.
   5 expanded_bug.owner            # Does not raise ValueError.
   6 if bug.owner.member.is_team:  # No further HTTP requests.
   7     for member in bug.owner.members:
   8         print member.display_name

This implementation is more conservative: it must specifically request every single bit of expanded data that will be used.

   1 bug = expand(bug, bug.owner.is_team, bug.owner.members.each.display_name)
   2 expanded_bug = GET(bug)       # Makes an HTTP request.
   3 print bug.owner.name          # Raises ValueError: value wasn't expanded.
   4 if bug.owner.is_team:         # No further HTTP requests.
   5     for member in bug.owner.members:
   6         print member.display_name

Of course, these examples assume we have a specific bug we want to expand. Our problematic code makes two requests *per bug*, and plugging this code in would simply bring that number down to one request per bug.

This code takes that down to one request, period. It operates on a scoped collection instead of an individual bug, and expands every object in the collection at once.

   1 bugs = source_package.bugtasks
   2 bugs = expand(bugs, bugs.each.owner, bugs.each.owner.members)
   3 expanded_bugs = GET(bugs)     # Makes an HTTP request
   4 for bug in expanded_bugs:     # No further HTTP requests:
   5     if bug.owner.is_team:
   6         for member in bug.owner.members:
   7             print member.display_name

Possible client-server syntax

The simplest way to support expansion is to add a general ws.expand argument to requests for entries or collections.

  GET /source_package/bugs?ws.expand=each.owner&ws.expand=each.owner.members

Specifying values for ws.expand that don't make sense will result in a 4xx response code.

Specifying values that do make sense will result in a much bigger JSON document than if you hadn't specified ws.expand. This document may take significantly longer to produce--maybe long enough that it would have timed out under the current system--but it will hopefully keep you from making lots of small HTTP requests in the future.

Hypermedia controls

The simplest way to show how to construct an expansion URL is to serve expansion links alongside the links themselves:

'members_collection_link' : 'https://.../~team/members'
'members_collection_expand_link' : 'https://.../~team?ws.expand=members'

If we were serving HTML representations, we could serve an HTML form allowing you to expand a number of the links in a representation at one time.

The alternative is to punt, and put information about how to expand links in the human-readable description of our media type. This would mean we could never change from the ws.expand system to some other system. With links or forms we can change it trivially.

We already use semantic indicators to distinguish between data attributes that are scalar (like person['name']) and attributes that are links which can be followed (like bug['owner_link']). Every single link-style attribute will get this new ability, so there's no need to add new control information indicating which links have it. We just need to program every client with this information.

The "restrict" operation

The "expand" operation reduces the need to make an additional HTTP request to follow a link. The "restrict" operation reduces the number of links that need to be followed in the first place, by allowing general server-side filters to be placed on a collection before the data is returned.

The client may request a collection with filters applied to any number of filterable fields. Which fields are "filterable" will be specified through hypermedia: they'll probably be the fields on which we have database indices. The representation returned will be a subset of the collection: the subset that matches the filter(s).

Possible client-side syntax

This code restricts a project's merge propoals to those with "Merged" status and created after a certain date.

   1 project = launchpad.projects['launchpad']
   2 proposals = project.merge_proposals
   3 proposals = restrict(proposals.each.status, "Merged")
   4 proposals = restrict(proposals.each.date_created, GreaterThan(some_date))
   5 some_proposals = GET(proposals)
   6 for proposal in some_proposals:
   7     ...

Two great features to note:

  1. We can apply the date_created filter on the server side, reducing the time and bandwidth expenditure.
  2. We no longer need to publish the getMergeProposals named operation at all. The only purpose of that operation was to let users filter merge proposals by status, and that's now a general feature. In the aggregate, removal of this and similar named operations will greatly simplify the web service.

You're not restricted to filtering collections based on properties of their entries. You can filter based on properties of entries further down the object graph. This code filters a scoped collection of bugs based on a field of the bug's owner. (There may be better ways to do this particular thing, but this should make it very clear what's going on.)

   1 project = launchpad.projects['launchpad']
   2 bugs = project.bugs
   3 bugs = restrict(bugs.owner.name, 'leonardr')
   4 my_launchpad_bugs = GET(bugs)

Possible client-server syntax

The simplest way to do this is to add a series of ws.restrict query arguments, each of which works similarly to ws.expand.

  GET /launchpad/bugs?ws.restrict.owner.name=leonardr

If your value for a ws.restrict.* argument makes no sense, or you specify a ws.restrict.* argument that doesn't map correctly onto the object graph, you'll get a 4xx error. If your arguments do make sense, you'll get a smaller collection than you would have otherwise gotten.

Hypermedia controls

We need to add hypermedia controls to indicate which fields in the object graph can be the target of a ws.restrict.* argument.

I'm not sure that we can explain the ws.restrict.* idea itself using WADL, since it's more complicated than the ws.expand idea. We may have to settle for human-readable documentation explaining how a client can pre-traverse the object graph and send an appropriate HTTP request.

Don't batch collections

Currently, clients fetch collections in batches, 75 entries at a time. This causes problems when the underlying collections are changing behind the scenes. As the collections change behind the scenes, entries may show up multiple times or fall through the cracks.

The entry that, five seconds ago, was item #76, may now be item #74, meaning that you'll miss it. Or the entry that used to be item #74 may now be item #76, meaning that you'll see it twice.

Our solution is to effectively get rid of batching. Instead of 75, the batch size will be something huge, like 1 million. If you ask for a collection that contains 100,000 entries, you will get a representation of all 100,000 entries.

Don't panic. For one thing, collections with 100,000 entries will be rare, because the "restrict" operation will make it much easier than it is now to get only a desired subset of a collection.

Huge collections will only occur when client code is poorly written (in which case the incredible slowness of the code will be an obvious problem) or when well-written client code actually does need to operate on a huge collection (in which case the incredible slowness of the code is to be expected).

Besides which, you won't get full representations of all 100,000 entries. When you get a collection, you'll receive a list of collapsed representations.

Collapsed representations

A collapsed representation is the opposite of an expanded representation. A normal representation of a person contains scalar data about the person, plus links to related entries and collections. An expanded representation of a person includes some of those related entries and collection inline. A collapsed representation of a person contains *no* data about the person.

A collapsed representation is just a resource identifier: a URL. If you want *any* information about the resource, you need to GET its URL, or visit the expander resource (see below).

This makes it relatively easy for us to dump huge numbers of collapsed representations into a response--we just have to generate the URLs. For items like bug messages, where calculating the URL is expensive, we might consider creating an alternate URL that's easier to calculate.

The expander resource

So, you have 100,000 links. How do you turn those links into representations? Fortunately, the expander resource (located at /expand) is designed to do just that. If you POST it a number of links, it will return a collection of full representations. If the links you POST include ws.expand arguments, the representations will be further expanded according to their ws.expand arguments.

But, the expander resource won't accept 100,000 links. It will only accept some small number, like, say, 75.

Yes, it's a bait and switch. Small-bore batching is still happening; it's just controlled by the client rather than the server. The server dumps the entire *membership* of some collection onto the client in a single atomic operation, but then it's up the client to get details about the membership in little chunks.

By the time the client is finished getting all those details, it's quite possible the membership has changed. But the client can be certain that the membership was accurate _as of the time of the initial request_.

Rather than hard-coding the maximum capacity of the expander resource, we plan to publish that as a bit of information. In the simplest design, a client can get the maximum capacity of the expander resource by sending it a GET request. This information would be cached for the same amount of time as the site WADL document.

Batch PATCH

Any collection URL that you can GET, you can also PATCH. Here's a simple example: marking all of the bugs assigned to leonardr as fixed.

 PATCH HTTP/1.1 /launchpad/bugs?ws.restrict.owner.name=leonardr
 Content-type: application/json

 {'status' : 'Fixed'}

As you can see, this is a potentially dangerous feature. That simple example does something that probably shouldn't be done. But it's not much different from examples that would be very useful (such as taking all the Fix Committed bugs for a given release, and marking them Fix Released).

We can reduce the danger a little by shunning the normal Python-like syntax for modifying objects, in favor of something that makes it more clear that you're modifying an entire dataset at once.

   1 project = launchpad.projects['launchpad']
   2 bugs = project.bugs
   3 bugs = restrict(bugs.owner.name, 'leonardr')
   4 PATCH(bugs, {'status' : 'Fixed'})

Unanswered questions:

Conclusion: What do we get?

Alternate designs and FAQs

Architectural concerns

Caching and conditional requests

Currently individual entries are cached on disk. We don't serve Cache-Control headers, so these cached entries never prevent requests, they only allow us to make conditional requests. That said, we could serve Cache-Control headers if we wanted to, and get a pretty good performance boost for common

We haven't been sending Cache-Control because we've been pretty stubborn about making sure the client never has old information. But the whole premise of the expander resource violates that promise. You'll get the membership of a collection at time T1, spend from time T1 to time T2 expanding the membership, and by time T2 the membership of the collection may be totally different and we never told you.

If we're going this far, Cache-Control with a timeout of a few seconds becomes a lot more reasonable. Even in this system we'll still be serving individual entries occasionally, and we should serve Cache-Control when we do.

In the current system, most requests are for individual entries, each of which is cached along with its ETag. In the new system, most requests will be for large collections of entries. It's difficult to calculate an ETag for a collection, and difficult to estimate what kind of Cache-Control header to send for one--that's why we don't do those things now and have no plans to do them.

By dealing almost exclusively with collections, we negate the few performance advantages we have now, rather than building on them.

How can we get those advantages back? Here are two ideas.

Give ETags to collections

Right now, collections don't have ETags because their representations change too often. They change every time the membership of a collection changes, or one of the objects in the collection changes.

But in the new system, the representation of a collection changes only when the membership changes. It's much easier to construct an ETag for a collection, and make the collection respond to conditional GET (and PATCH).

At first glance this isn't a huge help: we still have to run the query to determine the membership, and we will still end up iterating over the entire query set. But it will keep us from sending data over the wire in some cases. And although caching in-memory ETags for entries turned out to be a waste of time, we might be able to do something with cached ETags for collections, such as automatically calculating reasonable values for the Cache-Control header.

Cache entries as served from collections

Imagine that you POST the following to the expander resource.

 [{'self_link': '/bugs/1', 'http_etag': 'foo'},
  {'self_link': '/bugs/2', 'http_etag': 'bar'}]

You might get something like this in return:

 [{'self_link': '/bugs/1', 'http_etag': 'baz', 'description': 'Microsoft...'},
  {'self_link': '/bugs/2', 'http_etag': 'bar'}]

So, bug #1 has changed, the ETag is different, and you get the full representation. But bug #2 has not changed, and so you get the collapsed representation right back.

Of course, if you're asking for an expanded representation of these bugs, making a conditional request would be just as useless as making a conditional request for a collection. Expanded representations don't have their own ETags.

And here's another problem: how did we get the ETags for these bugs in the first place? We're not making individual requests to /bugs/2 anymore. We must have made an earlier request to the expander resource and cached the result. This implies that we have some way of picking the individual resources out of an expanded representation and caching them individually. How else would we know exactly where to look to find the cached ETag for bug #2?

If we go down this road, we'll still be using HTTP concepts like ETag, but our use of them will be almost entirely removed from the HTTP standard. ETags will almost never be served in the ETag header, will rarely be sent in the If-Match header, and will almost never be sent in the If-None-Match header. We will need to implement our own cache, since httplib2's cache is a cache of HTTP responses, and we need to implement a cache of representations, each chopped out of some larger representation.

I don't think this is intrinsically bad, since the HTTP requests we're getting rid of were the source of the latency that was killing performance. But "let's take this idea from HTTP and implement our own thing based on it" smells bad to me. We may end up with a system that's low-latency but also low-performance. I think we're also in danger of re-inventing something like CouchDB.

Explorability

Fix batching, don't end it

The current proposal increases the default batch size to an amount so huge we don't expect it to realistically ever be met. We can't make the batch size infinite: that might create situations in which creating even a collapsed representation of a collection causes a timeout. But making the batch size infinite is the only way to totally avoid our current batching problems, to the extent that we could remove the batching code from lazr.restfulclient.

We can't totally eliminate our problems with batching, but we can greatly reduced them by a) increasing the default batch size to something like 500, and b) changing lazr.restfulclient to fetch 5 or 10 redundant records when fetching subsequent pages, and de-duplicate the results. So, lazr.restfulclient would first obtain items 0-499, and if you iterated past the end of that list, lazr.restfulclient would obtain items 495-994, rather than items 500-999. This would catch the vast majority of changes to collections that happen while you're iterating.

If we could find a solution like this that let us keep batching more-or-less as is, we wouldn't need collapsed representations or the expander resource. This would simplify our design a lot.

  1. This only solves the problem for collections sorted by fields that never or rarely change. If a collection is sorted by a field like modification date, huge jumps from the back of the collection to the front are possible. The only general solution is to get the entire list at once.
  2. Naive users write code that iterates over a collection many times, killing performance. There are ways to optimize for this in our current system, but the new system will fix this problem automatically, by constructing a full list of collapsed representations and filling them out over the course of the first iteration.

Batch PATCH

How will we handle conditional writes with batch PATCH? :: Probably not necessary: we are setting values on an abstract dataset, without ever looking at specific values. A specific set of bugs might have an ETag, but "the set of all Fix-Committed bugs" doesn't.

Foundations/Webservice/DraftProposal (last edited 2010-11-18 17:18:57 by leonardr)