Differences between revisions 4 and 5

Statement of the problem

The existing web service and launchpadlib implementations are very easy to write code for, but difficult to write _efficient_ code for, and difficult to understand. Our users' performance problems have two main causes:

Too much data, too few ways to slice it

The only way to filter a collection is to scope it to some entry, or to invoke a named operation. These methods don't cover all, or even most, of the ways clients want to restrict our various datasets. So clients end up getting huge datasets and iterating over the whole thing, filtering them on the client level.

The named operations we do have are not standardized in any way: they're nearly-raw glimpses into our internal Python API. This makes it difficult to learn the web service and even to find a specific thing you want. For instance, this is from Edwin Grubbs's report on the OpenStack Design Summit (warthogs list, 2010-11-15):

"I also answered some questions about searching for bugs via the API. The fact that the method is named project.searchTasks() may have caused it to be ignored when reading the API docs."

Our batching system was designed to stop clients who don't need an entire dataset from having to get the whole thing. But almost all clients do need the entirety of whatever dataset they're operating on. Unlike website users, they generally aren't satisfied with the first 75 items.

The real problem is that because of our poor filtering support, the dataset a client can actually get is much larger than the dataset they need to operate on. In real usage, the batching system actually results in more HTTP requests and lower efficiency. (Though it does reduce timeouts.)

Example of problematic client code

   1 for mp in launchpad.projects['launchpad'].getMergeProposals(status='Merged'):
   2    if mp.date_created > some_date:
   3        process(mp)

We get a huge number of merge proposals and immediately discard almost all of them. If there was some way of applying the date_created filter on the server side, we could avoid this.

Too many requests

Retrieving an entry or collection associated with some other entry (such as a bug's owner or a team's members) requires a new HTTP request. Entries are cached, but we don't send Cache-Control directives, so even when the entry is cached we end up making a (conditional) HTTP request. It's the high-latency request, not the cost of processing it on the server side, that's painful.

Client code that crosses the network boundary (bug.owner) looks exactly like client code that doesn't (bug.id). We need to stop hiding the network boundary from the user, or at least pull back from hiding it so relentlessly. It should be obvious when you write inefficient code, and/or more difficult to write it.

Sending Cache-Control directives when we serve entries will mitigate this problem somewhat. Thus far we haven't done so, out of concerns about data freshness.

Example of problematic client code

   1 for bug in source_package.getBugTasks():
   2     if bug.owner is not None:
   3         if bug.owner.is_team:
   4             for member in bug.owner.members:
   5                 ...

This code looks innocuous but it has big problems. We make a separate HTTP request for every 'bug.owner' -- that's three subsequent requests for the same data. The second two requests are conditional, but that doesn't help much.

A simple Cache-Control header saying it's okay to cache an entry for five seconds would alleviate this problem. But a Cache-Control header can't stop the need to make at least _one_ request for '.owner' and '.owner.members'. This means two HTTP requests for every bug in the 'bugs' list. Let's say there are one hundred bugs and an HTTP request takes one second. This code will run in 6:40 without Cache-Control headers, and in 3:20 if we add Cache-Control headers.

It would be nice to get the running time closer to 0:10.

Predecessor documents

The "expand" operation

The "expand" operation lets you GET an entry or collection, *plus* some of the entries or collections that it links to. The client code will make one big HTTP request and populate an entire object graph, rather than just one object. This will make it possible to access 'bug.owner' and iterate over 'bug.owner.members' as many times as you want, without causing additional HTTP requests.

Possible client-side syntax

This code acquires a bug's owner, and the owner's members, in a single request. If the owner turns out not to be a team, the collection of members will be empty.

   1 bug.owner                     # Raises ValueError: bug.owner is not available 
   2                               # on this side of the network boundary.
   3 bug = expand(bug, bug.owner, bug.owner.members)
   4 expanded_bug = GET(bug)       # Makes an HTTP request.
   5 expanded_bug.owner            # Does not raise ValueError.
   6 if bug.owner.member.is_team:  # No further HTTP requests.
   7     for member in bug.owner.members:
   8         print member.display_name

This implementation is more conservative: it must specifically request every single bit of expanded data that will be used.

   1 bug = expand(bug, bug.owner.is_team, bug.owner.members.each.display_name)
   2 expanded_bug = GET(bug)       # Makes an HTTP request.
   3 bug.owner.name                # Raises ValueError: value wasn't expanded.
   4 if bug.owner.is_team:         # No further HTTP requests.
   5     for member in bug.owner.members:
   6         print member.display_name

Of course, these examples assume we have a specific bug we want to expand. Our problematic code makes two requests *per bug*, and plugging this code in would simply bring that number down to one request per bug.

This code takes that down to one request, period. It operates on a scoped collection instead of an individual bug, and expands every object in the collection at once.

   1 bugs = source_package.bugtasks
   2 bugs = expand(bugs, bugs.each.owner, bugs.each.owner.members)
   3 expanded_bugs = GET(bugs)     # Makes an HTTP request
   4 for bug in expanded_bugs:     # No further HTTP requests:
   5     if bug.owner.is_team:
   6         for member in bug.owner.members:
   7             print member.display_name

Possible client-server syntax

The simplest way to support expansion is to add a general ws.expand argument to requests for entries or collections.

   1   GET /source_package/bugs?ws.expand=each.owner&ws.expand=each.owner.members

Specifying values for ws.expand that don't make sense will result in a 4xx response code.

Specifying values that do make sense will result in a much bigger JSON document than if you hadn't specified ws.expand. This document may take significantly longer to produce--maybe long enough that it would have timed out under the current system--but it will hopefully keep you from making lots of small HTTP requests in the future.

Hypermedia controls

We already use semantic indicators to distinguish between data attributes that are scalar (like person['name']) and attributes that are links which can be followed (like bug['owner_link']). Every single link-style attribute will get this new ability, so there's no need to add new control information indicating which links have it.

What we need to do is update our human-readable description of our media type to describe this new addition to the HTTP protocol.

Ideally the ws.expand syntax itself would be described using a hypermedia control (allowing us to change it without changing the client), but we're not sure how to do this with our existing hypermedia standard (WADL). This idea effectively turns every single link into a form.

The URI Templates standard lets us describe forms that look like links, but the section of the standard that would let us do something like ws.expand is not defined, and clients that don't understand URI Templates will incorrectly interpret a URI Template as a URL.

The "restrict" operation

The "expand" operation reduces the need to make an additional HTTP request to follow a link. The "restrict" operation reduces the number of links that need to be followed in the first place, by allowing general server-side filters to be placed on a collection before the data is returned.

The client may request a collection with filters applied to any number of filterable fields. Which fields are "filterable" will be specified through hypermedia: they'll probably be the fields on which we have database indices. The representation returned will be a subset of the collection: the subset that matches the filter(s).

Possible client-side syntax

This code restricts a project's merge propoals to those with "Merged" status and created after a certain date.

   1 project = launchpad.projects['launchpad']
   2 proposals = project.merge_proposals
   3 proposals = restrict(proposals.each.status, "Merged")
   4 proposals = restrict(proposals.each.date_created, GreaterThan(some_date))
   5 some_proposals = GET(proposals)
   6 for proposal in some_proposals:
   7     ...

Two great features to note:

We can apply the date_created filter on the server side, reducing the time and bandwidth expenditure.
We no longer need to publish the getMergeProposals named operation at all. The only purpose of that operation was to let users filter merge proposals by status, and that's now a general feature. In the aggregate, removal of this and similar named operations will greatly simplify the web service.

You're not restricted to filtering collections based on properties of their entries. You can filter based on properties of entries further down the object graph. This code filters a scoped collection of bugs based on a field of the bug's owner. (There may be better ways to do this particular thing, but this should make it very clear what's going on.)

   1 project = launchpad.projects['launchpad']
   2 bugs = project.bugs
   3 bugs = restrict(bugs.owner.name, 'leonardr')
   4 my_launchpad_bugs = GET(bugs)

Possible client-server syntax

The simplest way to do this is to add a series of ws.restrict query arguments, each of which works similarly to ws.expand.

   1   GET /launchpad/bugs?ws.restrict.owner.name=leonardr

If your value for a ws.restrict.* argument makes no sense, or you specify a ws.restrict.* argument that doesn't map correctly onto the object graph, you'll get a 4xx error. If your arguments do make sense, you'll get a smaller collection than you would have otherwise gotten.

Hypermedia controls

We need to add hypermedia controls to indicate which fields in the object graph can be the target of a ws.restrict.* argument.

I'm not sure that we can explain the ws.restrict.* idea itself using WADL, since it's more complicated than the ws.expand idea. We may have to settle for human-readable documentation explaining how a client can pre-traverse the object graph and send an appropriate HTTP request.

Don't batch collections

Currently, clients fetch collections in batches, 75 entries at a time. This causes problems when the underlying collections are changing behind the scenes. As the collections change behind the scenes, entries may show up multiple times or fall through the cracks.

The entry that, five seconds ago, was item #76, may now be item #74, meaning that you'll miss it. Or the entry that used to be item #74 may now be item #76, meaning that you'll see it twice.

Our solution is to get rid of batching. If you ask for a collection, and the collection contains 100,000 entries, you will get a representation of all 100,000 entries.

Don't panic. You won't get full representations of all 100,000 entries. You'll get collapsed representations.

Collapsed representations

The expander resource

PATCHing a collection

Foundations/Webservice/DraftProposal (last edited 2010-11-18 17:18:57 by leonardr)

-  ⇤ ← Revision 4 as of 2010-11-16 19:35:44 → 
  Size: 11481
  Editor: leonardr
  Comment:
+   ← Revision 5 as of 2010-11-16 20:49:17 → ⇥
  Size: 12251
  Editor: leonardr
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 22:
-''"I also answered some questions about searching for bugs via the API.
 The fact that the method is named project.searchTasks() may have
 caused it to be ignored when reading the API docs."''
+''"I also answered some questions about searching for bugs via the API. The fact that the method is named project.searchTasks() may have caused it to be ignored when reading the API docs."''
-Line 274:
+Line 272:
-= Don't batch collections, and send only IDs =
+= Don't batch collections =

Currently, clients fetch collections in batches, 75 entries at a
time. This causes problems when the underlying collections are
changing behind the scenes. As the collections change behind the
scenes, entries may show up multiple times or fall through the
cracks.

The entry that, five seconds ago, was item #76, may now be item #74,
meaning that you'll miss it. Or the entry that used to be item #74 may
now be item #76, meaning that you'll see it twice.

Our solution is to get rid of batching. If you ask for a collection,
and the collection contains 100,000 entries, you will get a
representation of all 100,000 entries.

Don't panic. You won't get ''full'' representations of all 100,000
entries. You'll get collapsed representations.

= Collapsed representations =

launchpad development

Diff for "Foundations/Webservice/DraftProposal"

Statement of the problem

Too much data, too few ways to slice it

Example of problematic client code

Too many requests

Example of problematic client code

Predecessor documents

The "expand" operation

Possible client-side syntax

Possible client-server syntax

Hypermedia controls

The "restrict" operation

Possible client-side syntax

Possible client-server syntax

Hypermedia controls

Don't batch collections

Collapsed representations

The expander resource

PATCHing a collection