Diff for "Foundations/Webservice/DraftProposal"

Not logged in - Log In / Register

Differences between revisions 1 and 17 (spanning 16 versions)
Revision 1 as of 2010-11-13 12:56:25
Size: 17980
Editor: gary
Comment:
Revision 17 as of 2010-11-17 20:12:16
Size: 19825
Editor: leonardr
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
= Draft of Webservice Plans =

''This is a place Leonard, Benji and Gary sometimes are using to scribble about ideas while we discuss. We will present it to a larger audience for discussion when we all think it hangs together well, with a basic syntax we agree on and science fiction examples of how we might rewrite existing real-world code.''

== Goals ==

 * Core webservice idioms encourage efficient use
 * Easy to use because of a uniform, small, powerful API
 * Python syntax explicitly exposes the network boundary so that it is easy to reason about

== Proposed approach ==

Leonard has called it "filter and expand". Because "filter" is a Python built-in, I'm calling it "refine and expand" here. The names are up for discussion, like everything else.

Leonard's most recent draft is here: http://pastebin.ubuntu.com/529604/

Question about the proposal: why does the proposal say that we POST to the expander? The request does not mutate, so I would think GET would be a better verb.

== Q & A ==

 Why refine and expand? ::
 :: See leonardr's draft.

 Why get identities (possibly filtered or refined) in one request and get data for batches of identities (expanded) in multiple subsequent requests? ::
 :: see leonardr's draft. but in bullet points...
 :: it makes getting a filtered set transactional.
 :: it makes working with the result--len, walking in reverse, walking in step, and so on--easy, natural, and efficient.
 :: it can be regular, uniform and unlimited, if we think we can always return the set no matter how big it is, or at least with *really big* sets (we are currently envisioning a limit of somewhere between hundreds of thousands and a million).
 :: it continues the basic refine/expand model of encouraging users to think about working with collections, and not individual entities.

 Why prefer not to use strings for identifying sub-elements? ::
 :: Prettier
 :: Easier/more obvious spelling correction
 :: help() and dir() will work on them
 :: Tab completion

 Why prefer immutable query? ::
 :: Easier to reason about derived queries: always a new copy.
 :: Unlikely to present a memory problem

 Why prefer immutable results? ::
 :: Snapshot of response; shows that it is not transactional
 :: Clear parallel to a browser response
 :: No partial updates of sets (i.e., the data in a given response collection will always be self-consistent because it was fetched inside a transaction. This is actually stated a little strongly, and therefore the argument is weakened: the set membership is transactional, but the expansions are batched.)

 ''Potential worries:''
 :: may cause memory problems. Can come up with ways to alleviate if a problem (e.g., unchanged bits are kept) but hopefully we won't have to.
 :: people may want to union results. To alleviate, provide an explicit function that does this, returning a new collection. Members will be reused, so the cost should be cheap.

 Why do we need to be able to make a query from a response? Or (essentially the same question/same answer)...Why do we need arbitrary collections of items, instead of always filtering? ::
 :: The decision on what objects to get next may need to be made after the request. Example: User chooses items from checkboxes, and we want those.

 Why is it valuable to have arbitrary (local) Python callables as a filter? ::
 :: We can request only the parts we need for the filter

 Why do we need slicing in addition to Python callables? ::
 :: Explicit statement of what ones you want, beginning or end of group or even every N items (for statistical purposes).

 What syntax does benji want other than [::] for slicing? ::
 :: He's not sure. Maybe it's just another server-side filter type: Slice(0, 20). His main concern is that [::] suggests too strongly that they are list-like, which they aren't so we want to keep from confusing users. I wonder if results really can be list-like-enough now, but I understand the concern, even though I don't agree yet.

 Why would you want to filter a query based on a result? ::
 :: result may be too big
 :: you may want a sub-query. This raises a HTTP syntax worry: what if we effectively want to filter against a very large set? the query string will not support this because of practical url size limits (2K for IE, for instance; see http://www.boutell.com/newfaq/misc/urllength.html).

 When might you want to get a single specific item, expanded? ::
 :: You are looking for a specific project, like Ubuntu; or specific person, like yourself; and just want to expand it.

 When might you want to get a bunch of specific heterogeneous items, all at once, expanded? ::
 :: You are building a "page" for GroundControl or a hypothetical all-AJAX Launchpad. You need a bunch of things for the page, and you want it as efficiently as possible. Leonard's syntax proposes a generic "expand" resource, so we can do it in the webservice.

 Why do we want to have links "typed"? Or (essentially the same question/same answer)...Why do we want the links in homogenous groups? ::
 :: Because that way we can use the same proposed mechanism for expanding and filtering as we always have.

 Why do we prefer to make filters specify fields for their Python callables, not entire objects? ::
 :: ''Benji needs to make this argument, because I forget what he said, possibly because I was not completely convinced.''
 :: When you do this, you may really just want to iterate a result.

 Benji also asks, "Are we initially going to allow people to expand non-leaf-node values?" ::
 :: My answer is, yeah, probably, or at the very least that is ''intended'' initially. I'm not sure where Benji stands on this.

 Why does specifying a field in a filter not include it in the Python representation as if you had expanded it? ::
 :: reduce surprise. If you want it, include it in the expansion.

Some related questions:

 A number of people (martin and lifeless among them, I think) have said that it would be really nice to be able to use the same syntax for using the webservice and for writing LP code itself (in-process, not over the network). Are we considering that? ::
 :: The transactional semantics of writing LP code in-process are valuable and we will probably not want to try to make transactional-ish semantics for the webservice ever. This will quite possibly (but not necessarily, granted) lead to different syntaxes.
 :: That said, if out-of-process had certain operations that were transactional (get(), patch(), etc.) it seems that having a start_transaction() and end_transaction() that you could wrap them with when you were in-process would be doable. But we still think that even if we had it no one would want to use it. The launchpadlib API is going to be a shadow of what's possible when you have a fast connection to the DB.

== Examples ==

These are various competing drafts of a Python syntax, reflecting the above thoughts in various ways.

For the current thoughts (and competing syntaxes) on the HTTP side, see Leonard's draft.

=== Immutable request, very explicit version ===

Approaches in this one:
- request and response are immutable
- network calls are as explicit as possible

As an easy-to-see convention, function calls that make network calls are in capitals.

{{{#!python
base_req = ids(launchpad.people['canonical-ubuntu'].assigned_bugs)
}}}

This creates a request for the identities of bugs in the canonical-ubuntu team. We can make refined versions of this request. Note that nothing has gone over the wire yet, so we have no validation that there is actually a team or person named "canonical-ubuntu". In fact, "base_req" is just going to be the basis of requests that we actually do send.

{{{#!python
req1 = refine(base_req.milestone.name, OneOf('milestone1', 'milestone2'))
req2 = refine(base_req.specifications.target_milestone.name, OneOf('milestone1', 'milestone2'))
}}}

req1 and req2 are refined subsets, or filtered subsets, of the "base_req". Available filters are OneOf, AnyOf, GreaterThan, LessThan, EqualTo, and XXX (would be nice to have draft list; can we check types on the client side, and do we want to?). They can be passed specific values, as seen here, or OneOf and and AnyOf can use collections generated from previous webservice interactions, as we will see below.

(Note that we "refine" the request, rather than "filter" it. This is just because "filter" is a Python built-in. Note also that, contrary to other proposals, I did not specify making expansion hints before you make the identity request. I'll get into that more below.)

Now let's get some responses.

{{{#!python
response1 = GET(req1)
response2 = GET(req2)
}}}

Each GET is a single network call, and the two responses are not yet expanded, so all we know about each are the identities (links) and types (which we knew because of what we requested, not because of any server response).

As an aside, we could have just gotten the first 50 items from one of those requests using slice notation on the request.

{{{#!python
truncated_response = GET(req1[:50])
}}}

In this case, {{{req1[:50]}}} returns another request object, which is not iterable. The {{{truncated_response}}} is.

(Benji has some valid concerns over using the standard slice spelling, and might prefer something like using a standard Python slice object as an argument to ``refine``: {{{truncated_response = GET(refine(req1, slice(50)))}}}. However, since the request can't be iterated, I think the semantics are harder to misuse than other __getitem__ hacks; and it is very concise and convenient. I prefer it.)

Let's get back to the main example, in which we have response1 and response2, obtained from req1 and req2. We will GET and union the responses in one line of Python.

{{{#!python
identities = union(response1, response2)
}}}

We unioned the two responses, showing that the union is done locally. The response is now a union of the identities (links) from the two filters.

(Note that, if we had specified expansion desires earlier as is done in other proposals, I think this union would be much trickier, because we would have to keep track of what had been requested for each merged response, and somehow do a merge of requests when they overlap. This might result in a collection that would be expanded in a heterogeneous way. Alternatively, we could enforce homogenous expansion annotations in merged sets, though that seems like it could become very annoying; or merge all expansion requests into one for the unioned set, though this seems a bit too automatic. It makes more sense to me to specify what you want to expand when you make the expansion request, as I will show below.)

Now let's make a request to expand the unioned response, so we have some actual data to work with.

{{{#!python
request = expand(identities, identities.assignee, identities.milestone)
}}}

We're saying that we are interested in the top-level data of the bugs we found, the top-level data of each bug's assignee, and the top-level data for each associated milestone. For a few more examples, {{{expand(response)}}} would just specify the top-level data, while {{{expand(response.assignee)}}} would omit the top-level data, getting only the assignees.

That's an expansion request. Unlike a identity request, you can't use {{{refine}}} with it.

On the other hand, you can call {{{expand}}} on another expansion request to make a new request adding additional data requests. You can use {{{request.identities}}} to get to the original response collection that makes up the request. Also, as we'll discuss below, you can combine multiple expansion requests. We'll show that below. Anyway, for now, let's get the expanded values.

{{{#!python
first_fifty = GET(request[:50])
second_fifty = GET(request[50:100])
}}}

Each of those collections of fifty bugs would have objects that had all their scalar top-level data, and the scalar data for the assignee and milestone.

Something to note about any expanded collection is that the result may not end up matching the original filters; the collection did, back when it was made, but it might not now. The separate request highlights this distiction. (I thought about proposing that we provide something that would locally filter as it expands, but then the expansion has to also request everything that was used as a filter initially. I decided that this was too tricky; however, for this and other stories, I am pretty sure that responses should remember the requests that generated them, so clever things can be built later.)

Why did we batch the request? Because expansion must be batched.

Why is this expansion separate from the original request? Because it shows that we can only guarantee to expand in batches--it's not transactional. It also clearly shows that the expansion is a separate request.

One downside to an expansion call that is separate from the initial filtering call for identity is that it doesn't allow optimizations of the server sending both the identifiers and the expansions at once, as other proposals have allowed. That's true; it's a trade off against making the API clearer. We were not sure we wanted to do that anyway.

Another potential downside is that it will probably be tedious to expand with explicit slices many times, for each batch. To help with that, we could provide a batching convenience for this, like the below.

{{{#!python
expanded = batchGET(expand(response, response.assignee, response.milestone))
}}}

That would give you a collection of the union of the two sets, with the expanded top-level data for those bugs.

The collection from a batchGET would be lazy, and expand batches as they are requested. This would give back some of the automation that people enjoy from the existing webservice API, while using the spelling to clarify expectations.

What if you want to collect arbitrary items from a response and make them into a new expansion request? Make a {{{collection}}}.

{{{#!python
coll = response.collection([response[0], response[3], response[5], response[10]])
coll.add(response[-10])
request = expand(coll, coll.assignee)
}}}

Collections have the semantics of a set of identities of homogenous type. The type is determined from the object from which they originate. Therefore, from a request or response of bugs, the collection is of bugs. Similarly, {{{launchpad.poeple.collection()}}} will make a collection of people.

What about a Python callable for a filter? We don't provide this. Instead, follow these steps.

 1. Get a collection of identities (e.g. {{{identities = GET(request(launchpad.people))}}} or {{{identities = GET(refine(request(launchpad.people).project_membership.name, AnyOf('Launchpad', 'Landscape', 'Storm')))}}}).
 1. Expand the identities to get only the fields you need to filter the identities locally (e.g., {{{data = batchGET(expand(identities.name))}}})
 1. Add every item that meets your requirements to a collection (e.g., {{{wanted = identities.collection(o for o in identities if sounds_french(o.name))}}}).
 1. Expand the wanted items as desired (e.g., {{{data = batchGET(expand(wanted))}}}) and do stuff with it.

What if a response should be part of a filter? No problem. An identity response, an expansion response, a batchGET object, or a collection can all be passed as one of the arguments to AnyOf or OneOf (e.g., {{{AnyOf(identities)}}}.

What if you want to expand only one thing? I think practicality must trump purity here. {{{launchpad.people['gary']}}} should give you an object that you can use as an expand request, so you can say {{{gary = GET(launchpad.people['gary'])}}}. {{{gary = GET(launchpad.me)}}} should work too. You should also be able to say {{{gary_request = launchpad.people['gary']; GET(expand(gary_request, gary_request.assigned_bugs))}}}.

What if you want to expand heterogenous elements efficiently? GET and batchGET can take multiple expansion requests and aggregate them. when you provide more than one argument to GET and batchGET, you get a tuple of responses back equal to the number of requests. For instance, you can say {{{(person,), bugs = GET(request(launchpad.people['gary'])

Should we support sorting? How could we syntactically in HTTP? Would it be efficient enough on the server? Let's assume the answers to the above are "yes," "some solvable way," and "yes," and think for a moment about Python syntax. So then we are left with more questions. What should the Python syntax for sorting be? Can you union/merge sorted items and keep the sorting if they were sorted the same way? How do you get only the first N/last N/step-by N when the request is for a sorted response?

Would this work?

{{{identities = GET(sorted(ids(launchpad.people).name)[:50])}}}

What if you want to patch something? XXX You use the patch function. It takes an identity request or a response from identity or expand, along with

{{{#!python
items = ids(launchpad.people['canonical-ubuntu'].assigned_bugs)
items = refine(items.assignee, ...)
result = PATCH(items.status, "Won't Fix", items.assignee, None) # N pairs
}}}

You can use a {{{.collection()}}} too, and you can use items from the query tree:

(Did not think this through clearly; I think this would work.)

{{{#!python
result = PATCH(launchpad.people['gary'].name, "Yrag Retsop")
}}}

(What is the result of a PATCH? I forget what we return now.)

DELETE would work very similarly.

What if you want to create something? This is entirely separate from filter, expand, patch, and all that. It's probably some API that the collection provides, like {{{launchpad.people.create(...)}}} or something.


=== Mutable request version ===
XXX

=== Mutable response version ===
XXX
= Statement of the problem =

The existing web service and launchpadlib implementations are very
easy to write code for, but difficult to write _efficient_ code for,
and difficult to understand. Our users' performance problems have two
main causes:

== Too much data, too few ways to slice it ==

The only way to filter a collection is to scope it to some entry, or
to invoke a named operation. These methods don't cover all, or even
most, of the ways clients want to restrict our various datasets. So
clients end up getting huge datasets and iterating over the whole
thing, filtering them on the client level.

The named operations we do have are not standardized in any way:
they're nearly-raw glimpses into our internal Python API. This makes
it difficult to learn the web service and even to find a specific
thing you want. For instance, this is from Edwin Grubbs's report on
the Open``Stack Design Summit (warthogs list, 2010-11-15):

''"I also answered some questions about searching for bugs via the API. The fact that the method is named project.searchTasks() may have caused it to be ignored when reading the API docs."''

Our batching system was designed to stop clients who don't need an
entire dataset from having to get the whole thing. But almost all
clients do need the entirety of whatever dataset they're operating
on. Unlike website users, they generally aren't satisfied with the
first 75 items.

The real problem is that because of our poor filtering support, the
dataset a client can actually get is much larger than the dataset they
need to operate on. In real usage, the batching system actually
results in more HTTP requests and lower efficiency. (Though it does
reduce timeouts.)

=== Example of problematic client code ===

{{{#!python
for mp in launchpad.projects['launchpad'].getMergeProposals(status='Merged'):
   if mp.date_created > some_date:
       process(mp)
}}}

We get a huge number of merge proposals and immediately discard almost
all of them. If there was some way of applying the ```date_created```
filter on the server side, we could avoid this.

== Too many requests ==

Retrieving an entry or collection associated with some other entry
(such as a bug's owner or a team's members) requires a new HTTP
request. Entries are cached, but we don't send Cache-Control
directives, so even when the entry is cached we end up making a
(conditional) HTTP request. It's the high-latency request, not the
cost of processing it on the server side, that's painful.

Client code that crosses the network boundary (bug.owner) looks
exactly like client code that doesn't (bug.id). We need to stop hiding
the network boundary from the user, or at least pull back from hiding
it so relentlessly. It should be obvious when you write inefficient
code, and/or more difficult to write it.

Sending Cache-Control directives when we serve entries will mitigate
this problem somewhat. Thus far we haven't done so, out of concerns
about data freshness.

=== Example of problematic client code ===

{{{#!python
for bug in source_package.getBugTasks():
    if bug.owner is not None:
        if bug.owner.is_team:
     for member in bug.owner.members:
                ...
}}}

This code looks innocuous but it has big problems. We make a separate
HTTP request for every 'bug.owner' -- that's three subsequent requests
for the same data. The second two requests are conditional, but that
doesn't help much.

A simple Cache-Control header saying it's okay to cache an entry for
five seconds would alleviate this problem. But a Cache-Control header
can't stop the need to make at least _one_ request for '.owner'
and '.owner.members'. This means two HTTP requests for every bug in
the 'bugs' list. Let's say there are one hundred bugs and an HTTP
request takes one second. This code will run in 6:40 without
Cache-Control headers, and in 3:20 if we add Cache-Control headers.

It would be nice to get the running time closer to 0:10.

= Predecessor documents =

 * [[http://pastebin.ubuntu.com/529604/|Leonard's final draft describing the restrict and expand operations]]
 * [[http://pastebin.ubuntu.com/533178/|Benji/Gary discussion of client syntaxes]]

= The "expand" operation =

The "expand" operation lets you GET an entry or collection, *plus*
some of the entries or collections that it links to. The client code
will make one big HTTP request and populate an entire object graph,
rather than just one object. This will make it possible to
access 'bug.owner' and iterate over 'bug.owner.members' as many times
as you want, without causing additional HTTP requests.

== Possible client-side syntax ==

This code acquires a bug's owner, and the owner's members, in a single
request. If the owner turns out not to be a team, the collection of
members will be empty.

{{{#!python
print bug.owner # Raises ValueError: bug.owner is not available
                       # on this side of the network boundary.
bug = expand(bug, bug.owner, bug.owner.members)
expanded_bug = GET(bug) # Makes an HTTP request.
expanded_bug.owner # Does not raise ValueError.
if bug.owner.member.is_team: # No further HTTP requests.
    for member in bug.owner.members:
 print member.display_name
}}}

This implementation is more conservative: it must specifically request
every single bit of expanded data that will be used.

{{{#!python
bug = expand(bug, bug.owner.is_team, bug.owner.members.each.display_name)
expanded_bug = GET(bug) # Makes an HTTP request.
print bug.owner.name # Raises ValueError: value wasn't expanded.
if bug.owner.is_team: # No further HTTP requests.
    for member in bug.owner.members:
 print member.display_name
}}}

Of course, these examples assume we have a specific bug we want to
expand. Our problematic code makes two requests *per bug*, and
plugging this code in would simply bring that number down to one
request per bug.

This code takes that down to one request, period. It operates on a
scoped collection instead of an individual bug, and expands every
object in the collection at once.

{{{#!python
bugs = source_package.bugtasks
bugs = expand(bugs, bugs.each.owner, bugs.each.owner.members)
expanded_bugs = GET(bugs) # Makes an HTTP request
for bug in expanded_bugs: # No further HTTP requests:
    if bug.owner.is_team:
        for member in bug.owner.members:
     print member.display_name
}}}

== Possible client-server syntax ==

The simplest way to support expansion is to add a general ws.expand
argument to requests for entries or collections.

{{{
  GET /source_package/bugs?ws.expand=each.owner&ws.expand=each.owner.members
}}}

Specifying values for ws.expand that don't make sense will result in a
4xx response code.

Specifying values that do make sense will result in a much bigger JSON
document than if you hadn't specified ws.expand. This document may
take significantly longer to produce--maybe long enough that it would
have timed out under the current system--but it will hopefully keep
you from making lots of small HTTP requests in the future.

== Hypermedia controls ==

The simplest way to show how to construct an expansion URL is to serve
expansion links alongside the links themselves:

{{{
'members_collection_link' : 'https://.../~team/members'
'members_collection_expand_link' : 'https://.../~team?ws.expand=members'
}}}

If we were serving HTML representations, we could serve an HTML form
allowing you to expand a number of the links in a representation at
one time.

The alternative is to punt, and put information about how to expand
links in the human-readable description of our media type. This would
mean we could never change from the ws.expand system to some other
system. With links or forms we can change it trivially.

We already use semantic indicators to distinguish between data
attributes that are scalar (like person['name']) and attributes that
are links which can be followed (like bug['owner_link']). Every single
link-style attribute will get this new ability, so there's no need to
add new control information indicating which links have it. We just
need to program every client with this information.

= The "restrict" operation =

The "expand" operation reduces the need to make an additional HTTP
request to follow a link. The "restrict" operation reduces the number
of links that need to be followed in the first place, by allowing
general server-side filters to be placed on a collection before the
data is returned.

The client may request a collection with filters applied to any number
of filterable fields. Which fields are "filterable" will be specified
through hypermedia: they'll probably be the fields on which we have
database indices. The representation returned will be a subset of the
collection: the subset that matches the filter(s).

== Possible client-side syntax ==

This code restricts a project's merge propoals to those with "Merged"
status and created after a certain date.

{{{#!python
project = launchpad.projects['launchpad']
proposals = project.merge_proposals
proposals = restrict(proposals.each.status, "Merged")
proposals = restrict(proposals.each.date_created, GreaterThan(some_date))
some_proposals = GET(proposals)
for proposal in some_proposals:
    ...
}}}

Two great features to note:

 1. We can apply the date_created filter on the server side, reducing
 the time and bandwidth expenditure.

 2. We no longer need to publish the getMergeProposals named operation
 at all. The only purpose of that operation was to let users filter
 merge proposals by status, and that's now a general feature. In the
 aggregate, removal of this and similar named operations will greatly
 simplify the web service.

You're not restricted to filtering collections based on properties of
their entries. You can filter based on properties of entries further
down the object graph. This code filters a scoped collection of bugs
based on a field of the bug's owner. (There may be better ways to do
this particular thing, but this should make it very clear what's going
on.)

{{{#!python
project = launchpad.projects['launchpad']
bugs = project.bugs
bugs = restrict(bugs.owner.name, 'leonardr')
my_launchpad_bugs = GET(bugs)
}}}

== Possible client-server syntax ==

The simplest way to do this is to add a series of ''ws.restrict''
query arguments, each of which works similarly to ''ws.expand''.

{{{
  GET /launchpad/bugs?ws.restrict.owner.name=leonardr
}}}

If your value for a ws.restrict.* argument makes no sense, or you
specify a ws.restrict.* argument that doesn't map correctly onto the
object graph, you'll get a 4xx error. If your arguments do make sense,
you'll get a smaller collection than you would have otherwise gotten.

== Hypermedia controls ==

We need to add hypermedia controls to indicate which fields in the
object graph can be the target of a ws.restrict.* argument.

I'm not sure that we can explain the ws.restrict.* idea itself using
WADL, since it's more complicated than the ws.expand idea. We may have
to settle for human-readable documentation explaining how a client can
pre-traverse the object graph and send an appropriate HTTP request.

= Don't batch collections =

Currently, clients fetch collections in batches, 75 entries at a
time. This causes problems when the underlying collections are
changing behind the scenes. As the collections change behind the
scenes, entries may show up multiple times or fall through the
cracks.

The entry that, five seconds ago, was item #76, may now be item #74,
meaning that you'll miss it. Or the entry that used to be item #74 may
now be item #76, meaning that you'll see it twice.

Our solution is to effectively get rid of batching. Instead of 75, the
batch size will be something huge, like 1 million. If you ask for a
collection that contains 100,000 entries, you will get a
representation of all 100,000 entries.

Don't panic. For one thing, collections with 100,000 entries will be
rare, because the "restrict" operation will make it much easier than
it is now to get only a desired subset of a collection.

Huge collections will only occur when client code is poorly written
(in which case the incredible slowness of the code will be an obvious
problem) or when well-written client code actually does need to
operate on a huge collection (in which case the incredible slowness of
the code is to be expected).

Besides which, you won't get ''full'' representations of all 100,000
entries. When you get a collection, you'll receive a list
of ''collapsed'' representations.

== Collapsed representations ==

A collapsed representation is the opposite of an expanded
representation. A normal representation of a person contains scalar
data about the person, plus links to related entries and
collections. An expanded representation of a person includes some of
those related entries and collection inline. A collapsed
representation of a person contains *no* data about the person.

A collapsed representation is just a resource identifier: a URL. If
you want *any* information about the resource, you need to GET its
URL, or visit the expander resource (see below).

This makes it relatively easy for us to dump huge numbers of collapsed
representations into a response--we just have to generate the
URLs. For items like bug messages, where calculating the URL is
expensive, we might consider creating an alternate URL that's easier
to calculate.

== The expander resource ==

So, you have 100,000 links. How do you turn those links into
representations? Fortunately, the expander resource (located at
`/expand`) is designed to do just that. If you POST it a number of
links, it will return a collection of full representations. If the
links you POST include ws.expand arguments, the representations will
be further expanded according to their ws.expand arguments.

But, the expander resource won't accept 100,000 links. It will only
accept some small number, like, say, 75.

Yes, it's a bait and switch. Small-bore batching is still happening;
it's just controlled by the client rather than the server. The server
dumps the entire *membership* of some collection onto the client in a
single atomic operation, but then it's up the client to get details
about the membership in little chunks.

By the time the client is finished getting all those details, it's
quite possible the membership has changed. But the client can be
certain that the membership was accurate _as of the time of the
initial request_.

Rather than hard-coding the maximum capacity of the expander resource,
we plan to publish that as a bit of information. In the simplest
design, a client can get the maximum capacity of the expander resource
by sending it a GET request. This information would be cached for the
same amount of time as the site WADL document.

= Batch PATCH =

Any collection URL that you can GET, you can also PATCH. Here's a
simple example: marking all of the bugs assigned to leonardr as fixed.

{{{
 PATCH HTTP/1.1 /launchpad/bugs?ws.restrict.owner.name=leonardr
 Content-type: application/json

 {'status' : 'Fixed'}
}}}

As you can see, this is a potentially dangerous feature. That simple
example does something that probably shouldn't be done. But it's not
much different from examples that would be very useful (such as taking
all the Fix Committed bugs for a given release, and marking them Fix
Released).

We can reduce the danger a little by shunning the normal Python-like
syntax for modifying objects, in favor of something that makes it more
clear that you're modifying an entire dataset at once.

{{{#!python
project = launchpad.projects['launchpad']
bugs = project.bugs
bugs = restrict(bugs.owner.name, 'leonardr')
PATCH(bugs, {'status' : 'Fixed'})
}}}

Unanswered questions:

 * What should the HTTP response be? What if some of the operations succeed and others fail? Should the entire PATCH be rolled back, or should we use the 207 Multi-Status response code defined by WebDAV?
 * Can the server reject a batch PATCH that it feels operates on too wide a dataset? I really don't think it should be possible to PATCH every single bug in Launchpad with a single operation. If the server can reject a PATCH, how does the client respond? Can this negotiation be done automatically? (If it can, what's the point in ever rejecting a batch PATCH? It will happen anyway, it'll just go slower.)

= Conclusion: What do we get? =

 * The "expand" operation solves the problem of "too many requests". It allows us to retrieve an object graph in O(1) requests, rather than in O(N) requests.
 * Since entries in collections are served as collapsed representations, retrieving a ''detailed'' object graph still requires O(N) requests. But whereas N used to be the number of objects in the collection, N is now that number divided by 75 (ie. the maximum capacity of the expander resource).
 * The "restrict" operation solves the problem of "too much data, too few ways to slice it". It allows N to be much smaller than it would otherwise be, and allows most common ways of reducing N to be done on the server instead of the client.
 * The "restrict" operation also allows us to get rid of many of our existing one-off named operations, simplifying the service.
 * Since the batch size is now so high to be meaningless, clients will no longer overlook or duplicate entries as they page through the batches of a collection.
 * In some cases, the batch PATCH allows a write operation to proceed in O(1) requests, rather than O(N). This is a much less common scenario than a batch GET.

= Alternate designs and FAQs =

== Fix batching, don't end it ==

The current proposal increases the default batch size to an amount so
huge we don't expect it to realistically ever be met. We can't make
the batch size infinite: that might create situations in which
creating even a collapsed representation of a collection causes a
timeout. But making the batch size infinite is the only way to totally
avoid our current batching problems, to the extent that we could
remove the batching code from lazr.restfulclient.

We can't totally eliminate our problems with batching, but we can
greatly reduced them by a) increasing the default batch size to
something like 500, and b) changing lazr.restfulclient to fetch 5 or
10 redundant records when fetching subsequent pages, and de-duplicate
the results. So, lazr.restfulclient would first obtain items 0-499,
and if you iterated past the end of that list, lazr.restfulclient
would obtain items 495-994, rather than items 500-999. This would
catch the vast majority of changes to collections that happen while
you're iterating.

If we could find a solution like this that let us keep batching
more-or-less as is, we wouldn't need collapsed representations or the
expander resource. This would simplify our design a lot.

 1. This only solves the problem for collections sorted by fields that never or rarely change. If a collection is sorted by a field like modification date, huge jumps from the back of the collection to the front are possible. The only general solution is to get the entire list at once.
 2. Naive users write code that iterates over a collection many times, killing performance. There are ways to optimize for this in our current system, but the new system will fix this problem automatically, by constructing a full list of collapsed representations and filling them out over the course of the first iteration.

Statement of the problem

The existing web service and launchpadlib implementations are very easy to write code for, but difficult to write _efficient_ code for, and difficult to understand. Our users' performance problems have two main causes:

Too much data, too few ways to slice it

The only way to filter a collection is to scope it to some entry, or to invoke a named operation. These methods don't cover all, or even most, of the ways clients want to restrict our various datasets. So clients end up getting huge datasets and iterating over the whole thing, filtering them on the client level.

The named operations we do have are not standardized in any way: they're nearly-raw glimpses into our internal Python API. This makes it difficult to learn the web service and even to find a specific thing you want. For instance, this is from Edwin Grubbs's report on the OpenStack Design Summit (warthogs list, 2010-11-15):

"I also answered some questions about searching for bugs via the API. The fact that the method is named project.searchTasks() may have caused it to be ignored when reading the API docs."

Our batching system was designed to stop clients who don't need an entire dataset from having to get the whole thing. But almost all clients do need the entirety of whatever dataset they're operating on. Unlike website users, they generally aren't satisfied with the first 75 items.

The real problem is that because of our poor filtering support, the dataset a client can actually get is much larger than the dataset they need to operate on. In real usage, the batching system actually results in more HTTP requests and lower efficiency. (Though it does reduce timeouts.)

Example of problematic client code

   1 for mp in launchpad.projects['launchpad'].getMergeProposals(status='Merged'):
   2    if mp.date_created > some_date:
   3        process(mp)

We get a huge number of merge proposals and immediately discard almost all of them. If there was some way of applying the date_created filter on the server side, we could avoid this.

Too many requests

Retrieving an entry or collection associated with some other entry (such as a bug's owner or a team's members) requires a new HTTP request. Entries are cached, but we don't send Cache-Control directives, so even when the entry is cached we end up making a (conditional) HTTP request. It's the high-latency request, not the cost of processing it on the server side, that's painful.

Client code that crosses the network boundary (bug.owner) looks exactly like client code that doesn't (bug.id). We need to stop hiding the network boundary from the user, or at least pull back from hiding it so relentlessly. It should be obvious when you write inefficient code, and/or more difficult to write it.

Sending Cache-Control directives when we serve entries will mitigate this problem somewhat. Thus far we haven't done so, out of concerns about data freshness.

Example of problematic client code

   1 for bug in source_package.getBugTasks():
   2     if bug.owner is not None:
   3         if bug.owner.is_team:
   4             for member in bug.owner.members:
   5                 ...

This code looks innocuous but it has big problems. We make a separate HTTP request for every 'bug.owner' -- that's three subsequent requests for the same data. The second two requests are conditional, but that doesn't help much.

A simple Cache-Control header saying it's okay to cache an entry for five seconds would alleviate this problem. But a Cache-Control header can't stop the need to make at least _one_ request for '.owner' and '.owner.members'. This means two HTTP requests for every bug in the 'bugs' list. Let's say there are one hundred bugs and an HTTP request takes one second. This code will run in 6:40 without Cache-Control headers, and in 3:20 if we add Cache-Control headers.

It would be nice to get the running time closer to 0:10.

Predecessor documents

The "expand" operation

The "expand" operation lets you GET an entry or collection, *plus* some of the entries or collections that it links to. The client code will make one big HTTP request and populate an entire object graph, rather than just one object. This will make it possible to access 'bug.owner' and iterate over 'bug.owner.members' as many times as you want, without causing additional HTTP requests.

Possible client-side syntax

This code acquires a bug's owner, and the owner's members, in a single request. If the owner turns out not to be a team, the collection of members will be empty.

   1 print bug.owner               # Raises ValueError: bug.owner is not available 
   2                               # on this side of the network boundary.
   3 bug = expand(bug, bug.owner, bug.owner.members)
   4 expanded_bug = GET(bug)       # Makes an HTTP request.
   5 expanded_bug.owner            # Does not raise ValueError.
   6 if bug.owner.member.is_team:  # No further HTTP requests.
   7     for member in bug.owner.members:
   8         print member.display_name

This implementation is more conservative: it must specifically request every single bit of expanded data that will be used.

   1 bug = expand(bug, bug.owner.is_team, bug.owner.members.each.display_name)
   2 expanded_bug = GET(bug)       # Makes an HTTP request.
   3 print bug.owner.name          # Raises ValueError: value wasn't expanded.
   4 if bug.owner.is_team:         # No further HTTP requests.
   5     for member in bug.owner.members:
   6         print member.display_name

Of course, these examples assume we have a specific bug we want to expand. Our problematic code makes two requests *per bug*, and plugging this code in would simply bring that number down to one request per bug.

This code takes that down to one request, period. It operates on a scoped collection instead of an individual bug, and expands every object in the collection at once.

   1 bugs = source_package.bugtasks
   2 bugs = expand(bugs, bugs.each.owner, bugs.each.owner.members)
   3 expanded_bugs = GET(bugs)     # Makes an HTTP request
   4 for bug in expanded_bugs:     # No further HTTP requests:
   5     if bug.owner.is_team:
   6         for member in bug.owner.members:
   7             print member.display_name

Possible client-server syntax

The simplest way to support expansion is to add a general ws.expand argument to requests for entries or collections.

  GET /source_package/bugs?ws.expand=each.owner&ws.expand=each.owner.members

Specifying values for ws.expand that don't make sense will result in a 4xx response code.

Specifying values that do make sense will result in a much bigger JSON document than if you hadn't specified ws.expand. This document may take significantly longer to produce--maybe long enough that it would have timed out under the current system--but it will hopefully keep you from making lots of small HTTP requests in the future.

Hypermedia controls

The simplest way to show how to construct an expansion URL is to serve expansion links alongside the links themselves:

'members_collection_link' : 'https://.../~team/members'
'members_collection_expand_link' : 'https://.../~team?ws.expand=members'

If we were serving HTML representations, we could serve an HTML form allowing you to expand a number of the links in a representation at one time.

The alternative is to punt, and put information about how to expand links in the human-readable description of our media type. This would mean we could never change from the ws.expand system to some other system. With links or forms we can change it trivially.

We already use semantic indicators to distinguish between data attributes that are scalar (like person['name']) and attributes that are links which can be followed (like bug['owner_link']). Every single link-style attribute will get this new ability, so there's no need to add new control information indicating which links have it. We just need to program every client with this information.

The "restrict" operation

The "expand" operation reduces the need to make an additional HTTP request to follow a link. The "restrict" operation reduces the number of links that need to be followed in the first place, by allowing general server-side filters to be placed on a collection before the data is returned.

The client may request a collection with filters applied to any number of filterable fields. Which fields are "filterable" will be specified through hypermedia: they'll probably be the fields on which we have database indices. The representation returned will be a subset of the collection: the subset that matches the filter(s).

Possible client-side syntax

This code restricts a project's merge propoals to those with "Merged" status and created after a certain date.

   1 project = launchpad.projects['launchpad']
   2 proposals = project.merge_proposals
   3 proposals = restrict(proposals.each.status, "Merged")
   4 proposals = restrict(proposals.each.date_created, GreaterThan(some_date))
   5 some_proposals = GET(proposals)
   6 for proposal in some_proposals:
   7     ...

Two great features to note:

  1. We can apply the date_created filter on the server side, reducing the time and bandwidth expenditure.
  2. We no longer need to publish the getMergeProposals named operation at all. The only purpose of that operation was to let users filter merge proposals by status, and that's now a general feature. In the aggregate, removal of this and similar named operations will greatly simplify the web service.

You're not restricted to filtering collections based on properties of their entries. You can filter based on properties of entries further down the object graph. This code filters a scoped collection of bugs based on a field of the bug's owner. (There may be better ways to do this particular thing, but this should make it very clear what's going on.)

   1 project = launchpad.projects['launchpad']
   2 bugs = project.bugs
   3 bugs = restrict(bugs.owner.name, 'leonardr')
   4 my_launchpad_bugs = GET(bugs)

Possible client-server syntax

The simplest way to do this is to add a series of ws.restrict query arguments, each of which works similarly to ws.expand.

  GET /launchpad/bugs?ws.restrict.owner.name=leonardr

If your value for a ws.restrict.* argument makes no sense, or you specify a ws.restrict.* argument that doesn't map correctly onto the object graph, you'll get a 4xx error. If your arguments do make sense, you'll get a smaller collection than you would have otherwise gotten.

Hypermedia controls

We need to add hypermedia controls to indicate which fields in the object graph can be the target of a ws.restrict.* argument.

I'm not sure that we can explain the ws.restrict.* idea itself using WADL, since it's more complicated than the ws.expand idea. We may have to settle for human-readable documentation explaining how a client can pre-traverse the object graph and send an appropriate HTTP request.

Don't batch collections

Currently, clients fetch collections in batches, 75 entries at a time. This causes problems when the underlying collections are changing behind the scenes. As the collections change behind the scenes, entries may show up multiple times or fall through the cracks.

The entry that, five seconds ago, was item #76, may now be item #74, meaning that you'll miss it. Or the entry that used to be item #74 may now be item #76, meaning that you'll see it twice.

Our solution is to effectively get rid of batching. Instead of 75, the batch size will be something huge, like 1 million. If you ask for a collection that contains 100,000 entries, you will get a representation of all 100,000 entries.

Don't panic. For one thing, collections with 100,000 entries will be rare, because the "restrict" operation will make it much easier than it is now to get only a desired subset of a collection.

Huge collections will only occur when client code is poorly written (in which case the incredible slowness of the code will be an obvious problem) or when well-written client code actually does need to operate on a huge collection (in which case the incredible slowness of the code is to be expected).

Besides which, you won't get full representations of all 100,000 entries. When you get a collection, you'll receive a list of collapsed representations.

Collapsed representations

A collapsed representation is the opposite of an expanded representation. A normal representation of a person contains scalar data about the person, plus links to related entries and collections. An expanded representation of a person includes some of those related entries and collection inline. A collapsed representation of a person contains *no* data about the person.

A collapsed representation is just a resource identifier: a URL. If you want *any* information about the resource, you need to GET its URL, or visit the expander resource (see below).

This makes it relatively easy for us to dump huge numbers of collapsed representations into a response--we just have to generate the URLs. For items like bug messages, where calculating the URL is expensive, we might consider creating an alternate URL that's easier to calculate.

The expander resource

So, you have 100,000 links. How do you turn those links into representations? Fortunately, the expander resource (located at /expand) is designed to do just that. If you POST it a number of links, it will return a collection of full representations. If the links you POST include ws.expand arguments, the representations will be further expanded according to their ws.expand arguments.

But, the expander resource won't accept 100,000 links. It will only accept some small number, like, say, 75.

Yes, it's a bait and switch. Small-bore batching is still happening; it's just controlled by the client rather than the server. The server dumps the entire *membership* of some collection onto the client in a single atomic operation, but then it's up the client to get details about the membership in little chunks.

By the time the client is finished getting all those details, it's quite possible the membership has changed. But the client can be certain that the membership was accurate _as of the time of the initial request_.

Rather than hard-coding the maximum capacity of the expander resource, we plan to publish that as a bit of information. In the simplest design, a client can get the maximum capacity of the expander resource by sending it a GET request. This information would be cached for the same amount of time as the site WADL document.

Batch PATCH

Any collection URL that you can GET, you can also PATCH. Here's a simple example: marking all of the bugs assigned to leonardr as fixed.

 PATCH HTTP/1.1 /launchpad/bugs?ws.restrict.owner.name=leonardr
 Content-type: application/json

 {'status' : 'Fixed'}

As you can see, this is a potentially dangerous feature. That simple example does something that probably shouldn't be done. But it's not much different from examples that would be very useful (such as taking all the Fix Committed bugs for a given release, and marking them Fix Released).

We can reduce the danger a little by shunning the normal Python-like syntax for modifying objects, in favor of something that makes it more clear that you're modifying an entire dataset at once.

   1 project = launchpad.projects['launchpad']
   2 bugs = project.bugs
   3 bugs = restrict(bugs.owner.name, 'leonardr')
   4 PATCH(bugs, {'status' : 'Fixed'})

Unanswered questions:

  • What should the HTTP response be? What if some of the operations succeed and others fail? Should the entire PATCH be rolled back, or should we use the 207 Multi-Status response code defined by WebDAV?
  • Can the server reject a batch PATCH that it feels operates on too wide a dataset? I really don't think it should be possible to PATCH every single bug in Launchpad with a single operation. If the server can reject a PATCH, how does the client respond? Can this negotiation be done automatically? (If it can, what's the point in ever rejecting a batch PATCH? It will happen anyway, it'll just go slower.)

Conclusion: What do we get?

  • The "expand" operation solves the problem of "too many requests". It allows us to retrieve an object graph in O(1) requests, rather than in O(N) requests.
  • Since entries in collections are served as collapsed representations, retrieving a detailed object graph still requires O(N) requests. But whereas N used to be the number of objects in the collection, N is now that number divided by 75 (ie. the maximum capacity of the expander resource).

  • The "restrict" operation solves the problem of "too much data, too few ways to slice it". It allows N to be much smaller than it would otherwise be, and allows most common ways of reducing N to be done on the server instead of the client.
  • The "restrict" operation also allows us to get rid of many of our existing one-off named operations, simplifying the service.
  • Since the batch size is now so high to be meaningless, clients will no longer overlook or duplicate entries as they page through the batches of a collection.
  • In some cases, the batch PATCH allows a write operation to proceed in O(1) requests, rather than O(N). This is a much less common scenario than a batch GET.

Alternate designs and FAQs

Fix batching, don't end it

The current proposal increases the default batch size to an amount so huge we don't expect it to realistically ever be met. We can't make the batch size infinite: that might create situations in which creating even a collapsed representation of a collection causes a timeout. But making the batch size infinite is the only way to totally avoid our current batching problems, to the extent that we could remove the batching code from lazr.restfulclient.

We can't totally eliminate our problems with batching, but we can greatly reduced them by a) increasing the default batch size to something like 500, and b) changing lazr.restfulclient to fetch 5 or 10 redundant records when fetching subsequent pages, and de-duplicate the results. So, lazr.restfulclient would first obtain items 0-499, and if you iterated past the end of that list, lazr.restfulclient would obtain items 495-994, rather than items 500-999. This would catch the vast majority of changes to collections that happen while you're iterating.

If we could find a solution like this that let us keep batching more-or-less as is, we wouldn't need collapsed representations or the expander resource. This would simplify our design a lot.

  1. This only solves the problem for collections sorted by fields that never or rarely change. If a collection is sorted by a field like modification date, huge jumps from the back of the collection to the front are possible. The only general solution is to get the entire list at once.
  2. Naive users write code that iterates over a collection many times, killing performance. There are ways to optimize for this in our current system, but the new system will fix this problem automatically, by constructing a full list of collapsed representations and filling them out over the course of the first iteration.

Foundations/Webservice/DraftProposal (last edited 2010-11-18 17:18:57 by leonardr)