Diff for "ArchitectureGuide/ServicesAnalysis"

Not logged in - Log In / Register

Differences between revisions 5 and 6
Revision 5 as of 2011-05-09 23:07:13
Size: 9694
Editor: lifeless
Comment: more examples
Revision 6 as of 2011-05-09 23:08:19
Size: 9545
Editor: lifeless
Comment: fix conflict (I love thee moin)
Deletions are marked like this. Additions are marked like this.
Line 63: Line 63:
---- /!\ '''Edit conflict - other version:''' ----

---- /!\ '''Edit conflict - your version:''' ----
Line 68: Line 65:

---- /!\ '''End of edit conflict''' ----

Current state

Launchpad is currently designed as single large python project + postgresql database with component libraries, and a very few additional services: loggerhead, mhonarc.

Some parts of the project run different stacks: the librarian, buildd-manager, and some are deployed very differently - buildd-slave.

Friction attributable to this design

Code coupling

Because optimisations within the data model require complex queries, different domains within the Launchpad code tree get quite tangled.

Test suite

As a result of high code coupling and a lack of reliable contracts many changes have unexpectedly wide consequences. Because the code base is very large the test suite is very large. Because the layers are not amenable to substitution unit tests are very tricky to write and we usually exercise most of the entire stack. Additionally finding the right tests to work on is hard - partly because what one might consider layers are all smooshed together, and partly due to having many different styles of test which are not consistently testing at the same layers - it is hard to tell where to start.

Monolithic downtime

With a single database and deployed tree, most changes require complete downtime.

Poor integration with non-included services

Things like mhonarc and loggerhead have a very poor story in LP because we don't have a good story for skinning them and our monolithic approach drives the story we do have.

Benefits from this design

Schema and code coherency

We never have to deal with a schema from a different tree - all the code that knows about the plumbing is in one place.

Atomic relational integrity

All our data is in one place, we can use foreign keys to track deletes and things are never inconsistent.

Less moving parts to learn

As all the code is in one tree, we generally have one set of idioms, one language, one database engine to work with. This helps keep things approachable (and as bugs in-tree generally get more rapid attention than bugs in related trees we have some anecdata to support this benefit).

A service based Launchpad

In changing the tradeoff we make we should be clear about the things we want to optimise for, so we can evaluate new tradeoff points.

As a team we want to achieve three key things:

  • Fast development times
    • Which implies fast test runs and a low wtf factor on changes
  • Low latency services
  • Low or no-downtime schema changes

Better isolation of changes permits confident changes with smaller numbers of tests run, and decreases the WTF factor - so we need to look at how we can increase isolation of changes. Improving the overall speed with which requests are serviced helps with latency - so we need to consider whether we will add (or remove) intrinsic limits to performance. The busier a subsystem is the harder it is to take it down for schema changes (and the larger the subsystems is the wider the outage caused by taking it down).

For change isolation we need to look at the entire stack - previously we've only considered contracts within the one code base. But actually sitting on one schema implies one large bucket within which changes can propogate : and we see this regularly when we deal with optimisations and changes to the implementation of common layers (such as the team participation graph-caching optimisation).

One way we could improve isolation is to (gradually, not radically) convert Launchpad to be built from smaller services, each of which has a crisp contract, its own storage and schema. This would permit testing of just the changes within one service - or only the services that talk to that service.

While there are many ways one might try to slice up Launchpad into smaller services, we need to avoid creating silos between components we want to deeply integrate. One way to avoid that when we don't know what we want to integrate is to focus on layers which can be reused across Launchpad rather than on (for instance) breaking out user-visible components.

Identified service opportunities

These are potential things we could pull out to services - they are examples only - detailed analysis of each has not been done, so its not possible to say that they are all definites: they are merely opportunities.

team participation / directory service

The largest teams-per-person is ~300, the largest persons-per-team is 18K, but discounting the top two drops it to 3.7K, and top ten gets down to 1.6K. The 18K case can be serialised and passed over the network in 300ms though, which makes it feasible to grab and pass between systems. Smaller cases like a 2K membership team can be handled in 40ms (using psql and postgres to assess).

We have a number of significant use cases around the directory service which are poorly satisfied at the moment - we don't permit non-membership relations like 'administers' or 'audits' (e.g. is granted view privileges but not mutation privileges).

Running (minimally) the person-in-team, teams-for-person, persons-for-team facilities as a service would aid the separation of SSO (by providing a high availability service that the SSO web service could back end onto).

blob storage (the librarian)

The librarian stores upwards of 14M distinct files (after coalescing by hash) - but it is tightly coupled to the Launchpad schema.

We could build/bring in a simpler blob store and layer our special needs on top or as an extension to it. For instances needs such as the public-restricted librarian, size-calculations for many objects, or even aggregates (e.g. model a ppa as a bucket of blobs and we can get size data directly aggregated by the blob store)

The current service is difficult to evolve because its tightly coupled: any attempt to modify the schema runs into the slow-patch-application + high-change-friction issue which primarily exists because dozens of call sites talk directly to the storage schema even though most of them just want url generation.

mhonarc (the lists.launchpad.net UI)

This runs as an external service but our appservers do not use it as a backend - instead end users use it directly. If we modified it to write its archived information as machine readable metadata (e.g. a json file per message, per index page, per list) then our template servers which know about facets and menus and so on could efficiently grab that and render a nice page.

This would be easier than retrofitting an event system to update individual archives with different menus as LP policy and metadata change. It would even - if we want it to - permit robust renames of mailing lists and so on.

loggerhead (bazaar.launchpad.net web UI)

Like with mhonarc this already runs as a separate service; however we don't present any of its content in the main UI, and its a constant source of poor user experience. Happily it already has a minimal json API we can use and extend.

bzr+ssh (bazaar.launchpad.net bzr protocol)

This doesn't talk to the database at all and is a shoe-in to be split out now.

distribution source package names

The package names that can be used in the bugtracker depend on what packages are present in the distribution - currently this is a non-trivial query, but it could easily be delivered via a web service to the bugtracker UI. Whether a deeper split in the packaging metadata is needed or desirable is a related but as yet unanalyzed question.

A graph database

We have many places in the system - team participation; branch merged-status, package set traversals and probably more. A generic high performance graph database supporting caching, reachability oracles, parallelism could be used to simplify much of the graph using places in launchpad. While we could use a postgresql datatype, previous searches of these generally are less capable than dedicated graph servers.

A subscription service

There are some things we have subscriptions to (pillars, branches, bugs, questions) all of which are implemented differently; beyond that we'd like to have ephemeral subscriptions to named objects(e.g. when someone has a browser page open on a given url and we want to push updates to the page to them), and possibly even ephemeral subscriptions to anonymous things (which we might use to implement hand-off based callbacks for event driven api scripts).

A subscription service which offers filtering, durations and callbacks on changes could provide a key piece of functionality for implementing lower latency backend services (like browser notifications when a branch has been scanned) all in a single module which can be highly optimised.

A discussions/microlist service

Similar to the subscription service we also have many objects that can be commented on, and we want the same facilities on them all - spam management, searching and indexing, reporting (what has person X commented on); but at the moment we have to do schema changes on every new object we want a discussion facility around. It might even be possible (if we wanted to) to converge one on discussion facility with mailman/mhonarc backending & optimising things.

Friction in this design

Benefits from this design

ArchitectureGuide/ServicesAnalysis (last edited 2011-06-30 19:47:11 by lifeless)