About this document
Migrating to a services based design is a large project. We can't do it in one hit, and per the analysis we have a lot of different things we can do. We want to make sure that the project is a net win: we shouldn't put more effort into it than we will save in efficiency on future changes / headaches. One way we can do this is by making sure each individual transition we do will either pay for itself in the short term, or be part of a larger set which we expect to pay for itself in aggregate.
This document provides an overview of the basic decisions we need to reach before starting widespread work on using services, and provides some top level categories we can use to assess service projects.
Each service possibility we are considering should be included in this document: If you encounter service opportunities not listed here please add them. Each service needs to be described - basically what it would do and why.
As bringing up these services is by definition the creation of a new subproject under /launchpad-project, we don't strictly have a sensible place to file bugs. However for any service possibility which is a clear-cut improvement and involves extracting code from LP itself - please feel free to file that as a bug on Launchpad.
Level of detail
We don't need a huge level of detail in the roadmap: the description of each service possibility needs to be enough to let any of the following decisions be made:
- A developer could scratch an itch and JFDI
- A squad needs to work on this for (a time period)
- Some other defect is best fixed by implementing this service (be that scaling, responsiveness, resilience...)
All services need to meet the minimum requirements for new services. Some will be very straight forward technical implementations. Others will need some refinement around what the service needs to accomplish. For instance the GPG service has questions around key management and security - a checking only service is a no-brainer but a service which might create keys or do signing needs more thought.
If a service looks complex or hard to pin down its needs with high confidence then a LEP is needed. The normal LEP process will be used, but rather than end user UI the UI is the API within the datacentre, and rather than the product strategist approving the LEP, the technical architect will.
- We have a lot of learning to do - HA and deployment will be significantly more complex. Our monitoring needs to get a lot better.
- Some things will be dependent on object-sync facilities (e.g. rabbitmq) in the data centre.
- Adding backend services will increase latency if done poorly.
We need to choose various defaults for backend services.
Some have already be chosen.
We're using rabbitMQ as decided mid 2010. Not because its the best, but because its already in use in Canonical, and we're very unlikely to gain enough using a different MQ for now: once we're solidly service based we can revisit this.
note: we're seeing many test failures with the rabbit infrastructure in buildbot/jenkins. if it doesn't get addressed soon changing to a simpler to deploy queue may well be appealing.
In-datacentre HTTP protocol backend authentication
Some services may not use HTTP as a substrate and will need bespoke authentication.
After consultation with IS Projects:
- micro services run behind Apache and use ip address + basic auth (must be from a known ip address with a basic auth password to work). The in-dc network is considered trustworthy - we can run on HTTP in this setup.
Excluded options were:
- OAuth. Possibly with ip limits as per basic.
- Each service implements auth directly.
The big webapp today runs behind haproxy not apache (apache is only at the outer edge) so we should expect to implement whatever we choose directly for it (but not for other microservices).
-- RobertCollins: The use of apache + ip limits + basic auth will permit easy debugging (dev services will have no credentials), extremely lightweight client and per-request overhead, and we can add OAuth into Apache whenever we want in the future.
In-datacentre network protocol
In the datacentre we have no latency to worry about, but we do need to worry about efficiency and ease of development. While some services already have protocols, any new service we make will need us to choose a protocol for it. No decisions yet but some options are:
- XMLRPC: pros: already deploys, batteries included in Python and many other languages. cons: XML, RPC model rather than restful - no opportunity for caching, URLs can be opaque when debugging.
- adhoc restful json based apis. pros: nice to look at by hand, easy to interact with manually. cons: not included in the Python standard library, optimises for things that don't really affect us.
- google protobufs: pros: clear contracts, wire level upgrading built-in. cons: not well understood within the LP team, currently somewhat slow [in python].
- lazr.restful : Not suitable for rapid development or consumption by other servers.
- AMPQ : No high-avaibility solution for message queues infrastructure
High availability of HTTP backend services
HAProxy. Apache. Linux Virtual Server. (LVS per datacentre -> HAProxy -> backends with apache in each datacentre -> local connection to the microservice).
Other services will need some appropriate solution (e.g. clustering, failover etc).
Things which Launchpad may use but are not intrinsically tied to the Launchpad data model belong here. Some services will be in a grey area. A good question to ask is 'could a different web site use the service sensibly'. If the answer is yes then its probably independent.
Launchpad has to manage a number of GPG keys in a few different ways:
- Create new keys for PPA signatures
- Validate signatures (text in, (signer, status, cleartext) out)
- Store prepared key revocation certificates in the event of a compromise.
- Sign new package binaries build in the build cluster. (for PPAs, Ubuntu and derived distributions)
We may not want to expose all of these as a web service (because a single lying client in the datacentre could get a hostile binary package signed). However exposing the validation of signatures is a no brainer and would save some significant operational complexity. We can iterate to add more as wanted.
A dedicated web service that takes crypt text in and returns the signer, cleartext.
- avoid cold cache warmup for GPG validation (by having a long running gpg cache dir)
- make GPG validation available to non-lp-zope services without the headache of direct integration.
- Migrate over the GPG handler code and create a test fake.
A dedicated web service which has 4 APIs:
- Create a key (takes various metadata, publishes, returns the key id).
- Revoke a key (takes a key id, publishes the revocation cert for it).
- Clearsign a document (takes a plaintext and a key id, returns signed doc)
- Detached sign a document (takes a plaintext and a key id, returns signature)
- Aids in securing GPG keys: they can be placed on dedicated machines with tightly controlled API access (only the publisher machines would have access)
- avoids needing a working GPG setup on all the publisher machines (ubuntu, ppas, derived distros)
- A separate instance can be run up for higher security keys
- Migrate over existing GPG handler code, create a test fake, allocate at least one dedicated machine (because of security sharing machines isn't a good idea).
blob storage (the librarian)
The librarian stores upwards of 14M distinct files (after coalescing by hash) - but as implemented is tightly coupled to the Launchpad schema. It suffers from cold-cache effects on a regular basis, and we have explicit mechanisms in the schema to let us have weaker-than-actual links (for instance we can delete the blob but keep the reference, and delete the reference but retain the blob for a while).
We could build/bring in a simpler blob store and layer our special needs on top or as an extension to it. For instances needs such as the public-restricted librarian, size-calculations for many objects, or even aggregates (e.g. model a ppa as a bucket of blobs and we can get size data directly aggregated by the blob store)
The current service is difficult to evolve because it is tightly coupled: any attempt to modify the schema runs into the slow-patch-application + high-change-friction issue which primarily exists because dozens of call sites talk directly to the storage schema even though most of them just want url generation.
bzr+ssh (bazaar.launchpad.net bzr protocol)
The bzr+ssh / sftp service is essentially a standalone codebase included in the Launchpad tree. Its dependencies are on a URL mapping scheme which Launchpad already exposes over XMLRPC, access to credentials (also via XMLRPC) and on a chosen filesystem layout (hard coded today).
With a small amount of effort this should be extractable into a generic reusable standalone service.
Benefits: decouple maintenance, stop running tests for how bzr is hosted in the mail LP test suite. Would also make it easier for the bzr team (the natural maintainer of this component) to hack on it and improve it - or even make it into a better offering for bzr users to use in-house.
existing backend services
Things like splitting out the importd engine, rosetta translation import/export services go here. The benefits from splitting these things out will be shorter test run times for the main application; as they are generally already service based the operational change should be modest. Consider these low hanging fruit - easy to do, some benefits, low risk.
Our scripts all need internal APIs setup and to be migrated to them. They belong here. Many of these scripts routinely cause headaches. We should expect to identify a raft of performance problems and data integrity / access control issues as we migrate them. (Our current scripts have a lobotomised security model in place which we would not want to keep).
We have a number of potential optimisations best done using services: examples include
- graph traversals/reachability queries
- batch/aggregate workloads (map reduce)
- long poll callbacks
These are best done using services because they either need custom databases, are built using a dedicated (and separate) schema on a SQL database or will be long running and not subject to our normal transaction timeliness constraints.
A graph database
We have many places in the system - team participation; branch merged-status, package set traversals and probably more. A generic high performance graph database supporting caching, reachability oracles, parallelism could be used to simplify much of the graph using places in Launchpad. While we could use a postgresql datatype, previous searches of these generally are less capable than dedicated graph servers.
distribution source package names
The package names that can be used in the bugtracker depend on what packages are present in the distribution - currently this is a non-trivial query, but it could easily be stored as a specific business rule delivered via a web service to the bugtracker UI. Whether a deeper split in the packaging metadata is needed or desirable is a related but as yet unanalyzed question.
Reporting / data warehousing
We may well need to build multiple parallel schemas for our site - one cheap-transaction to support changing data rapidly, one search schema for fast lookups, and one data warehousing schema for reporting. While we could place all these in the same DB schema, the different constraints and requirements (for query schemas want denormalised data, cheap-transactions want maximum orthogonality, warehouse schemas want aggregated data into fact tables) mean that we would benefit if we used dedicated tools for these. For instance - we'd avoid contention on memory footprint, be able to use a dedicated warehouse DB if appropriate..
We have other services which are already poorly integrated into Launchpad. Specific examples are:
Overhauling these so that their UI is entirely done in the LP template engine and they act as pure data sources with real-time lookups would make a significant improvement to user experience.
loggerhead (bazaar.launchpad.net web UI)
Like with mhonarc this already runs as a separate service; however we don't present any of its content in the main UI, and it is a constant source of poor user experience. Happily it already has a minimal json API we can use and extend.
Projects/LiveBranches is doing this.
mhonarc (the lists.launchpad.net UI)
This runs as an external service but our appservers do not use it as a backend - instead end users use it directly. If we modified it to write its archived information as machine readable metadata (e.g. a json file per message, per index page, per list) then our template servers which know about facets and menus and so on could efficiently grab that and render a nice page.
This would be easier than retrofitting an event system to update individual archives with different menus as LP policy and metadata change. It would even - if we want it to - permit robust renames of mailing lists and so on.
We have functionality that is currently tightly coupled which we would benefit from splitting into dedicated services:
- directory services
- rendering/UI (with API bundled in because we render in the public API)
A map/reduce facility
Many of the adhoc analysis users use Launchpadlib for, and many of our backend scripts, consist of running some code against independent objects - for instance, extracting a semantic map from bugs, updating heat on bugs, generating custom aggregates (other than the precanned reports we offer), checking for new product releases, auditing Ubuntu mirrors, creating the burn-down charts and lp-kanban. A map/reduce service might let us generalise the parallelism and filtering aspects of this while reducing the administration involved in running such internal checks - and done well we could offer it to our users for adhoc purposes.
team participation / directory service
We have a number of significant use cases around the Launchpad person/team directory service which are poorly satisfied at the moment - for instance, we don't permit non-membership relations like 'administers' or 'audits' (e.g. is granted view privileges but not mutation privileges).
The largest teams-per-person is ~300, the largest persons-per-team is 18K, but discounting the top two drops it to 3.7K, and top ten gets down to 1.6K. The 18K case can be serialised and passed over the network in 300ms though, which makes it feasible to grab and pass between systems. Smaller cases like a 2K membership team can be handled in 40ms (using psql and postgres to assess).
Running (minimally) the person-in-team, teams-for-person, persons-for-team facilities as a service would aid the separation of SSO (by providing a high availability service that the SSO web service could back end onto).
A subscription service
There are some things we have subscriptions to (pillars, branches, bugs, questions) all of which are implemented differently; beyond that we'd like to have ephemeral subscriptions to named objects(e.g. when someone has a browser page open on a given url and we want to push updates to the page to them), and possibly even ephemeral subscriptions to anonymous things (which we might use to implement hand-off based callbacks for event driven api scripts).
A subscription service which offers filtering, durations and callbacks on changes could provide a key piece of functionality for implementing lower latency backend services (like browser notifications when a branch has been scanned) all in a single module which can be highly optimised.
A discussions/microlist service
Similar to the subscription service we also have many objects that can be commented on, and we want the same facilities on them all - spam management, searching and indexing, reporting (what has person X commented on); but at the moment we have to do schema changes on every new object we want a discussion facility around. It might even be possible (if we wanted to) to converge one on discussion facility with mailman/mhonarc backending & optimising things.
Template/API service (internet facing)
This is probably the key service: the services our users (both browser and launchpadlib) talk to. Currently our least reliable tests are involved with actual service delivery - and probably always will be (the nature of the beast when we're driving browsers programmatically). We could look at a number of possible splitups, but any changes made to this service are likely to be very visible. Our public API serves two masters; the web site (where it does template rendering into fragments for page updates) and and launchpadlib, our programmatic interface for users to drive Launchpad. The launchpadlib API depends heavily on the WADL and lazr.restful zope stack - changing that (for any reason) is going to require considerable care as we have users on stable LTS releases of Ubuntu to cater to.
However, if we treat the templating and api engine as the entry-service rather than as part of the core data access service, we can dramatically simplify the testing story: a clean contract between template rendering/public api and model manipulation/optimisation/refactoring. If care is taken around how information disclosure is managed, this front end service could dispense with the entire zope security model, and with database access also removed, would have no *correctness* related thread-local information: we could use scatter-gather techniques to gather all the needed information for a page upfront concurrently rather than serially. For instance, bug page rendering would (in terms of data gathering) change from sum(time to get tasks, time to get messages, time to get questions, time to get attachments, time to render) into sum(max(time get tasks, time to get messages, time to get questions, time to get attachments), time to render) because we can parallelise obtaining data but not rendering (at least today).
One thing that would make this service easier to implement is to stop rendering templates in API calls (at all) - and instead generate those things client side if they are being served out in an API response.
At the moment all logs are written directly to local disk (?) but could be construed as an internal network service instead. This would perhaps make disk space management easier (reportedly an IS problem); might allow live analytics or alerting.
Supplementary to logs, some services (eg etsy) benefit from having a 'counter' service that records when events happen and can track their frequency.
Audit trail / time line
Some users would appreciate being able to see a list of actions made on an object, or by a person, over time. At the moment this is done ad-hoc for some objects such as bugs.
This is somewhat connected to the concepts of a notification service and a logging service, with the difference being that logs are intended for internal consumption not for users, and notifications are sent as mail etc whereas this would record and retrieve them.