We want to serve PPAs via a virtual file system rather than materialising them into a persistent cache on germanium.

See [[LEP/DisklessArchives|The LEP]] for the project constraints. The implementation approach presented here is (as usual) subject to evolution if surprises are encountered. A prototype of the highest risk components has been done (and thrown away for good measure).

= Diskless Archives design and implementation =

`ppa.launchpad.net` will become a horizontally scaling collection of HTTP/HTTPS/SFTP/FTP nodes. Each node will have a txpkgupload instance, a squid instance with Launchpad specific helpers that can serve both launchpadlibrarian.net and ppa/private-ppa.launchpad.net, cron tasks for PPA log file analysis. The Launchpad script servers will take care of PPA archive publishing, publishing all the data into the librarian. Upload processing may be centralised, or execute on each machine.

A new API daemon may be introduced, depending on the cheapest way to deliver two small internal API's the project will need. If introduced, it will run on the existing Launchpad appserver nodes, with haproxy frontending it.

[[http://people.canonical.com/~wgrant/launchpad/lp-disklessarchives-arch.png|this]] (blue/white boxes are new components).


== Frontends ==

Squid will receive requests on HTTP or HTTPS and process them.

Squid will:
 * identify the domain for the request
 * identify the realm for the request (the specific PPA in question) for private-ppa requests
 * perform credential validation
 * map the requested URL through to a Librarian file using an internal Launchpad API.
 * Stream the resulting file to the client, caching it on the node *under the backend librarian URL*. This will:
   * Share the cache contents with the librarian domain.
   * Handle index files gracefully, as a new index will instantly result in cache misses, because the mapping step will find a different librarian URL (and we don't plan to cache the ppa mapping logic unless we need to, and if we do need to we will cache it using domain knowledge).

We will either need squid 3.1 with the EXT_TAG patch backported. This is available: `lp:~lifeless/squid/3.1-ext-tag` http://bazaar.launchpad.net/~lifeless/squid/3.1-ext-tag/revision/10456

=== Implementation ===

We need custom Squid [[http://wiki.squid-cache.org/Features/AddonHelpers|addon helpers]], and a small set of webservice API calls to Launchpad.

Squid will operate in reverse proxy ("accelerator") mode, using `external_acl_type` and `url_rewrite_program` helpers to provide decoupled authentication/authorization and path resolution.

Initially, squid will run with no synchronisation between the instances: things will be cached on multiple nodes. If load testing or scaling gives reason to be concerned, we'll introduce some form of consistent hashing: either a non-caching squid, or haproxy as a front tier (on the same hardware) that routes to a single node (while its up) for a given URL, removing all duplication from the cache farm and providing more effective horizontal scaling, at the cost of routing requests within the cluster.

==== PPA requests ====
For private-ppa and ppa requests: all requests will go through a mapper to determine the PPA from the URL. This will be purely local, no DB access, and its results are cached on the URL of the request by squid.

All requests then go through a helper to check authentication details are correct for that PPA:
 * ppa.l.n requests check that the ppa is public, not private.
 * private-ppa.l.n requests check that the ppa is private and the authentication details are correct.
The results for both of these helpers are cached by squid, so we only do one backend lookup per cache timeout per PPA+credentials tuple. Common case will be one per session per user. Auth failures won't be cached at all, so attempting 'too early' will not prevent logins.

Them, for all requests, our `url_rewrite_program` handler will map the requested URL to a Launchpad API for the corresponding internal librarian URL (which will take the LFC id or possibly even the file path), and Squid will stream that file (or return a 404 if the file has been deleted / does not exist).

There's a proof of concept [[http://paste.ubuntu.com/1036739/|Squid config]] and corresponding [[http://paste.ubuntu.com/1036741/|trivial helpers]].

==== launchpadlibrarian requests ====

The request will go through a url rewrite mapper that does an LFA->LFC lookup and determines the backend data to use. This will be in the same format as for the PPA requests; (it will be the same helper in fact, but a different code path).

For restricted.launchpadlibrarian.net requests, a separate helper will be used to validate the time-limited-token on the request. That acl helper will have caching disabled.

==== Librarian ====

A new listening port will be added to the librarian, which serves data out for the new backend API call - e.g. takes LFC only, or path + file hash. This will be used by the new service as it comes up, and the two current download ports will be entirely removed once the new system takes over full operation of the launchpadlibrarian domain.


=== Caching (all sorts) ===

The responses delivered from Squid will need their cache control headers removed / replaced, to ensure that downstream http proxies do not cache content in the archive url hierarchy inappropriately. This is a problem we are well versed in for archive.ubuntu.com and the same set of rules we use there should work well.

However, the url rewrite mapper will also be taking the user supplied URL and determining a Librarian URL from it. This mapping, if cached, would have the same net effect as poor HTTP caching would, and so needs to be aware of our policies around these files:

 * '''Packages''' (both source and binary). These live in `$ppa/pool/`, they're the bulk of the data, their contents are never modified, and each can be cached indefinitely.
 * '''Indices''' (`Packages`, `Sources`, `Release`, `Release.gpg`, etc.). These live in paths containing `$ppa/dists/` and are modified every time the package list changes, and cache coherency within a suite is essential, as they're authenticated by a path of hashes up to `Release.gpg`. These should use must-revalidate, but can be cached.
 * '''Custom uploads''' (`debian-installer`, `dist-upgrader`, `i18n`). These also live in `dists/`, and except for `i18n` their contents are immutable. They're uploaded as tarballs that are stored in the librarian, and we will have to unpack them into separate librarian files. No single set of heuristics is known to work for custom uploads, but cache coherency of `i18n` may be of some import. Note that custom uploads aren't commonly used in PPAs, but we expect them to be impotrant to some of Ubuntu Engineering's users. (they are a currently supported feature).

=== Expected load ===

 * The new librarian port will include aggressive caching headers as the current librarian does: file hotspots should be handled very gracefully, though we'll need to make sure the lookup and mapping code is extremely reliable. If load is a problem we'll introduce consistent hashing, and if its still a problem scale the cluster horizontally.
 * All content will be cached - private ppas, restricted librarian files, etc.
 * Squid will cache from the external_acl helpers - for instance, all requests for a single private ppa will use the cached credentials check, even though different urls are in play (within a single squid instance). (This is the reason for a dedicated URL->PPA helper).
 * Depending on how expensive URL->Librarian URL mapping is, the rewrite helper will may have to do its own caching: squid won't. The initial implementation should not have a rewrite cache, and instead load test the system by replaying existing request patterns through the new hardware, and measuring latency and load. If necessary, a rewrite cache can be added: memcache from the rewriter would be the cache of choice, using the existing LP memcache cluster. 

HTTP caching and horizontal scaling of frontends should resolve most issues with scaling to enormous volumes of data. What the new design doesn't handle well -- in fact, what it handles far worse than the existing static disk archives -- is enormous numbers of requests: where it was previously a filesystem path traversal, it's now a set of remote database queries. The caching points above are aimed at reducing the amount of redundant mapping work we do.

It's also important to note that Launchpad's database slaves have ample spare capacity, so they should be able to handle a lot of lookup requests. Most commonly hit lookups will be cached, but the Soyuz publication schema wasn't designed for efficient path resolution, so it's likely that we'll need some denormalisation or at least some creative new queries to achieve excellent performance.

== Backend (API service) ==

The frontends will talk to the database via a webservice API of some kind. While ideally we'd use our existing lazr.restful or XML-RPC infrastructure, it's likely that the Zope stack is unacceptably slow for the request volume we expect and performance we desire (100ms response time to users 99% of the time). An independent lightweight WSGI service which just exposes the relevant API methods, directly implemented as SQL without our slow infrastructure, is an effective option to achieve this, given the narrow schema and needs of the system.

  -- This contradicts "Access to a particular form of persistence (e.g. the librarians files-on-disk, or a particular postgresql schema) requires being in the same source tree as the definition of that persistence mechanism." on [[ArchitectureGuide/ServicesRequirements]]. I assume that's captured in the "ideally" bit? -- jml, [[<<Date(2012-06-14T12:04:55Z)>>]]

There are three calls to make:

 * `authorize('~owner/ppa', 'subscriber', 'password') -> true|false`
 * `assert_public('~owner/ppa') -> true|false`
 * `map_to_librarian('~owner/ppa/pool/main/h/hello/hello_2.8-2_i386.deb') -> 'http://lplibrarian.internal/1234/hello_2.8-2_i386.deb'`

We can run arbitrarily many instances, caching is not required, and it's read-only so it can balance across the slave databases in a fault-tolerant fashion. It will want to check missing auth creds against the master DB to handle replication latency, but they will be rare (except in a DOS situation). Even then, such lookups will be extraordinarily cheap, and our regular concurrency capping in haproxy will protect the core infrastructure from meltdown.


== Backend (non-interactive) ==

The main backend components implicated in these changes are the Soyuz publisher jobs: `process-accepted`, `publish-distro`, and `process-death-row`.

These scripts currently serve both PPAs and the Ubuntu primary archive. Any changes done to them to support the new PPA system need to:
 * Preserve the primary achive behaviour.
 * Preserve the behaviour on germanium until the project is complete and germanium is decommissioned.

One approach is to hybridise these scripts to be controllable in a fine grained manner, and we can turn off the disk-updates when we decommission germanium, and enable the PPA publishing in a new home (e.g a Launchpad scripts server).
Another approach is to produce a new (celery based perhaps) variant of these scripts dedicated to the new publishing data, and have them run even if germaniums existing publisher has published content on disk. This would avoid adding load to germanium, and make the migration process require less ops coordination (as well as reducing latency for the new system immediately, as LP script startup time could be eliminated [this last point is irrelevant for process-death-row]).
A third, recommended approach is be to split the work in these scripts between on-disk logic and non-disk logic, and migrate the non-disk logic component off of germanium early in the development process, reducing its current load, and giving a single source of logic for the new schema as it comes of age. Some complexity is entailed due to the primary archive using apt-ftp archive, but the needed special cases are less than those to handle two different publication mechanisms working on the same data for PPAs.

 * `process-accepted` handles new binary and custom uploads, mostly touching the database. The only bit that touches the on-disk archive is custom upload publication, which will need to be altered to also copy the tarball contents into the Librarian and record the paths in new DB table/fields.

 * `publish-distro` looks in the database for new source and binary publications, streams the files out of the librarian into the on-disk archive, marks them published, dominates active publications, and writes out indices to the filesystem. Diskless archives still need all the non-disk bits: the publications must be marked as published, they must be dominated, and the indices must written to the librarian instead of local disk.

 * `process-death-row` looks in the database for publications that are unwanted, unused, and so can be removed from disk. While the removing from disk part obviously isn't relevant for diskless archives, the logic to set `dateremoved` to cause garbage collection must be retained.

There are a couple of other jobs of interest:

 * `generate-ppa-htaccess` becomes deletable as we no longer use `htaccess` for authentication or authorization. The model gets simplified a little to remove such aspects as deactivation queueing.
 * `parse-ppa-apache-access-logs` can probably just be run on each frontend, but it may have minor concurrency problems:
  * A log file is considered parsed if the first line matches a list of parsed files in the system, but multiple apaches raises the possibilty of duplicate first-lines.
  * The script operates in a batch mode where N lines are processed then committed and summary counts updated. The summaries could collide with other scripts, leading to either rollbacks, or, (assuming some nasty storm caching) even wrong counts.
  Some solutions would be to coordinate the parsing across all frontends (e.g. a lower frequency and staggered runs), change to a near-realtime forwarding system (like logstash) with a centralised consumer. Changing the frequency is definitely easier (but doesn't address the 'is-it-parsed' check above), and this is a very low priority issue.

`expire-archive-files` is unaffected, since `process-death-row` or its replacement will still set `dateremoved`, just like now.

== Uploads ==

PPA uploads are currently over FTP or SFTP to `ppa.launchpad.net`, the same hostname used by public HTTP downloads. We would like to avoid breaking this, to avoid disrupting the ~4000 people that use it. This means we have to either run upload daemons on each frontend or forward the services to other hosts, at least until we can get updated versions of e.g. dput to all users.

In terms of concurrency, it's safe to run the (S)FTP daemon (`txpkgupload`) on each machine. It's less safe to run `process-upload` on a separate upload queue on each machine, but this is easily fixed (there's no archive lock, so two uploads can be accepted concurrently when consistency checks would reject them if they were processed serially).


== Availability ==

`ppa.launchpad.net`'s availability has traditionally depended only on `germanium`'s Apache and the network between there and the Internet. Diskless archives change all that. As shown in the earlier architectural diagram, we will now depend on not just the frontends but also the full Launchpad librarian and database stacks, with all the service, machine, and network dependencies that they entail.

 * '''Frontends''': We can run multiple redundant instances without a problem. That's the point of the project. Round-robin DNS or LVS can deliver the service to clients.
 * '''API services''': Again, no problem with running multiple redundant instances.
 * '''Librarian''': The layers of redundant Squids should mean that requests for common files rarely hit the backend at all. If they do and the backend is down, those files will be unavailable until it returns. Upgrades are downtime-free and there is no scheduled downtime, but the librarian frontend is currently running on a single machine. We can't reasonably support active-active machine redundancy at this time, but we can probably arrange SAN-backed active-passive with failover fairly easily.
 * '''Database''': The Squid helpers caching will cache credentials, preventing them from hitting the API server and DB at all. Uncached lookups will fail in the event of an issue with using the API service. We have two redundant slaves databases, but access to all three databases is currently denied during fastdowntime (usually 70s a couple of times a week, but up to 5 minutes every day). Future iterations on fastdowntime may do a rolling upgrade that leaves at least one slave available at all times.
 
Fault-tolerance of all non-HTTP/HTTPS PPA services will be increased, as redundant instances can be brought up on multiple machines.

PPA services should be able to move into the nodowntime set, eliminating another source of LP production variation.

== New Hostname ==

Idea from James Troup:

Run this stuff on a new hostname. software-center-agent starts handing out the new hostname to new subscriptions when we are
happy to go live. Old clients still hit the current setup on ppa.launchpad.net.

Makes testing easy, and can rollback with dns change.

Once it is proven bulletproof and all stakeholders are happy with implications, {private-,}ppa.launchpad.net could change in DNS.

Concerns:
  * How does LP know what to return when asked for a sources.list entry?
    - Could regex it on the s-c-a side to do what we want.
  * s-c-a already supports multiple "archive roots" for packages, so adding another would be ok, or maybe it would be special handing for the existing ppa ones.
  * achuni thinks the client behaves differently depending on whether the ppa is public, we should confirm that and see if it would be affected.

Are there implications on the LP side?