= Diskless Apt Archives =
Serve PPA files directly from the librarian, rather than from a single machine's multi-terabyte filesystem
'''Contact:''' William Grant <
>
'''On Launchpad:''' https://bugs.launchpad.net/launchpad-project/+bugs?field.tag=diskless-archives
== Rationale ==
This project aims to solve scaling and latency problems that impact the user experience for publishing and consuming software for a fee.
Such software is built into ''commercial'' PPAs, which like all PPAs are currently hosted on a single machine `germanium`. Access to a specific piece of software is granted by the system writing a separate access control file in each archive, which occurs an arbitrary amount of time after the API call to grant access completes, leading to delays and a poor experience.
The current architecture leads to a very large footprint for any machine wanting to be `ppa.launchpad.net`: it needs multiple TB of disk, and enough IO and CPU bandwidth to run all the PPA maintenance functions (uploading, publication, access control, log analysis). We are struggling with load today, and while a newer machine would defer that struggle, the lack of a scaling story means it would be only moderate amount of time before we face the same problem again, but without the ability to fix it by upgrading. uploading is not (at present) a scaling problem for us, though it is bound to the same hostname which means we need to change how uploading is handled to be able to scale `ppa.launchpad.net`.
The project will be successful if our sysadmins can easily and effectively add capacity to handle rapid and substantial increases in the number of PPAs and number of users of PPAs, and software centre users get their purchases immediately without hassles (introduced by Launchpad).
== Stakeholders ==
* Consumer Apps
* IS
== User stories ==
'''As a ''' Software Center customer<
>
'''I want ''' my download to start immediately after purchase<
>
'''so that ''' I can use my new application as soon as possible.<
>
'''As a ''' Software Center customer<
>
'''I want ''' my downloads to be quick<
>
'''so that ''' I can use my new application as soon as possible.<
>
'''As a ''' package uploader<
>
'''I want ''' my archive to be updated quickly<
>
'''so that ''' my users can use my new package as soon as possible.<
>
'''As a ''' commercial application provider<
>
'''I want ''' Launchpad PPAs to scale easily to cope with my app's downloads<
>
'''so that ''' I can worry about more important things than distribution.<
>
'''As a ''' Launchpad sysadmin<
>
'''I want ''' to add new PPA download capacity easily and rapidly on modest hardware<
>
'''so that ''' I can quickly respond to and mitigate high load situations.<
>
== Constraints and Requirements ==
=== Must ===
* Allow `ppa.launchpad.net` HTTP(S) download frontends to scale to handle additional load without service disruption.
* Let private PPAs scale to millions of subscribers.
* Let private PPAs scale to additional 10's of thousands of archives.
* Permit people access to private PPAs immediately after activating their subscription.
* Commission and activate a new scalable `ppa.launchpad.net` node in less than one hour (after base OS install).
* Run scalable nodes on our stock hardware build without requiring special RAM or disk configuration.
* Retain interoperability with PPA access from all supported versions of Ubuntu (hardy and up at time of writing).
=== Nice to have ===
* Scale PPA archive publication by adding machines.
* Scale PPA uploading and upload processing by adding machines.
* Reduce PPA package publication delay.
* Decrease or eliminate downtime for `ppa.launchpad.net`. Ideally it becomes a regular nodowntime target.
=== Must not ===
* Break compatibility with existing apt `sources.list` entries, including private PPA credentials.
* Interfere with other parts of Launchpad (e.g. PPA statistics, Ubuntu main and universe being regular archives on disk)
=== Undesirable ===
* Doing a locksteap break of (S)FTP uploads to the existing overloaded `ppa.launchpad.net` hostname. There are 4000 distinct uploaders over all of 2011, so contacting them is doable if we need to. We have a long term desire to move PPA hosting to its own domain, like the librarian is.
== Success ==
=== How will we know when we are done? ===
We can seamlessly increase capacity to handle additional PPA downloads, without downtime or other service disruption.
Users can download packages from private PPAs immediately after activating their subscription.
=== How will we measure how well we have done? ===
* SC stops seeing user pain related to the performance (download rate, subscription activation latency) of ppa.launchpad.net.
E.g. no more bug reports or questions.
* Publication latency for all PPAs drops down to <= 60 seconds 99% of the time.
* Upload latency remains constant or decreases.
== Thoughts? ==
* See also [[DisklessArchives|design and implementation notes]]