= Build Farm Scalability =

This LEP describes the constraints and what is an acceptable level of performance for the build farm.

'''As a ''' package uploader<<BR>>
'''I want ''' my build to be dispatched as soon as a builder is free and I am next in the queue<<BR>>
'''so that ''' I don't wait unnecessarily for my build to be finished<<BR>>

'''As a ''' person who pays for new hardware<<BR>>
'''I want ''' to see builders always busy if there are jobs in the queue<<BR>>
'''so that''' I know I am getting the best performance I can for my money.<<BR>>

'''As a ''' buildd admin<<BR>>
'''I want ''' to add new build slaves without adversely affecting dispatch times to other builders<<BR>>
'''so that''' the build farm scales.<<BR>>


This LEP '''does not''' describe a new messaging system etc.,.

== Rationale ==

This LEP is to guide development in the right direction such that we don't waste resources making changes that we don't really need.

It is being done now because the load on the build farm is increasing quite rapidly due to rebuilds etc., and the current manager does not scale and leaves slave resources wastefully idle for long periods.

It would bring value in terms of increased throughput of jobs on the build farm which would make PPA users, buildd-admins and the purse-carriers happier.

== Stakeholders ==

 * PPA users
 * Ubuntu Team
 * IS (LaMont Jones)
 * Canonical Shareholders
 * Linaro Team

== Constraints and Requirements ==

=== Must ===

 * When a builder becomes free, we must dispatch a queued build to it within a maximum of 30 seconds. '''[DONE]'''
 * Misbehaving jobs must not affect the rest of the build farm '''[DONE]'''
 * Misbehaving builders must not affect the rest of the build farm '''[DONE]'''
 * When a build is ready on a builder, it must be collected<<FootNote(''collected'' is a term specific to the build farm, and refers to the explicit step performed by the build master to fetch the results from the builder)>> and passed on to the next stage within 30 seconds of reaching the ready state. '''[DONE]'''
 * When adding new builders, each builder must not degrade the overall responsiveness by more than half a second per builder. '''[UNSURE]'''<<FootNote(We currently lack the tools to measure this)>>
 * Design for a system with 200 builders. '''[UNSURE]'''<<FootNote(We lack the testing environments to properly measure this)>>
 * Not starve low-scored builds when there are higher-scored builds in the queue (though low-scored builds may be performed at a lower rate) '''[DONE]'''<<FootNote(Although starvation is still possible in theory, the greatly improved throughput means that it is no longer a problem in practice)>>

=== Nice to have ===

 * Even faster dispatch and collection, say sub 10 second '''[DONE]'''
 * 10 millisecond response degradation for new builders '''[UNSURE]'''
 * The ability to dynamically alter the queue positions of jobs based on: '''[NOT DONE]'''
   * job type
   * archive (PPA)

== Success ==

=== How will we know when we are done? ===

'''Bugs for this feature:''' [[https://launchpad.net/launchpad-project/+bugs?field.tag=buildd-scalability]]

 * When we meet the *Must* criteria above, or get acceptably close to them. '''[DONE]'''
 * We can actually process "daily builds" daily. '''[DONE]'''

=== How will we measure how well we have done? ===

 * Graphing response times, build farm throughput and examining the build manager's log file.
   * Ultimately, the thing we're optimizing for is the time between "user requests build" to "user gets usable result of build". We should be graphing that, perhaps broken down by user type or build type, probably as some kind of distribution chart. -- jml
   * https://lpstats.canonical.com/graphs/BuildersActiveVirtual/
   * https://lpstats.canonical.com/graphs/BuildersActiveNonVirtual/
   * XXX: Still need graphs for measuring latency.
     * Something measuring total time from job request to job fully done with happy user (perhaps somehow adjusted for jobs taking different amounts of time to run)
     * Something measuring time from "builder available" to "job fully done" minus the time actually spent doing the work.
     * https://lpstats.canonical.com/graphs/BuilddLagPPASupportedArch/
     * https://lpstats.canonical.com/graphs/BuilddLagPrivatePPA/
     * https://lpstats.canonical.com/graphs/BuilddLagProductionSupportedArch/
     * https://lpstats.canonical.com/graphs/BuilddLagProductionUnsupportedArch/

== Thoughts? ==

 * This LEP is essential to the completion of [[LEP/SourcePackageRecipeBuilds]] -- jml

 * A nice upshot of some of these metrics is that it will make the impact of ''removing'' builders more obvious -- jml
   * Perhaps we should put the graphs in Launchpad itself, rather than lpstats? -- jml

== Post-mortem ==

All of the key requirements for the system have been met.  We still have some outstanding requirements that may warrant future work.  In particular,

=== Scaling the system to large numbers of builders ===

We have not properly tested how system response degrades in the face of large numbers of builders.  We do know that the CPU utilization of the build manager seems to increase linearly with the number of builders, and that this is a problem.

=== Making starvation impossible ===

The system currently avoids queue starvation simply by processing queues faster than they normally accumulate.  Starving out a particular kind of build is still a theoretical possibility, and thus will certainly happen at some point later in our production environment.

=== Queue management ===

Launchpad should have a user interface that lets build farm admins control the ordering of items in the queue.

=== More non-virtual builders ===

The [[https://lpstats.canonical.com/graphs/BuildersActiveVirtual/|virtual builders graph]] shows that we are making excellent use of our builders when the work-in-queue warrants it. The [[https://lpstats.canonical.com/graphs/BuildersActiveNonVirtual/|non-virtual builders graph]] shows that the build farm is  unable to keep up with the work for non-virtual builders. Hence, we probably need more non-virtual builders.

=== More flexible architecture builds ===

The [[https://lpstats.canonical.com/graphs/BuildersActiveVirtual/|virtual builders graph]] shows that even in times of peak use, we are not using all of our machines.  That is because some of the machines being counted are of the wrong architecture for the builds that have been requested.  It is conceivable that we could alter the builders so that their virtual machines ran with the architecture that was appropriate for the builds in the queue.

=== Measure overall system latency ===

We measure throughput, but we don't yet measure the overall system latency.  We must have graphs that show time from upload to queueing, time in queue, time until build starts, time from build complete to collection, time from collection until successful publishing etc.

== Footnotes ==