Build Farm Scalability

This LEP describes the constraints and what is an acceptable level of performance for the build farm.

As a package uploader
I want my build to be dispatched as soon as a builder is free and I am next in the queue
so that I don't wait unnecessarily for my build to be finished

As a person who pays for new hardware
I want to see builders always busy if there are jobs in the queue
so that I know I am getting the best performance I can for my money.

As a buildd admin
I want to add new build slaves without adversely affecting dispatch times to other builders
so that the build farm scales.

This LEP does not describe a new messaging system etc.,.

Rationale

This LEP is to guide development in the right direction such that we don't waste resources making changes that we don't really need.

It is being done now because the load on the build farm is increasing quite rapidly due to rebuilds etc., and the current manager does not scale and leaves slave resources wastefully idle for long periods.

It would bring value in terms of increased throughput of jobs on the build farm which would make PPA users, buildd-admins and the purse-carriers happier.

Stakeholders

PPA users
Ubuntu Team
IS (LaMont Jones)
Canonical Shareholders
Linaro Team

Constraints and Requirements

Must

When a builder becomes free, we must dispatch a queued build to it within a maximum of 30 seconds. [DONE]
Misbehaving jobs must not affect the rest of the build farm [DONE]
Misbehaving builders must not affect the rest of the build farm [DONE]
When a build is ready on a builder, it must be collected¹ and passed on to the next stage within 30 seconds of reaching the ready state. [DONE]
When adding new builders, each builder must not degrade the overall responsiveness by more than half a second per builder. [UNSURE]²
Design for a system with 200 builders. [UNSURE]³
Not starve low-scored builds when there are higher-scored builds in the queue (though low-scored builds may be performed at a lower rate) [DONE]⁴

Nice to have

Even faster dispatch and collection, say sub 10 second [DONE]
10 millisecond response degradation for new builders [UNSURE]
The ability to dynamically alter the queue positions of jobs based on: [NOT DONE]
- job type
- archive (PPA)

Success

How will we know when we are done?

Bugs for this feature: https://launchpad.net/launchpad-project/+bugs?field.tag=buildd-scalability

When we meet the *Must* criteria above, or get acceptably close to them. [DONE]
We can actually process "daily builds" daily. [DONE]

How will we measure how well we have done?

Graphing response times, build farm throughput and examining the build manager's log file.
- Ultimately, the thing we're optimizing for is the time between "user requests build" to "user gets usable result of build". We should be graphing that, perhaps broken down by user type or build type, probably as some kind of distribution chart. -- jml
- https://lpstats.canonical.com/graphs/BuildersActiveVirtual/
- https://lpstats.canonical.com/graphs/BuildersActiveNonVirtual/
- XXX: Still need graphs for measuring latency.
  - Something measuring total time from job request to job fully done with happy user (perhaps somehow adjusted for jobs taking different amounts of time to run)
  - Something measuring time from "builder available" to "job fully done" minus the time actually spent doing the work.
  - https://lpstats.canonical.com/graphs/BuilddLagPPASupportedArch/
  - https://lpstats.canonical.com/graphs/BuilddLagPrivatePPA/
  - https://lpstats.canonical.com/graphs/BuilddLagProductionSupportedArch/
  - https://lpstats.canonical.com/graphs/BuilddLagProductionUnsupportedArch/

Thoughts?

This LEP is essential to the completion of LEP/SourcePackageRecipeBuilds -- jml
A nice upshot of some of these metrics is that it will make the impact of removing builders more obvious -- jml
- Perhaps we should put the graphs in Launchpad itself, rather than lpstats? -- jml

Post-mortem

All of the key requirements for the system have been met. We still have some outstanding requirements that may warrant future work. In particular,

Scaling the system to large numbers of builders

We have not properly tested how system response degrades in the face of large numbers of builders. We do know that the CPU utilization of the build manager seems to increase linearly with the number of builders, and that this is a problem.

Making starvation impossible

The system currently avoids queue starvation simply by processing queues faster than they normally accumulate. Starving out a particular kind of build is still a theoretical possibility, and thus will certainly happen at some point later in our production environment.

Queue management

Launchpad should have a user interface that lets build farm admins control the ordering of items in the queue.

More non-virtual builders

The virtual builders graph shows that we are making excellent use of our builders when the work-in-queue warrants it. The non-virtual builders graph shows that the build farm is unable to keep up with the work for non-virtual builders. Hence, we probably need more non-virtual builders.

More flexible architecture builds

The virtual builders graph shows that even in times of peak use, we are not using all of our machines. That is because some of the machines being counted are of the wrong architecture for the builds that have been requested. It is conceivable that we could alter the builders so that their virtual machines ran with the architecture that was appropriate for the builds in the queue.

Measure overall system latency

We measure throughput, but we don't yet measure the overall system latency. We must have graphs that show time from upload to queueing, time in queue, time until build starts, time from build complete to collection, time from collection until successful publishing etc.

Footnotes

collected is a term specific to the build farm, and refers to the explicit step performed by the build master to fetch the results from the builder (1)
We currently lack the tools to measure this (2)
We lack the testing environments to properly measure this (3)
Although starvation is still possible in theory, the greatly improved throughput means that it is no longer a problem in practice (4)

launchpad development

LEP/BuildFarmScalability