LEP/BuildFarmScalability

Not logged in - Log In / Register

Build Farm Scalability

This LEP describes the constraints and what is an acceptable level of performance for the build farm.

As a package uploader
I want my build to be dispatched as soon as a builder is free and I am next in the queue
so that I don't wait unnecessarily for my build to be finished

As a person who pays for new hardware
I want to see builders always busy if there are jobs in the queue
so that I know I am getting the best performance I can for my money.

As a buildd admin
I want to add new build slaves without adversely affecting dispatch times to other builders
so that the build farm scales.

This LEP does not describe a new messaging system etc.,.

Rationale

This LEP is to guide development in the right direction such that we don't waste resources making changes that we don't really need.

It is being done now because the load on the build farm is increasing quite rapidly due to rebuilds etc., and the current manager does not scale and leaves slave resources wastefully idle for long periods.

It would bring value in terms of increased throughput of jobs on the build farm which would make PPA users, buildd-admins and the purse-carriers happier.

Stakeholders

Constraints and Requirements

Must

Nice to have

Success

How will we know when we are done?

Bugs for this feature: https://launchpad.net/launchpad-project/+bugs?field.tag=buildd-scalability

How will we measure how well we have done?

Thoughts?

Post-mortem

All of the key requirements for the system have been met. We still have some outstanding requirements that may warrant future work. In particular,

Scaling the system to large numbers of builders

We have not properly tested how system response degrades in the face of large numbers of builders. We do know that the CPU utilization of the build manager seems to increase linearly with the number of builders, and that this is a problem.

Making starvation impossible

The system currently avoids queue starvation simply by processing queues faster than they normally accumulate. Starving out a particular kind of build is still a theoretical possibility, and thus will certainly happen at some point later in our production environment.

Queue management

Launchpad should have a user interface that lets build farm admins control the ordering of items in the queue.

More non-virtual builders

The virtual builders graph shows that we are making excellent use of our builders when the work-in-queue warrants it. The non-virtual builders graph shows that the build farm is unable to keep up with the work for non-virtual builders. Hence, we probably need more non-virtual builders.

More flexible architecture builds

The virtual builders graph shows that even in times of peak use, we are not using all of our machines. That is because some of the machines being counted are of the wrong architecture for the builds that have been requested. It is conceivable that we could alter the builders so that their virtual machines ran with the architecture that was appropriate for the builds in the queue.

Measure overall system latency

We measure throughput, but we don't yet measure the overall system latency. We must have graphs that show time from upload to queueing, time in queue, time until build starts, time from build complete to collection, time from collection until successful publishing etc.

Footnotes

  1. collected is a term specific to the build farm, and refers to the explicit step performed by the build master to fetch the results from the builder (1)

  2. We currently lack the tools to measure this (2)

  3. We lack the testing environments to properly measure this (3)

  4. Although starvation is still possible in theory, the greatly improved throughput means that it is no longer a problem in practice (4)

LEP/BuildFarmScalability (last edited 2010-12-10 12:33:53 by jml)