Build Farm Scalability

This LEP describes the constraints and what is an acceptable level of performance for the build farm.

As a package uploader
I want my build to be dispatched as soon as a builder is free and I am next in the queue
so that I don't wait unnecessarily for my build to be finished

As a person who pays for new hardware
I want to see builders always busy if there are jobs in the queue
so that I know I am getting the best performance I can for my money.

As a buildd admin
I want to add new build slaves without adversely affecting dispatch times to other builders
so that the build farm scales.

This LEP does not describe a new messaging system etc.,.

Rationale

This LEP is to guide development in the right direction such that we don't waste resources making changes that we don't really need.

It is being done now because the load on the build farm is increasing quite rapidly due to rebuilds etc., and the current manager does not scale and leaves slave resources wastefully idle for long periods.

It would bring value in terms of increased throughput of jobs on the build farm which would make PPA users, buildd-admins and the purse-carriers happier.

Stakeholders

PPA users
Ubuntu Team
IS (LaMont Jones)
Canonical Shareholders
Linaro Team

Constraints and Requirements

Must

When a builder becomes free, we must dispatch a queued build to it within 30 seconds.
- 30 seconds maximum? -- jml
It must be robust to failures. That is, failures dealing with one builder should not affect any other builder.
- Perhaps also something about failing fast, cleanly & noticeably. -- jml
When a build is ready on a builder, it must be collected within 30 seconds of reaching the ready state.
- What does this mean? "Collected" is a generic verb. I'm guessing this means -- jml
When adding new builders, each builder must not degrade the overall responsiveness by more than half a second per builder.
- Perhaps rather than 0.5s, say that we could double the number of builders without noticeably reducing the responsiveness of the system? -- jml

XXX How realistic are these numbers? I pulled numbers put of the air because it's better than the current delay of 20-30 minutes. -- Julian

Until you have daemons & messaging, you're always going to be limited by cron's 1m rule. -- jml

Nice to have

Even faster dispatch and collection, say sub 10 second
10 millisecond response degradation for new builders
The ability to dynamically alter the queue positions of jobs based on:
- job type
- archive (PPA)

Must not

Subfeatures

None (yet).

Workflows

Success

How will we know when we are done?

When we meet the *Must* criteria above, or get acceptably close to them.
We can actually process "daily builds" daily.

How will we measure how well we have done?

Graphing response times, build farm throughput and examining the build manager's log file.
- Ultimately, the thing we're optimizing for is the time between "user requests build" to "user gets usable result of build". We should be graphing that, perhaps broken down by user type or build type, probably as some kind of distribution chart. -- jml
- Since one motivating use-case is keeping hardware busy, we should also have a graph of the number of machines actively building and the number of jobs in the queue. https://lpstats.canonical.com/graphs/CodeImports/ is such a graph.

Thoughts?

This LEP is essential to the completion of LEP/DailyBuilds -- jml
A nice upshot of some of these metrics is that it will make the impact of removing builders more obvious -- jml
- Perhaps we should put the graphs in Launchpad itself, rather than lpstats? -- jml

launchpad development

LEP/BuildFarmScalability