Diff for "LEP/BuildFarmScalability"

Not logged in - Log In / Register

Differences between revisions 2 and 3
Revision 2 as of 2010-08-17 15:31:43
Size: 2797
Comment:
Revision 3 as of 2010-08-18 10:42:03
Size: 4028
Editor: jml
Comment: Bunch of comments
Deletions are marked like this. Additions are marked like this.
Line 41: Line 41:
   * 30 seconds maximum? -- jml
Line 42: Line 43:
   * Perhaps also something about failing fast, cleanly & noticeably. -- jml
Line 43: Line 45:
   * What does this mean? "Collected" is a generic verb. I'm guessing this means -- jml
Line 44: Line 47:
   * Perhaps rather than 0.5s, say that we could double the number of builders without noticeably reducing the responsiveness of the system? -- jml
Line 46: Line 50:
 * Until you have daemons & messaging, you're always going to be limited by cron's 1m rule. -- jml
Line 75: Line 80:
   * Ultimately, the thing we're optimizing for is the time between "user requests build" to "user gets usable result of build". We should be graphing that, perhaps broken down by user type or build type, probably as some kind of distribution chart. -- jml
   * Since one motivating use-case is keeping hardware busy, we should also have a graph of the number of machines actively building and the number of jobs in the queue. https://lpstats.canonical.com/graphs/CodeImports/ is such a graph.
Line 77: Line 84:

 * This LEP is essential to the completion of [[LEP/DailyBuilds]] -- jml

 * A nice upshot of some of these metrics is that it will make the impact of ''removing'' builders more obvious -- jml
   * Perhaps we should put the graphs in Launchpad itself, rather than lpstats? -- jml

Build Farm Scalability

This LEP describes the constraints and what is an acceptable level of performance for the build farm.

As a package uploader
I want my build to be dispatched as soon as a builder is free and I am next in the queue
so that I don't wait unnecessarily for my build to be finished

As a person who pays for new hardware
I want to see builders always busy if there are jobs in the queue
so that I know I am getting the best performance I can for my money.

As a buildd admin
I want to add new build slaves without adversely affecting dispatch times to other builders
so that the build farm scales.

This LEP does not describe a new messaging system etc.,.

Rationale

This LEP is to guide development in the right direction such that we don't waste resources making changes that we don't really need.

It is being done now because the load on the build farm is increasing quite rapidly due to rebuilds etc., and the current manager does not scale and leaves slave resources wastefully idle for long periods.

It would bring value in terms of increased throughput of jobs on the build farm which would make PPA users, buildd-admins and the purse-carriers happier.

Stakeholders

  • PPA users
  • Ubuntu Team
  • IS (LaMont Jones)

  • Canonical Shareholders
  • Linaro Team

Constraints and Requirements

Must

  • When a builder becomes free, we must dispatch a queued build to it within 30 seconds.
    • 30 seconds maximum? -- jml
  • It must be robust to failures. That is, failures dealing with one builder should not affect any other builder.
    • Perhaps also something about failing fast, cleanly & noticeably. -- jml

  • When a build is ready on a builder, it must be collected within 30 seconds of reaching the ready state.
    • What does this mean? "Collected" is a generic verb. I'm guessing this means -- jml
  • When adding new builders, each builder must not degrade the overall responsiveness by more than half a second per builder.
    • Perhaps rather than 0.5s, say that we could double the number of builders without noticeably reducing the responsiveness of the system? -- jml

XXX How realistic are these numbers? I pulled numbers put of the air because it's better than the current delay of 20-30 minutes. -- Julian

  • Until you have daemons & messaging, you're always going to be limited by cron's 1m rule. -- jml

Nice to have

  • Even faster dispatch and collection, say sub 10 second
  • 10 millisecond response degradation for new builders
  • The ability to dynamically alter the queue positions of jobs based on:
    • job type
    • archive (PPA)

Must not

Subfeatures

None (yet).

Workflows

Success

How will we know when we are done?

  • When we meet the *Must* criteria above, or get acceptably close to them.
  • We can actually process "daily builds" daily.

How will we measure how well we have done?

  • Graphing response times, build farm throughput and examining the build manager's log file.
    • Ultimately, the thing we're optimizing for is the time between "user requests build" to "user gets usable result of build". We should be graphing that, perhaps broken down by user type or build type, probably as some kind of distribution chart. -- jml
    • Since one motivating use-case is keeping hardware busy, we should also have a graph of the number of machines actively building and the number of jobs in the queue. https://lpstats.canonical.com/graphs/CodeImports/ is such a graph.

Thoughts?

  • This LEP is essential to the completion of LEP/DailyBuilds -- jml

  • A nice upshot of some of these metrics is that it will make the impact of removing builders more obvious -- jml

    • Perhaps we should put the graphs in Launchpad itself, rather than lpstats? -- jml

LEP/BuildFarmScalability (last edited 2010-12-10 12:33:53 by jml)