Differences between revisions 5 and 6

This is a draft list of requirements for a new system that would replace the buildfarm and the jobs system. Its purpose is to allow us to assess candidate solutions according to how much they provide, and how much we would need to build on top of them.

Requirements

testable
- easy to write fast, robust isolated unit tests
- Able to be run on a laptop without hardly any set-up and reasonable confidence that it mostly matches production
debuggable in production
- e.g. if a task (or type of task) is abusing the database, it is easy to determine what type it is, and ideally which individual task
Scalable
- amount of work done scales O(n) with the number of workers
- responsiveness of the system scales O(1) with the number of workers
- amount of work done on any dispatcher can scale better than O(n) with the number of workers (e.g. by running multiple dispatchers)
Should have good availability
- Should not increase Launchpad scheduled downtime
- Downtime should be less than 1 hour per month
Supports fair (see below) scheduling in a way that we are content with
Avoid penalizing responsiveness of some tasks when other tasks are too slow
Permits manual override of scheduling (e.g. getting a security fix built quickly)
View
- Queue status
- Queue history (with a limited length)
- builder/worker status
- builder/worker history (with a limited length)
- task status and progress (e.g. log)
Should not store crazy amounts of data
Estimate of task duration
Administer tasks
- set priority
- cancel
- restart
cron-like functionality (create or dispatch tasks according to a schedule)
(Some?) tasks can be restarted after being cancelled/aborted.
Bad task detection. Work out if a task is blowing up each worker it uses and disable it after a certain limit.
Workers can persist
Expensive set-up for particular task types can be amortized by running several tasks of the same type per set-up.
- for micro-jobs, process startup is expensive, so this implies reusing processes.
Workers cannot interfere with other processes on their machine (e.g. by killing workers that consume too much memory or filehandles)
- ideally, this will be handled in a way that can be applied to our other systems with similar requirements
sub-second latency (e.g. tasks can be dispatched almost as soon as they are created)
Workers will have different capabilities (e.g. the ability to build for a given architecture or to run in a virtualized environment), and will not select or be assigned jobs that don't match their capabilities (Tags, message queue tokens)
Untrusted code can be run safely (e.g. using xen).
Multiple machines can participate in the system (e.g. a build farm)
Machines can be easily added to or removed from the system, even when tasks are in progress
- in-progress tasks are moved to a different machine or rescheduled
Workers can communicate with the master (logs, heartbeats)
Robust against subverted workers (not all workers shall be considered "at-risk of subversion")
- Subverted workers cannot communicate with any machines except master
- Subverted workers cannot compromise master
- Subverted workers cannot impersonate other workers
- Subverted workers cannot pretend to be handling a task that they aren't handling
Completely independent workers i.e. one bad worker can't take down another (but not all workers are at risk of being "bad").
Completely asynchronous and queue-based
Cope with network brownouts. Yes, DC engineers sometimes trip over cables.
Avoid dumping tasks onto staging via database restores

Disputed

No single point of failure. This means ditching the current notion of a single buildd-manager.
Dynamic task queue with manual overrides
- -- this looks like an implementation detail. What are the use cases? Providing fairness, but allowing emergencies to be handled?
Estimate of task start and completion time
- -- lifeless asserts that this is not worth it. Providing good estimates compromises performance and our estimates are never right-- they're constantly being revised as tasks are added to the system. Perhaps an "average time in queue" would be a better guidance?

Nice-to-have

Suspend and resume tasks
real-time logtail
production/edge/staging split
adding new task types does not require creating new database users.
does not need to be upgraded routinely when Launchpad is upgraded, only when its code changes
supports sub-second jobs (aka "micro-jobs")

Possible strategies

manage i386/amd64 builders with UEC (and potentially other services that support the AWS API)
manage builders with Landscape

Queueing notes

We need:

An "entry" priority where all things being equal we can programatically determine relative priority at job creation time,
Manual, dynamic control.

Fairness in queueing

Fairness is inevitably fuzzy. Everyone wants their own task handled next, and when there's contention, that just can't happen. It seems inevitable that scheduling will be subjected to endless tweaking. Still, we must try to produce a system that people feel is fair, even if it doesn't satisfy them completely. We can describe some aspects of fairness now, and flesh it out as we go along.

Some aspects of fairness that we may consider:

Users who create a lot of tasks should not delay other users unduely
Users who create a lot of tasks should not delay themselves unduely. E.g. Barry Warsaw probably doesn't want his Python builds to delay his Ubuntu on Mac Extras builds significantly.
Interactively-created tasks should have lower latency than batch tasks
Even low-priority tasks should not starve
Tasks that take unusually long to run should not block all other tasks, or all tasks in their class
If a logical task is split into several sub-tasks, the priority of the logical task should apply to the subtasks, especially if the subtasks are created dynamically. (e.g. binary builds from a source package recipe build should start soon after the source package recipe build completes)

Foundations/NewTaskSystem/Requirements (last edited 2010-09-17 14:06:11 by abentley)

-  ⇤ ← Revision 5 as of 2010-09-02 21:04:05 → 
  Size: 6231
  Editor: abentley
  Comment:
+   ← Revision 6 as of 2010-09-17 14:06:11 → ⇥
  Size: 6280
  Editor: abentley
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 48:
- * Robust against subverted workers
-Line 69:
+Line 68:
-   -- lifeless asserts that this is not worth it.  Providing good estimates compromises performance and our estimates are constantly being revised as tasks are added to the system.
+   -- lifeless asserts that this is not worth it.  Providing good estimates compromises performance and our estimates are never right-- they're constantly being revised as tasks are added to the system.  Perhaps an "average time in queue" would be a better guidance?
-Line 87:
+Line 86:
-. Manual, dyanmic control.
+. Manual, dynamic control.

launchpad development

Diff for "Foundations/NewTaskSystem/Requirements"