= Dispatch time estimation for build farm jobs = == Introduction == Due to technical limitations (the art of writing psychic software is not very well established yet :-) job dispatch times are '''''estimations''''' only. For the purpose of this description a 'platform' is considered to be the combination of a * processor (e.g. i386 or amd64) and a * virtualization setting (one of: true, false or null) Build farm jobs can either target a specific platform (e.g. binary builds) or be platform-independent (e.g. "generate a source package from a recipe" builds). The former can only make use of build machines (or "builders" in Soyuz parlance) of the given platform while platform-independent jobs may run on any available builder. Jobs with an unspecified virtualization setting will be dispatched to virtual builders only. Builders can -- roughly speaking -- either be idle or building. For any job running on a particular builder its estimated duration and its start time are available allowing us to ''estimate'' the job's remaining execution time. By the way, did I already mention that job dispatch times are an ''estimation'' only? === Problem definition === Given: * a queue of pending (i.e. ready to build) build farm jobs sorted in descending order according to their score * a pool of builders Wanted: the ''estimated'' dispatch time for a specific job (the job of interest (aka JOI)) in the pending queue. === Solution overview === There are two questions we need to answer in order to come up with a dispatch time ''estimation'' for the job of interest (JOI): 1. how long will the jobs ahead of the JOI (in the pending queue) take to run? This is the ''predecessor lead time'' (PLT). 1. how long will it take until the job at the head of the pending queue is dispatched to a builder? This is the ''time to next builder'' (TNB). The dispatch time ''estimation'' for the JOI is then calculated as follows: `now() + PLT + TNB` == Time to next builder == The time to next builder (TNB) is estimated for the ''head job'' which is the job at the head of the pending queue. Given the head job's platform (processor: P, virtualization setting: V) The `TNB` is taken to be the minimum ''remaining'' job execution time across all builders providing (P,V). Example: the head job's platform is (`i386`,`true`) and we have the following builders: || builder || estimated duration || job start time || || Africa || 10 minutes || -2 minutes || || Americas || 12 minutes || -4 minutes || || Antarctica || 8 minutes || -2 minutes || || Australia || 22 minutes || -8 minutes || The resulting `TNB` would be 6 minutes since the `Antarctica` builder is ''estimated'' to finish its job in that time. Sometimes jobs overdraw their estimated duration i.e. they run longer than estimated. In such cases we assume that the corresponding builder will finish in 2 minutes. This is somewhat of a .. *cough* .. ''educated guess'' but has worked reasonably well in the past. == Predecessor lead time == === Overview === The ''predecessor set'' is comprised by jobs that fulfil the following criteria: they are * ahead of a given job of interest (JOI) in the pending queue (the latter is sorted according to job score in descending order) * competing with it for builders The ''predecessor lead time'' for the JOI is then estimated as follows: 1. sum up the ''estimated duration'' of the jobs in the ''predecessor set''. This results in a ''lead time total'' (LTT). 1. divide the `LTT` by the smaller of these two values: the size of the a. ''predecessor set'' a. pool of builders available to run the jobs in the ''predecessor set'' '''Example A''': 10 builders can run the JOI and the ''predecessor set'' is comprised of jobs with estimated durations of 2, 4 and 6 minutes respectively. This results in a ''predecessor lead time'' of 4 minutes. The idea here being that although we have 10 builders only 3 of these can be used to run the jobs in the predecessor set. '''Example B''': 3 builders can run the JOI and the ''predecessor set'' is comprised of jobs with estimated durations of 2, 3, 4 and 6 minutes respectively. This results in a ''predecessor lead time'' of 5 minutes. === Multiple job types === Before the build farm generalization we only had one job type (''binary'' builds) and could hence make the assumption that all jobs in the ''predecessor set'' share the same builder pool. With the introduction of processor-independent build farm jobs that assumption ceased to be true. The examples that follow will all be based on the following set-up: '''Builders''': || builder pool size || processor || virtual || || 4 || `i386` || `false` || || 3 || `i386` || `true` || || 2 || `amd64` || `true` || || 1 || `hppa` || `true` || '''Jobs''': || Job || estimated duration || score || processor || virtual || || J1 || 2 minutes || 99 || `i386` || `true` || || J2 || 4 minutes || 98 || `i386` || `true` || || J3 || 5 minutes || 97 || `null` || `null` || || J4 || 1 minute || 96 || `i386` || `true` || || J5 || 5 minutes || 95 || `i386` || `true` || || J6 || 4 minutes || 94 || `null` || `null` || || J7 || 2 minutes || 93 || `hppa` || `true` || || J8 || 3 minutes || 92 || `null` || `null` || || J9 || 2 minutes || 91 || `i386` || `true` || '''Example C: Job of interest''': `J9` '''Predecessor set''': `{ J1, J2, J3, J4, J5, J6, J8, }` (`J7` needs a (`hppa`, `true`) builder and thus does not compete with `J9`). Please note that the `{ J3, J6, J8 }` job subset needs to be treated differently than `{ J1, J2, J4, J5 }` since it can run on 6 builders (all ''virtual'' builders) whereas `{ J1, J2, J4, J5 }` may only execute on the 3 builders with the (`i386`, `true`) platform. The predecessor lead time for `J9` is 8 minutes and estimated to be the sum of: * `{ J1, J2, J4, J5 }`: `(2 + 4 + 1 + 5 minutes)/3 = 4 minutes` * `{ J3, J6, J8 }`: `(5 + 4 + 3 minutes)/3 = 4 minutes` '''Example D: Job of interest''': `J8` '''Predecessor set''': `{ J1, J2, J3, J4, J5, J6, J7 }`, now all jobs building on virtual builders compete with `J8`. That includes `J7`. The predecessor lead time for `J8` is 10.5 minutes and estimated to be the sum of: * `{ J1, J2, J4, J5 }`: `(2 + 4 + 1 + 5 minutes)/3 = 4 minutes` * `{ J3, J6 }`: `(5 + 4 minutes)/2 = 4.5 minutes` * `{ J7 }`: `2 minutes` === In conclusion === The ''predecessor lead time'' estimation is probably a bit on the "pessimistic" side for ''predecessor sets'' (and/or subsets) that are relatively small in comparison to the builder pool size. However, that improves for larger "pending build job" queues which is when users typically become more interested in the dispatch time estimation in first place. == Where to find things == The ''time to next builder'' estimations are performed by `BuildQueue._estimateTimeToNextBuilder()`. The ''predecessor lead time'' estimation logic can be found in `BuildQueue._estimateJobDelay()`. `lib/lp/buildmaster/tests/test_buildqueue.py` contains a large number of tests that exercise the code in question and may help in understanding how it works.