Express Lane scheduling

As time goes on, we'll be using the job system more and more. It's critical that we complete jobs, but we also want our system to be responsive. Many of our job types have highly-variable durations. So it's easy for a short-running job to get stuck behind a long-running job, waiting for its turn. We can introduce parallelism, allowing multiple jobs to run at once, but this just makes the problem harder to encounter; if there are a maximum of four running jobs, it's possible for a short-running job to get stuck behind four long-running jobs.

Pragmatism

We often can and should improve the speed of the long-running jobs. But this is not always possible, as some jobs involve inherently slow network operations, large datasets, external libraries, etc. Accommodating varying durations provides increased responsiveness in the general case.

Applying the express lane concept to job scheduling

We can two types lanes: "express" and "slow". These will be implemented as processes. We must have at least one of each at a given time.

Slow Lane

The slow lane will behave like our normal twisted job runner does now. It will have a timeout that reflects the maximum time we want a job to run. Jobs which time out are considered permanently failed. It will accept jobs that have timed out in the express lane. When there are no jobs that have timed out in the express lane, it will accept any jobs.

Express Lane

The express lane will have a timeout that reflects the maximim time we want people to wait before their job starts. Jobs which time out are marked to be retried in the slow lane. It will accept jobs that have not timed out in the express lane.

An example

We start with both express and slow lanes empty. The timeout on the express lane is one minute, and the timeout on the slow lane is fifteen minutes. There are four pending jobs, none of which has run before: [fast1, slow1, slow2, fast2].

We schedule fast1 to run in the slow lane and slow1 to run in the fast lane.

0:00
Express lane: slow1
Slow lane: fast1

fast1 completes in 10 seconds, so we schedule slow2 to run in the slow lane

0:10
Express lane: slow1
Slow lane: slow2

slow1 does not complete in 1 minute, so it is ejected from the express lane, and fast2 is scheduled

1:00
Express lane: fast2
Slow lane: slow2

fast2 completes in 10 seconds, and we have nothing left to schedule in the express lane

1:10
Express lane: empty
Slow lane: slow2

slow2 completes in 10 minutes, and we schedule slow1 to run in the slow lane

10:10
Express lane: empty
Slow lane: slow1

slow1 also completes in 10 minutes, and we've exhausted our pending jobs

20:10
Express lane: empty
Slow lane: empty

Key points:

We were able to complete our fast-running jobs in 1:10
We wasted 1 minute trying to run slow1 because we didn't know it would be slow. So the runtime of slow1 is a 1.10x slower than strictly necessary. The slowdown approaches 2x as the slow-lane running-time approaches the express lane timeout.

Although it is not shown in this example, I assume that new jobs can be added to the pending list at any time. If jobs were added between 1:10 and 20:10, they would be scheduled to run in the express lane immediately.

Further details

Assumptions

Jobs can be re-run
Responsiveness is important to users

Optimizing using prediction

A strength of this system is that it behaves reasonably despite not knowing how long a job will take to run. However, if we assume we have a prediction of whether a job will be slow, we can reduce inefficiency and increase responsiveness. The method for predicting whether a job will be slow is not important here.

We can use this prediction two ways which are not mutually exclusive:

Sort the pending list according to how likely each job is to be slow, so that predicted-slow jobs appear at the end of the list.
Refuse to run jobs in the express lane if they are highly likely to be slow

Types of jobs that could benefit

Branch upgrade jobs
Merge proposal diff update jobs
Merge proposal creation jobs (because they generate diffs)
Branch puller (not currently a job)
Code Imports (not currently a job)

Lane planning

It may make sense to have as many slow lanes as there are processor cores, at least for some workloads. This way, there would be no lost performance when the express lanes were empty.

The number of express and slow lanes could vary according to the size of the queue and the current responsiveness of the system.

Implementation details

It appears no model changes are required. Jobs which time out in the express lane can be set "waiting", with an attempt count of 1. The express lane would ignore pending jobs whose attempt count was not 0.

Most of the work would be done in the ParallelLimitedTaskConsumer, which would provide "lanes" by tracking how many jobs were running as "express" jobs and how many were running as "slow" jobs.

The express lane could suspend jobs that time out, rather than aborting them. This would avoid wasting work, at the cost of increasing memory use.

It's also possible to have additional layers of lanes, if some jobs are ridiculously slow, but must be completed.

launchpad development

CodeTeam/ExpressLane