Differences between revisions 15 and 25 (spanning 10 versions)

Status

lazr.jobrunner project at https://launchpad.net/lazr.jobrunner

Removal of job-running code from the Launchpad Tree: https://code.launchpad.net/~adeuring/launchpad/lp-lazr.jobrunner/+merge/97458

Initial code landed in r15046.

Successful test against qastaging (Command)

RT #51940: "Please enable branch scan via Celery."

Example feature rule

jobs.celery.enabled_classses    default 0       BranchScanJob

Completed

Implementation of celery-based job running: https://code.launchpad.net/~abentley/lazr.jobrunner/run-via-celery
Implementation of job-running code in lazr.jobrunner: https://code.launchpad.net/~adeuring/launchpad/lazr.jobrunner-more-tests
More updates to support running Launchpad jobs via lazr.jobrunner: https://code.launchpad.net/~abentley/lazr.jobrunner/launchpad-via-celery/+merge/98894
release lazr.jobrunner 0.2
land ~abentley/launchpad/run-via-celery

Aaron's Todo

Support new job types for running via Celery.
Resource limits
- Oops if memory limit exceeded (may just need tests)
- Oops handling should not be subjected to memory limit
- Perhaps needs to be NIHed-- don't accept jobs while too much memory is in use.
Fast lane/slow lane
Upgrade story
requires removal of all existing job-running scripts Revise/remove lease handling.

Nice to have

Timeline in oopses
Use rabbitfixture for lazr.jobrunner test suite
Set up resource limits once per worker process, instead of per-job.

Postponed

Kill workers whose tasks error
- Rationale: We are not sure that this is necessary, and it seems hard.

Resources

Main machine is ackee
IRC log https://pastebin.canonical.com/62113/
jjo suggests we use 1.5-2 cores and 1.5 G on ackee
- therefore, probably one worker process each for fast and slow lanes
lifeless revises this to 4 workers, suggesting 3 fast and 1 slow
loganberry is also available, but heavily loaded

Open Questions

Currently, each job type has its own config. Do we retain that? If so, how do we associate jobs with their configs?

How do we support nodowntime/fastdowntime upgrades?

Send SIGINT to current celeryds and start new ones?
- Less coding
- more chance of exceeding resource limits as new workers take jobs while old workers are still working.
- old workers may last until the slow lane times out
On message or signal, workers stop accepting jobs, reschedule any running jobs, and exit?
- More coding
- No chance of exceeding resource limits
- Delays slow-running jobs further

Analysis of current work load

Current work load

Fast-first vs slow-first

The fast lane must always time-out and dump timed-out jobs to the slow lane. But should jobs start in the fast lane or the slow lane?

Fast-first

If they start in the fast lane, then every slow job will run in the fast lane first, then be re-queued into the slow lane. If slow jobs take up every fast-lane worker, then latency will be introduced. Therefore, there competing incentives to maximize the number of fast workers (to avoid latency) and to maximize the number of slow workers (to increase the throughput of slow jobs).

Slow-first

Another way to organise this is to initially queue into the slow lane, and treat the fast lane as overflow when all slow-lane workers are occupied. In this case, latency is introduced only when all workers are busy, not all fast-lane workers. However, slow-lane workers are more likely to be busy than fast-lane workers, because they do not have a timeout. So while there is still an incentive to maximize the number of fast workers (to reduce the likelihood that all workers will be busy), it is less pronounced than with fast-first. Two fast-lane workers is probably plenty.

Analysis

While fast jobs are the norm (see Current work load), it's not clear whether slow jobs arrive in clusters. If they do, slow-first has a better chance to absorb the work without introducing latency.

Slow-first reduces the advantage of having many fast-lane workers, encouraging a more balanced set of workers. The fewer workers permitted, the more slow-first shows an advantage in latency and throughput.

lifeless rightly observes that slow-first is more complicated than fast-first, and also that slow-first can be built on fast-first. He has indicated a strong preference for fast-first, until the need for slow-first can be demonstrated.

-  ⇤ ← Revision 15 as of 2012-03-23 18:56:43 → 
  Size: 4117
  Editor: abentley
  Comment:
+   ← Revision 25 as of 2012-04-09 14:03:05 → ⇥
  Size: 4648
  Editor: abentley
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 6:
-== Merged ==
+Initial code landed in r15046.

Successful test against qastaging ([[https://pastebin.canonical.com/63588/|Command]])

RT #51940: "Please enable branch scan via Celery."

Example feature rule
{{{
jobs.celery.enabled_classses	default	0	BranchScanJob
}}}

== Completed ==
-Line 12:
+Line 23:
+ * More updates to support running Launchpad jobs via lazr.jobrunner: https://code.launchpad.net/~abentley/lazr.jobrunner/launchpad-via-celery/+merge/98894

 * release lazr.jobrunner 0.2

 * land ~abentley/launchpad/run-via-celery
-Line 15:
+Line 31:
- * '''blocked on lp:~adeuring/lazr.jobrunner/use_job_repr_in_logging''' release lazr.jobrunner 0.2
 * '''blocked on lp:~adeuring/launchpad/lp-lazr.jobrunner''' land ~abentley/launchpad/run-via-celery
 *  workers: one-time Launchpad init
 * per job-type init
  * Possibly not needed, because not expensive.
  * Can be implemented in Job.run
+ * Support new job types for running via Celery.
-Line 22:
+Line 33:
-  * Maybe per-celeryd, since available resources doesn't change...
  * Oops if memory limit exceeded
+  * Oops if memory limit exceeded (may just need tests)
  * Oops handling should not be subjected to memory limit
-Line 27:
+Line 38:
+ * '''requires removal of all existing job-running scripts''' Revise/remove lease handling.
-Line 30:
+Line 42:
- * Use rabbitfixture for test suite
+ * Use rabbitfixture for lazr.jobrunner test suite
 * Set up resource limits once per worker process, instead of per-job.
-Line 34:
+Line 47:
+  * Rationale: We are not sure that this is necessary, and it seems hard.
-Line 47:
+Line 61:
- * Send SIGTERM to current celeryds and start new ones?
+ * Send SIGINT to current celeryds and start new ones?
-Line 50:
+Line 64:
+  * old workers may last until the slow lane times out

launchpad development

Diff for "CeleryJobRunner"