Diff for "CeleryJobRunner"

Not logged in - Log In / Register

Differences between revisions 29 and 30
Revision 29 as of 2012-05-16 15:06:23
Size: 4111
Editor: abentley
Comment:
Revision 30 as of 2012-05-16 15:25:53
Size: 4190
Editor: abentley
Comment:
Deletions are marked like this. Additions are marked like this.
Line 14: Line 14:

Test command for qastaging: [[https://pastebin.canonical.com/63588/|paste]]

Status

Developer Docs: CeleryJobs

lazr.jobrunner project at https://launchpad.net/lazr.jobrunner

Fast lane deployed on staging, production. Unfortunately, will fail over to non-existant slow lane, so only branch scans enabled, only on staging.

RT #52808: Deploy running jobs via Celery on staging/qastaging

Example feature rule

jobs.celery.enabled_classes     default 0       BranchScanJob

Test command for qastaging: paste

Completed

RT #51940: "Please enable branch scan via Celery."

Aaron's Todo

  • Resource limits
    • Oops handling should not be subjected to memory limit
    • Perhaps needs to be NIHed-- don't accept jobs while too much memory is in use.
  • requires removal of all existing job-running scripts Revise/remove lease handling.

Nice to have

  • Timeline in oopses
  • Use rabbitfixture for lazr.jobrunner test suite
  • Set up resource limits once per worker process, instead of per-job.

Postponed

  • Kill workers whose tasks error
    • Rationale: We are not sure that this is necessary, and it seems hard.

Resources

  • Main machine is ackee
  • IRC log https://pastebin.canonical.com/62113/

  • jjo suggests we use 1.5-2 cores and 1.5 G on ackee
    • therefore, probably one worker process each for fast and slow lanes
  • lifeless revises this to 4 workers, suggesting 3 fast and 1 slow
  • loganberry is also available, but heavily loaded

Analysis of current work load

Current work load

Fast-first vs slow-first

The fast lane must always time-out and dump timed-out jobs to the slow lane. But should jobs start in the fast lane or the slow lane?

Fast-first

If they start in the fast lane, then every slow job will run in the fast lane first, then be re-queued into the slow lane. If slow jobs take up every fast-lane worker, then latency will be introduced. Therefore, there competing incentives to maximize the number of fast workers (to avoid latency) and to maximize the number of slow workers (to increase the throughput of slow jobs).

Slow-first

Another way to organise this is to initially queue into the slow lane, and treat the fast lane as overflow when all slow-lane workers are occupied. In this case, latency is introduced only when all workers are busy, not all fast-lane workers. However, slow-lane workers are more likely to be busy than fast-lane workers, because they do not have a timeout. So while there is still an incentive to maximize the number of fast workers (to reduce the likelihood that all workers will be busy), it is less pronounced than with fast-first. Two fast-lane workers is probably plenty.

Analysis

While fast jobs are the norm (see Current work load), it's not clear whether slow jobs arrive in clusters. If they do, slow-first has a better chance to absorb the work without introducing latency.

Slow-first reduces the advantage of having many fast-lane workers, encouraging a more balanced set of workers. The fewer workers permitted, the more slow-first shows an advantage in latency and throughput.

lifeless rightly observes that slow-first is more complicated than fast-first, and also that slow-first can be built on fast-first. He has indicated a strong preference for fast-first, until the need for slow-first can be demonstrated.

CeleryJobRunner (last edited 2012-05-16 15:31:39 by abentley)