788
Comment:
|
3795
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
= Status = lazr.jobrunner project at https://launchpad.net/lazr.jobrunner Removal of job-running code from the Launchpad Tree: https://code.launchpad.net/~adeuring/launchpad/lp-lazr.jobrunner/+merge/97458 Implementation of job-running code in lazr.jobrunner: https://code.launchpad.net/~adeuring/launchpad/lazr.jobrunner-more-tests Implementation of celery-based job running: https://code.launchpad.net/~abentley/lazr.jobrunner/run-via-celery |
|
Line 3: | Line 15: |
* '''needs abel''' workers: one-time Launchpad init * per job-type init |
* workers: one-time Launchpad init * per job-type init |
Line 6: | Line 18: |
* '''post landing''' Can be implemented in Job.run | * Can be implemented in Job.run |
Line 8: | Line 20: |
* '''post landing''' Maybe per-celeryd, since available resources doesn't change... | * Maybe per-celeryd, since available resources doesn't change... |
Line 12: | Line 24: |
* Slow lane time_limit * Solve job-specific configuration |
= Postponed = * Kill workers whose tasks error |
Line 20: | Line 33: |
* lifeless revises this to 4 workers, suggesting 3 fast and 1 slow | |
Line 21: | Line 35: |
= Open Questions = Currently, each job type has its own config. Do we retain that? If so, how do we associate jobs with their configs? How do we support nodowntime/fastdowntime upgrades? * Send SIGTERM to current celeryds and start new ones? * Less coding * more chance of exceeding resource limits as new workers take jobs while old workers are still working. * On message or signal, workers stop accepting jobs, reschedule any running jobs, and exit? * More coding * No chance of exceeding resource limits * Delays slow-running jobs further = Analysis of current work load = [[CeleryJobRunner/CurrentWorkLoad|Current work load]] = Fast-first vs slow-first = The fast lane must always time-out and dump timed-out jobs to the slow lane. But should jobs start in the fast lane or the slow lane? == Fast-first == If they start in the fast lane, then every slow job will run in the fast lane first, then be re-queued into the slow lane. If slow jobs take up every fast-lane worker, then latency will be introduced. Therefore, there competing incentives to maximize the number of fast workers (to avoid latency) and to maximize the number of slow workers (to increase the throughput of slow jobs). == Slow-first == Another way to organise this is to initially queue into the slow lane, and treat the fast lane as overflow when all slow-lane workers are occupied. In this case, latency is introduced only when all workers are busy, not all fast-lane workers. However, slow-lane workers are more likely to be busy than fast-lane workers, because they do not have a timeout. So while there is still an incentive to maximize the number of fast workers (to reduce the likelihood that all workers will be busy), it is less pronounced than with fast-first. Two fast-lane workers is probably plenty. == Analysis == While fast jobs are the norm (see [[CeleryJobRunner/CurrentWorkLoad|Current work load]]), it's not clear whether slow jobs arrive in clusters. If they do, slow-first has a better chance to absorb the work without introducing latency. Slow-first reduces the advantage of having many fast-lane workers, encouraging a more balanced set of workers. The fewer workers permitted, the more slow-first shows an advantage in latency and throughput. lifeless rightly observes that slow-first is more complicated than fast-first, and also that slow-first can be built on fast-first. He has indicated a strong preference for fast-first, until the need for slow-first can be demonstrated. |
Status
lazr.jobrunner project at https://launchpad.net/lazr.jobrunner
Removal of job-running code from the Launchpad Tree: https://code.launchpad.net/~adeuring/launchpad/lp-lazr.jobrunner/+merge/97458
Implementation of job-running code in lazr.jobrunner: https://code.launchpad.net/~adeuring/launchpad/lazr.jobrunner-more-tests
Implementation of celery-based job running: https://code.launchpad.net/~abentley/lazr.jobrunner/run-via-celery
Aaron's Todo
- workers: one-time Launchpad init
- per job-type init
- Possibly not needed, because not expensive.
- Can be implemented in Job.run
- Resource limits
- Maybe per-celeryd, since available resources doesn't change...
- Oops if memory limit exceeded
- Perhaps needs to be NIHed-- don't accept jobs while too much memory is in use.
- Fast lane/slow lane
Postponed
- Kill workers whose tasks error
Resources
- Main machine is ackee
- jjo suggests we use 1.5-2 cores and 1.5 G on ackee
- therefore, probably one worker process each for fast and slow lanes
- lifeless revises this to 4 workers, suggesting 3 fast and 1 slow
- loganberry is also available, but heavily loaded
Open Questions
Currently, each job type has its own config. Do we retain that? If so, how do we associate jobs with their configs?
How do we support nodowntime/fastdowntime upgrades?
- Send SIGTERM to current celeryds and start new ones?
- Less coding
- more chance of exceeding resource limits as new workers take jobs while old workers are still working.
- On message or signal, workers stop accepting jobs, reschedule any running jobs, and exit?
- More coding
- No chance of exceeding resource limits
- Delays slow-running jobs further
Analysis of current work load
Fast-first vs slow-first
The fast lane must always time-out and dump timed-out jobs to the slow lane. But should jobs start in the fast lane or the slow lane?
Fast-first
If they start in the fast lane, then every slow job will run in the fast lane first, then be re-queued into the slow lane. If slow jobs take up every fast-lane worker, then latency will be introduced. Therefore, there competing incentives to maximize the number of fast workers (to avoid latency) and to maximize the number of slow workers (to increase the throughput of slow jobs).
Slow-first
Another way to organise this is to initially queue into the slow lane, and treat the fast lane as overflow when all slow-lane workers are occupied. In this case, latency is introduced only when all workers are busy, not all fast-lane workers. However, slow-lane workers are more likely to be busy than fast-lane workers, because they do not have a timeout. So while there is still an incentive to maximize the number of fast workers (to reduce the likelihood that all workers will be busy), it is less pronounced than with fast-first. Two fast-lane workers is probably plenty.
Analysis
While fast jobs are the norm (see Current work load), it's not clear whether slow jobs arrive in clusters. If they do, slow-first has a better chance to absorb the work without introducing latency.
Slow-first reduces the advantage of having many fast-lane workers, encouraging a more balanced set of workers. The fewer workers permitted, the more slow-first shows an advantage in latency and throughput.
lifeless rightly observes that slow-first is more complicated than fast-first, and also that slow-first can be built on fast-first. He has indicated a strong preference for fast-first, until the need for slow-first can be demonstrated.