Talk to the product strategist soon after cutting a first draft of this document
Removing script activity from the LP database
Currently Launchpad includes an operational tool 'script activity' that reports on scripts which fail to run. Because this is tied into the LP core, its not readily usable by components we split out of Launchpad itself. We would like to keep the same reporting facilities but permit them to work on split out components. Further, it might be nice to let other things within Canonical get the same functionality.
Contact: RobertCollins
On Launchpad: https://bugs.launchpad.net/launchpad-project/+bugs?field.tag=scriptactivity
This could grow into a monster project if we were to reexamine the base requirements; at this point we need to just permit the existing functionality on non-LP-core scripts.
Link this from LEP
Rationale
We are doing this because as our service orientated architecture expands we have scripts that need monitoring. Nagios can be configured to report on scripts that don't run, using indirection via log files and regexes - something simpler is why scriptactivity was added to LP, and consultation with IS has confirmed that it would still be beneficial there.
Our users probably don't care at all ;). We might in future offer this as a service, it is in principle something reusable.
Stakeholders
- GSA
- OSA
- LP TA
Possibly also U1/LS/ISD architects.
User stories
$STORY_NAME
As a Developer
I want To easily setup my scripts to have failures-to-run be notified
so that Me and my team know when things don't run
As a Sysadmin
I want To find out when a script last ran
so that I can find out how long something has been broken for
As a Sysadmin
I want To be able to have nagios alert when scripts have missed their deadline
so that We can be told about problems automatically
As a Sysadmin
I want To be able to be able to tell scriptactivity that a script is not expected anymore
so that When machines or scripts go away we don't get nagged
Constraints and Requirements
Must
- Allow us to delete scriptactivity from LP.
- Have nagios integration.
- Permit multiple teams to use it.
- Support python scripts.
Nice to have
- Support sh scripts.
Must not
Out of scope
Subfeatures
Success
How will we know when we are done?
How will we measure how well we have done?
Thoughts?
I vote for a minimal solution (e.g. XML-RPC call back to Launchpad). I think scriptactivity is a flawed approach, and we should conserve our energy and time to consider something different, especially given the vast efforts required to get a new service rolled out. -- GavinPanella, 2011-11-18
Flawed how? -- RobertCollins
* Managing jobs in tons of crontabs is a pain. Pausing jobs during a release might require changes to several crontabs. * Though it works well with source code control. * Multiple machines each with their own scheduler (i.e. crond) seems reliable against machine failure. * However, afaik, there is no HA by default. If a machine goes down jobs must be migrated by hand. * HA with cron is hard because of locking. Don't want the same script running in two places. * A single place to synchronously record activity - scriptactivity - is a point of failure. This is mitigated by the fact that the Launchpad database is essential to most (all?) scripts; if the database is down there's no point worrying about recording script runs.
-- GavinPanella, 2011-11-21
Fwiw, I think it would be better to dispatch jobs from a central location. A system like that inherently knows if scripts are running or not, plus it's easier to get an overview of the state of things, and much easier to suspend or modify execution. -- GavinPanella, 2011-11-18
That certainly fits into the monster-project camp - it raises issues with authorisation (who can edit what jobs), delegation, separation of concerns (scripts from other projects should use a separate service? what about scripts that affect two projects?); can't tell if something is or isn't running unless you have process group tracking e.g. upstart or systemd, and thats another whole kettle of fish to code. it also drives rigidity into the system as you need to know all the machines you can dispatch on, and then need both push and pull rules for running tasks. Also security, as you can't run an insecure script on e.g. the machine with Ubuntu signing keys. -- RobertCollins
> That certainly fits into the monster-project camp Yes, perhaps :) Though not ''that'' monster. For example, Celery <http://ask.github.com/celery/> already address several parts of the puzzle. > - it raises issues with authorisation (who can edit what jobs), > delegation Having all crontabs in a single branch (lp-production-crontabs) and most configuration in a single branch (lp-production-configs) shows that we're not addressing this problem as it is. > separation of concerns (scripts from other projects should use a > separate service? what about scripts that affect two projects?); I don't really know what you mean here. Do you mean U1 for example? > can't tell if something is or isn't running unless you have process > group tracking e.g. upstart or systemd, and thats another whole > kettle of fish to code. upstart, systemd, daemontools, almost certainly other software these problems, so there are either solutions or inspiration should we need it. We do control the scripts we run. If things are spawning crap and not keeping track of it we should fix the scripts. That's true now; cron doesn't care what gets spawned. > it also drives rigidity into the system as you need to know all the > machines you can dispatch on, and then need both push and pull rules > for running tasks. Also security, as you can't run an insecure > script on e.g. the machine with Ubuntu signing keys. I'm not thinking entirely cloud-like, homogenous computing resource-like, job dispatching. I imagine something where each machine is set up in advance with the environments for each job type that we will want to run on it (and each job type is set up on at least two machines for HA). The scheduling of those tasks should take place elsewhere, and the tasks can decide exactly how much configuration they're willing to accept from outside, often none at all (i.e. configuration will be provided by other means). For example, a centralized (though configured for HA) process can dispatch a "check bug watches" task. Any one of the machines able to service that request does so. The centralized service does not dispatch another "check bug watches" task until the most recent task has reported back, etc. This relies upon a reliable messaging system, which we do not yet have, but is achievable. If there are really security sensitive things then they could still be put on their own isolated machines. And we could keep crontabs for those. A system like this would give us an overview of what's in progress (in that respect it is a superset of scriptactivity), what's coming up, what's delayed, and so on. It would be easier to quiesce the whole Launchpad application, or subsets of it, and to adjust job schedules. HA for jobs would also be fairly simple to achieve.
-- GavinPanella, 2011-11-21
-- I think this is talking about another issue entirely - there are lots of facets of scheduling, cron etc, and reporting is one aspect. Locking and concurrency issues still exist in e.g. celery - you need concurrency safe consumers in that environment, or a big unscalable inner lock. (Its not an issue for us, but consider scheduling for a farm of 20K servers in 5 datacenters). Centralised control isn't a good answer here. Federatable is - but as soon as you're talking locking federation isn't sufficient. ScriptActivity is clearly federatable, if we were to build it on Cassandra. Now, the 20K scale doesn't apply to us, but thats only a thought experiment to make *clear* the issues that evolve: they all appear at much smaller scales, and I'm sure we'd run into them if we were to try for one big system to rule them all. -- RobertCollins