'''''Talk to the product strategist soon after cutting a first draft of this document'''''

= Removing script activity from the LP database =

Currently Launchpad includes an operational tool 'script activity' that reports on scripts which fail to run. Because this is tied into the LP core, its not readily usable by components we split out of Launchpad itself. We would like to keep the same reporting facilities but permit them to work on split out components. Further, it might be nice to let other things within Canonical get the same functionality.

'''Contact:''' RobertCollins <<BR>>
'''On Launchpad:''' https://bugs.launchpad.net/launchpad-project/+bugs?field.tag=scriptactivity

This could grow into a monster project if we were to reexamine the base requirements; at this point we need to just permit the existing functionality on non-LP-core scripts.

''Link this from [[LEP]]''

== Rationale ==

We are doing this because as our service orientated architecture expands we have scripts that need monitoring. Nagios can be configured to report on scripts that don't run, using indirection via log files and regexes - something simpler is why scriptactivity was added to LP, and consultation with IS has confirmed that it would still be beneficial there.

Our users probably don't care at all ;). We might in future offer this as a service, it is in principle something reusable.

== Stakeholders ==

 * GSA
 * OSA
 * LP TA

Possibly also U1/LS/ISD architects.

== User stories ==

<<Anchor(story-name>)>>

=== $STORY_NAME ===

'''As a ''' Developer<<BR>>
'''I want ''' To easily setup my scripts to have failures-to-run be notified<<BR>>
'''so that ''' Me and my team know when things don't run<<BR>>

'''As a ''' Sysadmin<<BR>>
'''I want ''' To find out when a script last ran<<BR>>
'''so that ''' I can find out how long something has been broken for<<BR>>

'''As a ''' Sysadmin<<BR>>
'''I want ''' To be able to have nagios alert when scripts have missed their deadline<<BR>>
'''so that ''' We can be told about problems automatically<<BR>>

'''As a ''' Sysadmin<<BR>>
'''I want ''' To be able to be able to tell scriptactivity that a script is not expected anymore<<BR>>
'''so that ''' When machines or scripts go away we don't get nagged<<BR>>


== Constraints and Requirements ==

=== Must ===

 * Allow us to delete scriptactivity from LP.

 * Have nagios integration.

 * Permit multiple teams to use it.

 * Support python scripts.

=== Nice to have ===

 * Support sh scripts.

=== Must not ===

=== Out of scope ===

== Subfeatures ==

== Success ==

=== How will we know when we are done? ===

=== How will we measure how well we have done? ===

== Thoughts? ==

  * I vote for a minimal solution (e.g. XML-RPC call back to Launchpad). I think scriptactivity is a flawed approach, and we should conserve our energy and time to consider something different, especially given the vast efforts required to get a new service rolled out. -- GavinPanella, 2011-11-18

    * Flawed how? -- RobertCollins

      * {{{
* Managing jobs in tons of crontabs is a pain. Pausing jobs
  during a release might require changes to several crontabs.

* Though it works well with source code control.

* Multiple machines each with their own scheduler (i.e. crond)
  seems reliable against machine failure.

* However, afaik, there is no HA by default. If a machine goes
  down jobs must be migrated by hand.

* HA with cron is hard because of locking. Don't want the same
  script running in two places.

* A single place to synchronously record activity -
  scriptactivity - is a point of failure. This is mitigated by
  the fact that the Launchpad database is essential to most
  (all?) scripts; if the database is down there's no point
  worrying about recording script runs.
}}} -- GavinPanella, 2011-11-21

  * Fwiw, I think it would be better to dispatch jobs from a central location. A system like that inherently ''knows'' if scripts are running or not, plus it's easier to get an overview of the state of things, and much easier to suspend or modify execution. -- GavinPanella, 2011-11-18

    * That certainly fits into the monster-project camp :) - it raises issues with authorisation (who can edit what jobs), delegation, separation of concerns (scripts from other projects should use a separate service? what about scripts that affect two projects?); can't tell if something is or isn't running unless you have process group tracking e.g. upstart or systemd, and thats another whole kettle of fish to code. it also drives rigidity into the system as you need to know all the machines you can dispatch on, and then need both push and pull rules for running tasks. Also security, as you can't run an insecure script on e.g. the machine with Ubuntu signing keys. -- RobertCollins

      * {{{
> That certainly fits into the monster-project camp

Yes, perhaps :) Though not ''that'' monster. For example, Celery
<http://ask.github.com/celery/> already address several parts of the
puzzle.

> - it raises issues with authorisation (who can edit what jobs),
> delegation

Having all crontabs in a single branch (lp-production-crontabs) and
most configuration in a single branch (lp-production-configs) shows
that we're not addressing this problem as it is.

> separation of concerns (scripts from other projects should use a
> separate service?  what about scripts that affect two projects?);

I don't really know what you mean here. Do you mean U1 for example?

> can't tell if something is or isn't running unless you have process
> group tracking e.g. upstart or systemd, and thats another whole
> kettle of fish to code.

upstart, systemd, daemontools, almost certainly other software these
problems, so there are either solutions or inspiration should we need
it.

We do control the scripts we run. If things are spawning crap and not
keeping track of it we should fix the scripts. That's true now; cron
doesn't care what gets spawned.

> it also drives rigidity into the system as you need to know all the
> machines you can dispatch on, and then need both push and pull rules
> for running tasks. Also security, as you can't run an insecure
> script on e.g. the machine with Ubuntu signing keys.

I'm not thinking entirely cloud-like, homogenous computing
resource-like, job dispatching.

I imagine something where each machine is set up in advance with the
environments for each job type that we will want to run on it (and
each job type is set up on at least two machines for HA).

The scheduling of those tasks should take place elsewhere, and the
tasks can decide exactly how much configuration they're willing to
accept from outside, often none at all (i.e. configuration will be
provided by other means).

For example, a centralized (though configured for HA) process can
dispatch a "check bug watches" task. Any one of the machines able to
service that request does so. The centralized service does not
dispatch another "check bug watches" task until the most recent task
has reported back, etc.

This relies upon a reliable messaging system, which we do not yet
have, but is achievable.

If there are really security sensitive things then they could still be
put on their own isolated machines. And we could keep crontabs for
those.

A system like this would give us an overview of what's in progress (in
that respect it is a superset of scriptactivity), what's coming up,
what's delayed, and so on. It would be easier to quiesce the whole
Launchpad application, or subsets of it, and to adjust job schedules.
HA for jobs would also be fairly simple to achieve.
}}} -- GavinPanella, 2011-11-21


 -- I think this is talking about another issue entirely - there are lots of facets of scheduling, cron etc, and reporting is one aspect. Locking and concurrency issues still exist in e.g. celery - you need concurrency safe consumers in that environment, or a big unscalable inner lock. (Its not an issue for us, but consider scheduling for a farm of 20K servers in 5 datacenters). Centralised control isn't a good answer here. Federatable is - but as soon as you're talking locking federation isn't sufficient. ScriptActivity is clearly federatable, if we were to build it on Cassandra. Now, the 20K scale doesn't apply to us, but thats only a thought experiment to make *clear* the issues that evolve: they all appear at much smaller scales, and I'm sure we'd run into them if we were to try for one big system to rule them all. -- RobertCollins