Talk to the product strategist soon after cutting a first draft of this document

Removing script activity from the LP database

Currently Launchpad includes an operational tool 'script activity' that reports on scripts which fail to run. Because this is tied into the LP core, its not readily usable by components we split out of Launchpad itself. We would like to keep the same reporting facilities but permit them to work on split out components. Further, it might be nice to let other things within Canonical get the same functionality.

Contact: RobertCollins
On Launchpad: https://bugs.launchpad.net/launchpad-project/+bugs?field.tag=scriptactivity

This could grow into a monster project if we were to reexamine the base requirements; at this point we need to just permit the existing functionality on non-LP-core scripts.

Link this from LEP

Rationale

We are doing this because as our service orientated architecture expands we have scripts that need monitoring. Nagios can be configured to report on scripts that don't run, using indirection via log files and regexes - something simpler is why scriptactivity was added to LP, and consultation with IS has confirmed that it would still be beneficial there.

Our users probably don't care at all ;). We might in future offer this as a service, it is in principle something reusable.

Stakeholders

GSA
OSA
LP TA

Possibly also U1/LS/ISD architects.

User stories

$STORY_NAME

As a Developer
I want To easily setup my scripts to have failures-to-run be notified
so that Me and my team know when things don't run

As a Sysadmin
I want To find out when a script last ran
so that I can find out how long something has been broken for

As a Sysadmin
I want To be able to have nagios alert when scripts have missed their deadline
so that We can be told about problems automatically

As a Sysadmin
I want To be able to be able to tell scriptactivity that a script is not expected anymore
so that When machines or scripts go away we don't get nagged

Constraints and Requirements

Must

Allow us to delete scriptactivity from LP.
Have nagios integration.
Permit multiple teams to use it.
Support python scripts.

Nice to have

Support sh scripts.

Must not

Out of scope

Subfeatures

Success

How will we know when we are done?

How will we measure how well we have done?

Thoughts?

I vote for a minimal solution (e.g. XML-RPC call back to Launchpad). I think scriptactivity is a flawed approach, and we should conserve our energy and time to consider something different, especially given the vast efforts required to get a new service rolled out. -- GavinPanella, 2011-11-18

Flawed how? -- RobertCollins

* Managing jobs in tons of crontabs is a pain. Pausing jobs
  during a release might require changes to several crontabs.

* Though it works well with source code control.

* Multiple machines each with their own scheduler (i.e. crond)
  seems reliable against machine failure.

* However, afaik, there is no HA by default. If a machine goes
  down jobs must be migrated by hand.

* HA with cron is hard because of locking. Don't want the same
  script running in two places.

* A single place to synchronously record activity -
  scriptactivity - is a point of failure. This is mitigated by
  the fact that the Launchpad database is essential to most
  (all?) scripts; if the database is down there's no point
  worrying about recording script runs.

-- GavinPanella, 2011-11-21

Fwiw, I think it would be better to dispatch jobs from a central location. A system like that inherently knows if scripts are running or not, plus it's easier to get an overview of the state of things, and much easier to suspend or modify execution. -- GavinPanella, 2011-11-18

That certainly fits into the monster-project camp - it raises issues with authorisation (who can edit what jobs), delegation, separation of concerns (scripts from other projects should use a separate service? what about scripts that affect two projects?); can't tell if something is or isn't running unless you have process group tracking e.g. upstart or systemd, and thats another whole kettle of fish to code. it also drives rigidity into the system as you need to know all the machines you can dispatch on, and then need both push and pull rules for running tasks. Also security, as you can't run an insecure script on e.g. the machine with Ubuntu signing keys. -- RobertCollins

> That certainly fits into the monster-project camp

Yes, perhaps :) Though not ''that'' monster. For example, Celery
<http://ask.github.com/celery/> already address several parts of the
puzzle.

> - it raises issues with authorisation (who can edit what jobs),
> delegation

Having all crontabs in a single branch (lp-production-crontabs) and
most configuration in a single branch (lp-production-configs) shows
that we're not addressing this problem as it is.

> separation of concerns (scripts from other projects should use a
> separate service?  what about scripts that affect two projects?);

I don't really know what you mean here. Do you mean U1 for example?

> can't tell if something is or isn't running unless you have process
> group tracking e.g. upstart or systemd, and thats another whole
> kettle of fish to code.

upstart, systemd, daemontools, almost certainly other software these
problems, so there are either solutions or inspiration should we need
it.

We do control the scripts we run. If things are spawning crap and not
keeping track of it we should fix the scripts. That's true now; cron
doesn't care what gets spawned.

> it also drives rigidity into the system as you need to know all the
> machines you can dispatch on, and then need both push and pull rules
> for running tasks. Also security, as you can't run an insecure
> script on e.g. the machine with Ubuntu signing keys.

I'm not thinking entirely cloud-like, homogenous computing
resource-like, job dispatching.

I imagine something where each machine is set up in advance with the
environments for each job type that we will want to run on it (and
each job type is set up on at least two machines for HA).

The scheduling of those tasks should take place elsewhere, and the
tasks can decide exactly how much configuration they're willing to
accept from outside, often none at all (i.e. configuration will be
provided by other means).

For example, a centralized (though configured for HA) process can
dispatch a "check bug watches" task. Any one of the machines able to
service that request does so. The centralized service does not
dispatch another "check bug watches" task until the most recent task
has reported back, etc.

This relies upon a reliable messaging system, which we do not yet
have, but is achievable.

If there are really security sensitive things then they could still be
put on their own isolated machines. And we could keep crontabs for
those.

A system like this would give us an overview of what's in progress (in
that respect it is a superset of scriptactivity), what's coming up,
what's delayed, and so on. It would be easier to quiesce the whole
Launchpad application, or subsets of it, and to adjust job schedules.
HA for jobs would also be fairly simple to achieve.