Checkwatches: The Next Generation
The problem
Checkwatches is the group noun for the code that allows us to interact with remote bugtrackers (usually found in lp.bug.externalbugtracker) and the cronscript that runs that code (found in lp.bugs.scripts.checkwatches).
There are a number of problems with the current checkwatches approach:
It's serial. Bug watches are checked on a per bug-tracker basis, one bug tracker after another.
It's inefficient. Whilst bug watch checking is often batched (so we request a block of bugs from the remote server instead of one at a time) there's still a significant setup cost to checking each bug tracker. For example, a single check of bugzilla.gnome.org can take up to 15 seconds, and most of that isn't spent getting watches from the remote server.
There can be only one. Only one instance of checkwatches.py runs at any one time. It's a cronscript that (currently) runs at 10-minute intervals. However, if one script run overruns it prevents the next one from taking place at all. The likelihood of an overrun increases as more bug trackers need watches checking. Since some trackers will always need the maximum number of watches checking at a time, overruns become very likely indeed.
Error reporting sucks. At the moment, checkwatches has an OOPS report, but it's not brilliant and its difficult to separate the signal from the noise.
checkwatches always checks every watch. Even if a remote bug hasn't changed status for years, checkwatches will still try to check it once every 24 hours. When a bug tracker has a lot of bug watches (gnome-bugs has 13,000 at the time of writing) this gets pretty silly.
It's all CLI. You need a LOSA to do anything with checkwatches, and they need to play silly buggers with checkwatches on loganberry.
Possible solutions
Here are some possible solutions to the problems of checkwatches. These are a bit brain-dumpy at the moment (this isn't even a spec; it's a proto-spec at best, so feel free to point out nonsense should it appear in the following).
Go Parallel
In other words, do more at the same time, for example by:
Running Multiple instances
- Each bugtracker could get its own checkwatches script. That is, there could be one master script which decides which trackers need checking and then create instances to check those specific trackers.
- If there's already an instance for a bugtracker, the master script should have a way of communicating to that instance that there are more bugs for it to check, so it shouldn't die after its current run is complete.
Using Twisted, or another asynchronous framework
checkwatches spends most of its time blocked doing network stuff. Twisted eats that kind of stuff with babies on the side. The externalbugtracker package would be reasonably easy to convert to use Twisted.
Don't check watches that don't need checking or can't be checked
If a watch hasn't changed status for > N days, we should check it less than other watches. For example, if our threshold is 30 days, bug watches that haven't changed for 30 days should be checked maybe twice a week instead of once a day. Watches unchanged for more than 60 days could be checked once a week and so on.
- If a watch is in a 'resolved' state we shouldn't check it as often - maybe once a week. We could combine this with the above logic ("it's been resolved for the last 120 days; check it once a month" and so on).
- 'Resolved' in this context means that the remote bug's status maps to one of (Fix Released, Invalid, Won't Fix) in Launchpad.
- If we repeatedly get errors when checking a given bug watch or bug tracker we should check that watch / tracker less often (after all we're just wasting cycles on it otherwise). We shouldn't back off too far on these, because it could be a transient problem).
Make it possible to control checkwatches through the UI
- Via the UI, admins should be able to:
- Tell checkwatches to force an update for a particular bug tracker (i.e. check all its watches).
- Set the remote products (where applicable) for which comments should be synced for a given bug tracker (e.g. gnome-bugs).
Provide better stats in the UI
- The UI for bug trackers should show:
- Number of watches
- Number of watches with errors (with a link through to a report of only those watches).
- Last run time for a given bug tracker
- Most common errors for a bug tracker
- Report of errors in the last run
Import everything
- At the moment we have the framework in place to support importance importing along with status importing, but we do it for exactly zero bug trackers. We should rectify this.
- We have comment importing support for Bugzilla (3.4 native or 3.0-3.2 with plugin) and Trac (with plugin), but we've no statistics about how often it actually gets used.