Overview

An OOPS is an error report - like a traceback or crash dump, but annotated with additional data about the thing that went wrong. Common annotations are the hostname, the time of the problem, a timeline of external operations made during the program or request that failed, CGI parameters for web requests and so forth. OOPS defines keys for a bson document permitting multiple languages to interoperate freely.

Launchpad has developed the OOPS system as a way of getting detailed diagnostic information on errors which happen rarely or in hard to reproduce circumstances, to make us more efficient at diagnosing and correcting issues.

There are a number of separate projects that make up the OOPS stack.

Console and tools

Plumbing

Getting started

There are three components in a minimal OOPS system:

A console to display and collate error reports.
A transport linking the console and the publishers together.
One or more publishers creating error reports.

Console

Using buildout or PIP or a similar tool, install oops-tools. Its a pretty standard django application. QA/OopsToolsSetup contains information on how it is setup and configured for Launchpad.

Transport

Typically AMQP is used for the transport. Install your favourite broker (e.g. RabbitMQ).

Oops-tools contains a script amqp2disk which will suck error reports out of AMQP, write them to a disk repository and inject them into the django analysis console at the same time. Pick a directory to receive these reports, then:

mkdir /path/to/output
amqp2disk --host <amqp host> --password <amqp password> --username <amqp user>  --vhost <amqp vhost> --output /path/to/output --queue=oops-tools --bind-to=oopses

will create a queue (oops-tools) and an exchange (oopses) and bind them together. It will then start consuming messages from the oops-tools queue. You can CTRL-C it when you want to stop it. These are persistent so that even if the consumer is not running, error reports will not be lost. The next time its run, you do not need the bind-to parameter.

A fresh install of rabbit on localhost, for instance, has a guest user and a a default vhost of / - so you might run it as

amqp2disk --host localhost --password guest --username guest  --vhost / --output /path/to/output --queue=oops-tools --bind-to=oopses

Publishing

The exact way to setup publishing depends on your environment - are you running scripts, have a django or zope web app, a twisted daemon etc...

In all cases you need a config object which encapsulates where to publish, what hooks are to be applied to the report and what filters are to be used.

So this becomes a two step process: setup your config object then hook it into your environment somewhere.

AMQP publishing

from functools import partial

from amqplib import client_0_8 as amqp
import oops
import oops_amqp

config = oops.Config()
conn_factory = partial(amqp.Connection, host="localhost:5672",
    userid="guest", password="guest", virtual_host="/", insist=False)
publisher = oops_amqp.Publisher(factory, "oopses", "")
config.publishers.append(publisher)

At this point you can emit oopses via the config:

print config.publish(config.create())

WSGI

To publish OOPS error reports from a WSGI app is pretty straight forward - just wrap your application and install the WSGI hooks.

import oops_wsgi

oops_wsgi.install_hooks(config)
application = oops_wsgi.make_app(application, config)

At this point, unhandled exceptions from the contained app, or calls to start_response with an exc_info parameter, will generate an OOPS. The OOPS id will be present in a new header X-OOPS-ID, and where safe, an error page showing this will be given to the user.

Diognosing OOPSes in anger

There is a analyze_oops_report script which will give you a break down of the errors happening.

$ bin/analyze_error_reports -r <report> --from <today>

launchpad development

OOPS