OOPS Display/analysis/processing

Contact: RobertCollins
On Launchpad: https://bugs.launchpad.net/launchpad-project/+bugs?field.tag=oops-handling

As a Technical architect
I want oops display/analysis/processing to be scalable, efficient and extensible
so that developers can analyse production issues more efficiently

Rationale

Oopses are a key part of our production issue reaction and analysis workflow, but our current toolchain is missing some features we expect would help us. Improving it will allow developers to gather fresh data more easily.

Our users care that we fix problems promptly, making it easier to fix problems helps us deliver faster fixes to users.

Stakeholders

RobertCollins -- LP Technical Architect

Other departments at Canonical are interested too

Landscape
Ubuntu one
ISD are trialling sentry but wil reevaluate in 6-12 months

Constraints and Requirements

Must

We need to solve a number of problems:

The way separate crashes are collated is hard to customise (and I think we need customisation for different sources of problems).
Reporting and adhoc querying are very high latency at the moment. Only a few people know how how to query, and it's basically manual.
The service doesn't scale to our requirements (multi-tenancy and extreme load spikes).

Launchpad has the following constraints/requirements:

< 60 seconds 99% of the time from OOPS generation to viewability (get it there quickly)
full text search of OOPS contents (find occurences of an exception/line of code/page id)
prompt and useful garbage collection (with LP integration for references). (do not waste space)
1M oops/day capacity from the initial design [we generate 30K/day at the moment, this allows for a large spike, or a 1% soft failure

rate on 100million web requests a day (we do 6M a day at the moment, so this is a single order-of-magnitude scaling buffer). (Be prepared for reasonable growth)

SQL slow request + repeated request discovery (do the performance analysis the current tools do).
- not sure what this means -- jml
- This means that the details in the oops about what queries were slow and repeated, must be done by any replacement system.
Collation by (contextlabel, exception type) tuples. Things that are LP web pages can use the page id as the contextlabel, scripts would choose one making sense for their context - e.g. 'checkwatches' for the checkwatches script. (aggregate reliably)
Easy deployment on development machines (so the whole team can contribute to the stack).
LOSA deployable and maintainable (be part of the production environment).
Gracefully handle unknown or missing fields in an OOPS (handle skew as we add new information before we decide how to deal with it). This also covers adhoc structured/semistructed data - e.g. a task id, delegation reference for a user etc: unknown fields should be rendered in some conservative fashion.
Working with OOPS logging in tests needs to be easy.

UbuntuOne has the following constraints/requirements:

traverse untrusted networks (oops reporting system in Canonical DC, server on ec2)
tolerate transient network failures (e.g. don't crash in the generator, don't lose all the

oops)

emit signals of some sort for event system integration (e.g. graphite, tuolumne etc).
allow arbitrary data above some core (e.g. datetime, 'type of oops')

Nice to have

Must not

Launchpad has this negative constraint:

Must not require per-instance or per-script configuration (eg. oops prefix must go)

Out of scope

Subfeatures

Workflows

A user encounters an OOPS; they paste the id into #launchpad, and a Launchpad engineer can look at the OOPS pretty much immediately, and then recommend a workaround to the user, file a bug, or even just sit down and fix the issue.

Success

How will we know when we are done?

The problems listed in the 'need to solve' section are solved.

How will we measure how well we have done?

Thoughts?

-very- early /possible/ design sketch:

rabbit exchange on each physical appserver [implemented w/one exchange + fallback cron job]
shovel or similar to move stuff from a queue on that appserver to the exchange on the oops tools rabbit broker [implemented]
lucene for indexing
cassandra data store (because theres nearly nothing relational in the system, and this will scale to run many different projects easily)
json for oops serialisation [used bson]
md5 (json bytes) for oops ids [used bson]
cassandra-django stack for the web instance

launchpad development

LEP/OopsDisplay