Diff for "LEP/OopsDisplay"

Not logged in - Log In / Register

Differences between revisions 3 and 4
Revision 3 as of 2011-02-22 05:54:10
Size: 4419
Editor: wgrant
Comment: Unbreak formatting slightly.
Revision 4 as of 2011-02-22 20:47:16
Size: 4786
Editor: lifeless
Comment: clarifications based on the questions aaron raised.
Deletions are marked like this. Additions are marked like this.
Line 46: Line 46:
 * Collation by (page id, exception type) tuples (aggregate reliably).  * Collation by (contextlabel, exception type) tuples. Things that are LP web pages can use the page id as the contextlabel, scripts would choose one making sense for their context - e.g. 'checkwatches' for the checkwatches script. (aggregate reliably)
Line 49: Line 49:
 * Gracefully handle unknown or missing fields in an OOPS (handle skew as we add new information before we decide how to deal with it).  * Gracefully handle unknown or missing fields in an OOPS (handle skew as we add new information before we decide how to deal with it). This also covers adhoc structured/semistructed data - e.g. a task id, delegation reference for a user etc: unknown fields should be rendered in some conservative fashion.
Line 63: Line 63:
 * Must not require per-instance configuration (eg. oops prefix must go)  * Must not require per-instance or per-script configuration (eg. oops prefix must go)

OOPS Display/analysis/processing

Contact: RobertCollins
On Launchpad: https://bugs.launchpad.net/launchpad-project/+bugs?field.tag=oops-handling

As a Technical architect
I want oops display/analysis/processing to be fast, efficient and extensible
so that developers can analyse production issues more efficiently

Rationale

Oopses are a key part of our production issue reaction and analysis workflow, but our current toolchain is insufficient to meet our needs. Improving it will allow developers to gather fresh data with less latency, reduce sysadmin interrupts when gathering such data and allow more rapid responses when something does go wrong.

Our users care that we fix problems promptly, making it easier to fix problems helps us deliver faster fixes to users.

Stakeholders

Other departments at Canonical are interested too

  • Landscape
  • Ubuntu one
  • ISD

Constraints and Requirements

Must

We need to solve a number of problems:

  • We need to decouple upgrades to what we capture and how it's analysed so that it's easier to improve things.
  • The way separate crashes are collated is hard to customise (and I think we need customisation for different sources of problems).
  • The reporting, querying and data collection are all high latency at the moment. Only a few people know how how to query, and it's basically manual.
  • The service doesn't scale to our requirements: it's slow at heart.
  • The service is not part of the production environment -- but it needs to be.
  • We have per-process configuration that is a real headache for deployments.

Launchpad has the following constraints/requirements:

  • < 60 seconds 99% of the time from OOPS generation to viewability (get it there quickly)

  • full text search of OOPS contents (find occurences of an exception/line of code/page id)
  • prompt and useful garbage collection (with LP integration for references). (do not waste space)
  • 1M oops/day capacity from the initial design [we generate 30K/day at the moment, this allows for a large spike, or a 1% soft failure

rate on 100million web requests a day (we do 6M a day at the moment, so this is a single order-of-magnitude scaling buffer). (Be prepared for reasonable growth)

  • SQL slow request + repeated request discovery (do the performance analysis the current tools do).
  • Collation by (contextlabel, exception type) tuples. Things that are LP web pages can use the page id as the contextlabel, scripts would choose one making sense for their context - e.g. 'checkwatches' for the checkwatches script. (aggregate reliably)
  • Easy deployment on development machines (so the whole team can contribute to the stack).
  • LOSA deployable and maintainable (be part of the production environment).
  • Gracefully handle unknown or missing fields in an OOPS (handle skew as we add new information before we decide how to deal with it). This also covers adhoc structured/semistructed data - e.g. a task id, delegation reference for a user etc: unknown fields should be rendered in some conservative fashion.

UbuntuOne has the following constraints/requirements:

  • traverse untrusted networks (oops reporting system in Canonical DC, server on ec2)
  • tolerate transient network failures (e.g. don't crash in the generator, don't lose all the

oops)

  • emit signals of some sort for event system integration (e.g. graphite, tuolumne etc).
  • allow arbitrary data above some core (e.g. datetime, 'type of oops')

Nice to have

Must not

Launchpad has this negative constraint:

  • Must not require per-instance or per-script configuration (eg. oops prefix must go)

Out of scope

Subfeatures

Workflows

A user encounters an OOPS; they paste the id into #launchpad, and a Launchpad engineer can look at the OOPS pretty much immediately, and then recommend a workaround to the user, file a bug, or even just sit down and fix the issue.

Success

How will we know when we are done?

The problems listed in the 'need to solve' section are solved.

How will we measure how well we have done?

Thoughts?

-very- early /possible/ design sketch:

  • rabbit exchange on each physical appserver
  • shovel or similar to move stuff from a queue on that appserver to the exchange on the oops tools rabbit broker
  • lucene for indexing
  • cassandra data store (because theres nearly nothing relational in the system, and this will scale to run many different projects easily)
  • json for oops serialiation
  • md5 (json bytes) for oops ids
  • cassandra-django stack for the web instance

LEP/OopsDisplay (last edited 2011-12-12 06:24:37 by lifeless)