Differences between revisions 1 and 2

OOPS Display/analysis/processing

Contact: RobertCollins
On Launchpad: https://bugs.launchpad.net/launchpad-project/+bugs?field.tag=oops-handling

As a Technical architect
I want oops display/analysis/processing to be fast, efficient and extensible
so that developers can analyse production issues more efficiently

Rationale

Oopses are a key part of our production issue reaction and analysis workflow, but our current toolchain is insufficient to meet our needs. Improving it will allow developers to gather fresh data with less latency, reduce sysadmin interrupts when gathering such data and allow more rapid responses when something does go wrong.

Our users care that we fix problems promptly, making it easier to fix problems helps us deliver faster fixes to users.

Stakeholders

RobertCollins -- LP Technical Architect

Other departments at Canonical are interested too

Landscape
Ubuntu one
ISD

Constraints and Requirements

Must

We need to solve a number of problems:

we need to decouple upgrades to what we capture and how its

analysed so that its easier to improve things

the way separate crashes are collated is hard to customise (and I

think we need customisation for different sources of problems)

the reporting, querying and data collection are all high latency at

the moment. Only a few people know how how to query, and its basically manual.

the service doesn't scale to our requirements : its slow at heart.
The service is not part of the production environment - but it needs to be
We have per-process configuration that is a real headache for deployments.

Launchpad has the following constraints/requirements:

< 60 seconds 99% of the time from OOPS generation to viewability (get it there quickly)
full text search of OOPS contents (find occurences of an exception/line of code/page id)
prompt and useful garbage collection (with LP integration for references). (do not waste space)
1M oops/day capacity from the initial design [we generate 30K/day

at the moment, this allows for a large spike, or a 1% soft failure rate on 100million web requests a day (we do 6M a day at the moment, so this is a single order-of-magnitude scaling buffer). (Be prepared for reasonable growth)

sql slow request + repeated request discovery (do the performance analysis the current tools do)
collation by (page id, exception type) tuples (aggregate reliably)
easy deployment on development machines (so the whole team can contribute to the stack)
LOSA deployable and maintainable (be part of the production environment)
Gracefully handle unknown or missing fields in an OOPS (handle skew as we add new information before we decide how to deal with it)

UbuntuOne has the following constraints/requirements:

traverse untrusted networks (oops reporting system in Canonical DC, server on ec2)
tolerate transient network failures (e.g. don't crash in the generator, don't lose all the

oops)

emit signals of some sort for event system integration (e.g. graphite, tuolumne etc).
allow arbitrary data above some core (e.g. datetime, 'type of oops')

Nice to have

Must not

Launchpad has this negative constraint:

Must not require per-instance configuration (eg. oops prefix must go)

Out of scope

Subfeatures

Workflows

A user encounters an OOPS; they paste the id into #launchpad, and a Launchpad engineer can look at the OOPS pretty much immediately, and then recommend a workaround to the user, file a bug, or even just sit down and fix the issue.

Success

How will we know when we are done?

The problems listed in the 'need to solve' section are solved.

How will we measure how well we have done?

Thoughts?

-very- early /possible/ design sketch:

rabbit exchange on each physical appserver
shovel or similar to move stuff from a queue on that appserver to the exchange on the oops tools rabbit broker
lucene for indexing
cassandra data store (because theres nearly nothing relational in the system, and this will scale to run many different projects easily)
json for oops serialiation
md5 (json bytes) for oops ids
cassandra-django stack for the web instance

-  ⇤ ← Revision 1 as of 2011-02-22 04:38:57 → 
  Size: 1891
  Editor: lifeless
  Comment:
+   ← Revision 2 as of 2011-02-22 05:12:33 → ⇥
  Size: 4412
  Editor: lifeless
  Comment: flesh out
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-The purpose of this template is to help us get ReadyToCode on features or tricky bugs as quickly as possible. See also LaunchpadEnhancementProposalProcess.
+= OOPS Display/analysis/processing =
 Line 3:
-The bits in ''italics'' are the bits that you should fill in. '''Delete the italic bits.'''
+'''Contact:''' RobertCollins <<BR>>
'''On Launchpad:''' https://bugs.launchpad.net/launchpad-project/+bugs?field.tag=oops-handling
-Line 5:
+Line 6:
-'''''Talk to the product strategist soon after cutting a first draft of this document'''''

= $HEADLINE =

''Short description of feature''

'''Contact:''' ''The primary contact for this LEP. Normally the drafter or the implementer.'' <<BR>>
'''On Launchpad:''' ''Link to a blueprint, milestone or (best) a bug tag search across launchpad-project''

'''As a ''' $PERSON<<BR>>
'''I want ''' $FEATURE<<BR>>
'''so that ''' $BENEFIT<<BR>>

''Consider clarifying the feature by describing what it is not?''

''Link this from [[LEP]]''
+'''As a ''' Technical architect<<BR>>
'''I want ''' oops display/analysis/processing to be fast, efficient and extensible<<BR>>
'''so that ''' developers can analyse production issues more efficiently<<BR>>
-Line 24:
+Line 12:
-''Why are we doing this now?''
+Oopses are a key part of our production issue reaction and analysis workflow, but our current toolchain is insufficient to meet our needs. Improving it will allow developers to gather fresh data with less latency, reduce sysadmin interrupts when gathering such data and allow more rapid responses when something does go wrong.
-Line 26:
+Line 14:
-''What value does this give our users? Which users?''
+Our users care that we fix problems promptly, making it easier to fix problems helps us deliver faster fixes to users.
-Line 30:
+Line 18:
-''Who really cares about this feature? When did you last talk to them?''
+ * RobertCollins -- LP Technical Architect

Other departments at Canonical are interested too

 * Landscape
 * Ubuntu one
 * ISD
-Line 36:
+Line 31:
-''What MUST the new behaviour provide?''
+We need to solve a number of problems:
 * we need to decouple upgrades to what we capture and how its
analysed so that its easier to improve things
 * the way separate crashes are collated is hard to customise (and I
think we need customisation for different sources of problems)
 * the reporting, querying and data collection are all high latency at
the moment. Only a few people know how how to query, and its basically manual.
 * the service doesn't scale to our requirements : its slow at heart.
 * The service is not part of the production environment - but it needs to be
 * We have per-process configuration that is a real headache for deployments.

Launchpad has the following constraints/requirements:
 * < 60 seconds 99% of the time from OOPS generation to viewability (get it there quickly)
 * full text search of OOPS contents (find occurences of an exception/line of code/page id)
 * prompt and useful garbage collection (with LP integration for references). (do not waste space)
 * 1M oops/day capacity from the initial design [we generate 30K/day
at the moment, this allows for a large spike, or a 1% soft failure
rate on 100million web requests a day (we do 6M a day at the moment,
so this is a single order-of-magnitude scaling buffer). (Be prepared for reasonable growth)
 * sql slow request + repeated request discovery (do the performance analysis the current tools do)
 * collation by (page id, exception type) tuples (aggregate reliably)
 * easy deployment on development machines (so the whole team can contribute to the stack)
 * LOSA deployable and maintainable (be part of the production environment)
 * Gracefully handle unknown or missing fields in an OOPS (handle skew as we add new information before we decide how to deal with it)

UbuntuOne has the following constraints/requirements:
 * traverse untrusted networks (oops reporting system in Canonical DC, server on ec2)
 * tolerate transient network failures (e.g. don't crash in the generator, don't lose all the
oops)
 * emit signals of some sort for event system integration (e.g. graphite, tuolumne etc).
 * allow arbitrary data above some core (e.g. datetime, 'type of oops')
-Line 42:
+Line 67:
-''What MUST it not do?''
+Launchpad has this negative constraint:
 * Must not require per-instance configuration (eg. oops prefix must go)
-Line 48:
+Line 74:
-''Other LaunchpadEnhancementProposal``s that form a part of this one.''
-Line 52:
+Line 76:
-''What are the workflows for this feature? Even a short list can help you and others understand the scope of the change.'' 
''Provide mockups for each workflow.''

'''''You do not have to get the mockups and workflows right at this point. In fact, it is better to have several alternatives, delaying deciding on the final set of workflows until the last responsible moment.'''''
+A user encounters an OOPS; they paste the id into #launchpad, and a Launchpad engineer can look at the OOPS pretty much immediately, and then recommend a workaround to the user, file a bug, or even just sit down and fix the issue.
-Line 61:
+Line 82:
+The problems listed in the 'need to solve' section are solved.
-Line 65:
+Line 88:
-''Put everything else here. Better out than in.''
+-very- early /possible/ design sketch:
 * rabbit exchange on each physical appserver
 * shovel or similar to move stuff from a queue on that appserver to the exchange on the oops tools rabbit broker
 * lucene for indexing
 * cassandra data store (because theres nearly nothing relational in the system, and this will scale to run many different projects easily)
 * json for oops serialiation
 * md5 (json bytes) for oops ids
 * cassandra-django stack for the web instance

launchpad development

Diff for "LEP/OopsDisplay"