Diff for "PolicyAndProcess/ZeroOOPSPolicy"

Not logged in - Log In / Register

Differences between revisions 1 and 2
Revision 1 as of 2009-10-09 20:40:02
Size: 2462
Editor: flacoste
Comment:
Revision 2 as of 2009-10-12 11:38:41
Size: 2561
Editor: jml
Comment:
Deletions are marked like this. Additions are marked like this.
Line 13: Line 13:

Look at [[http://people.canonical.com/~flacoste/tags-burndown-report.html|the burndown chart]].

  • Policy Name: Zero OOPS Policy

  • Policy Owner: Francis Lacoste

  • Parent Process/Activity: None

  • Supported Policy: None

Policy Overview

In a nutshell, this policy is about moving the tolerance-level for OOPSes to zero. This mean that any user-visible error happening in production is a stop-the-line event and should be fixed ASAP.

Look at the burndown chart.

Why this policy?

We should be proud of the service we build and deliver, and we cannot take pride in a low-quality product. Everytime an OOPS page reaches a user, whether because of a time out or an unhandled exception, we failed on the measure of quality. An OOPS page means that a user was prevented from completing their work, that's really bad.

Having zero tolerance for OOPSes in production means that we are putting actions behind our mantra of quality. An OOPS is basically an escaped defect, and we cannot tolerate that.

Daily we have between tens and hundreds of OOPS. This policy is basically about making sure that the Exceptions and Timeouts section of the report are empty.

What should be done about OOPSes

  • Everytime an OOPS is encountered in production, a bug should be filed for it with priority of High. It should be tagged with either 'oops' or 'timeout' on it.
  • Fixing bugs tagged 'oops' and 'timeout' takes priority over any feature development.
  • We should deploy all possible OOPS fix to production.

Once we achieve Zero-OOPS status:

  • Do root-cause analysis for every OOPS that occurs in production, to ensure that our process is really robust against escaped defects.

But All OOPSes are not equals

All OOPSes in the "Exceptions" and "Time outs" sections should be eliminated. If an OOPS isn't important - because it's only triggered by robots, or for whatever reason, then it shouldn't record an OOPS. Change the exception type so that it doesn't appear in these sections.

The end-goal is that the users don't get the BSOD pages and that the OOPS report sections are empty. So that when something appears there, we know it's a problem to fix. No sifting through many false positives.

When

We are starting this policy now.

Coming Soon

Burn down chart of the bugs with the "oops" and "timeout" tags.

PolicyAndProcess/ZeroOOPSPolicy (last edited 2011-09-14 20:39:02 by lifeless)