Diff for "PolicyAndProcess/ZeroOOPSPolicy"

Not logged in - Log In / Register

Differences between revisions 1 and 9 (spanning 8 versions)
Revision 1 as of 2009-10-09 20:40:02
Size: 2462
Editor: flacoste
Comment:
Revision 9 as of 2011-09-14 20:39:02
Size: 3538
Editor: lifeless
Comment: update to reference definition of critical
Deletions are marked like this. Additions are marked like this.
Line 12: Line 12:
stop-the-line event and should be fixed ASAP. stop-the-line event and should be fixed ASAP. This includes javascript errors
even though we do not currently record OOPS for them: Bug:741991.

[[http://webnumbr.com/launchpad-timeout-bugs|timeout tagged bugs]].
Line 16: Line 19:
We should be proud of the service we build and deliver, and we cannot take
pride in a low-quality product. Everytime an OOPS page reaches a user, whether
because of a time out or an unhandled exception, we failed on the measure of
quality. An OOPS page means that a user was prevented from completing their
work, that's really bad.
For three interlocking reasons:
 *
We should be proud of the service we build and deliver, and we cannot take pride in a low-quality product. Everytime an OOPS page reaches a user, whether because of a time out or an unhandled exception, we failed on the measure of quality. An OOPS page means that a user was prevented from completing their work, that's really bad.
Line 22: Line 22:
Having zero tolerance for OOPSes in production means that we are putting
actions behind our mantra of quality. An OOPS is basically an escaped defect,
and we cannot tolerate that.
 * Having zero tolerance for OOPSes in production means that we are putting actions behind our mantra of quality. An OOPS is basically an escaped defect, and we cannot tolerate that.
Line 26: Line 24:
Daily we have between tens and hundreds of OOPS. This policy is basically
about making sure that the Exceptions and Timeouts section of the report are
 * The OOPS system tells us when something has gone wrong. If that report isn't essentially empty every day (due to many different OOPSes affecting a small number of users), we won't detect really severe problems (particularly those that only happen a couple of times but may have very severe impact) - our operational signal to noise ratio is compromised. This aspect is supported by our (internal, sorry) definition of critical policy.

This policy is basically about making sure that the Exceptions and Timeouts section of the report are
Line 33: Line 32:
  it with priority of High. It should be tagged with either 'oops' or   it with priority of Critical. It should be tagged with either 'oops' or
Line 35: Line 34:
  * Fixing bugs tagged 'oops' and 'timeout' takes priority over any
  feature development.
  * We should deploy all possible OOPS fix to production.
  * Fixing critical bugs takes priority over all other bug fixing work (done
  by the interrupt squads).
     * Unless it's an operational incident, we should respect the work-in-process limit set through [[Kanban]]. But critical bugs should be the first bugs to be pulled for development once capacity is available.
  * We should deploy all possible OOPS fix to production as rapidly as possible.
Line 45: Line 45:
== But All OOPSes are not equals == == But All OOPSes are not the same ==
Line 49: Line 49:
whatever reason, then it shouldn't record an OOPS. Change the exception
type so that it doesn't appear in these sections.
whatever reason, then the system should be changed to not record an OOPS.
Line 52: Line 51:
The end-goal is that the users don't get the BSOD pages and that the OOPS
report sections are empty. So that when something appears there, we know it's
a problem to fix. No sifting through many false positives.
One way to prevent an OOPS being visible is to change the exception
type so that it doesn't trigger the OOPS code.

The end goals are:
 * users are able to use the system
 * OOPSes are only recorded when something is wrong that developers or operators need to know about.

The expected result of achieving these goals is that the system will generally be in good shape and if an OOPS is recorded its something important we should work on immediately - no sifting through many false positives.
Line 62: Line 66:
Burn down chart of the bugs with the "oops" and "timeout" tags. Burn down chart of the bugs with the "oops" tags. 

  • Policy Name: Zero OOPS Policy

  • Policy Owner: Francis Lacoste

  • Parent Process/Activity: None

  • Supported Policy: None

Policy Overview

In a nutshell, this policy is about moving the tolerance-level for OOPSes to zero. This mean that any user-visible error happening in production is a stop-the-line event and should be fixed ASAP. This includes javascript errors even though we do not currently record OOPS for them: 741991.

timeout tagged bugs.

Why this policy?

For three interlocking reasons:

  • We should be proud of the service we build and deliver, and we cannot take pride in a low-quality product. Everytime an OOPS page reaches a user, whether because of a time out or an unhandled exception, we failed on the measure of quality. An OOPS page means that a user was prevented from completing their work, that's really bad.
  • Having zero tolerance for OOPSes in production means that we are putting actions behind our mantra of quality. An OOPS is basically an escaped defect, and we cannot tolerate that.
  • The OOPS system tells us when something has gone wrong. If that report isn't essentially empty every day (due to many different OOPSes affecting a small number of users), we won't detect really severe problems (particularly those that only happen a couple of times but may have very severe impact) - our operational signal to noise ratio is compromised. This aspect is supported by our (internal, sorry) definition of critical policy.

This policy is basically about making sure that the Exceptions and Timeouts section of the report are empty.

What should be done about OOPSes

  • Everytime an OOPS is encountered in production, a bug should be filed for it with priority of Critical. It should be tagged with either 'oops' or 'timeout' on it.
  • Fixing critical bugs takes priority over all other bug fixing work (done by the interrupt squads).
    • Unless it's an operational incident, we should respect the work-in-process limit set through Kanban. But critical bugs should be the first bugs to be pulled for development once capacity is available.

  • We should deploy all possible OOPS fix to production as rapidly as possible.

Once we achieve Zero-OOPS status:

  • Do root-cause analysis for every OOPS that occurs in production, to ensure that our process is really robust against escaped defects.

But All OOPSes are not the same

All OOPSes in the "Exceptions" and "Time outs" sections should be eliminated. If an OOPS isn't important - because it's only triggered by robots, or for whatever reason, then the system should be changed to not record an OOPS.

One way to prevent an OOPS being visible is to change the exception type so that it doesn't trigger the OOPS code.

The end goals are:

  • users are able to use the system
  • OOPSes are only recorded when something is wrong that developers or operators need to know about.

The expected result of achieving these goals is that the system will generally be in good shape and if an OOPS is recorded its something important we should work on immediately - no sifting through many false positives.

When

We are starting this policy now.

Coming Soon

Burn down chart of the bugs with the "oops" tags.

PolicyAndProcess/ZeroOOPSPolicy (last edited 2011-09-14 20:39:02 by lifeless)