Differences between revisions 5 and 20 (spanning 15 versions)

OopsTools Sprint

Where: Campinas, SP, Brazil
When: 20-24 September 2010

Ursula and Diogo got together for a week to discuss improvements for https://edge.launchpad.net/oops-tools/. One of the goals for this sprint was to have Ursula and Diogo hacking together so Ursula can become more comfortable hacking on oops-tools.

Topics discussed

how to get rid of the reports sent to the launchpad@ list, at most have just a single email sent (e.g. https://pastebin.canonical.com/37802/)
improve the content of the reports
provide web ui so developers can generate a customized report with oopses only interesting to them
how to fix bug 461269 in a way that new oops attributes can be used to uniquely identify an infestation
change pageid to become a first class object rather than an oops attribute, so we can make queries and build reports for them.

User Stories

Deryck wants to see all OOPSes related to the Bugs team, but without the checkwatches noise.
Danilo wants to see all OOPSes related to Translations in a single report.
Robert wants to see reports grouped by pageid
Julian doesn't want to receive any more email
Francis, Robert and Jono wants to see an overall state of the production instances
Gary wants the connection between infestations and Launchpad bug report to be very reliable (i.e. once the tool is taught about a false positive, it should do the right thing the next time)

Post-sprint Analysis

After looking at these user stories, Ursula, Diogo and Gary identified high-level goals for the OOPS tools, and discussed measurable goals for OOPS tools.

High Level Goals

Help reduce error rates by effectively communicating application exceptions.
Help improve performance by effectively communicating timeouts.

We believe that the distinction between exceptions and timeouts is important, because analysis and communication of these problems can be significantly different.

The user stories strongly indicate two kinds of interests for these goals. Gary had hoped to focus only on the application in its entirety, but team leads do want to us to also focus on filtering data for their teams. Because of this, we can subdivide the two big goals.

Help team leads reduce the error rate in their part of the application
Help team leads reduce timeouts and improve performance in their part of the application
Help the entire team, especially strategy leads, reduce the application's error rate.
Help the entire team, especially strategy leads, reduce the application's timeouts and improve the application's performance.

Measurable Goals

How can we measure those goals? We want to measure the quality of Launchpad's code; we should also be able to measure the quality of our impact. This can give us validation of our efforts, or encourage us to try alternate approaches.

For this discussion, an "infestation" is a database record that is associated with a few things:

a type, such as Timeout, or an exception type such as RuntimeError;
a value, which for timeouts is the normalized SQL statement that took the most cumulative time in the request, and for exceptions is the normalized exception value;
maybe a pageid in the future (see comment 4 of bug 461269); and
maybe a Launchpad bug number, if it has been related.

OOPS -> infestation is a one to many relationship.

How do we measure error rates?

XXX The following still needs to be rewritten/expanded.

We rejected measuring and comparing total errors over time. Keeping errors out of the software is a QA responsibility, but not an OOPS tool responsibility. OOPS tools are generally too late in the game: they communicate errors that have been experienced, rather than errors that can be prevented.
We focused on measuring and comparing existing errors--that is, given two non-overlapping time ranges, new and old, we would ignore errors in the new time range that were not related to infestations that were in the old one. With that broad idea, we were interested in the following approaches.
- Measure the number of total OOPSes between two time periods. Ignoring new infestations as described above, we want to see that number decrease. We believe that this can be regarded as a very rough measurement of how well
  - compare the number of OOPS bugs closed between two time periods. (what does closed mean? XXX)
  - record how long it takes for an infestation to disappear.
  - measure how many infestations have LP bugs reports (or are triaged, or...) (right now we don't have very easy links between bugs and infestations XXX)
  - measure how long it takes to create/link an LP bug to an infestation (or triage it or whatever)
- How do we measure performance/timeouts?
  - - Record percentages of timeouts per total page views today, compare with future. - Record percentages of pages that take longer than the desired timeout per total page views; we would also want to have easy access to the pageids and OOPSes involved. - Monitor number of pageids that have special feature flags to increase timeout; that should also go down, other than occasional jumps - Measure how many timeout infestations have LP bugs reports (or are triaged, or...) (right now we don't group similar timeouts very well XXX) (right now we don't have very easy links between bugs and infestations XXX) - Measure how long it takes to create/link an LP bug to a timeout infestation (or triage it or whatever)
- How do we measure whether we are helping teams focus on their own problems?
  - - Maybe this is a possible solution, not something we need to measure for our own effectiveness.

Action items

Bug 652350: change ErrorSummary object to accept sections so it can be built dynamically
Bug 652351: web ui so developers can generate reports customized to what they need (http://ubuntuone.com/p/HvI/)
Bug 652356: page id should become a first class object
Bug 652354: put the exception value normalization code into the database
Bug 461269: new oops attributes, such as pageid, should be used to uniquely identify an infestation
File RT to have lp-production-configs on devpad automatically updated RT #41653
Bug 592355: team based oops summaries should use the infestation team information to better group oopses

XXX discuss timeline, ordering.

XXX add discussion: - 592355 identify infestations for teams better. The following notes are not exactly pertinent, but related.

- Ursula was keep it as it as is and measure how well it is doing - Gary: be able to change the bug - Gary: be able to look at new Launchpad bugs for OOPS listings in the description and guess that there might be a link

Bugs fixed during the sprint

Bug 612354: fix oops-tools bootstraping
Bug 251896: oops-tools should filter out not found errors referred from non-local domains

-  ⇤ ← Revision 5 as of 2010-09-22 17:50:50 → 
  Size: 2122
  Editor: matsubara
  Comment:
+   ← Revision 20 as of 2010-10-01 22:12:01 → ⇥
  Size: 7145
  Editor: gary
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-Mini OOPS Tools sprint, taking place in Campinas, SP, Brazil, 20-24 September 2010.
+= OopsTools Sprint =
 Line 3:
-== We're discussing ==
+ * Where: Campinas, SP, Brazil
 * When: 20-24 September 2010
-Line 5:
+Line 6:
- * Bug Bug:461269: oops reports should be grouped by oops signature not exception type and exception value (better grouping of oopses; better relation oopses-real problems)
 * How to move this information to be easily accessible via web
 * Improve oops-tools to handle queries for a given pageid
+Ursula and Diogo got together for a week to discuss improvements for https://edge.launchpad.net/oops-tools/.
One of the goals for this sprint was to have Ursula and Diogo hacking together
so Ursula can become more comfortable hacking on oops-tools.
-Line 9:
+Line 10:
-=== How to reduce all OOPS emails into one with grouped information ===
+== Topics discussed ==
-Line 11:
+Line 12:
-Send only one email to the list including information broke down by team:
+    * how to get rid of the reports sent to the launchpad@ list, at most have just a single email sent (e.g. https://pastebin.canonical.com/37802/)
    * improve the content of the reports
    * provide web ui so developers can generate a customized report with oopses only interesting to them
    * how to fix bug Bug:461269 in a way that new oops attributes can be used to uniquely identify an infestation
    * change pageid to become a first class object rather than an oops attribute, so we can make queries and build reports for them.
-Line 13:
+Line 18:
- * stats section is global, for all production instances
 * each team section shows top exception and top timeout and the percentage of oops reports the team is responsible for.
    * maybe show top offending pageid (i.e. the pageid with the greates number of oopses) instead?
 * all team sections show link to a full team report
 * remove unecessary sections from the report, such as soft time out, informational only, user generated errors
 * provide a way to find oopses that would appear in those sections through the web ui
 * remove unecessary oopses from the report (e.g bug Bug:540890 and Bug:251896)
+== User Stories ==
-Line 21:
+Line 20:
-{{{
+    * Deryck wants to see all OOPSes related to the Bugs team, but without the checkwatches noise.
    * Danilo wants to see all OOPSes related to Translations in a single report.
    * Robert wants to see reports grouped by pageid
    * Julian doesn't want to receive any more email
    * Francis, Robert and Jono wants to see an overall state of the production instances
    * Gary wants the connection between infestations and Launchpad bug report to be very reliable (i.e. once the tool is taught about a false positive, it should do the right thing the next time)
-Line 23:
+Line 27:
-Subject: Oops report for 2010-09-22
+=== Post-sprint Analysis ===
-Line 25:
+Line 29:
-= Stats for 2010-09-22 =
+After looking at these user stories, Ursula, Diogo and Gary identified high-level goals for the OOPS tools, and discussed measurable goals for OOPS tools.
-Line 27:
+Line 31:
-* 10000 Exceptions
* 50000 Time Outs
+==== High Level Goals ====
-Line 30:
+Line 33:
+ * Help reduce error rates by effectively communicating application exceptions.
 * Help improve performance by effectively communicating timeouts.
-Line 31:
+Line 36:
-== Bugs 20 % ==
Full report: https://lp-oops.canonical.com/summary/?team=Bugs&date=2010-09-22
+We believe that the distinction between exceptions and timeouts is important, because analysis and communication of these problems can be significantly different.
-Line 34:
+Line 38:
- * 230 ConjoinedBugTaskError: 'Foo bar is foobared'
 * 440 TimeOut: 'some broken page'
+The user stories strongly indicate two kinds of interests for these goals.  Gary had hoped to focus only on the application in its entirety, but team leads do want to us to also focus on filtering data for their teams.  Because of this, we can subdivide the two big goals.
-Line 37:
+Line 40:
-== Code 50 % ==
Full report: https://lp-oops.canonical.com/summary/?team=Code&date=2010-09-22
+ * Help team leads reduce the error rate in their part of the application
 * Help team leads reduce timeouts and improve performance in their part of the application
 * Help the entire team, especially strategy leads, reduce the application's error rate.
 * Help the entire team, especially strategy leads, reduce the application's timeouts and improve the application's performance.
-Line 40:
+Line 45:
- * 100 TooNewRecipeError: 'recipe is foobared'
 * 1140 TimeOut: 'some broken page'
+==== Measurable Goals ====
-Line 43:
+Line 47:
+How can we measure those goals?  We want to measure the quality of Launchpad's code; we should also be able to measure the quality of our impact.  This can give us validation of our efforts, or encourage us to try alternate approaches.
-Line 44:
+Line 49:
-== Foundations 10 % ==
Full report: https://lp-oops.canonical.com/summary/?team=Code&date=2010-09-22
+For this discussion, an "infestation" is a database record that is associated with a few things:
 * a type, such as Timeout, or an exception type such as RuntimeError;
 * a value, which for timeouts is the normalized SQL statement that took the most cumulative time in the request, and for exceptions is the normalized exception value;
 * maybe a pageid in the future (see comment 4 of bug Bug:461269); and
 * maybe a Launchpad bug number, if it has been related.
-Line 47:
+Line 55:
- * 330 LibrarianDiskError: 'librarian is down'
 * 666 TimeOut: 'some other page is broken'
+OOPS -> infestation is a one to many relationship.
-Line 50:
+Line 57:
-}}}
+How do we measure error rates?
-Line 52:
+Line 59:
-== We're doing ==
+XXX The following still needs to be rewritten/expanded.
-Line 54:
+Line 61:
- * Bug Bug:540890: exclude robot posts from reports
 * Bug Bug:251896: oops-tools should filter out not found errors referred from non-local domains
+ * We rejected measuring and comparing total errors over time. Keeping errors out of the software is a QA responsibility, but not an OOPS tool responsibility.  OOPS tools are generally too late in the game: they communicate errors that have been experienced, rather than errors that can be prevented.
 * We focused on measuring and comparing existing errors--that is, given two non-overlapping time ranges, new and old, we would ignore errors in the new time range that were not related to infestations that were in the old one.  With that broad idea, we were interested in the following approaches.
   * Measure the number of total OOPSes between two time periods.  Ignoring new infestations as described above, we want to see that number decrease.  We believe that this can be regarded as a very rough measurement of how well 
      * compare the number of OOPS bugs closed between two time periods. (what does closed mean? XXX)
      * record how long it takes for an infestation to disappear.
      * measure how many infestations have LP bugs reports (or are triaged, or...) (right now we don't have very easy links between bugs and infestations XXX)
      * measure how long it takes to create/link an LP bug to an infestation (or triage it or whatever)
  * How do we measure performance/timeouts?
    - Record percentages of timeouts per total page views today, compare with future.
    - Record percentages of pages that take longer than the desired timeout per total page views; we would also want to have easy access to the pageids and OOPSes involved.
    - Monitor number of pageids that have special feature flags to increase timeout; that should also go down, other than occasional jumps
    - Measure how many timeout infestations have LP bugs reports (or are triaged, or...) (right now we don't group similar timeouts very well XXX) (right now we don't have very easy links between bugs and infestations XXX)
    - Measure how long it takes to create/link an LP bug to a timeout infestation (or triage it or whatever)
  * How do we measure whether we are helping teams focus on their own problems?
    - Maybe this is a possible solution, not something we need to measure for our own effectiveness.
-Line 57:
+Line 77:
-== We did ==
+== Action items ==
-Line 59:
+Line 79:
- * Bug Bug:612354: fix oops-tools bootstraping
+    * Bug Bug:652350: change ErrorSummary object to accept sections so it can be built dynamically
    * Bug Bug:652351: web ui so developers can generate reports customized to what they need (http://ubuntuone.com/p/HvI/)
    * Bug Bug:652356: page id should become a first class object
    * Bug Bug:652354: put the exception value normalization code into the database
    * Bug Bug:461269: new oops attributes, such as pageid, should be used to uniquely identify an infestation
    * --(File RT to have lp-production-configs on devpad automatically updated)-- RT #41653
    * Bug Bug:592355: team based oops summaries should use the infestation team information to better group oopses

XXX discuss timeline, ordering.

XXX add discussion:
- 592355 identify infestations for teams better.  The following notes are not exactly pertinent, but related.
  - Ursula was keep it as it as is and measure how well it is doing
  - Gary: be able to change the bug
  - Gary: be able to look at new Launchpad bugs for OOPS listings in the description and guess that there might be a link

== Bugs fixed during the sprint ==

    * Bug Bug:612354: fix oops-tools bootstraping
    * Bug Bug:251896: oops-tools should filter out not found errors referred from non-local domains

launchpad development

Diff for "Foundations/QA/OOPSToolsMiniSprint"