OopsTools Sprint
- Where: Campinas, SP, Brazil
- When: 20-24 September 2010
Ursula and Diogo got together for a week to discuss improvements for https://edge.launchpad.net/oops-tools/. One of the goals for this sprint was to have Ursula and Diogo hacking together so Ursula can become more comfortable hacking on oops-tools.
Topics discussed
how to get rid of the reports sent to the launchpad@ list, at most have just a single email sent (e.g. https://pastebin.canonical.com/37802/)
- improve the content of the reports
- provide web ui so developers can generate a customized report with oopses only interesting to them
how to fix bug 461269 in a way that new oops attributes can be used to uniquely identify an infestation
- change pageid to become a first class object rather than an oops attribute, so we can make queries and build reports for them.
User Stories
- Deryck wants to see all OOPSes related to the Bugs team, but without the checkwatches noise.
- Danilo wants to see all OOPSes related to Translations in a single report.
- Robert wants to see reports grouped by pageid
- Julian doesn't want to receive any more email
- Francis, Robert and Jono wants to see an overall state of the production instances
- Gary wants the connection between infestations and Launchpad bug report to be very reliable (i.e. once the tool is taught about a false positive, it should do the right thing the next time)
Post-sprint Analysis
After looking at these user stories, Ursula, Diogo and Gary identified high-level goals for the OOPS tools, and discussed measurable goals for OOPS tools.
High Level Goals
- Help reduce error rates by effectively communicating application exceptions.
- Help improve performance by effectively communicating timeouts.
We believe that the distinction between exceptions and timeouts is important, because analysis and communication of these problems can be significantly different.
The user stories strongly indicate two kinds of interests for these goals. Gary had hoped to focus only on the application in its entirety, but team leads do want to us to also focus on filtering data for their teams. Because of this, we can subdivide the two big goals.
- Help team leads reduce the error rate in their part of the application
- Help team leads reduce timeouts and improve performance in their part of the application
- Help the entire team, especially strategy leads, reduce the application's error rate.
- Help the entire team, especially strategy leads, reduce the application's timeouts and improve the application's performance.
Measurable Goals
How can we measure those goals? We want to measure the quality of Launchpad's code; we should also be able to measure the quality of our impact. This can give us validation of our efforts, or encourage us to try alternate approaches.
For this discussion, an "infestation" is a database record that is associated with a few things:
a type, such as Timeout, or an exception type such as RuntimeError;
- a value, which for timeouts is the normalized SQL statement that took the most cumulative time in the request, and for exceptions is the normalized exception value;
maybe a pageid in the future (see comment 4 of bug 461269); and
- maybe a Launchpad bug number, if it has been related.
OOPS -> infestation is a one to many relationship.
How do we measure error rates?
XXX The following still needs to be rewritten/expanded.
- We rejected measuring and comparing total errors over time. Keeping errors out of the software is a QA responsibility, but not an OOPS tool responsibility. OOPS tools are generally too late in the game: they communicate errors that have been experienced, rather than errors that can be prevented.
- We focused on measuring and comparing existing errors--that is, given two non-overlapping time ranges, new and old, we would ignore errors in the new time range that were not related to infestations that were in the old one. With that broad idea, we were interested in the following approaches.
- Measure the number of total OOPSes between two time periods. Ignoring new infestations as described above, we want to see that number decrease. We believe that this can be regarded as a very rough measurement of how well
- compare the number of OOPS bugs closed between two time periods. (what does closed mean? XXX)
- record how long it takes for an infestation to disappear.
- measure how many infestations have LP bugs reports (or are triaged, or...) (right now we don't have very easy links between bugs and infestations XXX)
- measure how long it takes to create/link an LP bug to an infestation (or triage it or whatever)
- How do we measure performance/timeouts?
- - Record percentages of timeouts per total page views today, compare with future. - Record percentages of pages that take longer than the desired timeout per total page views; we would also want to have easy access to the pageids and OOPSes involved. - Monitor number of pageids that have special feature flags to increase timeout; that should also go down, other than occasional jumps - Measure how many timeout infestations have LP bugs reports (or are triaged, or...) (right now we don't group similar timeouts very well XXX) (right now we don't have very easy links between bugs and infestations XXX) - Measure how long it takes to create/link an LP bug to a timeout infestation (or triage it or whatever)
- How do we measure whether we are helping teams focus on their own problems?
- - Maybe this is a possible solution, not something we need to measure for our own effectiveness.
- Measure the number of total OOPSes between two time periods. Ignoring new infestations as described above, we want to see that number decrease. We believe that this can be regarded as a very rough measurement of how well
Action items
Bug 652350: change ErrorSummary object to accept sections so it can be built dynamically
Bug 652351: web ui so developers can generate reports customized to what they need (http://ubuntuone.com/p/HvI/)
Bug 652356: page id should become a first class object
Bug 652354: put the exception value normalization code into the database
Bug 461269: new oops attributes, such as pageid, should be used to uniquely identify an infestation
File RT to have lp-production-configs on devpad automatically updated RT #41653
Bug 592355: team based oops summaries should use the infestation team information to better group oopses
XXX discuss timeline, ordering.
XXX add discussion: - 592355 identify infestations for teams better. The following notes are not exactly pertinent, but related.
- - Ursula was keep it as it as is and measure how well it is doing - Gary: be able to change the bug - Gary: be able to look at new Launchpad bugs for OOPS listings in the description and guess that there might be a link