Differences between revisions 12 and 13

This page is about triaging Launchpad-related bugs. For general information on handling bug reports, see BugHandling. If you have any questions, ask for help right away.

Launchpad Bug Triage

What is bug triage?

Triage is the act of sorting bugs into different priority groups. There are many conflicting sorts - everyone has their pet bug that should be 'first'. The sort order we choose is from the projects perspective: we try to balance the needs of our users.

So, bug triage is: sorting bugs by importance-to-the-project, and these are the influences we try to strike a balance between in assessing that importance:

Things affecting launchpad project health.
Things affecting stakeholders
Things affecting other users

When we have triaged a bug, it has status triaged and an importance other than unknown.

Why triage

This may be obvious, but having just a big bucket of open bugs isn't very efficient: there are more genuinely important issues to fix than engineers, and as such engineers will forget what things are urgent and what aren't.

Secondly, each of the groups of users whose needs we're trying to compromise between are interested in when things will get done. By sorting the bugs we provide a proxy metric for when tasks will be worked on.

How much triage is needed?

The world is dynamic and constantly changing; as such any sort we come up with for our bugs will be outdated pretty quickly. We could make the sort complete (so all bugs are ranked) and constantly refresh it. However this is inefficient: the only times the sort actually matters are:

when a new bug is being selected to work on (by project importance).
when a user is taking a decision based on how long until the bug is likely to be worked on. For instance, they might decide to work up a patch, or whether to use Launchpad at all.

So how much sorting is enough? Two interesting metrics are freshness and completeness.

If the sort is too old, bugs will be indicated as 'should be next to work on' that are not valid as that any more. Our priorities may change month to month but they rarely change faster than that : so we can tolerate things being months (or more) stale.

The sort is complete enough if the answers to 'what is an important bug to work on now' and questions that users may ask (like 'how long till this will be worked on') get answers accurate enough... and how accurate do we need?

Well that's a tradeoff, but we think the answers are accurate enough if:

users can see that we care about performance, regressions, usability and polish
engineers selecting 'next bug to work on' based on the triage sort usually pick things that are the most useful thing to the project/stakeholders/users; that is that inconsequential stuff is tackled after consequential stuff

Bug Importance

Bug importance in Launchpad is where we record the result of the triage process; we have 5 buckets we can use in Launchpad: critical/high/medium/low/wishlist.

We don't actually ever block a release based on having a particular importance bug - we block releases based on having regressions, which any commit can have - and we mark that on the bug mapping to the commit.

The buckets combine to give a partial sort: bugs in the critical bucket are sorted before bugs in the high bucket.

We can choose to use some or all of these 5 buckets.

How many do we need? A good way to answer that is to consider our hypothetical complete, fresh sort, and consider how many slices we'd need to make in it to answer questions well; we also need to consider what would change to those slices when things change (such as new things coming that sort to the front).

Also buckets have a cost : we need a ruleset for triage that will let us assign bugs to buckets: every bucket makes the heuristics more complex.

Given that we have a freshness tolerance for most bugs of some months, that we don't want to update many bugs when a single bugshuffles in front, and that because we have more bugs coming in than we fix - we need three or perhaps four buckets:

A topmost bucket that is generally empty and crisis bugs go into.
A default bucket that bugs we haven't picked out as being important enough to sort above any other specific bug go into.
[optional] a bucket for bugs that are reasonably important but not extremely so
And a bucket containing bugs which are within the first 6 months of work

We map these buckets into:

critical : generally empty, bugs that need to jump the queue go here.
high: bugs that are likely to get attention within 6 months
low [or perhaps wishlist]: All other bugs.

This has a clear tension: time-till-we-start-work is a good metric for what bucket to put in, but given a bug with some symptoms how do we decide what bucket it should go into.

To address this tension we use two things:

A quarterly review of the bugs in the high bucket, to stop it overflowing.
Some heuristics for sorting bugs

Quarterly review

This is pretty simple - we re-triage bugs with high importance to see if things have changed and they should be downgraded. For upgrades we assume that user prompting will cause us to upgrade them.

Triage guidelines

These guidelines describe the rules we use to sort bugs - and from that sort we assign bugs to bugs. We broadly want:

queue jumping bugs to be in the critical bucket. (OOPS, timeouts, regressions, stakeholder-escalated bugs are all examples of queue jumping bugs)
the high bucket to be about 6 months deep - many parts of Canonical are on a 6-month cycle and fitting in with that is convenient

The quarterly review is responsible for shrinking the high bucket if it's too full.

What we need to do then in assessing the bucket for a bug is to do *enough* sorting on it to see if it's a queue jumper, of it's more important than the least important bug currently in the high bucket. Beyond that, all bugs are in the low bucket.

If a bug is a regression : if the thing *was* working and now isn't, we sort it higher. We're currently discussing having a policy that regressions are critical, which if implemented will make these queue jumpers (critical bucket).

If the bug is one that has been escalated via the Launchpad stakeholder process, it is a queue jumper (critical bucket).

OOPS and timeout bugs also jump the queue: performance is very important to our stakeholders and OOPS dramatically affect our ability to operate and maintain Launchpad as well as being a very negative experience when encountered. The ZeroOopsPolicy contains details on this.

For things like browser support, when a new browser is released but the vendor is in our supported-browser-set, we should treat issues as regressions and so they will be queue jumpers.

Beyond these rules a bug is more important than another bug if fixing it will make Launchpad more better than fixing the other bug. Discretion and a feel for whats in the bug database will help a lot here, as will awareness of our userbase and their needs. One sensible heuristic is to look at 5-10 existing high bugs, and if the new bug is less important than all of them, mark it low (it's probably less important than all existing high bugs).

Engineers have discretion to decide any particular bug should be sorted higher (or lower) than it has been; some change requests are very important to many of our users while still not big enough to need a dedicated feature-squad working on them (so these bugs may be high). When two engineers disagree, or if someone in the management chain disagrees, common sense and courtesy should be used in resolving the disagreement.

How to triage

Visit unknown/undecided importance bugs and untriaged status bugs

For each bug:

See if there are any duplicates by having a bit of a look around, search your memory etc. If you find a duplicate, mark the the
newer bug as a duplicate of the older bug (unless there is a compelling reason to use the newer bug as the master. Consider updating the description and tags of the older bug to help make it clearer. We use the older bug by default because we (roughly) work through bugs in the same bucket in date order.
If the bug is unrelated to Launchpad, move it somewhere appropriate.
If the bug is something we won't do at all, mark it as won't fix.
If it's a operational request, convert it to a question.
apply the guidelines in 'Triage Guidelines' to get a bucket for the bug and set the bug importance to that bucket.
If the bug status is 'Incomplete', check that the filer was asked to clarify something; if they were and haven't replied in a month, close the bug. Otherwise either ask them to clarify something, or set the bug to Triaged if they have clarified whatever was needed.
If the bug status is New, set it to triaged.

Assignment

Bug triage does not involve assigning an engineer. Engineers should only be assigned to bugs that are in progress. Even critical bugs do not need an engineer assigned: operational incidents are not tracked in the bug database, though critical bugs may be generated as followup work to be done; those bugs are then in the front-section of the queue, but that's all that is needed.

Selecting bugs to work on

The bug database holds the /project/ importance set of bugs. However individual or squad work-queues may be quite different. For instance, we have 3 squads working on features at any one time, 2 on maintenance. Generally speaking squads on feature-rotation will ignore 'importance' in selecting what to work on - they will be working on a feature and creating bugs as appropriate to create discussion points and todo items for that feature.

The Launchpad maintenance squads however will usually be working from the bug database - picking bugs up to work on based on their triaged importance. So for maintenance squads, they should simply look in each bucket in order - critical, high, low - and from within that bucket take one of the oldest bugs - one that seems interesting to them at the time. Crucially though, all bugs in the critical bucket should have someone or some squad working on them before any bugs in the high bucket are picked up and worked on, and likewise for low.

Community work will often ignore our bug triage and focus on itch scratching - and this also applies to patches done by Launchpad engineers in their personal and slack time: the selection logic for picking a bug only applies to effort being put in as part of their primary duties. That is, it's always totally ok to fix that low priority bug that's really annoying you, whether you're a user of Launchpad or a developer. A bug fix is a bug fix!

-  ⇤ ← Revision 12 as of 2011-02-14 11:10:29 → 
  Size: 11433
  Editor: allenap
  Comment: Some apostrophes.
+   ← Revision 13 as of 2011-04-29 21:38:59 → ⇥
  Size: 11496
  Editor: flacoste
  Comment: proper zerooopspolicy link
-Deletions are marked like this.
+Additions are marked like this.
 Line 154:
-experience when encountered. The ZeroOopsPolicy contains details on this.
+experience when encountered. The [[https://dev.launchpad.net/PolicyAndProcess/ZeroOOPSPolicy|ZeroOopsPolicy]]
contains details on this.

launchpad development

Diff for "BugTriage"