Bug triage is an integral part of the QA Process. This triage guide is aimed to help triage bugs in the Launchpad project.
XXX: matsubara: need to add a section explaining the different launchpad projects and how the initial triage queue works (e.g. https://launchpad.net/launchpad/+filebug)
Bug life cycle
Bug life cycle encompasses the time a bug is filed into the bug tracker to the time it is closed. Closed can be any of the final bug statuses: "Won't Fix", "Invalid", "Fix Released".
- Bug reported (Starts as "New")
- Bug refined and clarified (may involve change to "Incomplete")
Then either:
- Bug rejected (Status "Won't Fix" or "Invalid")
Or the following steps:
- Bug marked Confirmed (may involve change to Triaged)
- Bug assigned a milestone
- Bug marked In Progress
- Bug fixed and reviewed
- Fix merged to RF
Bug marked Fix Committed (with comment "fixed in RF rNNNN" when marking a bug as Fix Committed)
Launchpad release
- Bug marked Fix Released
Status definition
New: This is the status all newly filed bugs start with. The primary goal of the triage team is to handle all bugs with "New" status. It's supposed to be a temporary state and the triage team should aim to answer all newly filed reports in 48 hours.
Incomplete: Another temporary status. It's used when the bug report doesn't provide enough information about the problem. Usually this status change involves asking questions to the reporter to help clarify the issue:
- What were you trying to do?
- What was the expected result?
- What was the result?
- Can you attach an screenshot?
Confirmed: Means the bug is reproducible and contains all the information to reproduce the bug. Useful information to have in the bug report:
- Rationale behide the decision to implement the feature
- Use cases
- Meeting decisions
- Link to blueprints
- Link to other bugs
- Tags
Triaged: Means the developer working on the bug have enough information to start working on a fix. Usually the developer set this.
In Progress: works has started on bug.
Fix committed: work has been reviewed and merged into RF. Developers setting this status should leave a comment in the form: "Fixed in RF rNNNN", where NNNN is the revision number where the fix was committed.
Fix released: work has been released to the public after a Launchpad rollout or cherrypick
Invalid: spam, bugs reported in the wrong place (i.e. distro bugs reported against Launchpad), questions better suited to the Answer tracker.
Won't Fix: It's good practice to add a explanation when setting this status. This says to the reporter that, although we understand the report and reasons behind it, we are explicitly deciding not to fix it.
Importance definition
Or: "Why I don't classify bugs as medium"
(Based on an email from Curtis Hovey.)
The process of triaging issues (bugs, features, and tasks) has one crucial principle: Prioritise the work according to need and certainty.
Work is prioritised because there are not enough engineers to do all the work. Some features will never be completed, some bugs will never be fixed. Triage determines which bugs can and will be fixed, which features can and will be implemented. Need is generally understood, when planning work, but certainty is not, and that often leads to wasted work and unmet expectations.
By need, I mean a measure of severity. What percentage of users does the issue affect, and how severely does it impede them from completing their task.
By certainty, I mean a measure of how certain the engineers are that they can address the issue. Time is also a factor in this measure, the longer an issue takes to address, the more likely that the conditions that were first judged will change.
The act of triage is separating work into groups that are being worked on now, next and last. There can only be as many "now" bugs or features as there are engineers. The number of "next" work is limited to the velocity of the engineers and how infrequently plans change. The bugs that are last will probably never be addressed, the last features may never be started.
The corollary to this rule is that there are a finite number of bugs or features in the first two groups. There cannot be more work in these groups than there are engineers to do for the given period of time; otherwise the engineers, businesses and users are being misinformed about when issues will be addressed.
An Example
Consider there is one engineer and two bugs. He can only work one bug at a time. One bug is more important than the other. The risk is that he may not be able to fix one of the bugs before users are disappointed and abandon the application. He risks disappointing all users if he does not fix either bug because he choose the one with the most need over the one he was certain he could address.
If he does not know how to fix the bug with the most need, or that the fix takes a long time, he is wasting time he could have spent fixing the bug with more certainty. The only way he can address the bug with the most need is to employ a hack to reduce the need, to meet the expectations of some users. The hack is also used to gain time to understand the problem, thus increase certainty.
Only Assign Work that You Are Commiting to do in the Near Future
When a work is assigned to an engineer, he is commiting to complete the work in the near future. What the "near future" means is different for each project. I suggest 3 releases is the "near future", because when work is planned, the engineer is thinking about now, next, and last. For some projects this period might be 6 weeks, for others, 6 months.
I prefer to plan for the current release, and the next one. As work is reprioritised, it may be rescheduled to the third release. I do not think it is wise to plan a bug or feature to be completed in the third releases because if it slips to the fourth or fifth released, I doubt the it was correctly prioritized as high.
Any high work that is assigned to a engineer for more than 3 releases was not high. If it were, the work would have been reassigned to someone who could complete it in the scheduled time. Any other work that is assigned for more than 1 release is also misprioritised. You are lying to yourself, and the the project's users, when you assign work that you are not committing to fixing.
Practical Classifications of Importance
Work is often classified in relative terms. It is better to classify work according to how it are managed to convey when and under what terms the bug will be fixed or a feature will be complete. There are three priorities that work can be classified as:
- Critical
The bug dramatically impairs users. Users may lose their data. Users cannot complete crucial tasks. The feature is needed to encourage adoption or prevent abandonment of the project.
Synonyms: required, essential, now, must do
- The work is immediately assigned to a engineer. It is his top priority to fix. Team members help the engineer to plan and do the work. The work is released as soon as it is deployable; in the case of a bug, it is released outside of the release schedule.
- High
The bug prevents users from completing their tasks. The feature provides new kinds of tasks or new ways of completing tasks.
Synonyms: expected, next, can do, should do
- The work is assigned to a engineer to be completed in the next 3 releases. The engineer may choose to do other work if he believes it is within the scope of the high priority work.
- Medium
The bug is inconvenience for many users. The feature provides new ways of completing tasks.
Synonyms: preferred
- The work is not scheduled, though it is intended to be completed. When the work is assigned, it may also be scheduled, but there is no commitment to complete it for the stated release. The engineer may choose to postpone the work in favour of more important work.
- Low
The bug is an inconvenience to users, but it does not prevent them from completing their tasks. The feature is a convenience to users.
Synonyms: optional, last, may do
The engineer may assign the work to himself while working on a high priority work because the high work provides an opportunity to complete the low priority work at less cost. If the low work in any way jeopardises the high priority work, the low work is unassigned. The engineer is thus certain that the work can be fixed quickly and without difficulty. A corollary to this rule is that low work that is assigned to a engineer must be "in progress" or "fixed" states.
The Problem with "Medium"
It might be argued that when the engineer has an opportunity to fix a low or a medium bug, he must choose the medium one. This rules does not define a practical distinction between medium and low. There is no commitment to fix the medium bug; it will not be scheduled for fixing. A engineer chooses to undertake a low bug because he sees an opportunity to fix it while working in the affected code. The engineer is choosing to do unscheduled work because he is certain it does not jeopardise his scheduled work. The engineer might see an opportunity to fix a medium and a low bug at the same time, but that is unlikely.
It can also be argued that 'critical' is 'high' and that 'high' is 'medium'. True, that is a matter of semantics. The crux of the issue is that there are three practical classifications of work. The words chosen to describe the classifications could use the tofu scale of hard, firm, and soft. People who are unfamiliar with triage will appreciate names that convey the kind of attention the issue will receive.
Some teams with a large number of bugs prefer to keep a pool of medium work from which releases are planned. Items in the pool may be escalated to high if it is perceived that once work is started, there should be a commitment to complete it as scheduled. This work is different from low work because the work makes a substantial improvement to the application, but like low, there is no commitment when the work will be completed. It can be argued that work starts on medium bugs and features because of changes to other priorities, certainties, or the number of users it affects.
Consequences of Misprioritised Work
Stakeholders often use reports that list the prioritised work for a release and for each engineer. When work is misclassified there are two commonly observed consequences: a decreased in certainty, and a decrease in communication.
In the first consequence, the engineer's effort may be wasted; there are issues that have more need and certainty. Engineers, and other stakeholders, are often tempted to complete the misdirected work after the misclassification is discovered because it is assumed that it is better to always deliver something finished than nothing at all. This is a risky choice, because it jeopardises work in future releases. By working on less important work, the engineer is decreasing the certainty of the more important work.
The second consequence is that the engineer ignores the list and he works on issues according to some other source, such as the opinion of another stakeholder. While the engineer is working on the correct issue, it is unclear to other parties what work is going on and when will it be completed. Users may abandon the project in frustration. Planners cannot coordinate all the stakeholders.
The first consequence is possibly a failure to do re-prioritisation during the triage process, but second consequence is a total failure in the triage process. Why would anyone do triage if the prioritisation will be ignored? How can work be coordinated if the work is unknown to all stakeholders? Why would users trust a project if it does not do what it says it will do?
Work must be reprioritised during the triage process to ensure that engineers are working on the issues with the most need and certainty. Engineers must work from the list or prioritised issues.
Indicators of Misprioritised Work
The rules of practical classification provide tests for misprioritised bugs, features, or tasks.
- The work is critical, but it is not assigned and targeted for release.
- The work prioritised as high, but it is not assigned and for a release.
- The work is high, but have not been worked on in 3 releases.
- The work is low and unassigned, yet it is targeted for a release.
- The work is low and assigned, but the engineer is not working on it.
- The work is considered to be triaged, but it's priority is not critical, high, or low.
- An engineer is assigned more work than he can accomplish in 3 releases, and it cannot be reassigned.
Initial triage
XXX: matsubara: update this section to explain use of Triaged status and why we don't use Confirmed for LP bugs. XXX: update this section to explain how we don't use wishlist importance and how feature requests should be proposed/tagged.
Getting bugs out of the New state is a good thing. We can do this in three ways:
- Marking a bug Confirmed
- Marking a bug Invalid or Won't Fix
Marking a bug as duplicate
In the process of getting a bug to one of the two initial states or marked as duplicate, its status may be temporarily set to Incomplete.
Before a bug report can be Confirmed, there are four things that should be clear:
- What happened
- What should have happened
- What the difference is
- Why the change in behavior is justified
Often, the original report doesn't explicitly state all of these things. If the report relates to an OOPS, the "What happened" part is clear, but aside from "the page should not OOPS", it may not be clear what the page should display in the circumstances in which the error occurred. During bug triage, update the bug description if necessary to clarify these things and make it easy for people reading the report to understand the problem.
In the process of discussing the bug with the reporter, we may find that we want to do something different to what the reporter suggested. When this happens, we should ask them "we want to do this a different way, what do you think?". Some bugs may evolve to become different to what was reported.
When looking at a bug, it's important to ask "Is the system here simple enough? Can it be simpler?". A bug can point to a bigger underlying problem than the symptom the reporter is complaining about. Reporters can suggest solutions which are more complicated than they need be. Sometimes, removing code from Launchpad can fix bugs.
OOPS bugs
Along with responding to new bugs, dealing with oopses is the one of the most important QA task for Launchpad. Oopses can be divided into 5 types:
- Programming errors
Hard timeouts (RequestExpired, RequestQueryTimedOut)
Soft timeouts (SoftRequestTimeout)
- Programming errors that we don't care enough to fix (user generated errors section in the Oops Summary)
- Not found errors
Of these five kinds, the first two indicate a problem that needs fixing urgently ("severe oopses"). For each programming error and hard timeout, there should be a corresponding bug report, so that work can proceed on fixing the problem. The goal is to eliminate these kinds of oopses completely. To help manage these kinds of bug, we use the special "oops" and "timeout" tags.
While soft timeouts and not found errors can indicate problems, they are much less of a priority. The exception to this, is when a surge in one of this catogories is detected. For example, the reports shows a surge in Not found errors, it usually means there's an exposed broken link in the UI, so this kind of bugs becomes a priority.
In order to ensure that oopses are getting enough attention, the QA team prepares a list of bugs tagged with "oops" or "timeout" and bring them up during the Launchpad developer meeting.
More about the OOPS QA process
Tagging bug reports
Launchpad bug reports should be tagged according to these tags.
If you find a bug report that would benefit from grouping and the report is not covered by any of the existing tags, feel free to propose a new one.
Assigning Milestones
Milestones in the Launchpad project are used to plan what Launchpad developers will work on for that specific release.
Milestones are set by developers and leads, and occasionally by the QA team.
Assigning a milestone is important because:
- it helps team leads and management plan the workload for cycle;
- keeps the Launchpad community better informed when bugs are expected to be fixed.
When changing the milestone (i.e. postponing a bug fix), leave a comment explaining why it was changed.
Dealing with duplicates
Bugs are marked as duplicates when describing the same user-visible problem. They shouldn't be marked as duplicates if they have different user-visible problems, but the same presumed underlying cause.
For bugs that have the same underlying cause, but the user-described problem is different, leave a comment pointing to the bug that best describes the underlying problem.
File a bug report or ask a question?
Occasionally users end up filing bug reports to request action from a Launchpad admin.
For instance, we have some spam all over Launchpad and Canonical hosted wikis (i.e http://wiki.ubuntu.com) and sometimes users file bugs so an admin can deal with a specific instance of spam.
The best way to deal with the specific cases of spam, like deleting a inappropriate comment, is to file a question and have a Launchpad Admin solve that. Bug reports, in this sense, should be used to report the more general problem, something like: "Launchpad needs better spam protection" and questions be used to deal with the specific symptoms.
Such questions should then be linked to the bug report that would avoid all the questions in the first place.
Wrongly filed bugs
Sometimes users file bugs when what they really want is help to fix a specific issue. In cases like this, convert the bug report into a question, leaving a comment to the user explaining what happened.
Relevant information on bug reports
Oops references
Oops references are automatically converted into links, so in general, full tracebacks are not required in bug reports. It's useful to include the exception type and exception value, though, as it makes searching for those bugs easier.
Steps to reproduce
When adding links to bugs to demonstrate things, links should be to staging or launchpad.dev. This will discourage developers from changing other people's information in production. Staging will be changed so that all Launchpad developers have admin rights bug #30670.
Good bug summaries
A bug's summary should be the shortest understandable phrase that describes the problem specifically.
Horrible: |
Various problems |
Bad: |
OOPS ABC-1234 |
Good: |
I get an oops when contributing to a bounty |
Better: |
Oops when contributing to a bounty |
Great: |
Contributing to bounty fails if I've already contributed |