Diff for "PolicyAndProcess/Downtime"

Not logged in - Log In / Register

Differences between revisions 31 and 32
Revision 31 as of 2010-03-04 22:15:27
Size: 10407
Editor: flacoste
Comment: Remove reference to CP as nothing blocks PQM. Add note about overly pessimistic downtime estimate
Revision 32 as of 2010-06-09 21:09:43
Size: 10993
Editor: flacoste
Comment: Add communication requirements and incident report
Deletions are marked like this. Additions are marked like this.
Line 123: Line 123:
 * Immediately after the roll-out, examine the site for problems. For example, ensure CSS loads properly, all external links on the front page are reachable, etc.
Line 127: Line 125:

 * Start an [[IncidentTrackingTemplate|incident report]] to document
 any issues that will be encountered during the roll-out.
Line 140: Line 141:
Line 141: Line 143:

  * Immediately after the roll-out, examine the site for problems. For example, ensure CSS loads properly, all external links on the front page are reachable, etc.

  * Announce on `launchpadstatus` that the release is done.

  * Announce any post-roll out issues on `lauchpadstatus` as they are
  discovered. Add them to the incident report. (Following the
  Launchpad/PolicyandProcess/Announcements/IncidentProcess, where the release
  manager acts as the communication liaison.)
Line 147: Line 158:
  
Line 153: Line 163:

  * Complete the incident report and ask any engineers that participated in
  fixing the issues in doing root-cause analysis.

  • Process Name: Release Manager Rotation Process

  • Process Owner: Francis Lacoste

  • Parent Process/Activity: None

  • Supported Policy: None

Process Overview

Each cycle a different engineer takes the role of release manager. The release manager coordinates with the release team and all team leads to ensure that the tree is ready for the roll-out and that all critical bugs are in or worked-around.

Back-up release managers are the two RMs from the previous two cycles.

One option that has worked very well is to share the release-manager role across timezones, handing over the current status and tasks to the backup in the next timezone at the end of your day. It's a great opportunity to work together as a team.

Release Manager inputs

Activities

Week 3 (the one before the roll-out)

  • During week 3 ensure that staging is up-to-date so that it can be used for non-edge QA.

  • During week 3 ensure that Matt Revell has a downtime announcement email ready for lp-announce.
  • The down-time estimate should be based on the last staging DB restore.
    • We should be overly conservative in the read-only announcement. Recently, we always encountered DB locking issues during the roll-out which increased the read-only window. Until we are able to have stable DB upgrades and determine a reliable fudge factor between time on staging and production, we should plan for the worst and come back earlier.
  • During week 3 send out an email to Stuart Metcalf from Canonical ISD about the upcoming roll-out to ask about any changes that need to be rolled-out for canonical-identity-provider or shipit. (Launchpad roll-outs imply

    a roll-out of the Canonical Identity Provider and ShipIt code).

  • You might want to request that pqm-blockers (such as cherry-picks) are not processed on the day that PQM is closing.
  • At the end of the week, make sure all the cowboys listed on LaunchpadProductionStatus are either already landed or listed among CurrentRolloutBlockers

  • Also go through all the 'unusual rollout requirements' and clean them up from LaunchpadProductionStatus page

Release Week

  • At the beginning of week 4. Make sure that release-critical was turned on in PQM. (Monday 00:00 UTC)
    • There is an option to leave devel open for r-c landings as well until Tuesday.
  • At the beginning of week 4, schedule a call with the Foundations team lead (and other leads if known to be pertinent) to determine what system changes might need comprehensive QA. If these exist, consider these thoughts.
    • Any related problem encountered should be treated as a red flag, forcing more thorough QA.
    • Foundations lead should report on reviewing logs on edge, such as of cronscript output.
  • Look at the staging DB restore time for the week-end and determine if any changes to the announced down-time should be made.
  • Determine the schedule and deadlines. Send an email to launchpad-dev with all of the deadlines, similar to this example email. Place the deadlines on the team calendar.

  • Update the #launchpad-dev topic to state we are in 'Release Critical' and to list the release manager.

  • Maintain the list of the Current roll-out blockers

    • The release manager should poll the team leads and QA engineers continuously to ensure that the list of release blockers is up-to-date. (We need to explore a work-around to retire this wiki page and do the management in Launchpad.) All bugs that are likely to cause lots of OOPSes, time-outs or prevent several users from working are good CRB candidates. It's a good idea to subscribe yourself to the page. (Currently broken.)
  • Make sure that developers are assigned to all problems we want to fix.
  • Review release-critical merge proposals. The policy should be:
    • All RC candidates go through the normal review process.
    • After code and UI review the MP is left in 'Needs Review' state.
    • A new review of type 'release-critical' is added to the MP and assigned to the release manager.
    • If the MP is approved for 'release-critical', the review is marked 'Approve' and the state of the MP is set to 'Approved'.

On the day before the roll-out

  • Check that the LOSA do a staged deployment of the code. We are looking for any hidden build problems and to determine the amount of time this step will take.
  • Request that landing to the devel branch be closed, 24 hours before the scheduled release. All changes should on the last day be merged through db-devel.

On the day of the roll-out

  • Chase up Current Rollout Blockers and any other pending release-critical fixes.

  • With PQM remaining open, have the LOSAs stop buildbot and set it do manual runs.
  • Remind people that all changes need to be in buildbot for 9 hours before the roll-out time. The LOSAs require two hours of pre-release preparation and we need to allow for two complete buildbot cycles. (9 = 2 + 2 * 3.5)

  • In the case of failures, it's best to roll-out the last-known-good-build rather than delaying the release. The cut-off point to decide which revision

    to roll out is 2 hours before the scheduled release.

  • Re-announce downtime so it serves as a reminder on launchpadstatus account at least 4h before the actual rollout.

  • Ensure that any embargoed external resources (e.g. blog entries) are live and accessible through the links provided. Ensure that a blog editor (Matthew Revell or delegate) is available at the time of the roll-out.
  • Check that source code dependencies revision numbers are correct (compare

    them with what is listed in utilities/sourcedeps.conf) "bzr branch lp:~launchpad-pqm/lp-production-configs/trunk" then look at config-manager/production-{devel,stable}

  • Start an incident report to document any issues that will be encountered during the roll-out.

During the roll-out

  • If some unexpected problems are encountered during the roll-out and these put the roll-out schedule off-track by more than 30 minutes, the decision to abort the roll-out should be taken. In that case:
    • Identify the source of the problem and come up with a best estimate on how soon it can be fixed.
    • Determine when the next attempt should be undertaken and make sure it is announced. Give at least 12 hours of fore-notice.

After the roll-out

  • Immediately after the roll-out, examine the site for problems. For example, ensure CSS loads properly, all external links on the front page are reachable, etc.
  • Announce on launchpadstatus that the release is done.

  • Announce any post-roll out issues on lauchpadstatus as they are discovered. Add them to the incident report. (Following the Launchpad/PolicyandProcess/Announcements/IncidentProcess, where the release manager acts as the communication liaison.)

  • With the QA engineers, review the OOPS reports.
    • All common OOPSes are candidates for more release-critical fixes and scheduling another roll-out.
  • Prepare and schedule any necessary re-roll.
  • When a re-roll is needed, same activities than in the pre-roll out case.
  • Open the tree, when the released version is fine for the next cycle.
  • Complete the incident report and ask any engineers that participated in fixing the issues in doing root-cause analysis.
  • The release-manager needs to select the next release manager.

Re-opening PQM

Once the roll out is complete and any critical issues have been dealt with, it's time to re-open PQM. Before doing that, though, we need to merge db-devel back into devel.

Release critical policy

  • To apply for a release-critical approval, you must have a reviewed merge proposal on Launchpad. The engineer adds a review of type

    release-critical to the merge proposal and ensures it is in the 'Needs Review' state.

  • Good candidates for release-critical approval are issues found during QA that are bound to create OOPSes and time outs or otherwise significantly inconvenience our end-users.
  • Apart from special exceptions discussed with the project lead, only bug fixes should be granted release-critical approval.
  • If there is no way for the developer to QA his change on staging through the normal update procedure before the roll-out, it's recommended to request a cowboy of the branch on staging to QA it before approval.

Database Patches

  • Release-critical branches containing database patches should only be accepted if they don't impact the estimated roll-out time.
  • One the day of the roll-out, only database patches that would be critical to prevent data corruption should be accepted.
  • For the second re-roll, again only DB patches critical to data safety should be considered as it impacts our ability to update without down-time.

Scheduling

  • Engineers apply in advance for one cycle.
  • They are selected by the previous release manager. Once selected, their name

    is put on the Launchpad Production Status page.

  • The actual roll-out time is determined by the release-manager's location:
    • Location

      Roll out time

      Americas

      23:00UTC

      Europe

      10:00UTC

      Asia/Pacific

      00:00UTC

  • No engineer should apply for the role more than twice a year.

References

PolicyAndProcess/Downtime (last edited 2011-06-06 22:02:02 by flacoste)