Diff for "PolicyAndProcess/Downtime"

Not logged in - Log In / Register

Differences between revisions 1 and 32 (spanning 31 versions)
Revision 1 as of 2009-07-21 23:00:34
Size: 4124
Editor: flacoste
Comment:
Revision 32 as of 2010-06-09 21:09:43
Size: 10993
Editor: flacoste
Comment: Add communication requirements and incident report
Deletions are marked like this. Additions are marked like this.
Line 15: Line 15:
Back-up release managers are the two RMs from the previous two cycles.

One option that has worked very well is to share the release-manager role across timezones, handing over the current status and tasks to the backup in the next timezone at the end of your day. It's a great opportunity to work together as a team.
Line 18: Line 22:
  * OOPS report
  * Merge proposal
  * OOPS reports
  * Merge proposals
  * [[https://wiki.canonical.com/InformationInfrastructure/OSA/LaunchpadProductionStatus|LaunchpadProductionStatus]] (private)
Line 23: Line 28:
=== Before the roll-out === === Week 3 (the one before the roll-out) ===

  * During week 3 ensure that staging is [[https://staging.launchpad.net/successful-updates.txt|up-to-date]] so that it can be used for non-edge QA.

  * During week 3 ensure that Matt Revell has a downtime announcement email ready for lp-announce.

  * The down-time estimate should be based on the last staging DB restore.

    We should be overly conservative in the read-only announcement.
    Recently, we always encountered DB locking issues during the roll-out which
    increased the read-only window. Until we are able to have stable DB
    upgrades and determine a reliable fudge factor between time on staging and
    production, we should plan for the worst and come back earlier.

  * During week 3 send out an email to Stuart Metcalf from Canonical ISD about
  the upcoming roll-out to ask about any changes that need to be rolled-out
  for canonical-identity-provider or shipit. (Launchpad roll-outs imply
  a roll-out of the Canonical Identity Provider and ShipIt code).

  * You might want to request that pqm-blockers (such as cherry-picks) are not processed on the day that PQM is closing.

  * At the end of the week, make sure all the cowboys listed on [[https://wiki.canonical.com/InformationInfrastructure/OSA/LaunchpadProductionStatus|LaunchpadProductionStatus]] are either already landed or listed among CurrentRolloutBlockers

  * Also go through all the 'unusual rollout requirements' and clean them up from [[https://wiki.canonical.com/InformationInfrastructure/OSA/LaunchpadProductionStatus|LaunchpadProductionStatus]] page

=== Release Week ===
Line 26: Line 56:
  in PQM.

  * Update the `#launchpad-dev` topic to list him as release-manager.
  in PQM. (Monday 00:00 UTC)
     * There is an option to leave devel open for r-c landings as well until Tuesday.

  * At the beginning of week 4, schedule a call with the Foundations team lead
  (and other leads if known to be pertinent) to determine what system
  changes might need comprehensive QA. If these exist, consider these
  thoughts.

     * Any related problem encountered should be treated as a red flag,
     forcing more thorough QA.

     * Foundations lead should report on reviewing logs on edge, such
     as of cronscript output.

  * Look at the staging DB restore time for the week-end and determine if any
  changes to the announced down-time should be made.

  * Determine the schedule and deadlines. Send an email to launchpad-dev with all of the deadlines, similar to this [[ExampleReleaseScheduleEmail|example email]]. Place the deadlines on the team calendar.

  * Update the `#launchpad-dev` topic to state we are in 'Release Critical' and to list the release manager.
Line 33: Line 80:
    continuously to ensure that the list of release blockers is up to date.

    All bugs that are likely to cause lots of OOPSes, time outs or prevent
    continuously to ensure that the list of release blockers is up-to-date.  (We need to explore a
    work-around to retire this wiki page and do the management in Launchpad.)


    All bugs that are likely to cause lots of OOPSes, time-outs or prevent
Line 38: Line 86:
    It's a good idea to subscribe yourself to the page. (Currently broken.)
Line 40: Line 90:
  * Review release-critical merge.   * Review release-critical merge proposals. The policy should be:
     * All RC candidates go through the normal review process.
     * After code and UI review the MP is left in 'Needs Review' state.
     * A new review of type 'release-critical' is added to the MP and assigned to the release manager.
     * If the MP is approved for 'release-critical', the review is marked 'Approve' and the state of the MP is set to 'Approved'.
Line 45: Line 99:
  * Request that landing to the `devel` branch be closed. (All changes    should on the last day be merged through `db-devel`.)
  * Check that the LOSA do a staged deployment of the code. We are looking for
  any hidden build problems and to determine the amount of time this step will
  take.

  *
Request that landing to the `devel` branch be closed, 24 hours before the scheduled release.  All changes should on the last day be merged through `db-devel`.
Line 54: Line 110:
  * Remind people that all changes need to be in buildbot for '''6 hours'''
  before the roll-out time.
  * With PQM remaining open, have the LOSAs stop buildbot and set it do manual runs.
  * Remind people that all changes need to be in buildbot for '''9 hours'''
  before the roll-out time. The LOSAs require two hours of pre-release preparation and we need
  to allow for two complete buildbot cycles. (9 = 2 + 2 * 3.5)
Line 61: Line 119:
  * Re-announce downtime so it serves as a reminder on [[http://identi.ca/launchpadstatus|launchpadstatus]] account at least 4h before the actual rollout.

  * Ensure that any embargoed external resources (e.g. blog entries) are live and accessible through the links provided. Ensure that a blog editor (Matthew Revell or delegate) is available at the time of the roll-out.

 * Check that source code dependencies revision numbers are correct (compare
 them with what is listed in `utilities/sourcedeps.conf`) "bzr branch lp:~launchpad-pqm/lp-production-configs/trunk" then look at config-manager/production-{devel,stable}

 * Start an [[IncidentTrackingTemplate|incident report]] to document
 any issues that will be encountered during the roll-out.

=== During the roll-out ===

  * If some unexpected problems are encountered during the roll-out and these
  put the roll-out schedule off-track by more than 30 minutes, the decision to
  abort the roll-out should be taken. In that case:

    * Identify the source of the problem and come up with a best estimate on
    how soon it can be fixed.

    * Determine when the next attempt should be undertaken and make sure it is
    announced. Give at least 12 hours of fore-notice.
Line 64: Line 144:
  * Immediately after the roll-out, examine the site for problems. For example, ensure CSS loads properly, all external links on the front page are reachable, etc.

  * Announce on `launchpadstatus` that the release is done.

  * Announce any post-roll out issues on `lauchpadstatus` as they are
  discovered. Add them to the incident report. (Following the
  Launchpad/PolicyandProcess/Announcements/IncidentProcess, where the release
  manager acts as the communication liaison.)
Line 66: Line 155:
    All common OOPSes are canditates for more release-critical fixes and     All common OOPSes are candidates for more release-critical fixes and
Line 75: Line 164:
  * The release-manager need to select the next release manager.   * Complete the incident report and ask any engineers that participated in
  fixing the issues in doing root-cause analysis.

  * The release-manager needs to select the next release manager.

== Re-opening PQM ==

Once the roll out is complete and any critical issues have been dealt with, it's time to re-open PQM. Before doing that, though, we need to merge db-devel back into devel.
Line 80: Line 176:
  merge proposal on Launchpad. The release manager simply add a review of type
  `release-critical` to the merge proposal.

  * Any issues found during QA that is bound to create OOPSes, time outs or be
  very inconveniencing to users are good candidate for release-critical
  approval.

  * Apart special exceptions discussed with the project lead, only bug fixes
  merge proposal on Launchpad. The engineer adds a review of type
  `release-critical` to the merge proposal and ensures it is in the 'Needs Review' state.

  * Good candidates for release-critical approval are issues found during QA that are
  bound to create OOPSes and time outs or otherwise significantly inconvenience our end-users.

  * Apart from special exceptions discussed with the project lead, only bug fixes
Line 90: Line 185:
  * If there is no way that the developer can QA his change on staging through
  the normal update procedure before the roll-out, for complex changes, it's
  recommended to ask a cow-boy of the branch on staging to QA it before
  approval.

  * For the second roll-out, any change requiring database changes should go
  through the project lead, since a re-roll with a DB updates creates
  significant down-time for our users.
  * If there is no way for the developer to QA his change on staging through
  the normal update procedure before the roll-out, it's recommended to request
  a cowboy of the branch on staging to QA it before approval.

=== Database Patches ===

  * Release-critical branches containing database patches should only
  be accepted if they don't impact the estimated roll-out time.

  * One the day of the roll-out, only database patches that would be critical
  to prevent data corruption should be accepted.

  * For the second re-roll, again only DB patches critical to data safety
  should be considered as it impacts our ability to update without down-time.
Line 102: Line 202:
  * Engineer apply in advance for one cycle.

  * They are selected by the previous release manager. Once selected, their
  * Engineers apply in advance for one cycle.

  * They are selected by the previous release manager. Once selected, their name
Line 107: Line 207:
  * The actual roll-out time is determined based on the release-manager
 
location:
  * The actual roll-out time is determined by the release-manager's location:
Line 111: Line 210:
        || Americas || 00:00UTC ||
        || Europe || 09:00UTC ||
        || Americas || 23:00UTC ||
        || Europe || 10:00UTC ||
Line 115: Line 214:
  * No engineer can apply for the role more than twice a year.   * No engineer should apply for the role more than twice a year.

== References ==

 * [[https://wiki.canonical.com/InformationInfrastructure/OSA/LaunchpadRollout | OSA Launchpad Rollout Procedures]]
 * SpuriousFailures -- useful for diagnosing last-minute build failures
 * CurrentRolloutBlockers -- things currently blocking rollout
 * [[QATeam/TestPlans]] -- proof from all the teams that their code works
 * [[https://wiki.canonical.com/InformationInfrastructure/OSA/LaunchpadProductionStatus|Launchpad production status]]

  • Process Name: Release Manager Rotation Process

  • Process Owner: Francis Lacoste

  • Parent Process/Activity: None

  • Supported Policy: None

Process Overview

Each cycle a different engineer takes the role of release manager. The release manager coordinates with the release team and all team leads to ensure that the tree is ready for the roll-out and that all critical bugs are in or worked-around.

Back-up release managers are the two RMs from the previous two cycles.

One option that has worked very well is to share the release-manager role across timezones, handing over the current status and tasks to the backup in the next timezone at the end of your day. It's a great opportunity to work together as a team.

Release Manager inputs

Activities

Week 3 (the one before the roll-out)

  • During week 3 ensure that staging is up-to-date so that it can be used for non-edge QA.

  • During week 3 ensure that Matt Revell has a downtime announcement email ready for lp-announce.
  • The down-time estimate should be based on the last staging DB restore.
    • We should be overly conservative in the read-only announcement. Recently, we always encountered DB locking issues during the roll-out which increased the read-only window. Until we are able to have stable DB upgrades and determine a reliable fudge factor between time on staging and production, we should plan for the worst and come back earlier.
  • During week 3 send out an email to Stuart Metcalf from Canonical ISD about the upcoming roll-out to ask about any changes that need to be rolled-out for canonical-identity-provider or shipit. (Launchpad roll-outs imply

    a roll-out of the Canonical Identity Provider and ShipIt code).

  • You might want to request that pqm-blockers (such as cherry-picks) are not processed on the day that PQM is closing.
  • At the end of the week, make sure all the cowboys listed on LaunchpadProductionStatus are either already landed or listed among CurrentRolloutBlockers

  • Also go through all the 'unusual rollout requirements' and clean them up from LaunchpadProductionStatus page

Release Week

  • At the beginning of week 4. Make sure that release-critical was turned on in PQM. (Monday 00:00 UTC)
    • There is an option to leave devel open for r-c landings as well until Tuesday.
  • At the beginning of week 4, schedule a call with the Foundations team lead (and other leads if known to be pertinent) to determine what system changes might need comprehensive QA. If these exist, consider these thoughts.
    • Any related problem encountered should be treated as a red flag, forcing more thorough QA.
    • Foundations lead should report on reviewing logs on edge, such as of cronscript output.
  • Look at the staging DB restore time for the week-end and determine if any changes to the announced down-time should be made.
  • Determine the schedule and deadlines. Send an email to launchpad-dev with all of the deadlines, similar to this example email. Place the deadlines on the team calendar.

  • Update the #launchpad-dev topic to state we are in 'Release Critical' and to list the release manager.

  • Maintain the list of the Current roll-out blockers

    • The release manager should poll the team leads and QA engineers continuously to ensure that the list of release blockers is up-to-date. (We need to explore a work-around to retire this wiki page and do the management in Launchpad.) All bugs that are likely to cause lots of OOPSes, time-outs or prevent several users from working are good CRB candidates. It's a good idea to subscribe yourself to the page. (Currently broken.)
  • Make sure that developers are assigned to all problems we want to fix.
  • Review release-critical merge proposals. The policy should be:
    • All RC candidates go through the normal review process.
    • After code and UI review the MP is left in 'Needs Review' state.
    • A new review of type 'release-critical' is added to the MP and assigned to the release manager.
    • If the MP is approved for 'release-critical', the review is marked 'Approve' and the state of the MP is set to 'Approved'.

On the day before the roll-out

  • Check that the LOSA do a staged deployment of the code. We are looking for any hidden build problems and to determine the amount of time this step will take.
  • Request that landing to the devel branch be closed, 24 hours before the scheduled release. All changes should on the last day be merged through db-devel.

On the day of the roll-out

  • Chase up Current Rollout Blockers and any other pending release-critical fixes.

  • With PQM remaining open, have the LOSAs stop buildbot and set it do manual runs.
  • Remind people that all changes need to be in buildbot for 9 hours before the roll-out time. The LOSAs require two hours of pre-release preparation and we need to allow for two complete buildbot cycles. (9 = 2 + 2 * 3.5)

  • In the case of failures, it's best to roll-out the last-known-good-build rather than delaying the release. The cut-off point to decide which revision

    to roll out is 2 hours before the scheduled release.

  • Re-announce downtime so it serves as a reminder on launchpadstatus account at least 4h before the actual rollout.

  • Ensure that any embargoed external resources (e.g. blog entries) are live and accessible through the links provided. Ensure that a blog editor (Matthew Revell or delegate) is available at the time of the roll-out.
  • Check that source code dependencies revision numbers are correct (compare

    them with what is listed in utilities/sourcedeps.conf) "bzr branch lp:~launchpad-pqm/lp-production-configs/trunk" then look at config-manager/production-{devel,stable}

  • Start an incident report to document any issues that will be encountered during the roll-out.

During the roll-out

  • If some unexpected problems are encountered during the roll-out and these put the roll-out schedule off-track by more than 30 minutes, the decision to abort the roll-out should be taken. In that case:
    • Identify the source of the problem and come up with a best estimate on how soon it can be fixed.
    • Determine when the next attempt should be undertaken and make sure it is announced. Give at least 12 hours of fore-notice.

After the roll-out

  • Immediately after the roll-out, examine the site for problems. For example, ensure CSS loads properly, all external links on the front page are reachable, etc.
  • Announce on launchpadstatus that the release is done.

  • Announce any post-roll out issues on lauchpadstatus as they are discovered. Add them to the incident report. (Following the Launchpad/PolicyandProcess/Announcements/IncidentProcess, where the release manager acts as the communication liaison.)

  • With the QA engineers, review the OOPS reports.
    • All common OOPSes are candidates for more release-critical fixes and scheduling another roll-out.
  • Prepare and schedule any necessary re-roll.
  • When a re-roll is needed, same activities than in the pre-roll out case.
  • Open the tree, when the released version is fine for the next cycle.
  • Complete the incident report and ask any engineers that participated in fixing the issues in doing root-cause analysis.
  • The release-manager needs to select the next release manager.

Re-opening PQM

Once the roll out is complete and any critical issues have been dealt with, it's time to re-open PQM. Before doing that, though, we need to merge db-devel back into devel.

Release critical policy

  • To apply for a release-critical approval, you must have a reviewed merge proposal on Launchpad. The engineer adds a review of type

    release-critical to the merge proposal and ensures it is in the 'Needs Review' state.

  • Good candidates for release-critical approval are issues found during QA that are bound to create OOPSes and time outs or otherwise significantly inconvenience our end-users.
  • Apart from special exceptions discussed with the project lead, only bug fixes should be granted release-critical approval.
  • If there is no way for the developer to QA his change on staging through the normal update procedure before the roll-out, it's recommended to request a cowboy of the branch on staging to QA it before approval.

Database Patches

  • Release-critical branches containing database patches should only be accepted if they don't impact the estimated roll-out time.
  • One the day of the roll-out, only database patches that would be critical to prevent data corruption should be accepted.
  • For the second re-roll, again only DB patches critical to data safety should be considered as it impacts our ability to update without down-time.

Scheduling

  • Engineers apply in advance for one cycle.
  • They are selected by the previous release manager. Once selected, their name

    is put on the Launchpad Production Status page.

  • The actual roll-out time is determined by the release-manager's location:
    • Location

      Roll out time

      Americas

      23:00UTC

      Europe

      10:00UTC

      Asia/Pacific

      00:00UTC

  • No engineer should apply for the role more than twice a year.

References

PolicyAndProcess/Downtime (last edited 2011-06-06 22:02:02 by flacoste)