Diff for "PolicyAndProcess/Downtime"

Not logged in - Log In / Register

Differences between revisions 1 and 25 (spanning 24 versions)
Revision 1 as of 2009-07-21 23:00:34
Size: 4124
Editor: flacoste
Comment:
Revision 25 as of 2009-12-14 17:03:02
Size: 9303
Editor: danilo
Comment:
Deletions are marked like this. Additions are marked like this.
Line 15: Line 15:
Back-up release managers are the two RMs from the previous two cycles.

One option that has worked very well is to share the release-manager role across timezones, handing over the current status and tasks to the backup in the next timezone at the end of your day. It's a great opportunity to work together as a team.
Line 18: Line 22:
  * OOPS report
  * Merge proposal
  * OOPS reports
  * Merge proposals
  * [[https://wiki.canonical.com/InformationInfrastructure/OSA/LaunchpadProductionStatus|LaunchpadProductionStatus]] (private)
Line 23: Line 28:
=== Before the roll-out === === Week 3 (the one before the roll-out) ===

  * During week 3 ensure that staging is [[https://staging.launchpad.net/successful-updates.txt|up-to-date]] so that it can be used for non-edge QA.

  * During week 3 ensure that Matt Revell has a downtime announcement email ready for lp-announce.

  * The down-time estimate should be based on the last staging DB restore.

  * During week 3 send out an email to Stuart Metcalf from Canonical ISD about
  the upcoming roll-out to ask about any changes that need to be rolled-out
  for canonical-identity-provider or shipit. (Launchpad roll-outs imply
  a roll-out of the Canonical Identity Provider and ShipIt code).

  * You might want to request that pqm-blockers (such as cherry-picks) are not processed on the day that PQM is closing.

  * At the end of the week, make sure all the cowboys listed on https://wiki.canonical.com/InformationInfrastructure/OSA/LaunchpadProductionStatus are either already landed or listed among CurrentRolloutBlockers

=== Release Week ===
Line 26: Line 48:
  in PQM.   in PQM. (Monday 00:00 UTC)
Line 28: Line 50:
  * Update the `#launchpad-dev` topic to list him as release-manager.   * At the beginning of week 4, schedule a call with the Foundations team lead
  (and other leads if known to be pertinent) to determine what system
  changes might need comprehensive QA. If these exist, consider these
  thoughts.

     * Any related problem encountered should be treated as a red flag,
     forcing more thorough QA.

     * Foundations lead should report on reviewing logs on edge, such
     as of cronscript output.

  * Look at the staging DB restore time for the week-end and determine if any
  changes to the announced down-time should be made.

  * Determine the schedule and deadlines. Send an email to launchpad-dev with all of the deadlines, similar to this [[ExampleReleaseScheduleEmail|example email]]. Place the deadlines on the team calendar.

  * Update the `#launchpad-dev` topic to state we are in 'Release Critical' and to list the release manager.
Line 33: Line 71:
    continuously to ensure that the list of release blockers is up to date.     continuously to ensure that the list of release blockers is up-to-date.  (We need to explore a
    work-around to retire this wiki page and do the management in Launchpad.)
Line 35: Line 74:
    All bugs that are likely to cause lots of OOPSes, time outs or prevent     All bugs that are likely to cause lots of OOPSes, time-outs or prevent
Line 37: Line 76:

    It's a good idea to subscribe yourself to the page. (Currently broken.)
Line 40: Line 81:
  * Review release-critical merge.   * Review release-critical merge proposals. The policy should be:
     * All RC candidates go through the normal review process.
     * After code and UI review the MP is left in 'Needs Review' state.
     * A new review of type 'release-critical' is added to the MP and assigned to the release manager.
     * If the MP is approved for 'release-critical', the review is marked 'Approve' and the state of the MP is set to 'Approved'.
Line 45: Line 90:
  * Request that landing to the `devel` branch be closed. (All changes
  should on the last day be merged through `db-devel`.)
  * Check that the LOSA do a staged deployment of the code. We are looking for
  any hidden build problems and to determine the amount of time this step will
  take.
Line 48: Line 94:
  * Request that landing to the `devel` branch be closed, 24 hours before the scheduled release. All changes should on the last day be merged through `db-devel`.
Line 54: Line 101:
  * Remind people that all changes need to be in buildbot for '''6 hours'''
  before the roll-out time.
  * With PQM remaining open, have the LOSAs stop buildbot and set it do manual runs.
  * Remind people that all changes need to be in buildbot for '''9 hours'''
  before the roll-out time. The LOSAs require two hours of pre-release preparation and we need
  to allow for two complete buildbot cycles. (9 = 2 + 2 * 3.5)
Line 61: Line 110:
  * Ensure that any embargoed external resources (e.g. blog entries) are live and accessible through the links provided. Ensure that a blog editor (Matthew Revell or delegate) is available at the time of the roll-out.

 * Immediately after the roll-out, examine the site for problems. For example, ensure CSS loads properly, all external links on the front page are reachable, etc.

=== During the roll-out ===

  * If some unexpected problems are encountered during the roll-out and these
  put the roll-out schedule off-track by more than 30 minutes, the decision to
  abort the roll-out should be taken. In that case:

    * Identify the source of the problem and come up with a best estimate on
    how soon it can be fixed.

    * Determine when the next attempt should be undertaken and make sure it is
    announced. Give at least 12 hours of fore-notice.
Line 66: Line 130:
    All common OOPSes are canditates for more release-critical fixes and     All common OOPSes are candidates for more release-critical fixes and
Line 69: Line 133:
  
Line 75: Line 140:
  * The release-manager need to select the next release manager.   * The release-manager needs to select the next release manager.

== Re-opening PQM ==

Once the roll out is complete and any critical issues have been dealt with, it's time to re-open PQM. Before doing that, though, we need to merge db-devel back into devel.
Line 80: Line 149:
  merge proposal on Launchpad. The release manager simply add a review of type
  `release-critical` to the merge proposal.
  merge proposal on Launchpad. The engineer adds a review of type
  `release-critical` to the merge proposal and ensures it is in the 'Needs Review' state.
Line 83: Line 152:
  * Any issues found during QA that is bound to create OOPSes, time outs or be
  v
ery inconveniencing to users are good candidate for release-critical
  approval
.
  * Good candidates for release-critical approval are issues found during QA that are
 
bound to create OOPSes and time outs or otherwise significantly inconvenience our end-users.
Line 87: Line 155:
  * Apart special exceptions discussed with the project lead, only bug fixes   * Apart from special exceptions discussed with the project lead, only bug fixes
Line 90: Line 158:
  * If there is no way that the developer can QA his change on staging through
  the normal update procedure before the roll-out, for complex changes, it's
 
recommended to ask a cow-boy of the branch on staging to QA it before
 
approval.
  * If there is no way for the developer to QA his change on staging through
  the normal update procedure before the roll-out, it's recommended to request
 
a cowboy of the branch on staging to QA it before approval.
Line 95: Line 162:
  * For the second roll-out, any change requiring database changes should go
  through the project lead, since a re-roll with a DB updates creates
  significant down-time for our users.
=== Database Patches ===
Line 99: Line 164:
  * Release-critical branches containing database patches should only
  be accepted if they don't impact the estimated roll-out time.

  * One the day of the roll-out, only database patches that would be critical
  to prevent data corruption should be accepted.

  * For the second re-roll, again only DB patches critical to data safety
  should be considered as it impacts our ability to update without down-time.
Line 102: Line 175:
  * Engineer apply in advance for one cycle.   * Engineers apply in advance for one cycle.
Line 104: Line 177:
  * They are selected by the previous release manager. Once selected, their   * They are selected by the previous release manager. Once selected, their name
Line 107: Line 180:
  * The actual roll-out time is determined based on the release-manager
 
location:
  * The actual roll-out time is determined by the release-manager's location:
Line 111: Line 183:
        || Americas || 00:00UTC ||
        || Europe || 09:00UTC ||
        || Americas || 23:00UTC ||
        || Europe || 10:00UTC ||
Line 115: Line 187:
  * No engineer can apply for the role more than twice a year.   * No engineer should apply for the role more than twice a year.

== References ==

 * [[https://wiki.canonical.com/InformationInfrastructure/OSA/LaunchpadRollout | OSA Launchpad Rollout Procedures]]
 * SpuriousFailures -- useful for diagnosing last-minute build failures
 * CurrentRolloutBlockers -- things currently blocking rollout
 * [[QATeam/TestPlans]] -- proof from all the teams that their code works
 * [[https://wiki.canonical.com/InformationInfrastructure/OSA/LaunchpadProductionStatus|Launchpad production status]]

  • Process Name: Release Manager Rotation Process

  • Process Owner: Francis Lacoste

  • Parent Process/Activity: None

  • Supported Policy: None

Process Overview

Each cycle a different engineer takes the role of release manager. The release manager coordinates with the release team and all team leads to ensure that the tree is ready for the roll-out and that all critical bugs are in or worked-around.

Back-up release managers are the two RMs from the previous two cycles.

One option that has worked very well is to share the release-manager role across timezones, handing over the current status and tasks to the backup in the next timezone at the end of your day. It's a great opportunity to work together as a team.

Release Manager inputs

Activities

Week 3 (the one before the roll-out)

  • During week 3 ensure that staging is up-to-date so that it can be used for non-edge QA.

  • During week 3 ensure that Matt Revell has a downtime announcement email ready for lp-announce.
  • The down-time estimate should be based on the last staging DB restore.
  • During week 3 send out an email to Stuart Metcalf from Canonical ISD about the upcoming roll-out to ask about any changes that need to be rolled-out for canonical-identity-provider or shipit. (Launchpad roll-outs imply

    a roll-out of the Canonical Identity Provider and ShipIt code).

  • You might want to request that pqm-blockers (such as cherry-picks) are not processed on the day that PQM is closing.
  • At the end of the week, make sure all the cowboys listed on https://wiki.canonical.com/InformationInfrastructure/OSA/LaunchpadProductionStatus are either already landed or listed among CurrentRolloutBlockers

Release Week

  • At the beginning of week 4. Make sure that release-critical was turned on in PQM. (Monday 00:00 UTC)
  • At the beginning of week 4, schedule a call with the Foundations team lead (and other leads if known to be pertinent) to determine what system changes might need comprehensive QA. If these exist, consider these thoughts.
    • Any related problem encountered should be treated as a red flag, forcing more thorough QA.
    • Foundations lead should report on reviewing logs on edge, such as of cronscript output.
  • Look at the staging DB restore time for the week-end and determine if any changes to the announced down-time should be made.
  • Determine the schedule and deadlines. Send an email to launchpad-dev with all of the deadlines, similar to this example email. Place the deadlines on the team calendar.

  • Update the #launchpad-dev topic to state we are in 'Release Critical' and to list the release manager.

  • Maintain the list of the Current roll-out blockers

    • The release manager should poll the team leads and QA engineers continuously to ensure that the list of release blockers is up-to-date. (We need to explore a work-around to retire this wiki page and do the management in Launchpad.) All bugs that are likely to cause lots of OOPSes, time-outs or prevent several users from working are good CRB candidates. It's a good idea to subscribe yourself to the page. (Currently broken.)
  • Make sure that developers are assigned to all problems we want to fix.
  • Review release-critical merge proposals. The policy should be:
    • All RC candidates go through the normal review process.
    • After code and UI review the MP is left in 'Needs Review' state.
    • A new review of type 'release-critical' is added to the MP and assigned to the release manager.
    • If the MP is approved for 'release-critical', the review is marked 'Approve' and the state of the MP is set to 'Approved'.

On the day before the roll-out

  • Check that the LOSA do a staged deployment of the code. We are looking for any hidden build problems and to determine the amount of time this step will take.
  • Request that landing to the devel branch be closed, 24 hours before the scheduled release. All changes should on the last day be merged through db-devel.

On the day of the roll-out

  • Chase up Current Rollout Blockers and any other pending release-critical fixes.

  • With PQM remaining open, have the LOSAs stop buildbot and set it do manual runs.
  • Remind people that all changes need to be in buildbot for 9 hours before the roll-out time. The LOSAs require two hours of pre-release preparation and we need to allow for two complete buildbot cycles. (9 = 2 + 2 * 3.5)

  • In the case of failures, it's best to roll-out the last-known-good-build rather than delaying the release. The cut-off point to decide which revision

    to roll out is 2 hours before the scheduled release.

  • Ensure that any embargoed external resources (e.g. blog entries) are live and accessible through the links provided. Ensure that a blog editor (Matthew Revell or delegate) is available at the time of the roll-out.
  • Immediately after the roll-out, examine the site for problems. For example, ensure CSS loads properly, all external links on the front page are reachable, etc.

During the roll-out

  • If some unexpected problems are encountered during the roll-out and these put the roll-out schedule off-track by more than 30 minutes, the decision to abort the roll-out should be taken. In that case:
    • Identify the source of the problem and come up with a best estimate on how soon it can be fixed.
    • Determine when the next attempt should be undertaken and make sure it is announced. Give at least 12 hours of fore-notice.

After the roll-out

  • With the QA engineers, review the OOPS reports.
    • All common OOPSes are candidates for more release-critical fixes and scheduling another roll-out.
  • Prepare and schedule any necessary re-roll.
  • When a re-roll is needed, same activities than in the pre-roll out case.
  • Open the tree, when the released version is fine for the next cycle.
  • The release-manager needs to select the next release manager.

Re-opening PQM

Once the roll out is complete and any critical issues have been dealt with, it's time to re-open PQM. Before doing that, though, we need to merge db-devel back into devel.

Release critical policy

  • To apply for a release-critical approval, you must have a reviewed merge proposal on Launchpad. The engineer adds a review of type

    release-critical to the merge proposal and ensures it is in the 'Needs Review' state.

  • Good candidates for release-critical approval are issues found during QA that are bound to create OOPSes and time outs or otherwise significantly inconvenience our end-users.
  • Apart from special exceptions discussed with the project lead, only bug fixes should be granted release-critical approval.
  • If there is no way for the developer to QA his change on staging through the normal update procedure before the roll-out, it's recommended to request a cowboy of the branch on staging to QA it before approval.

Database Patches

  • Release-critical branches containing database patches should only be accepted if they don't impact the estimated roll-out time.
  • One the day of the roll-out, only database patches that would be critical to prevent data corruption should be accepted.
  • For the second re-roll, again only DB patches critical to data safety should be considered as it impacts our ability to update without down-time.

Scheduling

  • Engineers apply in advance for one cycle.
  • They are selected by the previous release manager. Once selected, their name

    is put on the Launchpad Production Status page.

  • The actual roll-out time is determined by the release-manager's location:
    • Location

      Roll out time

      Americas

      23:00UTC

      Europe

      10:00UTC

      Asia/Pacific

      00:00UTC

  • No engineer should apply for the role more than twice a year.

References

PolicyAndProcess/Downtime (last edited 2011-06-06 22:02:02 by flacoste)