Diff for "PolicyAndProcess/Downtime"

Not logged in - Log In / Register

Differences between revisions 1 and 84 (spanning 83 versions)
Revision 1 as of 2009-07-21 23:00:34
Size: 4124
Editor: flacoste
Comment:
Revision 84 as of 2011-04-06 21:49:37
Size: 6719
Editor: lifeless
Comment: overhaul
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
## page was renamed from PolicyAndProcess/ReleaseManagerRotation
Line 10: Line 11:
Each cycle a different engineer takes the role of release manager. The release
manager coordinates with the release team and all team leads to ensure that
the tree is ready for the roll-out and that all critical bugs are in or
worked-around.
Each month a DB downtime is scheduled. The team lead schedules this with the Launchpad stakeholders and LOSAs. It is broadly scheduled up to a year in advance on [[DowntimeDeploymentSchedule|the downtime schedule]]. The scheduled time should overlap with availability from the DBA
Line 15: Line 13:
== Release Manager inputs == The launchpad engineers request a merge from db-stable of *only* QA-ok commits 48 hours before the scheduled time. The db downtime then deploys just that merge revision and where possibly only does schema changes.

Other downtime requiring events (such as codehosting setup changes) have partial downtime scheduled as required.

== DB-stable->devel merge imputs ==
Line 18: Line 20:
  * OOPS report
  * Merge proposal
  * Deployment reports
   * [[https://devpad.canonical.com/~lpqateam/qa_reports/deployment-stable.html|stable]]
   * [[https://devpad.canonical.com/~lpqateam/qa_reports/deployment-db-stable.html|db-stable]]
Line 23: Line 26:
=== Before the roll-out === === Coordinator (LP Team lead) ===
Line 25: Line 28:
  * At the beginning of week 4. Make sure that release-critical was turned on
  in PQM.
 * Negotiate and confirm the dates and times for the main release window and a backup release window with the following groups. Refer to the [[DowntimeDeploymentSchedule|the downtime schedule]] for the pre-scheduled dates and times (we have a process in place to ensure our release calendar does not conflict with Ubuntu; conflicts normally do not happen). See [[#Scheduling|Scheduling]] for more info on exact times. The goal is to confirm that the scheduled downtime does not interrupt other teams at critical times, to ensure that the release window does not place undue stress or risk upon our own operations team, and renegotiate the dates and times if necessary. Note when scheduling that services on Launchpad can begin being paused up to half an hour before the rollout.
Line 28: Line 30:
  * Update the `#launchpad-dev` topic to list him as release-manager.    * the stakeholder's list (private-canonical-launchpad-stakeholders@lists.launchpad.net).
   * the LOSAs (losas@canonical.com)
   * the DBA (Stuart Bishop <stuart.bishop@canonical.com>)
Line 30: Line 34:
  * Maintain the list of the [[CurrentRolloutBlockers|Current roll-out blockers]] Let Matthew Revell <matthew.revell@canonical.com> know the time to announce it.
Line 32: Line 36:
    The release manager should poll the team leads and QA engineers
    continuously to ensure that the list of release blockers is up to date.
The downtime window is always 90 minutes; if less time is consumed, great. If more time is needed, we go back to the drawing board to figure out how to do it faster.
Line 35: Line 38:
    All bugs that are likely to cause lots of OOPSes, time outs or prevent
    several users from working are good CRB candidates.
Re-announce downtime so it serves as a reminder on [[http://identi.ca/launchpadstatus|launchpadstatus]] account at '''least 4h''' before the actual rollout. You can do this yourself with the [[https://wiki.canonical.com/Launchpad/PolicyandProcess/Announcements/Downtime|identi.ca login info]].
Line 38: Line 40:
  * Make sure that developers are assigned to all problems we want to fix. Ensure that Matt Revell has a downtime announcement email ready for lp-announce. Again, make sure it's noted that some LP services (e.g. soyuz) can go offline temporarily up to half an hour before the rollout.
Line 40: Line 42:
  * Review release-critical merge. === Engineering responsibilities ===
Line 42: Line 44:
 * All QA for db changes must include checking that the aggregate database time is within budget. To find out how long the database updates will take to deploy on production, double the aggregate time from staging and add 10 minutes. Grep for lines like "2208-30-0 applied just now in 70.5 seconds"in devpad:/x/launchpad.net-logs/staging/sourcherry/2011-*-staging_restore.log
 * If the aggregate time is too large, either merge less than all the possible revisions, or back out the problem revision. This is part of qaing on db-stable.
Line 43: Line 47:
=== On the day before the roll-out ===  * Ask losas to merge db-stable rev XXX to devel and put PQM into RC mode. ('on prae as pqm {{{cd /home/pqm/archives/rocketfuel/launchpad/devel; bzr update; bzr revert; bzr merge -r 10381 lp:~launchpad-pqm/launchpad/db-stable; bzr commit -m "Merge db-stable 10381." ; bzr push lp:~launchpad-pqm/launchpad/devel}}}. If there are conflicts a developer should reproduce th emerge and tell the LOSA how to resolve them.
 * After buildbot blesses the merge ask LOSA to do the upgrade to qastaging ('refresh the qastaging tree on sourcherry, run normal upgrade scripts there')
 * Ask losas to take PQM out of RC mode once verifying that qastaging is working well (e.g. re-qa all the schema changes included in the merge).
  * Update the `#launchpad-dev` topic to state 'PQM in RC mode' and then reverse it.
Line 45: Line 52:
  * Request that landing to the `devel` branch be closed. (All changes
  should on the last day be merged through `db-devel`.)
  * Make sure all the cowboys listed on [[https://wiki.canonical.com/InformationInfrastructure/OSA/LaunchpadProductionStatus|LaunchpadProductionStatus]] are either already landed or are added to the 'unusual rollout requirements' on that same page, so that the LOSAs will re-apply the cowboys after the rollout. It is definitely preferable not to re-apply cowboys, so it's a good idea to track down who hasn't landed a permanent fix yet.
Line 48: Line 54:
  * If some unexpected problems are encountered during the roll-out and these put the roll-out schedule off-track by more than 30 minutes, and it can be aborted, then it should be aborted. In that case:
Line 49: Line 56:
=== On the day of the roll-out ===   * Identify the source of the problem and come up with a best estimate on how soon it can be fixed.
Line 51: Line 58:
  * Chase up ''Current Rollout Blockers'' and any other pending release-critical
  fixes.
  * Determine when the next attempt should be undertaken (usually: the backup window) and make sure it is announced. Give at least 24 hours of advance warning (because the readonly replica takes 24 hours to restore).
Line 54: Line 60:
  * Remind people that all changes need to be in buildbot for '''6 hours'''
  before the roll-out time.
==== After the roll-out ====
Line 57: Line 62:
  * In the case of failures, it's best to roll-out the last-known-good-build
  rather than delaying the release. The cut-off point to decide which revision
  to roll out is '''2 hours''' before the scheduled release.
  * Immediately after the roll-out, examine the site for problems. For example, ensure CSS loads properly, all external links on the front page are reachable, etc.
Line 61: Line 64:
  * Announce on `launchpadstatus` that the release is done.
Line 62: Line 66:
=== After the roll-out ===   * Announce any post-roll out issues on `lauchpadstatus` as they are discovered. Create incident reports for them. (Following the Launchpad/PolicyandProcess/Announcements/IncidentProcess).
Line 64: Line 68:
  * With the QA engineers, review the OOPS reports. === Database Patches ===
Line 66: Line 70:
    All common OOPSes are canditates for more release-critical fixes and
    scheduling another roll-out.
  * If the qa-ok of a patch merged to devel was faulty and a fix is needed, this shouldn't impact the downtime - if it will, the entire thing should just be backed out.
Line 69: Line 72:
  * Prepare and schedule any necessary re-roll. == References ==
Line 71: Line 74:
  * When a re-roll is needed, same activities than in the pre-roll out case.

  * Open the tree, when the released version is fine for the next cycle.

  * The release-manager need to select the next release manager.

== Release critical policy ==

  * To apply for a release-critical approval, you must have a reviewed
  merge proposal on Launchpad. The release manager simply add a review of type
  `release-critical` to the merge proposal.

  * Any issues found during QA that is bound to create OOPSes, time outs or be
  very inconveniencing to users are good candidate for release-critical
  approval.

  * Apart special exceptions discussed with the project lead, only bug fixes
  should be granted release-critical approval.

  * If there is no way that the developer can QA his change on staging through
  the normal update procedure before the roll-out, for complex changes, it's
  recommended to ask a cow-boy of the branch on staging to QA it before
  approval.

  * For the second roll-out, any change requiring database changes should go
  through the project lead, since a re-roll with a DB updates creates
  significant down-time for our users.


== Scheduling ==

  * Engineer apply in advance for one cycle.

  * They are selected by the previous release manager. Once selected, their
  is put on the [[https://wiki.canonical.com/InformationInfrastructure/OSA/LaunchpadProductionStatus|Launchpad Production Status page]].

  * The actual roll-out time is determined based on the release-manager
  location:

        || Location || Roll out time ||
        || Americas || 00:00UTC ||
        || Europe || 09:00UTC ||
        || Asia/Pacific || 00:00UTC ||

  * No engineer can apply for the role more than twice a year.
 * [[https://wiki.canonical.com/InformationInfrastructure/OSA/LaunchpadRollout | OSA Launchpad Rollout Procedures]]
 * SpuriousFailures -- useful for diagnosing last-minute build failures
 * [[https://wiki.canonical.com/InformationInfrastructure/OSA/LaunchpadProductionStatus|Launchpad production status]]
 * [[https://wiki.canonical.com/Launchpad/PolicyandProcess/ProductionChangeApprovalPolicy|Approval of production change]]
 * [[https://wiki.canonical.com/Launchpad/PolicyandProcess/EmergencyChange|Emergency change (aka cherry-picks)]]

  • Process Name: Release Manager Rotation Process

  • Process Owner: Francis Lacoste

  • Parent Process/Activity: None

  • Supported Policy: None

Process Overview

Each month a DB downtime is scheduled. The team lead schedules this with the Launchpad stakeholders and LOSAs. It is broadly scheduled up to a year in advance on the downtime schedule. The scheduled time should overlap with availability from the DBA

The launchpad engineers request a merge from db-stable of *only* QA-ok commits 48 hours before the scheduled time. The db downtime then deploys just that merge revision and where possibly only does schema changes.

Other downtime requiring events (such as codehosting setup changes) have partial downtime scheduled as required.

DB-stable->devel merge imputs

  • Email and IRC messages from engineers and team leads.
  • Deployment reports

Activities

Coordinator (LP Team lead)

  • Negotiate and confirm the dates and times for the main release window and a backup release window with the following groups. Refer to the the downtime schedule for the pre-scheduled dates and times (we have a process in place to ensure our release calendar does not conflict with Ubuntu; conflicts normally do not happen). See Scheduling for more info on exact times. The goal is to confirm that the scheduled downtime does not interrupt other teams at critical times, to ensure that the release window does not place undue stress or risk upon our own operations team, and renegotiate the dates and times if necessary. Note when scheduling that services on Launchpad can begin being paused up to half an hour before the rollout.

Let Matthew Revell <matthew.revell@canonical.com> know the time to announce it.

The downtime window is always 90 minutes; if less time is consumed, great. If more time is needed, we go back to the drawing board to figure out how to do it faster.

Re-announce downtime so it serves as a reminder on launchpadstatus account at least 4h before the actual rollout. You can do this yourself with the identi.ca login info.

Ensure that Matt Revell has a downtime announcement email ready for lp-announce. Again, make sure it's noted that some LP services (e.g. soyuz) can go offline temporarily up to half an hour before the rollout.

Engineering responsibilities

  • All QA for db changes must include checking that the aggregate database time is within budget. To find out how long the database updates will take to deploy on production, double the aggregate time from staging and add 10 minutes. Grep for lines like "2208-30-0 applied just now in 70.5 seconds"in devpad:/x/launchpad.net-logs/staging/sourcherry/2011-*-staging_restore.log
  • If the aggregate time is too large, either merge less than all the possible revisions, or back out the problem revision. This is part of qaing on db-stable.
  • Ask losas to merge db-stable rev XXX to devel and put PQM into RC mode. ('on prae as pqm cd /home/pqm/archives/rocketfuel/launchpad/devel; bzr update; bzr revert; bzr merge -r 10381 lp:~launchpad-pqm/launchpad/db-stable; bzr commit -m "Merge db-stable 10381." ; bzr push lp:~launchpad-pqm/launchpad/devel. If there are conflicts a developer should reproduce th emerge and tell the LOSA how to resolve them.

  • After buildbot blesses the merge ask LOSA to do the upgrade to qastaging ('refresh the qastaging tree on sourcherry, run normal upgrade scripts there')
  • Ask losas to take PQM out of RC mode once verifying that qastaging is working well (e.g. re-qa all the schema changes included in the merge).
    • Update the #launchpad-dev topic to state 'PQM in RC mode' and then reverse it.

    • Make sure all the cowboys listed on LaunchpadProductionStatus are either already landed or are added to the 'unusual rollout requirements' on that same page, so that the LOSAs will re-apply the cowboys after the rollout. It is definitely preferable not to re-apply cowboys, so it's a good idea to track down who hasn't landed a permanent fix yet.

    • If some unexpected problems are encountered during the roll-out and these put the roll-out schedule off-track by more than 30 minutes, and it can be aborted, then it should be aborted. In that case:
    • Identify the source of the problem and come up with a best estimate on how soon it can be fixed.
    • Determine when the next attempt should be undertaken (usually: the backup window) and make sure it is announced. Give at least 24 hours of advance warning (because the readonly replica takes 24 hours to restore).

After the roll-out

  • Immediately after the roll-out, examine the site for problems. For example, ensure CSS loads properly, all external links on the front page are reachable, etc.
  • Announce on launchpadstatus that the release is done.

  • Announce any post-roll out issues on lauchpadstatus as they are discovered. Create incident reports for them. (Following the Launchpad/PolicyandProcess/Announcements/IncidentProcess).

Database Patches

  • If the qa-ok of a patch merged to devel was faulty and a fix is needed, this shouldn't impact the downtime - if it will, the entire thing should just be backed out.

References

PolicyAndProcess/Downtime (last edited 2011-06-06 22:02:02 by flacoste)