Diff for "PolicyAndProcess/Downtime"

Not logged in - Log In / Register

Differences between revisions 31 and 88 (spanning 57 versions)
Revision 31 as of 2010-03-04 22:15:27
Size: 10407
Editor: flacoste
Comment: Remove reference to CP as nothing blocks PQM. Add note about overly pessimistic downtime estimate
Revision 88 as of 2011-06-06 22:02:02
Size: 7541
Editor: flacoste
Comment: PQM tag to use for rc
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
## page was renamed from PolicyAndProcess/ReleaseManagerRotation
Line 3: Line 4:
 * '''Process Name:''' Release Manager Rotation Process  * '''Process Name:''' Monthly DB deployment with downtime
Line 10: Line 11:
Each cycle a different engineer takes the role of release manager. The release
manager coordinates with the release team and all team leads to ensure that
the tree is ready for the roll-out and that all critical bugs are in or
worked-around.
Each month a DB downtime is scheduled. The team lead schedules this with the Launchpad stakeholders and LOSAs. It is broadly scheduled up to a year in advance on [[DowntimeDeploymentSchedule|the downtime schedule]]. The scheduled time should overlap with availability from the DBA
Line 15: Line 13:
Back-up release managers are the two RMs from the previous two cycles. The launchpad engineers request a merge from db-stable of *only* QA-ok commits 48 hours before the scheduled time. The db downtime then deploys just that merge revision and where possibly only does schema changes.
Line 17: Line 15:
One option that has worked very well is to share the release-manager role across timezones, handing over the current status and tasks to the backup in the next timezone at the end of your day. It's a great opportunity to work together as a team. Other downtime requiring events (such as codehosting setup changes) have partial downtime scheduled as required.
Line 19: Line 17:
== Release Manager inputs == == DB-stable->devel merge imputs ==
Line 22: Line 20:
  * OOPS reports
  * Merge proposals
  * [[https://wiki.canonical.com/InformationInfrastructure/OSA/LaunchpadProductionStatus|LaunchpadProductionStatus]] (private)
  * Deployment reports
   * [[https://devpad.canonical.com/~lpqateam/qa_reports/deployment-stable.html|stable]]
   * [[https://devpad.canonical.com/~lpqateam/qa_reports/deployment-db-stable.html|db-stable]]
Line 28: Line 26:
=== Week 3 (the one before the roll-out) === === Coordinator (LP Team lead) ===
Line 30: Line 28:
  * During week 3 ensure that staging is [[https://staging.launchpad.net/successful-updates.txt|up-to-date]] so that it can be used for non-edge QA.  * Negotiate and confirm the dates and times for the main release window and a backup release window with the following groups. Refer to the [[DowntimeDeploymentSchedule|the downtime schedule]] for the pre-scheduled dates and times (we have a process in place to ensure our release calendar does not conflict with Ubuntu; conflicts normally do not happen). The goal is to confirm that the scheduled downtime does not interrupt other teams at critical times, to ensure that the release window does not place undue stress or risk upon our own operations team, and renegotiate the dates and times if necessary. Note when scheduling that services on Launchpad can begin being paused up to half an hour before the rollout.
Line 32: Line 30:
  * During week 3 ensure that Matt Revell has a downtime announcement email ready for lp-announce.    * the stakeholder's list (private-canonical-launchpad-stakeholders@lists.launchpad.net).
   * the LOSAs (losas@canonical.com)
   * the DBA (Stuart Bishop <stuart.bishop@canonical.com>)
Line 34: Line 34:
  * The down-time estimate should be based on the last staging DB restore. Let Matthew Revell <matthew.revell@canonical.com> know the time to announce it.
Line 36: Line 36:
    We should be overly conservative in the read-only announcement.
    Recently, we always encountered DB locking issues during the roll-out which
    increased the read-only window. Until we are able to have stable DB
    upgrades and determine a reliable fudge factor between time on staging and
    production, we should plan for the worst and come back earlier.
The downtime window is always 90 minutes; if less time is consumed, great. If more time is needed, we go back to the drawing board to figure out how to do it faster.
Line 42: Line 38:
  * During week 3 send out an email to Stuart Metcalf from Canonical ISD about
  the upcoming roll-out to ask about any changes that need to be rolled-out
  for canonical-identity-provider or shipit. (Launchpad roll-outs imply
  a roll-out of the Canonical Identity Provider and ShipIt code).
Re-announce downtime so it serves as a reminder on [[http://identi.ca/launchpadstatus|launchpadstatus]] account at '''least 4h''' before the actual rollout. You can do this yourself with the [[https://wiki.canonical.com/Launchpad/PolicyandProcess/Announcements/Downtime|identi.ca login info]].
Line 47: Line 40:
  * You might want to request that pqm-blockers (such as cherry-picks) are not processed on the day that PQM is closing. Ensure that Matt Revell has a downtime announcement email ready for lp-announce. Again, make sure it's noted that some LP services (e.g. soyuz) can go offline temporarily up to half an hour before the rollout.
Line 49: Line 42:
  * At the end of the week, make sure all the cowboys listed on [[https://wiki.canonical.com/InformationInfrastructure/OSA/LaunchpadProductionStatus|LaunchpadProductionStatus]] are either already landed or listed among CurrentRolloutBlockers === Engineering responsibilities ===
Line 51: Line 44:
  * Also go through all the 'unusual rollout requirements' and clean them up from [[https://wiki.canonical.com/InformationInfrastructure/OSA/LaunchpadProductionStatus|LaunchpadProductionStatus]] page  * All QA for db changes must include checking that the aggregate database time is within budget. To find out how long the database updates will take to deploy on production, double the aggregate time from staging and add 10 minutes. Grep for lines like "2208-30-0 applied just now in 70.5 seconds"in devpad:/x/launchpad.net-logs/staging/sourcherry/2011-*-staging_restore.log
 * If the aggregate time is too large, either merge less than all the possible revisions, or back out the problem revision. This is part of qaing on db-stable.
Line 53: Line 47:
=== Release Week ===  * Ask losas to merge db-stable rev XXX to `devel` and put PQM into RC mode. ('on prae as pqm {{{cd /home/pqm/archives/rocketfuel/launchpad/devel; bzr update; bzr revert; bzr merge -r 10381 lp:~launchpad-pqm/launchpad/db-stable; bzr commit -m "Merge db-stable 10381." ; bzr push lp:~launchpad-pqm/launchpad/devel}}}. If there are conflicts a developer should reproduce the merge and tell the LOSA how to resolve them.
 * After buildbot blesses the merge ask LOSA to do the upgrade to qastaging ('refresh the qastaging tree on sourcherry, run normal upgrade scripts there')
 * '''While in RC mode:'''
    * Anyone can land 'release-critical' branches on `devel` that contain
    changes that would prevent the deployment from succeeding. This include
    fix or reversal for items marked `qa-bad`. Use `[release-critical=<you>]` in the PQM commit message.
    * A good rule of thumb to apply is: Does landing this change will save us from handling [[https://wiki.canonical.com/Launchpad/PolicyandProcess/CrisisHandlingPolicy|an incident]] during or shortly after the deployment? In the affirmative, land it!
    * For all other changes (features that would be delayed, nice to have fixes, annoyances to user, etc.) approval is required from either the [[https://launchpad.net/~launchpad-leader|project lead]], [[https://launchpad.net/~launchpad-architect|technical architect]] or [[https://launchpad.net/~launchpad-strategist|product strategist]].
 * Ask losas to take PQM out of RC mode once verifying that qastaging is working well (e.g. re-qa all the schema changes included in the merge).
  * Update the `#launchpad-dev` topic to state 'PQM in RC mode' and then reverse it.
Line 55: Line 58:
  * At the beginning of week 4. Make sure that release-critical was turned on
  in PQM. (Monday 00:00 UTC)
     * There is an option to leave devel open for r-c landings as well until Tuesday.
  * Make sure all the cowboys listed on [[https://wiki.canonical.com/InformationInfrastructure/OSA/LaunchpadProductionStatus|LaunchpadProductionStatus]] are either already landed or are added to the 'unusual rollout requirements' on that same page, so that the LOSAs will re-apply the cowboys after the rollout. It is definitely preferable not to re-apply cowboys, so it's a good idea to track down who hasn't landed a permanent fix yet.
Line 59: Line 60:
  * At the beginning of week 4, schedule a call with the Foundations team lead
  (and other leads if known to be pertinent) to determine what system
  changes might need comprehensive QA. If these exist, consider these
  thoughts.
  * If some unexpected problems are encountered during the roll-out and these put the roll-out schedule off-track by more than 30 minutes, and it can be aborted, then it should be aborted. In that case:
Line 64: Line 62:
     * Any related problem encountered should be treated as a red flag,
     forcing more thorough QA.
  * Identify the source of the problem and come up with a best estimate on how soon it can be fixed.
Line 67: Line 64:
     * Foundations lead should report on reviewing logs on edge, such
     as of cronscript output.
  * Determine when the next attempt should be undertaken (usually: the backup window) and make sure it is announced. Give at least 24 hours of advance warning (because the readonly replica takes 24 hours to restore).
Line 70: Line 66:
  * Look at the staging DB restore time for the week-end and determine if any
  changes to the announced down-time should be made.
==== After the roll-out ====
Line 73: Line 68:
  * Determine the schedule and deadlines. Send an email to launchpad-dev with all of the deadlines, similar to this [[ExampleReleaseScheduleEmail|example email]]. Place the deadlines on the team calendar.   * Immediately after the roll-out, examine the site for problems. For example, ensure CSS loads properly, all external links on the front page are reachable, etc.
Line 75: Line 70:
  * Update the `#launchpad-dev` topic to state we are in 'Release Critical' and to list the release manager.   * Announce on `launchpadstatus` that the release is done.
Line 77: Line 72:
  * Maintain the list of the [[CurrentRolloutBlockers|Current roll-out blockers]]

    The release manager should poll the team leads and QA engineers
    continuously to ensure that the list of release blockers is up-to-date. (We need to explore a
    work-around to retire this wiki page and do the management in Launchpad.)

    All bugs that are likely to cause lots of OOPSes, time-outs or prevent
    several users from working are good CRB candidates.

    It's a good idea to subscribe yourself to the page. (Currently broken.)

  * Make sure that developers are assigned to all problems we want to fix.

  * Review release-critical merge proposals. The policy should be:
     * All RC candidates go through the normal review process.
     * After code and UI review the MP is left in 'Needs Review' state.
     * A new review of type 'release-critical' is added to the MP and assigned to the release manager.
     * If the MP is approved for 'release-critical', the review is marked 'Approve' and the state of the MP is set to 'Approved'.


=== On the day before the roll-out ===

  * Check that the LOSA do a staged deployment of the code. We are looking for
  any hidden build problems and to determine the amount of time this step will
  take.

  * Request that landing to the `devel` branch be closed, 24 hours before the scheduled release. All changes should on the last day be merged through `db-devel`.

=== On the day of the roll-out ===

  * Chase up ''Current Rollout Blockers'' and any other pending release-critical
  fixes.

  * With PQM remaining open, have the LOSAs stop buildbot and set it do manual runs.
  * Remind people that all changes need to be in buildbot for '''9 hours'''
  before the roll-out time. The LOSAs require two hours of pre-release preparation and we need
  to allow for two complete buildbot cycles. (9 = 2 + 2 * 3.5)

  * In the case of failures, it's best to roll-out the last-known-good-build
  rather than delaying the release. The cut-off point to decide which revision
  to roll out is '''2 hours''' before the scheduled release.

  * Re-announce downtime so it serves as a reminder on [[http://identi.ca/launchpadstatus|launchpadstatus]] account at least 4h before the actual rollout.

  * Ensure that any embargoed external resources (e.g. blog entries) are live and accessible through the links provided. Ensure that a blog editor (Matthew Revell or delegate) is available at the time of the roll-out.

 * Immediately after the roll-out, examine the site for problems. For example, ensure CSS loads properly, all external links on the front page are reachable, etc.

 * Check that source code dependencies revision numbers are correct (compare
 them with what is listed in `utilities/sourcedeps.conf`) "bzr branch lp:~launchpad-pqm/lp-production-configs/trunk" then look at config-manager/production-{devel,stable}

=== During the roll-out ===

  * If some unexpected problems are encountered during the roll-out and these
  put the roll-out schedule off-track by more than 30 minutes, the decision to
  abort the roll-out should be taken. In that case:

    * Identify the source of the problem and come up with a best estimate on
    how soon it can be fixed.

    * Determine when the next attempt should be undertaken and make sure it is
    announced. Give at least 12 hours of fore-notice.

=== After the roll-out ===

  * With the QA engineers, review the OOPS reports.

    All common OOPSes are candidates for more release-critical fixes and
    scheduling another roll-out.

  
  * Prepare and schedule any necessary re-roll.

  * When a re-roll is needed, same activities than in the pre-roll out case.

  * Open the tree, when the released version is fine for the next cycle.

  * The release-manager needs to select the next release manager.

== Re-opening PQM ==

Once the roll out is complete and any critical issues have been dealt with, it's time to re-open PQM. Before doing that, though, we need to merge db-devel back into devel.

== Release critical policy ==

  * To apply for a release-critical approval, you must have a reviewed
  merge proposal on Launchpad. The engineer adds a review of type
  `release-critical` to the merge proposal and ensures it is in the 'Needs Review' state.

  * Good candidates for release-critical approval are issues found during QA that are
  bound to create OOPSes and time outs or otherwise significantly inconvenience our end-users.

  * Apart from special exceptions discussed with the project lead, only bug fixes
  should be granted release-critical approval.

  * If there is no way for the developer to QA his change on staging through
  the normal update procedure before the roll-out, it's recommended to request
  a cowboy of the branch on staging to QA it before approval.
  * Announce any post-roll out issues on `lauchpadstatus` as they are discovered. Create incident reports for them. (Following the Launchpad/PolicyandProcess/Announcements/IncidentProcess).
Line 178: Line 76:
  * Release-critical branches containing database patches should only
  be accepted if they don't impact the estimated roll-out time.

  * One the day of the roll-out, only database patches that would be critical
  to prevent data corruption should be accepted.

  * For the second re-roll, again only DB patches critical to data safety
  should be considered as it impacts our ability to update without down-time.

== Scheduling ==

  * Engineers apply in advance for one cycle.

  * They are selected by the previous release manager. Once selected, their name
  is put on the [[https://wiki.canonical.com/InformationInfrastructure/OSA/LaunchpadProductionStatus|Launchpad Production Status page]].

  * The actual roll-out time is determined by the release-manager's location:

        || Location || Roll out time ||
        || Americas || 23:00UTC ||
        || Europe || 10:00UTC ||
        || Asia/Pacific || 00:00UTC ||

  * No engineer should apply for the role more than twice a year.
  * If the qa-ok of a patch merged to devel was faulty and a fix is needed, this shouldn't impact the downtime - if it will, the entire thing should just be backed out.
Line 207: Line 82:
 * CurrentRolloutBlockers -- things currently blocking rollout
 * [[QATeam/TestPlans]] -- proof from all the teams that their code works
Line 210: Line 83:
 * [[https://wiki.canonical.com/Launchpad/PolicyandProcess/ProductionChangeApprovalPolicy|Approval of production change]]
 * [[https://wiki.canonical.com/Launchpad/PolicyandProcess/EmergencyChange|Emergency change (aka cherry-picks)]]

  • Process Name: Monthly DB deployment with downtime

  • Process Owner: Francis Lacoste

  • Parent Process/Activity: None

  • Supported Policy: None

Process Overview

Each month a DB downtime is scheduled. The team lead schedules this with the Launchpad stakeholders and LOSAs. It is broadly scheduled up to a year in advance on the downtime schedule. The scheduled time should overlap with availability from the DBA

The launchpad engineers request a merge from db-stable of *only* QA-ok commits 48 hours before the scheduled time. The db downtime then deploys just that merge revision and where possibly only does schema changes.

Other downtime requiring events (such as codehosting setup changes) have partial downtime scheduled as required.

DB-stable->devel merge imputs

  • Email and IRC messages from engineers and team leads.
  • Deployment reports

Activities

Coordinator (LP Team lead)

  • Negotiate and confirm the dates and times for the main release window and a backup release window with the following groups. Refer to the the downtime schedule for the pre-scheduled dates and times (we have a process in place to ensure our release calendar does not conflict with Ubuntu; conflicts normally do not happen). The goal is to confirm that the scheduled downtime does not interrupt other teams at critical times, to ensure that the release window does not place undue stress or risk upon our own operations team, and renegotiate the dates and times if necessary. Note when scheduling that services on Launchpad can begin being paused up to half an hour before the rollout.

Let Matthew Revell <matthew.revell@canonical.com> know the time to announce it.

The downtime window is always 90 minutes; if less time is consumed, great. If more time is needed, we go back to the drawing board to figure out how to do it faster.

Re-announce downtime so it serves as a reminder on launchpadstatus account at least 4h before the actual rollout. You can do this yourself with the identi.ca login info.

Ensure that Matt Revell has a downtime announcement email ready for lp-announce. Again, make sure it's noted that some LP services (e.g. soyuz) can go offline temporarily up to half an hour before the rollout.

Engineering responsibilities

  • All QA for db changes must include checking that the aggregate database time is within budget. To find out how long the database updates will take to deploy on production, double the aggregate time from staging and add 10 minutes. Grep for lines like "2208-30-0 applied just now in 70.5 seconds"in devpad:/x/launchpad.net-logs/staging/sourcherry/2011-*-staging_restore.log
  • If the aggregate time is too large, either merge less than all the possible revisions, or back out the problem revision. This is part of qaing on db-stable.
  • Ask losas to merge db-stable rev XXX to devel and put PQM into RC mode. ('on prae as pqm cd /home/pqm/archives/rocketfuel/launchpad/devel; bzr update; bzr revert; bzr merge -r 10381 lp:~launchpad-pqm/launchpad/db-stable; bzr commit -m "Merge db-stable 10381." ; bzr push lp:~launchpad-pqm/launchpad/devel. If there are conflicts a developer should reproduce the merge and tell the LOSA how to resolve them.

  • After buildbot blesses the merge ask LOSA to do the upgrade to qastaging ('refresh the qastaging tree on sourcherry, run normal upgrade scripts there')
  • While in RC mode:

    • Anyone can land 'release-critical' branches on devel that contain changes that would prevent the deployment from succeeding. This include fix or reversal for items marked qa-bad. Use [release-critical=<you>] in the PQM commit message.

    • A good rule of thumb to apply is: Does landing this change will save us from handling an incident during or shortly after the deployment? In the affirmative, land it!

    • For all other changes (features that would be delayed, nice to have fixes, annoyances to user, etc.) approval is required from either the project lead, technical architect or product strategist.

  • Ask losas to take PQM out of RC mode once verifying that qastaging is working well (e.g. re-qa all the schema changes included in the merge).
    • Update the #launchpad-dev topic to state 'PQM in RC mode' and then reverse it.

    • Make sure all the cowboys listed on LaunchpadProductionStatus are either already landed or are added to the 'unusual rollout requirements' on that same page, so that the LOSAs will re-apply the cowboys after the rollout. It is definitely preferable not to re-apply cowboys, so it's a good idea to track down who hasn't landed a permanent fix yet.

    • If some unexpected problems are encountered during the roll-out and these put the roll-out schedule off-track by more than 30 minutes, and it can be aborted, then it should be aborted. In that case:
    • Identify the source of the problem and come up with a best estimate on how soon it can be fixed.
    • Determine when the next attempt should be undertaken (usually: the backup window) and make sure it is announced. Give at least 24 hours of advance warning (because the readonly replica takes 24 hours to restore).

After the roll-out

  • Immediately after the roll-out, examine the site for problems. For example, ensure CSS loads properly, all external links on the front page are reachable, etc.
  • Announce on launchpadstatus that the release is done.

  • Announce any post-roll out issues on lauchpadstatus as they are discovered. Create incident reports for them. (Following the Launchpad/PolicyandProcess/Announcements/IncidentProcess).

Database Patches

  • If the qa-ok of a patch merged to devel was faulty and a fix is needed, this shouldn't impact the downtime - if it will, the entire thing should just be backed out.

References

PolicyAndProcess/Downtime (last edited 2011-06-06 22:02:02 by flacoste)