Diff for "PolicyAndProcess/Downtime"

Not logged in - Log In / Register

Differences between revisions 54 and 55
Revision 54 as of 2010-09-22 18:32:30
Size: 14047
Editor: edwin-grubbs
Comment:
Revision 55 as of 2010-10-21 14:35:44
Size: 14099
Editor: edwin-grubbs
Comment:
Deletions are marked like this. Additions are marked like this.
Line 66: Line 66:
  * Follow [[https://bugs.edge.launchpad.net/launchpad-project/+bugs?field.tag=qa-needstesting | the needs testing bugs]] and encourage engineers to QA before PQM closes.   * Follow [[https://bugs.edge.launchpad.net/launchpad-project/+bugs?field.tag=qa-needstesting,qa-bad | the needs testing bugs]] and encourage engineers to QA before PQM closes.
Line 79: Line 79:
  * Make sure all the cowboys listed on [[https://wiki.canonical.com/InformationInfrastructure/OSA/LaunchpadProductionStatus|LaunchpadProductionStatus]] are either already landed or listed among CurrentRolloutBlockers   * Make sure all the cowboys listed on [[https://wiki.canonical.com/InformationInfrastructure/OSA/LaunchpadProductionStatus|LaunchpadProductionStatus]] are either already landed or are added to the 'unusual rollout requirements' on that same page, so that the LOSAs will re-apply the cowboys after the rollout. It is definitely preferable not to re-apply cowboys, so it's a good idea to track down who hasn't landed a permanent fix yet.
Line 106: Line 106:
  * Maintain the list of the [[CurrentRolloutBlockers|Current roll-out blockers]]   * Check the qa report for [[https://devpad.canonical.com/~lpqateam/qa_reports/deployment-stable.html|stable]] and [[https://devpad.canonical.com/~lpqateam/qa_reports/deployment-db-stable.html|db-stable]].
Line 111: Line 111:
  * The release manager should poll the team leads and QA engineers continuously to ensure that the list of release blockers is up-to-date. (We need to explore a work-around to retire this wiki page and do the management in Launchpad.)

      * All bugs that are likely to cause lots of OOPSes, time-outs or prevent
      several users from working are good CRB candidates.

      * It's a good idea to subscribe yourself to the page. (The Subscribe
      User is broken, but you can still subscribe by going to UserPreferences
      and adding `CurrentRolloutBlockers` to the list of subscribed page.)
  * The release manager should poll the team leads and QA engineers continuously to ensure that the release blockers in the qa report ([[https://devpad.canonical.com/~lpqateam/qa_reports/deployment-stable.html|stable]] and [[https://devpad.canonical.com/~lpqateam/qa_reports/deployment-db-stable.html|db-stable]]) are being handled.
Line 142: Line 135:
  * Chase up ''Current Rollout Blockers'' and any other pending release-critical fixes.   * Chase up ''Current Rollout Blockers'' in the [[[[https://devpad.canonical.com/~lpqateam/qa_reports/deployment-db-stable.html|db-stable qa report]] and any other pending release-critical fixes.
Line 203: Line 196:
  [[https://wiki.canonical.com/Launchpad/PolicyandProcess/ProductionChangeApprovalPolicy|Production Change Approval Policy]]. In this case, it is important that in addition of your change being tracked on CurrentRolloutBlockers, you ensure that the release manager is aware of that extra branch.   [[https://wiki.canonical.com/Launchpad/PolicyandProcess/ProductionChangeApprovalPolicy|Production Change Approval Policy]]. In this case, it is important that you ensure that the release manager is aware of that extra branch.
Line 236: Line 229:
 * CurrentRolloutBlockers -- things currently blocking rollout

  • Process Name: Release Manager Rotation Process

  • Process Owner: Francis Lacoste

  • Parent Process/Activity: None

  • Supported Policy: None

Process Overview

Each cycle a different engineer takes the role of release manager. The release manager coordinates with the release team and all team leads to ensure that the tree is ready for the roll-out and that all critical bugs are in or worked-around.

Back-up release managers are the two RMs from the previous two cycles.

One option that has worked very well is to share the release-manager role across timezones, handing over the current status and tasks to the backup in the next timezone at the end of your day. It's a great opportunity to work together as a team.

It is the incumbent release manager's job to find a willing person to manage the next release.

Release Manager inputs

Activities

When you start

  • Negotiate and confirm the dates and times for the main release window and a backup release window with the following groups. Refer to the Launchpad Release Calendar for the pre-scheduled dates and times (we has a process in place to ensure our release calendar does not conflict with Ubuntu; conflicts normally do not happen). The goal is to confirm that the scheduled downtime does not interrupt other teams at critical times, to ensure that the release window does not place undue stress or risk upon our own operations team, and renegotiate the dates and times if necessary.

    • Check the stakeholders list archives for the mails I sent out if you are looking for a template. -- mars 2010-07-29 14:37:13

    • Consider when critical people have scheduled holidays and vacation around the release date. You can run a release without critical people like DB reviewers or the Project Lead, but it increases the stress and risk when something goes wrong. -- mars 2010-07-29 14:37:13

    • We consider the backup release window now because getting feedback from all the stakeholders can take a few days. Presumably you want to try a second rollout as soon as possible after the first (ask the LOSAs about what a reasonable backup window would be). -- mars 2010-07-29 14:52:37

End of Week 2

  • Notify the canonical-launchpad list of the expected date and time for PQM closure. This should ideally happen at least 1 week before PQM closes; at the latest it should happen at the start of Week 3.

Week 3 (the one before the roll-out)

Preparation is required. Many items can be considered critical.

Start of Week 3

  • Ensure that staging is up-to-date and had a DB restore so that it can be used for non-edge QA.

  • The down-time estimate should be based on the last staging DB restore. We should be overly conservative in the read-only announcement. Recently, we always encountered DB locking issues during the roll-out which increased the read-only window. Until we are able to have stable DB upgrades and determine a reliable fudge factor between time on staging and production, we should plan for the worst and come back earlier.
  • Follow the needs testing bugs and encourage engineers to QA before PQM closes.

During Week 3

  • Ensure that Matt Revell has a downtime announcement email ready for lp-announce.

End of Week 3

On the day the PQM closes.

  • Also go through all the 'unusual rollout requirements' and clean them up from LaunchpadProductionStatus page

  • Make sure all the cowboys listed on LaunchpadProductionStatus are either already landed or are added to the 'unusual rollout requirements' on that same page, so that the LOSAs will re-apply the cowboys after the rollout. It is definitely preferable not to re-apply cowboys, so it's a good idea to track down who hasn't landed a permanent fix yet.

  • Offer to review anything may need a release critical. Provide your RC to any work that may miss the PQM close time (EC2 is unpredictable). Providing RCs before the close of PQM allows engineers to work with some certainty that there work is required.

  • Ask the LOSAs to switch PQM to release-critical around 22:00 UTC (Friday or Thursday). Allow landing to devel.
  • Update the #launchpad-dev topic to state we are in 'Release Critical' and to list the release manager.

  • Be prepared to make test fixes over the weekend to ensure the last branches are complete, or back them out so that staging is ready for testing Monday

Release Week

Start of Week 4

  • At the beginning of week 4, schedule a call with the Foundations team lead (and other leads if known to be pertinent) to determine what system changes might need comprehensive QA. If these exist, consider these thoughts.
    • Any related problem encountered should be treated as a red flag, forcing more thorough QA.
    • Foundations lead should report on reviewing logs on edge, such as of cronscript output.
  • Look at the staging DB restore time for the week-end and determine if any changes to the announced down-time should be made.
  • Determine the schedule and deadlines. Send an email to launchpad-dev with all of the deadlines, similar to this example email. Place the deadlines on the team calendar.

  • Check the qa report for stable and db-stable.

During Week 4

  • The release manager should poll the team leads and QA engineers continuously to ensure that the release blockers in the qa report (stable and db-stable) are being handled.

  • Make sure that developers are assigned to all problems we want to fix.
  • Review release-critical merge proposals. The policy should be:
    • All RC candidates go through the normal review process.
    • After code and UI review the MP is left in 'Needs Review' state.
    • A new review of type 'release-critical' is added to the MP and assigned to the release manager.
    • If the MP is approved for 'release-critical', the review is marked 'Approve' and the state of the MP is set to 'Approved'.
  • Consider closing devel for RC landings (Tuesday is usually the latest devel can be open)

On the day before the roll-out

  • Ensure that PQM is closed to devel, RC landing go to db-devel.
  • Check that the LOSA do a staged deployment of the code. We are looking for any hidden build problems and to determine the amount of time this step will take.

On the day of the roll-out

  • Chase up Current Rollout Blockers in the db-stable qa report and any other pending release-critical fixes.

  • With PQM remaining open, have the LOSAs stop buildbot and set it do manual runs. This step just keeps the buildmaster process tree tidy when the codehost servers are restarted.
  • Remind people that all changes need to be in buildbot for 9 hours before the roll-out time. The LOSAs require two hours of pre-release preparation and we need to allow for two complete buildbot cycles. (9 = 2 + 2 * 3.5)

  • In the case of failures, it's best to roll-out the last-known-good-build rather than delaying the release. The cut-off point to decide which revision to roll out is 2 hours before the scheduled release.

  • Re-announce downtime so it serves as a reminder on launchpadstatus account at least 4h before the actual rollout.

  • Start a new RolloutReport to document any issues that will be encountered during the roll-out.

  • Once the revision number that will be rolled-out has been announced to the

    LOSA, PQM can be re-opened and db-devel should be merged into devel.

During the roll-out

  • If some unexpected problems are encountered during the roll-out and these put the roll-out schedule off-track by more than 30 minutes, the decision to abort the roll-out should be taken. In that case:
  • Identify the source of the problem and come up with a best estimate on how soon it can be fixed.
  • Determine when the next attempt should be undertaken and make sure it is announced. Give at least 12 hours of fore-notice.

After the roll-out

  • Immediately after the roll-out, examine the site for problems. For example, ensure CSS loads properly, all external links on the front page are reachable, etc.
  • Announce on launchpadstatus that the release is done.

  • Announce any post-roll out issues on lauchpadstatus as they are discovered. Add them to the incident report. (Following the Launchpad/PolicyandProcess/Announcements/IncidentProcess, where the release manager acts as the communication liaison.)

  • With the QA engineers, review the OOPS reports. All common OOPSes are candidates for cherry-picks.
  • Complete the incident report and ask any engineers that participated in fixing the issues in doing root-cause analysis.
  • The release-manager needs to select the next release manager.

Release critical policy

  • To apply for a release-critical approval, you must have a reviewed merge proposal on Launchpad. The engineer adds a review of type

    release-critical to the merge proposal and ensures it is in the 'Needs Review' state.

  • Good candidates for release-critical approval are issues found during QA that are bound to create OOPSes and time outs or otherwise significantly inconvenience our end-users.
  • A good rule of thumb to apply is if this is something that would be likely

    to be deployed as a cherry-pick, it's good for release-critical.

  • Apart from special exceptions discussed with the project lead, only bug fixes should be granted release-critical approval.
  • If there is no way for the developer to QA his change on staging through the normal update procedure before the roll-out, it's recommended to request a cowboy of the branch on staging to QA it before approval.
  • If the current release manager isn't available for granting release-critical approval, you can seek approval using the regular Production Change Approval Policy. In this case, it is important that you ensure that the release manager is aware of that extra branch.

Database Patches

  • Release-critical branches containing database patches should only be accepted if they don't impact the estimated roll-out time.
  • One the day of the roll-out, only database patches that would be critical to prevent data corruption should be accepted.
  • For the second re-roll, again only DB patches critical to data safety should be considered as it impacts our ability to update without down-time.

Scheduling

  • Engineers apply in advance for one cycle.
  • They are selected by the previous release manager. Once selected, their name

    is put on the Launchpad Production Status page.

  • The actual roll-out time is determined by the release-manager's location:
    • Location

      Roll out time

      Americas

      22:00UTC

      Europe

      10:00UTC

      Asia/Pacific

      22:00UTC

  • No engineer should apply for the role more than twice a year.

References

PolicyAndProcess/Downtime (last edited 2011-06-06 22:02:02 by flacoste)