## page was renamed from PolicyAndProcess/ReleaseManagerRotation ||<>|| * '''Process Name:''' Monthly DB deployment with downtime * '''Process Owner:''' Francis Lacoste * '''Parent Process/Activity:''' None * '''Supported Policy:''' None == Process Overview == Each month a DB downtime is scheduled. The team lead schedules this with the Launchpad stakeholders and LOSAs. It is broadly scheduled up to a year in advance on [[DowntimeDeploymentSchedule|the downtime schedule]]. The scheduled time should overlap with availability from the DBA The launchpad engineers request a merge from db-stable of *only* QA-ok commits 48 hours before the scheduled time. The db downtime then deploys just that merge revision and where possibly only does schema changes. Other downtime requiring events (such as codehosting setup changes) have partial downtime scheduled as required. == DB-stable->devel merge imputs == * Email and IRC messages from engineers and team leads. * Deployment reports * [[https://devpad.canonical.com/~lpqateam/qa_reports/deployment-stable.html|stable]] * [[https://devpad.canonical.com/~lpqateam/qa_reports/deployment-db-stable.html|db-stable]] == Activities == === Coordinator (LP Team lead) === * Negotiate and confirm the dates and times for the main release window and a backup release window with the following groups. Refer to the [[DowntimeDeploymentSchedule|the downtime schedule]] for the pre-scheduled dates and times (we have a process in place to ensure our release calendar does not conflict with Ubuntu; conflicts normally do not happen). The goal is to confirm that the scheduled downtime does not interrupt other teams at critical times, to ensure that the release window does not place undue stress or risk upon our own operations team, and renegotiate the dates and times if necessary. Note when scheduling that services on Launchpad can begin being paused up to half an hour before the rollout. * the stakeholder's list (private-canonical-launchpad-stakeholders@lists.launchpad.net). * the LOSAs (losas@canonical.com) * the DBA (Stuart Bishop ) Let Matthew Revell know the time to announce it. The downtime window is always 90 minutes; if less time is consumed, great. If more time is needed, we go back to the drawing board to figure out how to do it faster. Re-announce downtime so it serves as a reminder on [[http://identi.ca/launchpadstatus|launchpadstatus]] account at '''least 4h''' before the actual rollout. You can do this yourself with the [[https://wiki.canonical.com/Launchpad/PolicyandProcess/Announcements/Downtime|identi.ca login info]]. Ensure that Matt Revell has a downtime announcement email ready for lp-announce. Again, make sure it's noted that some LP services (e.g. soyuz) can go offline temporarily up to half an hour before the rollout. === Engineering responsibilities === * All QA for db changes must include checking that the aggregate database time is within budget. To find out how long the database updates will take to deploy on production, double the aggregate time from staging and add 10 minutes. Grep for lines like "2208-30-0 applied just now in 70.5 seconds"in devpad:/x/launchpad.net-logs/staging/sourcherry/2011-*-staging_restore.log * If the aggregate time is too large, either merge less than all the possible revisions, or back out the problem revision. This is part of qaing on db-stable. * Ask losas to merge db-stable rev XXX to `devel` and put PQM into RC mode. ('on prae as pqm {{{cd /home/pqm/archives/rocketfuel/launchpad/devel; bzr update; bzr revert; bzr merge -r 10381 lp:~launchpad-pqm/launchpad/db-stable; bzr commit -m "Merge db-stable 10381." ; bzr push lp:~launchpad-pqm/launchpad/devel}}}. If there are conflicts a developer should reproduce the merge and tell the LOSA how to resolve them. * After buildbot blesses the merge ask LOSA to do the upgrade to qastaging ('refresh the qastaging tree on sourcherry, run normal upgrade scripts there') * '''While in RC mode:''' * Anyone can land 'release-critical' branches on `devel` that contain changes that would prevent the deployment from succeeding. This include fix or reversal for items marked `qa-bad`. Use `[release-critical=]` in the PQM commit message. * A good rule of thumb to apply is: Does landing this change will save us from handling [[https://wiki.canonical.com/Launchpad/PolicyandProcess/CrisisHandlingPolicy|an incident]] during or shortly after the deployment? In the affirmative, land it! * For all other changes (features that would be delayed, nice to have fixes, annoyances to user, etc.) approval is required from either the [[https://launchpad.net/~launchpad-leader|project lead]], [[https://launchpad.net/~launchpad-architect|technical architect]] or [[https://launchpad.net/~launchpad-strategist|product strategist]]. * Ask losas to take PQM out of RC mode once verifying that qastaging is working well (e.g. re-qa all the schema changes included in the merge). * Update the `#launchpad-dev` topic to state 'PQM in RC mode' and then reverse it. * Make sure all the cowboys listed on [[https://wiki.canonical.com/InformationInfrastructure/OSA/LaunchpadProductionStatus|LaunchpadProductionStatus]] are either already landed or are added to the 'unusual rollout requirements' on that same page, so that the LOSAs will re-apply the cowboys after the rollout. It is definitely preferable not to re-apply cowboys, so it's a good idea to track down who hasn't landed a permanent fix yet. * If some unexpected problems are encountered during the roll-out and these put the roll-out schedule off-track by more than 30 minutes, and it can be aborted, then it should be aborted. In that case: * Identify the source of the problem and come up with a best estimate on how soon it can be fixed. * Determine when the next attempt should be undertaken (usually: the backup window) and make sure it is announced. Give at least 24 hours of advance warning (because the readonly replica takes 24 hours to restore). ==== After the roll-out ==== * Immediately after the roll-out, examine the site for problems. For example, ensure CSS loads properly, all external links on the front page are reachable, etc. * Announce on `launchpadstatus` that the release is done. * Announce any post-roll out issues on `lauchpadstatus` as they are discovered. Create incident reports for them. (Following the Launchpad/PolicyandProcess/Announcements/IncidentProcess). === Database Patches === * If the qa-ok of a patch merged to devel was faulty and a fix is needed, this shouldn't impact the downtime - if it will, the entire thing should just be backed out. == References == * [[https://wiki.canonical.com/InformationInfrastructure/OSA/LaunchpadRollout | OSA Launchpad Rollout Procedures]] * SpuriousFailures -- useful for diagnosing last-minute build failures * [[https://wiki.canonical.com/InformationInfrastructure/OSA/LaunchpadProductionStatus|Launchpad production status]] * [[https://wiki.canonical.com/Launchpad/PolicyandProcess/ProductionChangeApprovalPolicy|Approval of production change]] * [[https://wiki.canonical.com/Launchpad/PolicyandProcess/EmergencyChange|Emergency change (aka cherry-picks)]]