This LEP has been implemented. See the related documentation:
Fast downtime
Rather than extended (typically 60-90 minutes) of downtime, have short downtime windows multiple times a week.
Contact: RobertCollins
On Launchpad: https://bugs.launchpad.net/launchpad-project/+bugs?field.tag=fastdowntime
Future related work on Launchpad: https://bugs.launchpad.net/launchpad-project/+bugs?field.tag=fastdowntime-later
Rationale
Our development cycle times correlate very highly with schema changes. Technical limitations in our environment make applying schema changes require disconnecting all clients for a period of time. By making this short and designing our schema changes carefully we can dramatically simplify the way that we do downtime (most of the time), resulting in less overall downtime and faster delivery of features (with less churn on developer focus).
The basis for this change has been raised and hammered out on the stakeholders list; coding can start while further fine tuning is done on the -users list.
See also Database/LivePatching which documents some implementation issues as well as things we can do totally live (like adding new indices).
Stakeholders
All the LP stakeholders; particularly OEM who depend on LP to do daily releases.
User stories
developer-make-change
As a developer
I want to change Launchpads schema without waiting 4 weeks
so that I can fix a bug / improve functionality for users.
When a developer has a DB patch they can choose to try and deploy it in a fixed window downtime. They will broadly follow these steps:
- DB review, flagging that they want it done before the monthly downtime.
- Timing test on qastaging [done in a transaction : rolled back afterwards]
- Land the patch on devel
- Deploys continue as normal
- Downtime is scheduled for the db patch, it is applied
- Developer can start using the patch in python code.
Constraints and Requirements
Must
- Reliably remove all contention on the DB for schema changes
- Reliably restore connections to the DB without requiring appserver / librarian / builddmaster / codebrowse-mapper instance restarts
- Be reliably fast: 3-5 minutes initially, but aim for 30-60 seconds medium term.
Nice to have
There are a lot of bells and whistles we could do, but they will be the focus of future completely distinct work: we want to deliver the core functionality as rapidly and reliably as possible.
Must not
- Require manual steps during the schema deploy: fully automated
Out of scope
- A 'fail whale' page during the downtime
- Schema patches that cannot be done incrementally [or which we decide are simply too-hard].
Subfeatures
Success
How will we know when we are done?
We can reliably deploy schema changes 24 hours after they land in devel, with < 5 minutes downtime.
How will we measure how well we have done?
The project lead has cycletime graphs which reflect long cycle times for DB related projects: their cycle time should come way down : the further it comes down the better this project succeeded.
Thoughts?
Put everything else here. Better out than in.