Diff for "LEP/FastDowntime"

Not logged in - Log In / Register

Differences between revisions 3 and 4
Revision 3 as of 2011-07-12 23:55:45
Size: 3208
Editor: lifeless
Comment: bring in workflow
Revision 4 as of 2011-07-13 22:21:39
Size: 3335
Editor: lifeless
Comment: link -later
Deletions are marked like this. Additions are marked like this.
Line 6: Line 6:
'''On Launchpad:''' https://bugs.launchpad.net/launchpad-project/+bugs?field.tag=fastdowntime '''On Launchpad:''' https://bugs.launchpad.net/launchpad-project/+bugs?field.tag=fastdowntime<<BR>>
'''Future related work on Launchpad:''' https://bugs.launchpad.net/launchpad-project/+bugs?field.tag=fastdowntime-later

Fast downtime

Rather than extended (typically 60-90 minutes) of downtime, have short downtime windows multiple times a week.

Contact: RobertCollins
On Launchpad: https://bugs.launchpad.net/launchpad-project/+bugs?field.tag=fastdowntime<<BR>> Future related work on Launchpad: https://bugs.launchpad.net/launchpad-project/+bugs?field.tag=fastdowntime-later

Rationale

Our development cycle times correlate very highly with schema changes. Technical limitations in our environment make applying schema changes require disconnecting all clients for a period of time. By making this short and designing our schema changes carefully we can dramatically simplify the way that we do downtime (most of the time), resulting in less overall downtime and faster delivery of features (with less churn on developer focus).

The basis for this change has been raised and hammered out on the stakeholders list; coding can start while further fine tuning is done on the -users list.

See also Database/LivePatching which documents some implementation issues as well as things we can do totally live (like adding new indices).

Stakeholders

All the LP stakeholders; particularly OEM who depend on LP to do daily releases.

User stories

developer-make-change

As a developer
I want to change Launchpads schema without waiting 4 weeks
so that I can fix a bug / improve functionality for users.

When a developer has a DB patch they can choose to try and deploy it in a fixed window downtime. They will broadly follow these steps:

  • DB review, flagging that they want it done before the monthly downtime.
  • Timing test on qastaging [done in a transaction : rolled back afterwards]
  • Land the patch on devel
  • Deploys continue as normal
  • Downtime is scheduled for the db patch, it is applied
  • Developer can start using the patch in python code.

Constraints and Requirements

Must

  • Reliably remove all contention on the DB for schema changes
  • Reliably restore connections to the DB without requiring appserver / librarian / builddmaster / codebrowse-mapper instance restarts
  • Be reliably fast: 3-5 minutes initially, but aim for 30-60 seconds medium term.

Nice to have

There are a lot of bells and whistles we could do, but they will be the focus of future completely distinct work: we want to deliver the core functionality as rapidly and reliably as possible.

Must not

  • Require manual steps during the schema deploy: fully automated

Out of scope

  • A 'fail whale' page during the downtime
  • Schema patches that cannot be done incrementally [or which we decide are simply too-hard].

Subfeatures

Success

How will we know when we are done?

We can reliably deploy schema changes 24 hours after they land in devel, with < 5 minutes downtime.

How will we measure how well we have done?

The project lead has cycletime graphs which reflect long cycle times for DB related projects: their cycle time should come way down : the further it comes down the better this project succeeded.

Thoughts?

Put everything else here. Better out than in.

LEP/FastDowntime (last edited 2011-12-22 18:22:22 by gary)