Downtime

Not logged in - Log In / Register

Revision 5 as of 2012-07-25 19:26:35

Clear message

Overview

Launchpad tries not to have downtime at all, but some system changes require it, and sometimes we have unscheduled downtime.

Unscheduled partial downtime

Whenever a service (or even a page) isn't working, that part of the system is unavailable. In these situations the system administrators and developers treat the fault as critical, requiring immediate action.

Unscheduled complete downtime

Sometimes severe faults occur that will prevent most or all of Launchpad from working. This faults are also treated as critical with immediate action required; they are usually escalated to more folk immediately to get as many resources working on it as possible.

Scheduled partial downtime

Some services run in a configuration where changing over from one version of the code to another requires a short period of downtime (our PPA upload service is an example of this). Other services (though we are working to get rid of this category) run with only one instance of a particular resource and if hardware maintenance or security fixes are required, we will need to take the service offline for a short period. ppa.launchpad.net and bazaar.launchpad.net are examples of this configuration. We make the downtime involved as minimal as possible and schedule the downtime with 24 hours notice (unless extenuating circumstances, like remote root exploits, apply). We don't try to do this sort of upgrade or maintenance at the same time as scheduled complete downtime, because that multiplies the risk of something going wrong and extends such downtime.

Scheduled complete downtime

Finally we have some crucial services which require the entire Launchpad system to be unavailable while they are altered and maintained. We take this seriously - interrupting all our users is a Big Deal, and we're working on reducing the number of things which require complete downtime.

Fastdowntime

We have a window scheduled at 0800UTC weekdays where we can do extremely fast updates to crucial services such as the primary database. Fast in this case means guaranteed less than 5 minutes and usually less than one minute. We usually announce that we'll be using this window to make a change 24 hours in advance. The primary use for this 0800 window is database schema changes, which require a short period with no activity in the database to complete reliably (the Launchpad database is quite busy and attempting most forms of online-schema change result in a deadlock). Part of our QA is ensuring that DB patches will apply rapidly (< 15 seconds on the master node for a single patch). Current total allotted time for a FDT is 5 minutes.

We have some weeks/days where we will not perform a fastdowntime - currently only Ubuntu release time or if operations folk are on leave.

Slow downtime

In some (rare) circumstances we may need a long window where a crucial service which the site cannot run without will be unavailable. We do these no more than once a month, and announce them at least 24 hours in advance.

At this is extremely disruptive we have no regular schedule for this, and only schedule them when needed.