Downtime

Not logged in - Log In / Register

Revision 1 as of 2011-03-30 23:57:59

Clear message

Overview

Launchpad tries not to have downtime at all, but some system changes require it, and sometimes we have unscheduled downtime.

Unscheduled partial downtime

Whenever a service (or even a page) isn't working, that part of the system is unavailable. In these situations the system administrators and developers treat the fault as critical, requiring immediate action.

Unscheduled complete downtime

Sometimes severe faults occur that will prevent most or all of Launchpad from working. This faults are also treated as critical with immediate action required; they are usually escalated to more folk immediately to get as many resources working on it as possible.

Scheduled partial downtime

Some services run in a configuration where changing over from one version of the code to another requires a short period of downtime (our database server is an example of this). Other services (though we are working to get rid of this category) run with only one instance of a particular resource and if hardware maintenance or security fixes are required, we will need to take the service offline for a short period. ppa.launchpad.net and bazaar.launchpad.net are examples of this configuration. We make the downtime involved as minimal as possible and schedule the downtime with 24 hours notice (unless extenuating circumstances, like remote root exploits, appply). We don't try to do this sort of upgrade or maintenance at the same time as scheduled complete downtime, because that multiplies the risk of something going wrong and extends such downtime.

Scheduled complete downtime

Finally we have some crucial services which require the entire Launchpad system to be unavailable while they are altered and maintained. We schedule these events once a month at the moment, as they are slow and complex operations and affect every user. We are working to make these downtimes smaller and faster. The primary reason for scheduling complete downtime is to perform database schema changes: the Launchpad database is quite busy and attempting most forms of online-schema change result in a deadlock - we've caused unscheduled complete downtime attempting to do online schema changes in the past, and now require that all database activity is halted when we make schema changes. Our monthly schedule is available here.