Downtime

Not logged in - Log In / Register

Overview

Launchpad tries not to have downtime at all, but some system changes require it, and sometimes we have unscheduled downtime.

Unscheduled partial downtime

Whenever a service (or even a page) isn't working, that part of the system is unavailable. In these situations the system administrators and developers treat the fault as critical, requiring immediate action.

Unscheduled complete downtime

Sometimes severe faults occur that will prevent most or all of Launchpad from working. This faults are also treated as critical with immediate action required; they are usually escalated to more folk immediately to get as many resources working on it as possible.

Scheduled partial downtime

Some services run in a configuration where changing over from one version of the code to another requires a short period of downtime (our PPA upload service is an example of this). Other services (though we are working to get rid of this category) run with only one instance of a particular resource and if hardware maintenance or security fixes are required, we will need to take the service offline for a short period. ppa.launchpad.net and bazaar.launchpad.net are examples of this configuration. We make the downtime involved as minimal as possible and schedule the downtime with 24 hours notice (unless extenuating circumstances, like remote root exploits, apply). We don't try to do this sort of upgrade or maintenance at the same time as scheduled complete downtime, because that multiplies the risk of something going wrong and extends such downtime.

Scheduled complete downtime

Finally we have some crucial services which require the entire Launchpad system to be unavailable while they are altered and maintained. We take this seriously - interrupting all our users is a Big Deal, and we're working on reducing the number of things which require complete downtime.

Fastdowntime

Updates that incur less than 30 seconds of downtime can be done unscheduled without announcement. The primary example is database schema changes, which require a short period with no activity in the database to complete reliably (the Launchpad database is quite busy and attempting most forms of online schema change results in a deadlock). Part of our QA is ensuring that DB patches will apply rapidly (<15 seconds on the master node for a single patch).

Slow downtime

In some (rare) circumstances we may need a long window where a crucial service which the site cannot run without will be unavailable. We do these no more than once a month, and announce them at least 24 hours in advance. As this is extremely disruptive, there is no regular schedule.

Downtime (last edited 2016-01-06 12:53:58 by wgrant)