Database/LivePatching

Not logged in - Log In / Register

Revision 4 as of 2011-03-22 20:33:58

Clear message

Summary

We want to make database schema updates mid-development cycle. These updates will be limited in scope, as we need to keep system downtime to under 5 minutes. Any unsuitable updates need to be applied as part of the normal scheduled monthly upgrade which has a much more generous time budget.

These updates will be done independantly of code updates. This means that, before or along with landing the database patch, code may need to be landed to ensure everything runs both with and without the database patch applied.

Lightweight Patches

Some database patches we can apply live. These are applied directly to the databases, bypassing Slony.

Lightweight patches will land in the Launchpad tree. They will be given a non-zero patch number in the (major, minor, patch) revision number. Launchpad ignores these patches when detecting if the patch level of the database matches the patch level it expects, allowing Launchpad to run even though patches have been applied to the database that do not exist in that instances tree.

Lightweight patches are applied manually to all live Launchpad database nodes containing the main Launchpad schema (launchpad_prod_*, launchpad_standalone_*), as the patch being applied to the live database may need to be an altered version of the patch landed in the tree.

The (major, minor, patch) revision tuple needs to be manually inserted into the LaunchpadDatabaseRevision table on the master. Failing to do this will abort the next full upgrade when the system attempts to apply a patch that has already been applied (this situation is normally caught by the full staging update).

Index Creation

Indexes can be created without downtime by applying the database patch manually, altering all the CREATE INDEX commands to CREATE INDEX CONCURRENTLY.

Don't bother starting index creation when there are long running transactions in progress. The indexes will not complete until the long running transactions have completed.

CREATE INDEX CONCURRENTLY cannot run inside a transaction, so indexes should first be built on the master. If INDEXES cannot be built for some reason, such as trying to build a UNIQUE index on non-unique data, manual repair will be needed to back out the changes.

Once build on the master, the indexes can be built on the slaves simultaneously or serially.

Index creation should be done from a screen(1) session or similar to avoid network dropouts - for CREATE INDEX CONCURRENTLY, early disconnection will result in an invalid index rather than rolling back.

The database patch tuple needs to be inserted manually into the LaunchpadDatabaseRevision table on the master. This stops attempts to reapply the patch during the next full upgrade (which will cause it to fail).

The CREATE INDEX CONCURRENTLY statements need to be run on all Launchpad database nodes.

After creating the indexes, confirm that there are no invalid indexes on the affected tables on all nodes.

Table Creation

Run the CREATE TABLE statements on all nodes.

Add the new tables and associated sequences to the desired replication set (currently requires writing an adhoc Slonik script, but a helper to do this is possible).

On the master, populate the newly replicated table with any wanted data.

Views

Run the CREATE OR REPLACE VIEW statements on all nodes.

Stored Procedures / trusted.sql

trusted.sql can be run on all nodes inside a transaction at any time to create new or modify existing stored procedures. To minimize version skew between the master and slaves, run this on all nodes simultaneously, and COMMIT the transactions as close to simultaneously as you can (two phase commit would be overkill - version skew should be harmess as data modifying stored procedures are only ever run on the master).

Heavyweight Patches

Heavyweight patches need to be applied via Slony as they change the structure of the replicated database. This requires some system downtime to allow Slony to grab the locks it needs.

Patch Process Overview

1) Update wildcherry Launchpad source tree, and run 'make clean build'.

2) Wait for long running transactions to terminate, or kill them.

3) Shutdown or reconfigure clients to not use the database.

4) Put pgbouncer into 'pause' state.

5) Apply database patches. Note that we don't need to run fti.py - the full text indexes seem static now and if we do need to tweak them we can do it as part of the monthly update:

    cd database/schema
    ./upgrade.py
    # ./fti.py
    ./security.py --cluster

6) Unpause pgbouncer

7) Restore client systems to normal

TODO