Diff for "Database/LivePatching"

Not logged in - Log In / Register

Differences between revisions 8 and 9
Revision 8 as of 2011-08-05 14:22:26
Size: 5869
Editor: stub
Comment:
Revision 9 as of 2011-08-08 12:06:01
Size: 6133
Editor: lifeless
Comment: update
Deletions are marked like this. Additions are marked like this.
Line 5: Line 5:
We want to make database schema updates mid-development cycle. These updates will be limited in scope, as we need to keep system downtime to under 5 minutes. Any unsuitable updates need to be applied as part of the normal scheduled monthly upgrade which has a much more generous time budget. This wiki page documents the schema changes that can be done on Launchpads PostgreSQL + Slony database.
Line 7: Line 7:
These updates will be done independently of code updates. This means that, before or along with landing the database patch, code may need to be landed to ensure everything runs both with and without the database patch applied. We use Hot schema changes to deploy changes without causing downtime.
Line 9: Line 9:
= Lightweight Patches = And we use Cold schema changes to deploy changes with minimal (a few seconds per change) downtime.

In both cases, we deploy these changes without changing the running appserver code, so the changes need to be compatible with the existing live code. This means that, before landing the database patch, code may need to be landed to ensure everything runs both with and without the database patch applied.

All DB schema changes are landed on db-devel, and must land without appserver code changes (but test changes may be included, if trigger/db function changes are being landed). Schema patches need a non-zero patch level (e.g. (major, minor, 1) schema version number). Launchpad ignores these patches when detecting if the patch level
of the database matches the patch level it expects, allowing Launchpad
to run even though patches have been applied to the database that do
not exist in that instances tree.


= Hot patches =
Line 14: Line 24:
Lightweight patches will land in the Launchpad tree. They will be
given a non-zero patch number in the (major, minor, patch) revision
number. Launchpad ignores these patches when detecting if the patch level
of the database matches the patch level it expects, allowing Launchpad
to run even though patches have been applied to the database that do
not exist in that instances tree.
They are currently applied by crafting an appropriate wrapper script and then running it directly on all the nodes containing the LP schema (launchpad_prod_*, launchpad_standalone_*).
Line 21: Line 26:
Lightweight patches are applied manually to all live Launchpad
database nodes containing the main Launchpad schema (launchpad_prod_*,
launchpad_standalone_*), as the patch being applied to the live database
may need to be an altered version of the patch landed in the tree.
The (major, minor, patch) revision tuple needs to be manually inserted into the LaunchpadDatabaseRevision table on the master. Failing to do this will abort the next full upgrade when the system attempts to apply a patch that has already been applied (this situation is normally caught by the full staging update). It must not be applied to each node - it has to replicate via slony.
Line 26: Line 28:
The (major, minor, patch) revision tuple needs to be manually inserted into the LaunchpadDatabaseRevision table on the master. Failing to do this will abort the next full upgrade when the system attempts to apply a patch that has already been applied (this situation is normally caught by the full staging update). We will be automating this in future, to handle at least the happy-path and avoiding accidentally running a hot patch during a fastdowntime.
Line 30: Line 32:
Indexes can be created without downtime by applying the database
patch manually, altering all the CREATE INDEX commands to CREATE INDEX
CONCURRENTLY.
Index creation for all but the smallest tables should done in a hot patch.
Line 34: Line 34:
Don't bother starting index creation when there are long running transactions in progress. The indexes will not complete until the long running transactions have completed. The patch needs to use CREATE INDEX CONCURRENTLY which will do 3 separate fast transactions to make, populate, and activate the index. There is some contention added but it is minimal.
Line 36: Line 36:
CREATE INDEX CONCURRENTLY cannot run inside a transaction, so indexes should first be built on the master. If INDEXES cannot be built for some reason, such as trying to build a UNIQUE index on non-unique data, manual repair will be needed to back out the changes. Don't bother starting index creation when there are long running transactions in progress. The indexes will not complete until the long running transactions have completed. (Because the index has to wait for the existing read transactions on the table to complete before it can be activated.

CREATE INDEX CONCURRENTLY cannot run inside a transaction, so indexes should first be built on the master. If the index cannot be built for some reason, such as trying to build a UNIQUE index on non-unique data, manual repair will be needed to back out the changes. /!\ Note that deleting an index requires a fastdowntime deploy.
Line 44: Line 46:
The database patch tuple needs to be inserted manually into the
LaunchpadDatabaseRevision table on the master. This stops attempts to
reapply the patch during the next full upgrade (which will cause it
to fail).
After creating the indexes, confirm that there are no invalid indexes on the affected tables on all nodes.
Line 49: Line 48:
The CREATE INDEX CONCURRENTLY statements need to be run on all Launchpad
database nodes.
== Views ==
Line 52: Line 50:
After creating the indexes, confirm that there are no invalid indexes on the affected tables on all nodes. Run the CREATE OR REPLACE VIEW statements on all nodes.

== Stored Procedures / trusted.sql ==

Stored procedures can be changed (as long as the signature is still consistent with any triggers).

Table trigger definitions must be done as cold patches.

trusted.sql can be run on all nodes inside a transaction at any time to create new or modify existing stored procedures. To minimize version skew between the master and slaves, run this on all nodes simultaneously in a transaction, and COMMIT the transactions as close to simultaneously as you can (two phase commit would be overkill - version skew should be harmess as data modifying stored procedures are only ever run on the master).


= Cold Patches =

Cold patches need to be applied with no other system activity to prevent locks holding the patch up. We have automation around this to do it as rapidly as possible.


== Patch Process Overview ==

 1. Update wildcherry Launchpad source tree to the revision to run: make sure that the revision will only include one new patch (we do one patch at a time to reduce the debugging burden when something goes wrong post-apply), and run 'make clean build'.

 1. /!\ Run full-update.py --dry-run and check that only the expected db patch will be applied.

 1. Run full-update.py
Line 56: Line 76:

This can sometimes be done hot, but not reliably due to FK relations needing exclusive table locks to activate. Do it as a cold patch.
Line 65: Line 87:
== Views ==

Run the CREATE OR REPLACE VIEW statements on all nodes.

== Stored Procedures / trusted.sql ==

trusted.sql can be run on all nodes inside a transaction at any time to create new or modify existing stored procedures. To minimize version skew between the master and slaves, run this on all nodes simultaneously, and COMMIT the transactions as close to simultaneously as you can (two phase commit would be overkill - version skew should be harmess as data modifying stored procedures are only ever run on the master).


= Heavyweight Patches =

Heavyweight patches need to be applied via Slony as they change the structure of the replicated database. This requires some system downtime to allow Slony to grab the locks it needs.

== Patch Process Overview ==

1) Update wildcherry Launchpad source tree, and run 'make clean build'.

2) Wait for long running transactions to terminate, or kill them.

3) Shutdown or reconfigure clients to not use the database.

4) Put pgbouncer into 'pause' state.

5) Apply database patches. Note that we don't need to run fti.py - the full text indexes seem static now and if we do need to tweak them we can do it as part of the monthly update:

{{{
    cd database/schema
    ./upgrade.py
    # ./fti.py
    ./security.py --cluster
}}}

6) Unpause pgbouncer

7) Restore client systems to normal
Line 108: Line 94:
 * To easily control connections, we need connections to go via. pgbouncer instead of directly to the database. Setting this up is currently underway. Confirm pgbouncers PAUSE mode works as we need it to.  * To easily control connections, we need connections to go via. pgbouncer instead of directly to the database. Setting this up is currently underway.
Line 113: Line 99:

 * Try to tune Slony to propagate events faster.

Summary

This wiki page documents the schema changes that can be done on Launchpads PostgreSQL + Slony database.

We use Hot schema changes to deploy changes without causing downtime.

And we use Cold schema changes to deploy changes with minimal (a few seconds per change) downtime.

In both cases, we deploy these changes without changing the running appserver code, so the changes need to be compatible with the existing live code. This means that, before landing the database patch, code may need to be landed to ensure everything runs both with and without the database patch applied.

All DB schema changes are landed on db-devel, and must land without appserver code changes (but test changes may be included, if trigger/db function changes are being landed). Schema patches need a non-zero patch level (e.g. (major, minor, 1) schema version number). Launchpad ignores these patches when detecting if the patch level of the database matches the patch level it expects, allowing Launchpad to run even though patches have been applied to the database that do not exist in that instances tree.

Hot patches

Some database patches we can apply live. These are applied directly to the databases, bypassing Slony.

They are currently applied by crafting an appropriate wrapper script and then running it directly on all the nodes containing the LP schema (launchpad_prod_*, launchpad_standalone_*).

The (major, minor, patch) revision tuple needs to be manually inserted into the LaunchpadDatabaseRevision table on the master. Failing to do this will abort the next full upgrade when the system attempts to apply a patch that has already been applied (this situation is normally caught by the full staging update). It must not be applied to each node - it has to replicate via slony.

We will be automating this in future, to handle at least the happy-path and avoiding accidentally running a hot patch during a fastdowntime.

Index Creation

Index creation for all but the smallest tables should done in a hot patch.

The patch needs to use CREATE INDEX CONCURRENTLY which will do 3 separate fast transactions to make, populate, and activate the index. There is some contention added but it is minimal.

Don't bother starting index creation when there are long running transactions in progress. The indexes will not complete until the long running transactions have completed. (Because the index has to wait for the existing read transactions on the table to complete before it can be activated.

CREATE INDEX CONCURRENTLY cannot run inside a transaction, so indexes should first be built on the master. If the index cannot be built for some reason, such as trying to build a UNIQUE index on non-unique data, manual repair will be needed to back out the changes. /!\ Note that deleting an index requires a fastdowntime deploy.

Once build on the master, the indexes can be built on the slaves simultaneously or serially.

Index creation should be done from a screen(1) session or similar to avoid network dropouts - for CREATE INDEX CONCURRENTLY, early disconnection will result in an invalid index rather than rolling back.

After creating the indexes, confirm that there are no invalid indexes on the affected tables on all nodes.

Views

Run the CREATE OR REPLACE VIEW statements on all nodes.

Stored Procedures / trusted.sql

Stored procedures can be changed (as long as the signature is still consistent with any triggers).

Table trigger definitions must be done as cold patches.

trusted.sql can be run on all nodes inside a transaction at any time to create new or modify existing stored procedures. To minimize version skew between the master and slaves, run this on all nodes simultaneously in a transaction, and COMMIT the transactions as close to simultaneously as you can (two phase commit would be overkill - version skew should be harmess as data modifying stored procedures are only ever run on the master).

Cold Patches

Cold patches need to be applied with no other system activity to prevent locks holding the patch up. We have automation around this to do it as rapidly as possible.

Patch Process Overview

  1. Update wildcherry Launchpad source tree to the revision to run: make sure that the revision will only include one new patch (we do one patch at a time to reduce the debugging burden when something goes wrong post-apply), and run 'make clean build'.
  2. /!\ Run full-update.py --dry-run and check that only the expected db patch will be applied.

  3. Run full-update.py

Table Creation

This can sometimes be done hot, but not reliably due to FK relations needing exclusive table locks to activate. Do it as a cold patch.

/!\ Don't attempt to do this without the DBA available, as there still seem to be edge cases where we cannot do this live (such as a table with foreign key constraints to a busy table, requiring locks on the busy table that will not be granted quickly enough).

Run the CREATE TABLE statements on all nodes.

Add the new tables and associated sequences to the desired replication set (currently requires writing an adhoc Slonik script, but a helper to do this is possible).

On the master, populate the newly replicated table with any wanted data.

Table Renaming

Because code is being deployed separately from database updates, we need to maintain a fully updateable working alias for the old table name to the new table name. See Database/TableRenamePatch for an example database patch.

TODO

  • To easily control connections, we need connections to go via. pgbouncer instead of directly to the database. Setting this up is currently underway.
  • Work out how we test things work both with and without the database patch applied.
  • There is logic in upgrade.py that detects unreplicated tables and adds them to the relevant replication sets. This should be split out into a helper that can be applied to the live system to make new table creation easier.

Database/LivePatching (last edited 2013-02-04 00:46:17 by wgrant)