Diff for "Database/LivePatching"

Not logged in - Log In / Register

Differences between revisions 1 and 15 (spanning 14 versions)
Revision 1 as of 2011-03-09 09:03:02
Size: 5237
Editor: stub
Comment:
Revision 15 as of 2013-02-04 00:46:17
Size: 5135
Editor: wgrant
Comment: Update for the post-slony era
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
= Live Database Updates = <<TableOfContents(2)>>
Line 3: Line 3:
== Summary == = Summary =
Line 5: Line 5:
We want to make database schema updates mid-development cycle. These updates will be limited in scope, as we need to keep system downtime to under 5 minutes. Any unsuitable updates need to be applied as part of the normal scheduled monthly upgrade which has a much more generous time budget. This wiki page documents the schema changes that can be deployed on Launchpad's replicated PostgreSQL database.
Line 7: Line 7:
These updates will be done independantly of code updates. This means that, before or along with landing the database patch, code may need to be landed to ensure everything runs both with and without the database patch applied. We use Hot schema changes to deploy changes without causing downtime.
Line 9: Line 9:
== Lightweight Patches == And we use Cold schema changes to deploy changes with minimal (a few seconds per change) downtime.
Line 11: Line 11:
Some database patches we can apply live. These are applied directly to
the databases, bypassing Slony.
In both cases, we deploy these changes without changing the running appserver code, so the changes need to be compatible with the existing live code. This means that, before landing the database patch, code may need to be landed to ensure everything runs both with and without the database patch applied.
Line 14: Line 13:
Lightweight patches will land in the Launchpad tree. They will be
given a non-zero patch number in the (major, minor, patch) revision
number. Launchpad ignores these patches when detecting if the patch level
of the database matches the patch level it expects, allowing Launchpad
to run even though patches have been applied to the database that do
not exist in that instances tree.
Cold schema changes are landed on db-devel, and hot ones on devel. All schema changes must land without appserver code changes (but test changes may be included, if trigger/db function changes are being landed).
Line 21: Line 15:
Lightweight patches are applied manually to all live Launchpad
database nodes containing the main Launchpad schema (launchpad_prod_*,
launchpad_standalone_*), as the patch being applied to the live database
may need to be an altered version of the patch landed in the tree.
/!\ These instructions only apply to tables under native PostgreSQL streaming replication. `lp_*` tables are replicated to SSO via Slony, so require extra special care.


= Hot patches =

Some database patches we can apply live. They are currently applied by crafting an appropriate wrapper script and then running it directly on the master database node, from where it will replicate to the slaves.
Line 28: Line 24:
=== Index Creation === We will be automating this in future, to handle at least the happy-path and avoiding accidentally running a hot patch during a fastdowntime.
Line 30: Line 26:
Indexes can be created without downtime by applying the database
patch manually, altering all the CREATE INDEX commands to CREATE INDEX
CONCURRENTLY.
== Index Creation ==
Line 34: Line 28:
Don't bother starting index creation when there are long running transactions in progress. The indexes will not complete until the long running transactions have completed. Index creation for all but the smallest tables should be done in a hot patch.
Line 36: Line 30:
CREATE INDEX CONCURRENTLY cannot run inside a transaction, so indexes should first be built on the master. If INDEXES cannot be built for some reason, such as trying to build a UNIQUE index on non-unique data, manual repair will be needed to back out the changes. The patch needs to use CREATE INDEX CONCURRENTLY which will do 3 separate fast transactions to make, populate, and activate the index. There is some contention added but it is minimal.
Line 38: Line 32:
Once build on the master, the indexes can be built on the slaves simultaneously or serially. Don't bother starting index creation when there are long running transactions in progress. The indexes will not complete until the long running transactions have completed. (Because the index has to wait for the existing read transactions on the table to complete before it can be activated.

CREATE INDEX CONCURRENTLY cannot run inside a transaction, so if the index cannot be built for some reason, such as trying to build a UNIQUE index on non-unique data, manual repair will be needed to back out the changes. /!\ Note that deleting an index requires a fastdowntime deploy.

Once built on the master, the index will automatically replicate to the slaves.
Line 44: Line 42:
The database patch tuple needs to be inserted manually into the
LaunchpadDatabaseRevision table on the master. This stops attempts to
reapply the patch during the next full upgrade (which will cause it
to fail).
After creating the indexes, confirm that there are no invalid indexes on the affected tables.
Line 49: Line 44:
The CREATE INDEX CONCURRENTLY statements need to be run on all Launchpad
database nodes.
== Views ==
Line 52: Line 46:
After creating the indexes, confirm that there are no invalid indexes on the affected tables on all nodes. Run the CREATE OR REPLACE VIEW statements on the master.

== Stored Procedures ==

Stored procedures can be changed (as long as the signature is still consistent with any triggers).

Table trigger definitions must be done as cold patches.
Line 55: Line 55:
=== Table Creation === = Cold Patches =
Line 57: Line 57:
Run the CREATE TABLE statements on all nodes.

Add the new tables and associated sequences to the desired replication set (currently requires writing an adhoc Slonik script, but a helper to do this is possible).

On the master, populate the newly replicated table with any wanted data.

=== Views ===

Run the CREATE OR REPLACE VIEW statements on all nodes.

=== Stored Procedures / trusted.sql ===

trusted.sql can be run on all nodes inside a transaction at any time to create new or modify existing stored procedures. To minimize version skew between the master and slaves, run this on all nodes simultaneously, and COMMIT the transactions as close to simultaneously as you can (two phase commit would be overkill - version skew should be harmess as data modifying stored procedures are only ever run on the master).
Cold patches need to be applied with no other system activity to prevent locks holding the patch up. We have automation around this to do it as rapidly as possible.
Line 72: Line 60:
== Heavyweight Patches == == Patch Process Overview ==
Line 74: Line 62:
Heavyweight patches need to be applied via Slony as they change the structure of the replicated database. This requires some system downtime to allow Slony to grab the locks it needs.  1. Update wildcherry Launchpad source tree to the revision to run: make sure that the revision will only include one new patch (we do one patch at a time to reduce the debugging burden when something goes wrong post-apply), and run 'make clean build'.
Line 76: Line 64:
=== Patch Process Overview ===  1. /!\ Run full-update.py --dry-run and check that only the expected db patch will be applied.
Line 78: Line 66:
1) Update wildcherry Launchpad source tree, and run 'make clean build'.

2) Wait for long running transactions to terminate, or kill them.

3) Shutdown or reconfigure clients to not use the database.

4) Put pgbouncer into 'pause' state.

5) Apply database patches. Note that we don't need to run fti.py - we will not be :

    cd database/schema
    ./upgrade.py
    # ./fti.py
    ./security.py --cluster

6) Unpause pgbouncer

7) Restore client systems to normal
 1. Run full-update.py
Line 98: Line 69:
== TODO == == Table Creation ==
Line 100: Line 71:
 * To easily control connections, we need connections to go via. pgbouncer instead of directly to the database. Setting this up is currently underway. Confirm pgbouncers PAUSE mode works as we need it to. This can sometimes be done hot, but not reliably due to FK relations needing exclusive table locks to activate. Do it as a cold patch.

/!\ Don't attempt to do this without the DBA available, as there still seem to be edge cases where we cannot do this live (such as a table with foreign key constraints to a busy table, requiring locks on the busy table that will not be granted quickly enough).


== Table Renaming ==

Because code is being deployed separately from database updates, we need to maintain a fully updateable working alias for the old table name to the new table name. See [[Database/TableRenamePatch|Database/TableRenamePatch]] for an example database patch.

== Adding columns ==
If the table is large enough that applying the patch raises performance concerns, split the work up into the following steps. All steps, aside from the first, are optional depending on the desired result.
 1. Create the column without indices and without setting not-null or a default value.
 1. Set the default
 1. Add the index via a hot patch
 1. Run a garbo job (or similar) to fill in the values
 1. Set the column not-null.



= TODO =
Line 103: Line 93:

 * There is logic in upgrade.py that detects unreplicated tables and adds them to the relevant replication sets. This should be split out into a helper that can be applied to the live system to make new table creation easier.

 * Try to tune Slony to propagate events faster.

Summary

This wiki page documents the schema changes that can be deployed on Launchpad's replicated PostgreSQL database.

We use Hot schema changes to deploy changes without causing downtime.

And we use Cold schema changes to deploy changes with minimal (a few seconds per change) downtime.

In both cases, we deploy these changes without changing the running appserver code, so the changes need to be compatible with the existing live code. This means that, before landing the database patch, code may need to be landed to ensure everything runs both with and without the database patch applied.

Cold schema changes are landed on db-devel, and hot ones on devel. All schema changes must land without appserver code changes (but test changes may be included, if trigger/db function changes are being landed).

/!\ These instructions only apply to tables under native PostgreSQL streaming replication. lp_* tables are replicated to SSO via Slony, so require extra special care.

Hot patches

Some database patches we can apply live. They are currently applied by crafting an appropriate wrapper script and then running it directly on the master database node, from where it will replicate to the slaves.

The (major, minor, patch) revision tuple needs to be manually inserted into the LaunchpadDatabaseRevision table on the master. Failing to do this will abort the next full upgrade when the system attempts to apply a patch that has already been applied (this situation is normally caught by the full staging update).

We will be automating this in future, to handle at least the happy-path and avoiding accidentally running a hot patch during a fastdowntime.

Index Creation

Index creation for all but the smallest tables should be done in a hot patch.

The patch needs to use CREATE INDEX CONCURRENTLY which will do 3 separate fast transactions to make, populate, and activate the index. There is some contention added but it is minimal.

Don't bother starting index creation when there are long running transactions in progress. The indexes will not complete until the long running transactions have completed. (Because the index has to wait for the existing read transactions on the table to complete before it can be activated.

CREATE INDEX CONCURRENTLY cannot run inside a transaction, so if the index cannot be built for some reason, such as trying to build a UNIQUE index on non-unique data, manual repair will be needed to back out the changes. /!\ Note that deleting an index requires a fastdowntime deploy.

Once built on the master, the index will automatically replicate to the slaves.

Index creation should be done from a screen(1) session or similar to avoid network dropouts - for CREATE INDEX CONCURRENTLY, early disconnection will result in an invalid index rather than rolling back.

After creating the indexes, confirm that there are no invalid indexes on the affected tables.

Views

Run the CREATE OR REPLACE VIEW statements on the master.

Stored Procedures

Stored procedures can be changed (as long as the signature is still consistent with any triggers).

Table trigger definitions must be done as cold patches.

Cold Patches

Cold patches need to be applied with no other system activity to prevent locks holding the patch up. We have automation around this to do it as rapidly as possible.

Patch Process Overview

  1. Update wildcherry Launchpad source tree to the revision to run: make sure that the revision will only include one new patch (we do one patch at a time to reduce the debugging burden when something goes wrong post-apply), and run 'make clean build'.
  2. /!\ Run full-update.py --dry-run and check that only the expected db patch will be applied.

  3. Run full-update.py

Table Creation

This can sometimes be done hot, but not reliably due to FK relations needing exclusive table locks to activate. Do it as a cold patch.

/!\ Don't attempt to do this without the DBA available, as there still seem to be edge cases where we cannot do this live (such as a table with foreign key constraints to a busy table, requiring locks on the busy table that will not be granted quickly enough).

Table Renaming

Because code is being deployed separately from database updates, we need to maintain a fully updateable working alias for the old table name to the new table name. See Database/TableRenamePatch for an example database patch.

Adding columns

If the table is large enough that applying the patch raises performance concerns, split the work up into the following steps. All steps, aside from the first, are optional depending on the desired result.

  1. Create the column without indices and without setting not-null or a default value.
  2. Set the default
  3. Add the index via a hot patch
  4. Run a garbo job (or similar) to fill in the values
  5. Set the column not-null.

TODO

  • Work out how we test things work both with and without the database patch applied.

Database/LivePatching (last edited 2013-02-04 00:46:17 by wgrant)