Rationale
We currently tie together two unrelated things:
- Changes to the database
- exposure of new features on 'launchpad.net'
This causes many problems for us, including:
- Unfinished features are deployed to the entire userbase.
- Finished features and small bugfixes/tweaks are held back from users for (on average) 2 weeks.
- We regularly encounter conflicts and test interactions between the 'released' and 'beta' code bases.
- We have a very complex deployment situation, with *at best* 2 versions of the code running at any one time, and regularly encounter friction between the two versions.
We want to decouple these two things, deliver features when they are ready, not earlier or later, and streamline and simplify our deployment of code, changes to the database and general maintenance processes.
Stakeholders
Launchpad developers: Various mails have been sent to launchpad-dev. Developers have their processes impacted when we change development process, so need to be able to have their needs met.
Launchpad users: hard to have a discussion with all the users that are affected. Users generally want Launchpad to be fast and reliable, and if we are successful with this LEP will get both of those things more often.
Constraints
- Allow features that are not polished to only be shown to early adopters.
- Must be able to review the list of features that are exposed to less-than-every-user.
- Must not slow down the current development process.
- Deliver finished features to all users as promptly as possible.
Out-of-scope
- Optimising the flow of features that require database schema changes. That is up for review in a future effort. Note that smaller changes will be doable in stable with this workflow.
Nice-to-have
Obvious when looking at edge that a feature is new and warrants feedback.
- This won't be delivered as a core part of this, though an idiom for doing it will be possible and fairly easy to do on a case by case basis.
Implementation
The end state we want to arrive at is:
- New features are controlled by feature flags rather than being on edge/on production.
- All our non test/demo servers run one code base, and that code is fully QA'd.
- Rollouts are done on demand, are lightweight, can easily be rolled back, and are totally automated.
- Rollouts are done whenever we have something ready to rollout.
This may take some time to completely achieve, so we are staging the implementation.
In progress - Stage 0 - Stop using edge for 'unreleased features'
To detangle the two concerns (deployment and releasing) we need to have features in the code base enabled at runtime, rather than deployment time. This will be accomplished by LEP/FeatureFlags. Nothing should depend on the 'is_edge' check. The feature flags facility is now available in db-devel, and from 10.09 should be used for all changes which the developer does not want *immediately* given to users.
DONE - Stage 1 - Remove appserver rollout downtime
RT 40685 : deploy icing to apache before updating appservers, will fix the downtime experienced by some users during appserver-only rollouts.
DONE - Stage 2 - QA all code
This involves setting up a QA environment on the staging server running the production database schema against the 'stable' branch. Rather than deploy 'stable tip' we will start deploying a nominated revision of stable which will be the highest revision which every commit has been QA'd. QA failures will require the failing revision to be reverted and any additional revisions landed since the failing one to also be QA'd OK or reverted.
See MergeWorkflow for the QA process details. This stage permits no-downtime DB patches to be applied within a release cycle, as long as they are blessed as such, and code that depends on them is landed after the patch has been applied.
In progress - Stage 3 - remove 'edge'
With all code QA'd we will deploy to all appservers when we deploy, rather than to edge; the edge appservers will be repurposed as production appservers, and the edge sites turned into redirects to production. We considered using the edge hostname edge to trigger-on many/all in progress features, but decided against it because it makes testing interactions in the daily QA process require significantly more complexity, and we're aiming for as simple as possible here.
These bugs affect zero-downtime deployments to appservers:
640065 (appserver requests are interrupted); rt 41503 will work around this.
We need to add a features flag for recipes, and a team based scope, before we can disable edge.
Finally we need to test the behaviour of redirects on old launchpadlib clients.
Stage 4 - iterate on deployment friction
To reduce the complexity of our environment we want all the servers running the same revision, but we have some areas that are hard to deploy to, or cause downtime at the moment.
The following RT tickets will improve this:
40477 XMLRPC server rolling upgrades (obsolete)
40480 codehosting rolling upgrades (in-progress)
40482 staging-with-production-schema environment (done)
41762 loggerhead rolling upgrades (open)
42472 redirect edge to prod (in progress)
41379 librarian ha upload port (in progress)
40490 deployment visibility (open)
Some bugs in the LP codebase will also help, but are less strictly needed:
http://pad.lv/607391 (done)
There may be other issues, but we will discover these if/when a deployment goes wrong, and feed them back into the process as high/critical bugs.
Success
When we can update the db schema without rolling out features under development, and the Launchpad developers haven't gone mad from crazy process changes.
Better quality features released to production.