'''See TarmacMergeMachinery for the Operator's Instructions and Developer Notes.'''
'''See SimplifedMergePlanning for a roadmap of this project's implementation.'''
= Simplify Merge and Batch Testing Machinery =
We currently speed up our landing process by batching several merges into a single test. The [[Trunk | mechanism we have to do this]] more than doubled the complexity of the previous, slower approach from even a naive comparison: for a single landing, we went from a single branch (devel) and a single merge process (PQM) to two branches (devel and stable) and PQM plus buildbot plus a custom poller that landed branches. The mechanism also introduced a new source of slowdowns for developers: a stop-the-line "testfix" condition.
(This is separate from the division of stable and db-stable, which was done for a separate reason. That is not changed in this proposal.)
Despite all the complexity, analysis showed the new process to be a win over our previous one-branch-at-a-time process (analysis presented by flacoste at All Hands 2009 [[https://allhands09.canonical.com/combined/2009-05-20/EngineeringTools/4/|Company private link, sorry.]]).
This proposal outlines an approach that gets us similar benefits for significantly less complexity and fewer opportunities for slowdowns.
There are other proposals for speeding up our test process: speed up the suite itself, and run the tests in parallel. They can work in conjunction with this effort, but I believe that the effort I describe is much simpler and much more likely to be completed soon.
'''As a ''' developer<
>
'''I want ''' smoother branch landing machinery<
>
'''so that ''' I can land my code faster and with fewer spurious errors<
>
'''On Launchpad:''' https://bugs.launchpad.net/launchpad-project?field.tag=merge-machinery-story
== Rationale ==
We're doing this now because the current merge process is a constant drain on the time of the LOSAs, the build engineers, and the developers; and the proposed solution should alleviate a significant part of the pain quickly.
== Stakeholders ==
LOSAs, QA and build engineers, Launchpad developers.
== Constraints ==
* Running tests for the db branch and the regular branch should be able to be done in parallel.
* When the regular branch gets merges, something should automatically try to merge it into the db branch. If the merge fails, it should generate a sufficient notice to developers, and should be resolvable with a manual merge to the db branch. This is still a "stop-the-line" situation.
* Problems should be communicated clearly to the people responsible for resolving them (at least as clearly as they are now).
== Unnecessary Desires ==
* We switch to Tarmac. That's still desired, but not necessary for this particular proposal. If Robert has the time for making this work soon in PQM, we can treat it as an experiment on the process before we switch to Tarmac.
== Success ==
=== The Pilot Program ===
Our first point for evaluating success will be after the completion of a volunteer pilot program. The pilot volunteers will use our new system to land their code on the `stable` branch.
To evaluate the pilot's success, we will measure the time from code review approval until the time that the branch has landed in `stable`. The average landing time will be measured for a period six months before the pilot program, in the month immediately preceding the program, and during the program. The average time to land a branch on `stable` should be shorter for the pilot volunteers.
Our reasoning for the shorter landing time is that fewer people across the project should have their branches rejected by testfix mode (because of testfix mode's elimination). Lowering the time spent resubmitting rejected branches should cause a drop in the average landing time.
We have published the branch landing statistics for the month of October 2010 [[https://spreadsheets.google.com/a/canonical.com/ccc?key=0Aq6EjGubW4qydF90TElqaV9CRFR0QndmX0twV3ZDN3c&hl=en|here]] ,,[company internal link],,
=== Final Goals ===
If the pilot is successful, we will want to achieve the following goals in the remainder of the program:
* Buildbot and the poller script have been eliminated, and the number of "trunks" have been halved (leaving effectively only stable, db-stable, and production-stable, whatever they are called).
* We no longer have a testfix mode.
* We can land branches to stable faster on average than we do now (see "Thoughts" section below).
== Subfeatures ==
(None)
== Implementation Proposal ==
For this description, I will use "branch lander" as a name for a component that is Tarmac or PQM.
=== Description ===
* We run the branch lander on four separate machines (virtual or otherwise). These can be the machines that LOSAs have obtained for dedicated buildbot slaves, since they will be used for the same basic mechanism. One branch lander is for the db Launchpad branch, one is for the regular Launchpad branch, one is for the Launchpad production branch, and one is for everything else managed by the branch lander mechanism. (PQM or Tarmac might use another mechanism, but the basic constraint is that these three things shouldn't block one another.)
The remainder of the description describes the branch lander behavior for the three Launchpad branches only. The branch lander for other branches should be unchanged.
* The branch lander has at least one configurable knob: how many branches to merge and test at once. It might have another configurable knob: how long to wait after the first merge request before starting a test run. As of this writing, the existing machinery have the following values, where "treeStableCount" is the number of branches to merge and test at once, and "treeStableTimer" is the number of seconds to wait after the first merge request before starting a test run: treeStableCount=3, treeStableTimer=12*60.
* The branch lander locally merges N branches into a copy of the destination branch and tests them, where N is determined by the circumstances and the configuration described in the previous point.
* If the tests for the N branches pass, they are merged to the destination branch individually. Moreover, on the regular (non-db) branch lander, once the branches are merged into the target branch, the branch lander makes a request to the db branch lander that the regular Launchpad branch be merged into the db branch. (This mimics the current behavior; the basic constraint is that changes in the regular branch should be automatically propagated to the db branch, and if the automatic merge fails, it should fail loudly and require a manual merge before things can proceed.)
* If the tests fail, all N branches are rejected, the submitters are alerted, and the branch lander goes back to looking in its queue for more branches to merge and test. (Notice that there is no testfix mode for this failure because no failing branches are ever merged.)
=== Risks ===
Does anyone actually value the fact that the devel branch has changes committed to it very quickly? This approach eliminates that, along with the specifics of the "five minute PQM" (which is really about a 13 minute PQM). Five minute PQM was arguably a shell game anyway: the real interest is how quickly we get to stable, and this approach can be faster and more stable than the current mechanism, while retaining its true value (batching test runs).
== Thoughts ==
This reduces complexity, reduces fragility, and eliminates a source of stop-the-line problems, with little or no downside. I think it is an obvious win, especially if it can be done relatively cheaply.
How might we measure whether we land branches faster? The metric I had in mind was comparing three numbers:
1. using the current landing mechanism, the average time between [someone sends a branch to ec2 land] and [branch lands on stable];
2. using the current landing mechanism, the average time between [someone sends a branch directly to pqm] and [branch lands on stable]; and
3. using the proposed mechanism, the same as #2 (the average time between [someone sends a branch directly to pqm] and [branch lands on stable]).
I propose that comparing #1 and #3 is reasonable because we always are supposed to use ec2 test because if we break things, we break things for the entire team. That clearly will give a win to #3. Comparing #2 and #3 will still clearly give a win to #3 I think because
* testfix is no longer an issue;
* ec2 and buildbot fragility is no longer an issue; and
* we don't have to compile the code twice, once in pqm and once in buildbot.