Simplify Merge and Batch Testing Machinery

We currently speed up our landing process by batching several merges into a single test. The mechanism we have to do this more than doubled the complexity of the previous, slower approach from even a naive comparison: for a single landing, we went from a single branch (devel) and a single merge process (PQM) to two branches (devel and stable) and PQM plus buildbot plus a custom poller that landed branches. The mechanism also introduced a new source of slowdowns for developers: a stop-the-line "testfix" condition.

(This is separate from the division of stable and db-stable, which was done for a separate reason. That is not changed in this proposal.)

Despite all the complexity, analysis showed the new process to be a win over our previous one-branch-at-a-time process (analysis presented by flacoste; I don't have a reference).

This proposal outlines an approach that gets us similar benefits for significantly less complexity and fewer opportunities for slowdowns.

There are other proposals for speeding up our test process: speed up the suite itself, and run the tests in parallel. They can work in conjunction with this effort, but I believe that the effort I describe is much simpler and much more likely to be completed soon.

As a developer
I want smoother branch landing machinery
so that I can land my code faster and with fewer spurious errors

Rationale

We're doing this now because the current merge process is a constant drain on the time of the LOSAs, the build engineers, and the developers; and the proposed solution should alleviate a significant part of the pain quickly.

Stakeholders

LOSAs, QA and build engineers, Launchpad developers.

Constraints

Running tests for the db branch and the regular branch should be able to be done in parallel.
When the regular branch gets merges, something should automatically try to merge it into the db branch. If the merge fails, it should generate a sufficient notice to developers, and should be resolvable with a manual merge to the db branch. This is still a "stop-the-line" situation.
Problems should be communicated clearly to the people responsible for resolving them (at least as clearly as they are now).

Unnecessary Desires

We switch to Tarmac. That's still desired, but not necessary for this particular proposal. If Robert has the time for making this work soon in PQM, we can treat it as an experiment on the process before we switch to Tarmac.

Success

Buildbot and the poller script have been eliminated, and the number of "trunks" have been halved (leaving effectively only stable, db-stable, and production-stable, whatever they are called).
We no longer have a testfix mode.
We can land branches to stable faster on average than we do now.

Subfeatures

(None)

Implementation Proposal

For this description, I will use "branch lander" as a name for a component that is Tarmac or PQM.

Description

If the branch lander is PQM, we run PQM on three separate machines (virtual or otherwise). These can be the machines that LOSAs have obtained for dedicated buildbot slaves, since they will be used for the same basic mechanism. One PQM is for the db Launchpad branches, one PQM is for the regular Launchpad branches, and one PQM is for everything else managed by PQM. (PQM or Tarmac might use another mechanism, but the basic constraint is that these three things shouldn't block one another.)

The remainder of the description describes the branch lander behavior for the two Launchpad branches only. The branch lander for other branches should be unchanged.

The branch lander has at least one configurable knob: how many branches to merge and test at once. It might have another configurable knob: how long to wait after the first merge request before starting a test run. As of this writing, the existing machinery have the following values, where "treeStableCount" is the number of branches to merge and test at once, and "treeStableTimer" is the number of seconds to wait after the first merge request before starting a test run: treeStableCount=3, treeStableTimer=12*60.
The branch lander locally merges N branches into a copy of the destination branch and tests them, where N is determined by the circumstances and the configuration described in the previous point.
- If the tests for the N branches pass, they are merged to the destination branch individually. Moreover, on the regular (non-db) branch lander, once the branches are merged into the target branch, the branch lander makes a request to the db branch lander that the regular Launchpad branch be merged into the db branch. (This mimics the current behavior; the basic constraint is that changes in the regular branch should be automatically propagated to the db branch, and if the automatic merge fails, it should fail loudly and require a manual merge before things can proceed.)
- If the tests fail, all N branches are rejected, the submitters are alerted (and must determine what caused the test failures and then resubmit their branches), and the branch lander goes back to looking in its queue for more branches to merge and test. (Notice that there is no testfix mode for this failure because no failing branches are ever merged.)

Risks

Does anyone actually value the fact that the devel branch has changes committed to it very quickly? This approach eliminates that, along with the specifics of the "five minute PQM" (which is really about a 13 minute PQM). Five minute PQM was arguably a shell game anyway: the real interest is how quickly we get to stable, and this approach can be faster and more stable than the current mechanism, while retaining its true value (batching test runs).

Thoughts?

This reduces complexity, reduces fragility, and eliminates a source of stop-the-line problems, with little or no downside. I think it is an obvious win, especially if it can be done relatively cheaply.

launchpad development

Foundations/Proposals/SimplifyMergeMachinery