See TarmacMergeMachinery for the Operator's Instructions and Developer Notes.

See SimplifedMergePlanning for a roadmap of this project's implementation.

Simplify Merge and Batch Testing Machinery

We currently speed up our landing process by batching several merges into a single test. The mechanism we have to do this more than doubled the complexity of the previous, slower approach from even a naive comparison: for a single landing, we went from a single branch (devel) and a single merge process (PQM) to two branches (devel and stable) and PQM plus buildbot plus a custom poller that landed branches. The mechanism also introduced a new source of slowdowns for developers: a stop-the-line "testfix" condition.

(This is separate from the division of stable and db-stable, which was done for a separate reason. That is not changed in this proposal.)

Despite all the complexity, analysis showed the new process to be a win over our previous one-branch-at-a-time process (analysis presented by flacoste at All Hands 2009 Company private link, sorry.).

This proposal outlines an approach that gets us similar benefits for significantly less complexity and fewer opportunities for slowdowns.

There are other proposals for speeding up our test process: speed up the suite itself, and run the tests in parallel. They can work in conjunction with this effort, but I believe that the effort I describe is much simpler and much more likely to be completed soon.

As a developer
I want smoother branch landing machinery
so that I can land my code faster and with fewer spurious errors

On Launchpad: https://bugs.launchpad.net/launchpad-project?field.tag=merge-machinery-story

Rationale

We're doing this now because the current merge process is a constant drain on the time of the LOSAs, the build engineers, and the developers; and the proposed solution should alleviate a significant part of the pain quickly.

Stakeholders

LOSAs, QA and build engineers, Launchpad developers.

Constraints

Unnecessary Desires

Success

The Pilot Program

Our first point for evaluating success will be after the completion of a volunteer pilot program. The pilot volunteers will use our new system to land their code on the stable branch.

To evaluate the pilot's success, we will measure the time from code review approval until the time that the branch has landed in stable. The average landing time will be measured for a period six months before the pilot program, in the month immediately preceding the program, and during the program. The average time to land a branch on stable should be shorter for the pilot volunteers.

Our reasoning for the shorter landing time is that fewer people across the project should have their branches rejected by testfix mode (because of testfix mode's elimination). Lowering the time spent resubmitting rejected branches should cause a drop in the average landing time.

We have published the branch landing statistics for the month of October 2010 here [company internal link]

Final Goals

If the pilot is successful, we will want to achieve the following goals in the remainder of the program:

Subfeatures

(None)

Implementation Proposal

For this description, I will use "branch lander" as a name for a component that is Tarmac or PQM.

Description

The remainder of the description describes the branch lander behavior for the three Launchpad branches only. The branch lander for other branches should be unchanged.

Risks

Does anyone actually value the fact that the devel branch has changes committed to it very quickly? This approach eliminates that, along with the specifics of the "five minute PQM" (which is really about a 13 minute PQM). Five minute PQM was arguably a shell game anyway: the real interest is how quickly we get to stable, and this approach can be faster and more stable than the current mechanism, while retaining its true value (batching test runs).

Thoughts

This reduces complexity, reduces fragility, and eliminates a source of stop-the-line problems, with little or no downside. I think it is an obvious win, especially if it can be done relatively cheaply.

How might we measure whether we land branches faster? The metric I had in mind was comparing three numbers:

  1. using the current landing mechanism, the average time between [someone sends a branch to ec2 land] and [branch lands on stable];
  2. using the current landing mechanism, the average time between [someone sends a branch directly to pqm] and [branch lands on stable]; and
  3. using the proposed mechanism, the same as #2 (the average time between [someone sends a branch directly to pqm] and [branch lands on stable]).

I propose that comparing #1 and #3 is reasonable because we always are supposed to use ec2 test because if we break things, we break things for the entire team. That clearly will give a win to #3. Comparing #2 and #3 will still clearly give a win to #3 I think because

Foundations/Proposals/SimplifyMergeMachinery (last edited 2010-12-08 22:07:44 by mars)