Single machine parallel testing of single branches

Running more than one instance the Launchpad test suite for the one branch at the same time on the same computer. Taking one run of the Launchpad test suite and splitting it up across many different processes on the same computer and gathering the results.

Contact: RobertCollins On Launchpad: https://bugs.launchpad.net/launchpad-project/+bugs?field.tag=paralleltest

As a Technical Architect
I want our automated test runs to parallelise effectively within a single machine
so that the qa and branch promotion pipeline moves faster

As a Developer
I want ec2 runs to parallelise effectively within the ec2 instance
so that landings happen faster

Rationale

Our test suite execution time is a crucial factor in the minimum cycle type to make a change and deploy it. Similarly our ability to recover from intermittently failing tests is defined by the time it takes the test suite to run.

Having a dramatically faster test suite will reduce the latency for bugfixes, gathering of profile data and repetition/solving of intermittent failures.

Stakeholders

All Launchpad developers.

Constraints and Requirements

Must

Fix create a means to run our tests with one test thread per CPU in the machine no less reliably than our existing single threaded mode. Must parallelise more effectively than bin/test -j (which does per-layer splits).
Update our ec2 test environment to use the ec2 machine type that will run our parallelised test suite as rapidly as possible.
Determine the needed disk/cpu bandwidth per test process so that we can reasonably project the performance of different size machines for buildbot.
Organise and upgrade our CI test running to take advantage of this new environment.
Permit developers to reliably run parallelised as well.

Nice to have

testr --parallel support for Launchpad working.

Must not

Out of scope

Changing the landing technology is out of scope: thats something the SimplifiedMergeMachinery project will evaluate and act on.

Workflows

Success

How will we know when we are done?

ec2 and buildbot test runs are dramatically shorter than they are today: down to less than 50% of the current time, preferrable 15%-20%

How will we measure how well we have done?

Length of time from submission to devel through to the revision landing on "stable". Expect it to go down.

Thoughts?

It is possible that the work on this will lead to it being easier to get started hacking on Launchpad. Although such is out-of-scope for this LEP, we should keep our eyes open.
The prototype LXC + testr based parallelisation seems to have the best effort-reward tradeoff today- will have some friction around host os and so on, but benchmarks so far show dramatic scaling - linear on reasonably large machines.
See the project page.

launchpad development

LEP/ParallelTesting