Helpful tips for troubleshooting Launchpad parallel testing

Running the Launchpad test suite in parallel significantly decreases the turn-around time for buildbot to bless recently landed changes. The decrease in time from over five hours (as of 2012-08-28) to around 45 minutes (estimated from EC2 runs) will allow for quicker deployment and better branch isolation as fewer branches will be in the new collection under tests since the window will shrink.

These benefits come at the price of complexity. When things go wrong they are harder to figure out -- a lot harder to figure out. And slicing and dicing the test suite exposes problems that were previously masked by runs on a single processor.

Here we provide some tips on where to look when the test suite falls over. Of course, it won't always be the fault of running the tests in parallel. Expect good old fashioned test fix mode due to real bugs at about the same rate as before parallelizing, though the blame list should be shorter.

How do I find out what worker ran a particular test?

If you are looking at a failure in the buildbot summary output, the second line of the failure listing should look something like this: "worker ID: worker-3". That's the worker. You are done.

If instead you are given a full or partial test name and a subunit output, you should be able to do something like subunit-filter -s --with 'NAME OF TEST'. After that, look in the output for the test: NAME OF TEST line. The next line or so should have a "tags:" prefix. Look there for the name of the worker (of the form "worker-N").

How do I get the list of tests run on a particular process?

If you are looking at buildbot, this is really easy: look at the buildbot log named for the given worker. That's what you want.

If instead you are given a subunit output, the answer is just a bit more complicated.

The two --without=":setUp$" --without=":tearDown$" options should be replaceable with --without-tag='zope:layer' (see 986429) but we are not using that yet -- tags are still a bit hosed in other ways so we don't completely trust them yet.

How do I run the tests in a given order, to mimic the order that tests ran in a particular process?

There's an optimization here that works more than 90% of the time, in our experience and rough guess. The steps that are part of the optimization are preceded with "OPTIMIZATION:". If you can't dupe with these instructions, the next thing to try is to remove the optimization.

  1. Download a worker file from buildbot, or otherwise create a test list (one test per line) to be fed to --load-list from your failed test run. Let's assume you save the file as worker0.txt.

  2. OPTIMIZATION: Find the layer where the bad test is run and the first test in that layer: bin/test --load-list worker0.txt --list-tests > layered.txt

  3. OPTIMIZATION: In layered.text find the test of interest and then search backwards for the layer marker. Note the first test in that layer. In worker-0.txt, copy and paste the set of tests starting with the first in the layer to the bad test and save as worker0-trimmed.txt. (This change has to be made the original file, not the one produced by --list-tests, as it names some tests in a way that is not compatible with --load-list.)

  4. Now you can run just the tests that are candidates for causing the isolation error: /bin/test -vv --load-list worker0-trimmed.txt

The next question might help you too...

What do I do if a test fails that is unrelated to code changes?

If a test fails that is clearly outside the scope of the changes to the code and it passes when run by itself, then it is likely you've discovered a test isolation problem. These problems are revealed by parallel testing as a result of tests being run in different sets and in a different order than in a single execution run.

The likely culprit is some test in that same worker set that has already run changed global state and didn't clean up after itself. Perhaps it changed a class variable and didn't reset it. Or maybe it created a file on the file system and didn't remove it when done. With 17,000+ tests in our suite, some of them are still unhygienic and will cause problems.

To figure out what has happened you need to 1) find the worker that run the failing test (see "How do I get the list of tests run on a particular process?" above), 2) get the list of tests run by that worker (see "How do I run the tests in a given order, to mimic the order that tests ran in a particular process?" above), and 3) begin a binary search of the tests that ran up to and including the failing test. Run that set as shown above and using the process of elimination zoom in on the minimal set of tests that cause the problem. In practice this can happen much more quickly than you'd expect. Once the suspects are found there are no easy answers for figuring out the problem other than normal code inspection and debugging techniques. Note that sometimes an interaction error is caused by three or more tests. If you can't dupe with two halves but you can dupe with the whole set, that's probably what you have.

A similar sort of problem to a test interaction is that the failing test expects some state that it did not set up itself. Some previous test did. The steps are the same as given above.

What if subunit reports the wrong number of tests?

Occasionally the test runs generate a subunit stream that is either invalid or is valid but not parsed correctly by subunit. When that happens arbitrary chunks of the test results can be lost, usually manifesting in a dramatic reduction in the apparent number of tests run.

Diagnosing this kind of problem requires finding the location of the parse problem, usually via inspection or binary search. Once found either the subunit stream generation or parsing will have to be fixed.

When this happens in buildbot, download the full output and then try to dupe using the subunit tools locally. A basic pattern to get a test count from a subunit output file is cat FILENAME | subunit-filter -s --without=":setUp$" --without=":tearDown$" --without-tag="zope:info_suboptimal" --no-passthrough | subunit-ls | wc -l.

Known issues

The following tests are currently causing intermittent problems that have not been resolved:

The test times of the different workers in a given run also vary more than we expect or want. This could be caused by a number of things, including resource contention of some sort. Another possible cause is discussed in this bug:

yellow/ParallelTestingTroubleshooting (last edited 2012-08-29 18:36:05 by bac)