yellow/ParallelTestingTroubleshooting

Not logged in - Log In / Register

Revision 1 as of 2012-08-28 14:49:20

Clear message

Helpful tips for troubleshooting Launchpad parallel testing

Running the Launchpad test suite in parallel significantly increases the turn-around time for buildbot to bless recently landed changes. The decrease in time from over five hours (as of 2012-08-28) to around 45 minutes (estimated from EC2 runs) will allow for quicker deployment and better branch isolation as fewer branches will be in the new collection under tests since the window will shrink.

These benefits come at the price of complexity. When things go wrong they are harder to figure out -- a lot harder to figure out. And slicing and dicing the test suite exposes problems that were previously masked by runs on a single processor.

Here we provide some tips on where to look when the test suite falls over. Of course, it won't always be the fault of running the tests in parallel. Expect good old fashioned test fix mode due to real bugs at about the same rate as before parallelizing, though the blame list should be shorter.

How do I find out what worker ran a particular test?

Given a full or partial test name and a subunit output, you should be able to do something like subunit-filter -s --with 'NAME OF TEST'. After that, look in the output for the test: NAME OF TEST line. The next line or so should have a "tags:" prefix. Look there for the name of the worker (of the form "worker-N").

How do I get the list of tests run on a particular process?

Unfortunately, there's not a simple answer.

The two --without=":setUp$" --without=":tearDown$" options should be replaceable with --without-tag='zope:layer' (see 986429) but we are not using that yet -- tags are still a bit hosed in other ways so we don't completely trust them yet.

How do I run the tests in a given order, to mimic the order that tests ran in a particular process?

  1. Create a test list (one test per line) to be fed to --load-list from your failed test run. Let's assume you save the file as worker0.txt.

  2. Find the layer where the bad test is run and the first test in that layer: bin/test --load-list worker0.txt --list-tests > layered.txt

  3. In layered.text find the test of interest and then search backwards for the layer marker. Note the first test in that layer. In worker-0.txt, copy and paste the set of tests starting with the first in the layer to the bad test and save as worker0-trimmed.txt. (This change has to be made the original file, not the one produced by --list-tests, as it names some tests in a way that is not compatible with --load-list.)

  4. Now you can run just the tests that are candidates for causing the isolation error: /bin/test -vv --load-list worker0-trimmed.txt

What to do if a test fails, especially one that is unrelated to code changes?