Helpful tips for troubleshooting Launchpad parallel testing
Running the Launchpad test suite in parallel significantly increases the turn-around time for buildbot to bless recently landed changes. The decrease in time from over five hours (as of 2012-08-28) to around 45 minutes (estimated from EC2 runs) will allow for quicker deployment and better branch isolation as fewer branches will be in the new collection under tests since the window will shrink.
These benefits come at the price of complexity. When things go wrong they are harder to figure out -- a lot harder to figure out. And slicing and dicing the test suite exposes problems that were previously masked by runs on a single processor.
Here we provide some tips on where to look when the test suite falls over. Of course, it won't always be the fault of running the tests in parallel. Expect good old fashioned test fix mode due to real bugs at about the same rate as before parallelizing, though the blame list should be shorter.
How do I find out what worker ran a particular test?
Given a full or partial test name and a subunit output, you should be able to do something like subunit-filter -s --with 'NAME OF TEST'. After that, look in the output for the test: NAME OF TEST line. The next line or so should have a "tags:" prefix. Look there for the name of the worker (of the form "worker-N").
How do I get the list of tests run on a particular process?
Unfortunately, there's not a simple answer.
- Download the desired subunit output from the buildbot output.
- Now run a command that looks something like this:
$ cat ~/Downloads/testrepo-0.txt | env PYTHONPATH=python ./filters/subunit-filter -s --with-tag='worker-0' --without=":setUp$" --without=":tearDown$" --without-tag="zope:info_suboptimal" --no-passthrough | subunit-ls > worker0.txt
The two --without=":setUp$" --without=":tearDown$" options should be replaceable with --without-tag='zope:layer' (see 986429) but we are not using that yet -- tags are still a bit hosed in other ways so we don't completely trust them yet.
How do I run the tests in a given order, to mimic the order that tests ran in a particular process?
Create a test list (one test per line) to be fed to --load-list from your failed test run. Let's assume you save the file as worker0.txt.
Find the layer where the bad test is run and the first test in that layer: bin/test --load-list worker0.txt --list-tests > layered.txt
In layered.text find the test of interest and then search backwards for the layer marker. Note the first test in that layer. In worker-0.txt, copy and paste the set of tests starting with the first in the layer to the bad test and save as worker0-trimmed.txt. (This change has to be made the original file, not the one produced by --list-tests, as it names some tests in a way that is not compatible with --load-list.)
Now you can run just the tests that are candidates for causing the isolation error: /bin/test -vv --load-list worker0-trimmed.txt
What to do if a test fails, especially one that is unrelated to code changes?
If a test fails that is clearly outside the scope of the changes to the code and it passes when run by itself, then it is likely you've discovered a test isolation problem. These problems are revealed by parallel testing as a result of tests being run in different sets and in different order than in a single execution run. The likely culprits are:
- Some test in that same worker set that has already run changed global state and didn't clean up after itself. Perhaps it changed a class variable and didn't reset it. Or maybe it created a file on the file system and didn't remove it when done. With 17,000+ tests in our suite, some of them are still unhygienic and will cause problems.
- To figure out what has happened you need to 1) find the worker that run the failing test, 2) get the list of tests run by that worker, 3) begin a binary search of the tests that ran up to and including the failing test.