Diff for "yellow/ParallelTestingTroubleshooting"

Not logged in - Log In / Register

Differences between revisions 2 and 3
Revision 2 as of 2012-08-28 15:26:51
Size: 5730
Editor: bac
Comment:
Revision 3 as of 2012-08-28 18:36:54
Size: 6191
Editor: benji
Comment: Add some info on debugging subunit stream problems.
Deletions are marked like this. Additions are marked like this.
Line 47: Line 47:
'''<your answer here>''' Occasionally the test runs generate a subunit stream that is either invalid or is valid but not parsed correctly by subunit. When that happens arbitrary chunks of the test results can be lost, usually manifesting in a dramatic reduction in the apparent number of tests run.

Diagnosing this kind of problem requires finding the location of the parse problem, usually via inspection or binary search. Once found either the subunit stream generation or parsing will have to be fixed.

Helpful tips for troubleshooting Launchpad parallel testing

Running the Launchpad test suite in parallel significantly increases the turn-around time for buildbot to bless recently landed changes. The decrease in time from over five hours (as of 2012-08-28) to around 45 minutes (estimated from EC2 runs) will allow for quicker deployment and better branch isolation as fewer branches will be in the new collection under tests since the window will shrink.

These benefits come at the price of complexity. When things go wrong they are harder to figure out -- a lot harder to figure out. And slicing and dicing the test suite exposes problems that were previously masked by runs on a single processor.

Here we provide some tips on where to look when the test suite falls over. Of course, it won't always be the fault of running the tests in parallel. Expect good old fashioned test fix mode due to real bugs at about the same rate as before parallelizing, though the blame list should be shorter.

How do I find out what worker ran a particular test?

Given a full or partial test name and a subunit output, you should be able to do something like subunit-filter -s --with 'NAME OF TEST'. After that, look in the output for the test: NAME OF TEST line. The next line or so should have a "tags:" prefix. Look there for the name of the worker (of the form "worker-N").

How do I get the list of tests run on a particular process?

Unfortunately, there's not a simple answer.

  • Install the Testing Cabal's PPA or the Yellow Squad's PPA on Precise and get or update subunit.

  • Download the desired subunit output from the buildbot output.
  • Now run a command that looks something like this:
    • $ cat ~/Downloads/testrepo-0.txt | env PYTHONPATH=python ./filters/subunit-filter -s --with-tag='worker-0' --without=":setUp$" --without=":tearDown$" --without-tag="zope:info_suboptimal" --no-passthrough | subunit-ls > worker0.txt

Q: is the PYTHONPATH and running out of a branch required or can we just run subunit-filter as installed by the PPA?

The two --without=":setUp$" --without=":tearDown$" options should be replaceable with --without-tag='zope:layer' (see 986429) but we are not using that yet -- tags are still a bit hosed in other ways so we don't completely trust them yet.

How do I run the tests in a given order, to mimic the order that tests ran in a particular process?

  1. Create a test list (one test per line) to be fed to --load-list from your failed test run. Let's assume you save the file as worker0.txt.

  2. Find the layer where the bad test is run and the first test in that layer: bin/test --load-list worker0.txt --list-tests > layered.txt

  3. In layered.text find the test of interest and then search backwards for the layer marker. Note the first test in that layer. In worker-0.txt, copy and paste the set of tests starting with the first in the layer to the bad test and save as worker0-trimmed.txt. (This change has to be made the original file, not the one produced by --list-tests, as it names some tests in a way that is not compatible with --load-list.)

  4. Now you can run just the tests that are candidates for causing the isolation error: /bin/test -vv --load-list worker0-trimmed.txt

What to do if a test fails, especially one that is unrelated to code changes?

If a test fails that is clearly outside the scope of the changes to the code and it passes when run by itself, then it is likely you've discovered a test isolation problem. These problems are revealed by parallel testing as a result of tests being run in different sets and in a different order than in a single execution run. The likely culprits are:

  • Some test in that same worker set that has already run changed global state and didn't clean up after itself. Perhaps it changed a class variable and didn't reset it. Or maybe it created a file on the file system and didn't remove it when done. With 17,000+ tests in our suite, some of them are still unhygienic and will cause problems.
    • To figure out what has happened you need to 1) find the worker that run the failing test, 2) get the list of tests run by that worker, 3) begin a binary search of the tests that ran up to and including the failing test. Run that set as shown above and using the process of elimination zoom in on the minimal set of tests that cause the problem. (In practice this can happen much more quickly than you'd expect.) Once the suspects are found there are no easy answers for figuring out the problem other than normal code inspection and debugging techniques.
  • The failing test expects some state that it did not set up itself.

What if subunit reports the wrong number of tests?

Occasionally the test runs generate a subunit stream that is either invalid or is valid but not parsed correctly by subunit. When that happens arbitrary chunks of the test results can be lost, usually manifesting in a dramatic reduction in the apparent number of tests run.

Diagnosing this kind of problem requires finding the location of the parse problem, usually via inspection or binary search. Once found either the subunit stream generation or parsing will have to be fixed.

Known issues

The following tests are currently causing intermittent problems that have not been resolved:

  • 974617 - test_operationalerror_view_integration fails intermittently in parallel tests

  • 1002820 - lp.testing.layers.RabbitMQLayer:setUp times out rarely/intermittently in parallel tests

  • 1013921 - our zope.testing fork needs to emit subunit time immediately before test start and immediately before test completion

  • 1014916 - simultaneously started lucid containers pause while starting after the first seven

  • Other bugs tagged ''paralleltest''

yellow/ParallelTestingTroubleshooting (last edited 2012-08-29 18:36:05 by bac)