Parallel Testing Status: 2012-03-21
Overview
These two weeks were a roller coaster for the project, moving back and forth between seemingly final failure and exhilarating success. We addressed over 2800 test failures, conquered hangs, and fixed issues in Launchpad, Testrepository, and LXC. Today we have our first unimpeded runs to completion on an eight-core EC2 machine. These runs took around 55 minutes, and had a handful of test failures.
We also made progress on related jobs, such as getting our Python shell tools packaged in Ubuntu, getting Python charm helpers added to the official charm helpers package, and moving along our replace-rocketfuel-* slack time project.
We still have a lot of work to do. We need to improve the continuous integration steps in a variety of ways, for stability, reporting, and speed; address the remaining test failures as we find them, including getting help on a kernel issue; complete experiments to determine the incremental value of cores to the parallelization; and get the smoothly oiled machine we have with Juju and EC2 also running manually in the data center. However, this update marks a major milestone for the project, and we are pleased to have accomplished what we did this week.
Progress towards biweekly action items
- [yellow] [carried over] Scaling assessment based on experimental results
- Was waiting on webops. They announced week of March 14 they would not have the resources to follow through in the short term. We determined that we would pursue on ec2 using a large machine there (an eight core instance size is available).
- Very poor test success rate (~2800 failures/errors) and hangs in the test run made us question whether any experiment would be too flawed. Thanks to fixing (and working around) lxc issues, test isolation issues, and testrepository issues, the team got this down to under 10 failures/errors (more details later) and got rid of the known hangs, so it seemed it might no longer be a concern
- After a false start on a machine with too little memory (a c1.xlarge instance, 8 cores and 7GB) we worked with the other available eight core machine, which has lots of memory (a m2.4xlarge, 8 cores and 68.4GB). This worked fine. Eight tests simultaneously seem to need at least 15GB, based on spot checks while tests ran.
- Initial information (determined today) is that tests on eight cores take about 55 minutes; no comparison values yet, but in progress (not comparable but interesting: test run takes 6 hours on lpbuildbot).
- In-progress methodology: on the same machine, hack testrepository’s testcommand.local_concurrency to report 1-8 cores (see bug 957145). Get one or two results per core. These are only initial test ordering (round robin) runs. We have a kanban card to keep the .testrepository directory across builds. Once we have this, we will run tests to see how results change on second and third runs; we will then clean out .testrepository data after each concurrency change.
- [yellow] Identify and fix Launchpad bugs for test failures discovered in parallel test runs
- Fixed
- Launchpad, bug 954319 (Benji): readonly mode isolation bug,
caused > 2700 of our failures
- LXC, bug 959352 (Benji): partial workaround (also see in tracking)
- Launchpad bugs 953912 (Benji), 953911 (Francesco), 953902 (Francesco): Test isolation errors
- LXC, bug 951150 (Gary, Benji): non-ephemeral home directories were causing us problems
- LXC, bug 949956 (Benji): shared MAC address/IP address issues
- Testrepository, bug 955006 (Francesco): Unicode issues, workaround in place
- Launchpad, bug 954319 (Benji): readonly mode isolation bug,
- Identified and not yet fixed
- Launchpad test isolation bugs: 953913 (Brad, in progress); more coming
- /dev/random exhaustion needs to be addressed in setuplxc configuration (Gary, Francesco, Benji): was causing hangs on 8-core machine. Tried replacing /dev/random with /dev/urandom: insufficient, at least at first attempt. rng-tools worked, reproducibly:
- apt-get install rng-tools
echo "HRNGDEVICE=/dev/urandom" >> /etc/default/rng-tools
- /etc/init.d/rng-tools start
- Fixed
- [Francis] [carried over] Get proposed deployment plan, as approved by Robert, also approved by mthaddon.
- Yet to get some webops cycle on this.
Other accomplishments
- [Brad, Graham] Disposition of common Python code:
- Generic code was extracted to create python-shelltoolbox, a collection of helpers for interacting with the shell via Python. A PPA was created and is in ~yellow. Clint has accepted the task of redoing the packaging so that it is backwards compatible to Lucid and sponsoring it for inclusion in Ubuntu.
- Charm-specific code has been moved into the existing charm-tools source package. The packaging for it is a bit hairy as that source package already builds two binary packages and the new Python bits would add a third. Clint offered suggestions on how to do it but they didn’t work so he has taken the task of getting the packaging to work for our additions to charm-tools.
- Code specific to the buildbot charms, shared between master and slave, is currently duplicated via the locals.py files. This approach works but will be replaced by a PPA.
- [Francesco: slack] lpsetup:
- Added buildout files with testing support.
- Added unit tests (for argparser, handlers and utils)
- The env var LANG=C is set during the installation of launchpad developer packages. This way we can avoid installing language-pack-en.
- Updated the install and lxc-install sub commands to support a custom ssh key name. The root ssh key is no longer needed, so it is not created anymore.
- Created the recipe for debian packaging.
Progress on tracked items
Completed by others
- LXC: 925024 - apparmor makes it impossible to install postgresql-common on Precise
- LXC: aufs option should be added to lxc-start-ephemeral
New and incomplete
- LXC 959352: Ephemeral containers have "/rootfs" prefix in /proc/self/maps entries HIGH OR CRITICAL
- Testrepository 949950 (mentioned but not filed last time): testrepository show full subunit stream of running tests HIGH
- Testrepository 957145: force amount of parallelization, overriding reported cores
- 961103: testrepository “String or Integer object expected for key, unicode found”
Carried over and incomplete
- 914166 - Zope layer setup and teardown 'tests' cannot be filtered by testr
- no activity in the last eight weeks
- RT 50242 - get a buildbot machine for testing
- actively in progress again
- We prefer EC2 for tests
Goals for next meeting
- Bug fixes for test failures discovered in parallel test runs. Already known targets:
- Launchpad
- 953913: test isolation error
- /dev/random exhaustion solution in setuplxc
- Testrepository/zope.testing
- 609986: subunit support for layer failures
- Buildbot improvements
- clean up old broken ephemeral lxc containers
- keep .testrepository data around between builds
- report failures more accurately
- make tests always randomly ordered
- Tracking
- LXC 959352
- Launchpad
- Deliver scaling assessment based on experimental results, using ec2 (carry over from previous two weeks)
- Get data center box running tests, and have a single comparison run with ec2.
- /dev/random exhaustion solution approved and installed