November 2010 Quarterly Update
Landing and QA improvements
To support the Release Features When They Are Done effort, Maris, Ursula, and Diogo delivered a new component, qatagger, as well as changes to ec2 test and bzr lp-land that allowed developers to more easily participate in the QA system, and allowed developers and LOSAs to see the QA state and the most recent deployable revision based on QA.
We also changed ec2 test, bzr lp-land, and the PQM regex to enforce our orphan commit policies, requiring commits to be linked to a bug, or marked no-qa. This makes our QA process more robust.
Maris, Diogo and Ursula are now working on implementing the Simplify Merge Machinery proposal, with Paul Hummer's assistance. This will hopefully increase the Launchpad team's velocity. We will have measurements to validate the experiment (comparing the average time between approving a merge proposal and getting it on stable before the change, and after the change).
The OOPS tools and related error reporting got some maintenance work, including changes to improve grouping of webservice bugs and performance. We started planning out some ideas on how to make the OOPS tools and the OOPS reports more effective for both communication and research, but decided to move to the Simplify Merge Machinery effort, above, first. We hope to return to improving the OOPS tools in the new year.
Lucid/Python 2.6/Slony/Postgres upgrade
In the last quarterly report, we reported that the software was ready for Lucid, and the associated upgrade to Python 2.6, Slony, and Postgres 8.4. By mid October, the LOSAs had upgraded all Launchpad machines.
Each individual machine migration was fairly smooth. However, the larger process had several significant problems.
- During the months-long period in-between the software being ready and the production machines being updated, we had developers, ec2 tests, PQM, buildbot, staging, and production all varying in whether they ran Hardy/2.5 or Lucid/2.6. This caused much pain for developers, particularly towards the end when we introduced 2.6-isms and the only indication that there was a problem was edge or staging being broken.
- While Postgres 8.4 may have mitigated the problems of some slow queries, it provoked new problems. These caught us by surprise.
- We had no story for rolling back the Lucid/Slony/Postgres 8.4 upgrade. It was all or nothing.
Robert has requested an analysis of these failures, such as a root cause analysis. We have not yet performed this analysis. The following are preliminary notes in that direction:
- The switch took the IS team longer than the Foundations team expected (months rather than weeks after the software was ready). This arguably turned the first bullet point above from an annoyance to a significant problem. Why did we misjudge the time for that effort? Similarly, why did it take so long? Why did it happen so gradually--that is, even if it took a long time to start, why wasn't the switch more compressed in time from start to stop?
- The switch between LTS releases, Hardy to Lucid, did not inherently have a smooth transition available between Python versions, Slony versions, or Postgres versions. This was noted to Ubuntu developers before the 2009 All Hands meeting, and discussed there. The solution agreed upon required less work at the expense of higher risk. Was this reasonable?
Webservice
Leonard and Benji have primarily focused on bug 532055, providing a user-friendly way of authorizing desktop, GUI launchpadlib applications. Communication externally with Ubuntu developers and internally with the Launchpad team about everything from user interface to security to implementation took up the vast majority of time for this effort, but the last branches should be going through review now.
Benji is keen on improving webservice documentation, and is making simple but important strides there now.
Diogo worked on a service level agreement for our webservice clients, and Maris and Diogo worked on testing some of our most important webservice clients. Concretely, this effort led to branches from the Foundations team that added tests and test improvements for apport and a few OEM scripts; and advice on how to test applications and scripts based on our webservice. Work on the SLA will hopefully be published soon, along with some of our further plans to help with testing.
Leonard and Benji are about to begin work on some long-planned work to improve the usability of the webservice, and look forward to addressing some long-standing bugs in that regard.
Openid
Stuart made some significant changes to address the OpenID bugs discussed in the last quarterly report. We closed the majority of the high-priority bugs, and now have a clear path to solving the other high priority ones that existed then.
The change did cause unanticipated problems with SSO. 644824 tracks these problems, which are on their way to being addressed.
Launchpad performance
Stuart finished his database report work.
- The performance report got some unexpected help from Francis, who is getting the monthly and weekly performance reports working.
Gary and Maris implemented ++profile++ (598289). It needs to be made available on staging/qastaging for authorized users before it is as useful as it could be. It also would ideally be usable in the webservice.
- As mentioned in the last report, according to our webpagetest.org tests, for the average page, the most obvious way to speed up our pages is in networking and client-side optimizations. Robert and James Troupe have reportedly made progress a private VPN experiment to see if an approach like that can help our SSL and static resource costs, both of which are surprisingly high.
Stuart has made some improvements to the memcached tal integration (634326, 634646).
- As described in the last quarterly report, the alternate template implementation named "Chameleon" seems to give about a 15% speed increase on average on the server. Gary has investigated the technical challenges necessary to take this further, and is beginning the effort now.
Test speedups
Stuart did not get around to investigating his long-proposed Mock Database work, but still hopes to.
Build improvements
We are now using a released version of buildout and dependencies. We've only begun to take advantage of the features in our own packages, and in fact we need to return to buildout to address some problems reported by other teams with the new features.
We're still hoping to find time to switch away from the local download-cache in favor of a shared stash of our sdists somewhere in the data center. This hopefully just takes a small amount of work, and once it is done it should make some things simpler for lazr packages and Launchpad's build too (and remove the download-cache-as-bzr-branch that has annoyed several folks).
Smoldering Fires
The Librarian's memory leak (556245) now has a meliae dump for us to look at, thanks to the LOSAs. We'll be doing that soon.
We've had two recent instances of production hanging (669296). We've started investigating that.