August 2010
Lucid/Python 2.6 upgrade
Software is ready. LOSAs are working on the upgrade. Foundations may be responsible for some of the image upgrades, but this is mostly now a LOSA task.
Slony and Postgres upgrade
The software changes have been done for both of these. We are going to actually migrate to Slony first. spm has backported the Slony packages from Lucid, and they will be deployed to Hardy very soon, hopefully in the next week or two. After that, the LOSAs will hopefully be able to migrate the DB servers to Lucid. At the moment this is out of Foundations' hands.
Webservice
Leonard stopped the performance effort after making some nice gains (https://dev.launchpad.net/Foundations/Webservice/Performance).
- Leonard has started work on the desktop integration work (a script for managing OAuth tokens without opening a web browser).
Leonard and Benji are working on changing the len implementation for Soyuz bug 590708, which I incorrectly handled as a critical bug, interrupting the webservice desktop integration.
After we are out of the 590708 hole, and have finished desktop integration, Leonard and Benji will be working with jml to identify webservice usability improvements. Example candidate bugs: 534363 274074 481090 487522 539070 541637 583761. Leonard and Benji will then work to implement the selected improvements in lazr.restful and in the Launchpad webservice itself.
Diogo is leading an effort to make our webservice clients more easily QAd, with the help of Leonard, Martin Pitt and the bugs team (539705).
We're trying to make sure that webservice OOPSes and performance are reported clearly in our various tools (606184 and 607154).
Openid bugs
We had just started trying to dig ourselves out of the openid bug hole at the epic (https://bugs.edge.launchpad.net/launchpad-foundations/+bugs?field.tag=openid) but some other priorities pushed this work away. The majority of these are about the fact that we need to be able to associate multiple accounts with a single person. Stuart will be tackling 580461 in the next few weeks. Hopefully that will set us up to clean up the rest of the bugs like dominoes.
Launchpad performance
- Reporting
We've already announced our performance report work. Stuart is finishing up some database report work. In general we've been very pleased with the information we've been able to gather from the tests we've assembled, and with the ability they give us to analyze our problems. We want to make sure that these reports are understandable and usable for everyone on the team who is interested, though, and we believe there's still some work there to be done.
Maris made progress with Robert on implementing ++profile++ (598289) but it needs to be picked back up before the work is complete.
- According to our webpagetest.org tests, for the average page, the most obvious way to speed up our pages is in networking and client-side optimizations.
- After an initial cut at a risk analysis of trying to remove HTTPS for some interactions with the site, we have rejected it. Instead, Robert and James Troupe are going to be doing a private VPN experiment to see if an approach like that can help our SSL and static resource costs, both of which are surprisingly high.
- Maris also has some client-side improvements he will attempt soon.
Finally we also want to implement an ability to test page dependencies in the browser using Windmill, so that we can make assertions about the number and size of resources a page loads in tests, both initially and on a page reload. We think that will help us write tests to maintain fixes that we need to do to our pages (see 609885, for instance).
The work on reducing timeouts is even more important than I had thought--the outliers really are a big problem. According to our graphs, the average server page rendering time in Launchpad is in fact not inexcusably slow at all (median of 0.15, seconds, average of 0.36 seconds). Improving network and client side, as discussed above, will make a big difference with the average page. However, arguably the bigger problem in our usability (at least for users relatively close to the London data center) does appear to be the edge cases on the server side that are the outliers on our graphs. They happen fairly infrequently, but Launchpad is so big and with so many users that a relatively small statistic can be a big problem for one of the many constituents that we care about.
- Stuart is giving his assistance on some SQL aspects of this.
- We do plan to retry working on the Chameleon integration in the next few months, which seems to give about a 15% speed increase on average on the server. It should increase a few pages, perhaps such as problem page that Danilo showed me in Translations, even more.
- We also have some ideas on how to make the OOPS tools and the OOPS reports more effective for both communication and research.
Workflow changes
Robert is leading the way to a number of great workflow changes, mostly centered on continuous deployment, and we're supporting him with changes to the related machinery. Here's some details on progress for changes that we are a part of now.
- Ursula is finishing the changes to QA tagging that will allow marking commits as incremental or just unQAable, extinguishing orphan commits.
Continuous QA will hopefully reduce the source of emergencies during deployment, and is a step towards continuous deployment. Maris, Diogo, Ursula are starting work on this now. Also see how devs will interact with QA.
Simplify the merge machinery, eliminating buildbot and testfix mode and thereby hopefully reducing emergencies for the build engineer. Robert will hopefully hack PQM to give this pattern a whirl soon, and Paul and Tim will hopefully get Merge Queues and tarmac ready to implement the pattern just a bit later.
Stuart will be working on making the cronscripts more robust, which is a part of what we need for smoother deployments (607391)
Test speedups
Stuart is planning to investigate his long-proposed Mock Database work.
Build improvements
The changes we made to Buildout were accepted upstream by the maintainer, but a beta release caused issues particularly for people who used Buildout with virtualenv. A few people (Tim, for instance) reported some issues that have been squashed in the software but not deployed to Launchpad. A new beta release of Buildout should go out publicly this week, and be incorporated into Launchpad this week or after the upcoming LP release. This will be good for Landscape too, it seems.
This also should be a small part of some long-needed simplification and cleanup of our lazr packages to make the build easier to use and understand. Maris has been leading the cleanups there.
We're also hoping to find time to switch away from the local download-cache in favor of a shared stash of our sdists somewhere in the data center. This hopefully just takes a small amount of work, and once it is done it should make some things simpler for lazr packages and Launchpad's build too (and remove the download-cache-as-bzr-branch that has annoyed several folks).
Smoldering Fires
Robert's trying to improve our search. I'm sure that will take some of Stuart's time, at the least.
The Librarian needs some love. Robert is giving it some, bless him. There's a memory leak that needs to be addressed too (556245).
App server machines are going into swap every week or two now. We closed a memory leak, but there's at least another slow one we know about. We will probably need to tackle this again soon.