Contents

Overview
Current themes
Web
API
Database
Test
Comments

Overview

Performance can mean many different things and as such is very hard to measure - but generally we know if something 'performs well' - it feels responsive (we can measure this), it lets us achieve what we want quickly (we can measure this), it handles lots of users (and we can measure this), and make efficient use of resources (we can measure this).

This page focuses on the purely technical aspects of performance: responsiveness, capacity, throughput and efficiency. It aims to provide a common language for talking about challenges and improvements in these areas, and the style of tools available for dealing with them. There are other technology and challenge specific pages in this wiki, which can and should be consulted for the specifics about what a team is working on or the chosen technology for some part of the system.

This is a living document - please edit it and add to it freely! In writing it, I (RobertCollins) have drawn mainly on experiences with bzr, launchpad, squid, and various books : I'm not citing sources in the interests of brevity and clarity, but if there are bits in here that you disagree with, let me know and I'll dig up sources / discuss / whatever.

Responsiveness

Responsiveness is simply how fast-or-slow you get feedback on actions in the system. Depending on the layer you are working at this may be web requests, ui changes, packages building - etc.

We want Launchpad to be very responsive to interactive operations, so that users are not waiting for Launchpad. For batch operations we care less about responsiveness, but where we can be it's nice to still be responsive. Managing a task through an asynchronous work queue does not by itself make the task a batch operation: if it happens on behalf of an interactive user, it should still be considered an interactive task.

We have a (internal only - sorry, it includes confidential URLs at the moment) report which reports how long server side requests take to complete. To find out how long 99% of requests take to complete, add 3* the stddev and the mean. Some of our requests are very fast - many are extremely fast - but many are also very slow.

Things that help perceived responsiveness:

Consistently responding quickly: If some requests are very slow, the system as a whole is perceived as being very slow.
Separating out work-for-the-user and work-for-our-housekeeping.
Efficient queries / formatting
Memoised data we can query more easily (indices in the database are an extremely common case of this, but we can also memoise things in other data structures)

Things that hinder responsiveness(red flags):

Doing memoisation or background processing in an interactive context. Note that sometimes we have to complete some memoisation or background processing work before the user's *desire* is satisfied, but we can still respond quickly to tell them we're working on what they need.

Capacity

Launchpad is the hosting site for thousands of projects and the Ubuntu distribution. It is important that it be able to handle the entire set of users, and grow gracefully as we add features and or more users start using it. This is often what people mean when they talk of 'Scaling' in the context of a web site. Scaling can also refer to just handling very large datasets though, and we have that in Launchpad too - the bugs database for instance is write-only, so it grows and grows.

Capacity it related to Responsiveness - when a systems exceeds its capacity, its responsiveness is often one of the first things to downhill.

We have some graphs in the lpstats system (sorry, also staff only) that show system load on the database servers, web servers etc. These combined with the request counts, buildd queue lengths and so on can give a feeling of our workload and how we're managing it.

For now, we can only really tell when we're hitting a specific capacity issue when some part of the system breaks.

However, there are a number of things which can help with capacity:

Load balanced systems where we can add more of the system when the load exceeds capacity. This lets us spend money to get capacity - which is in some ways easier than redeveloping things
Efficient operation of tasks: Doing things very efficiently can allow enough capacity that we don't need to worry about load balancing and so forth. Many things will be fairly constant for us, and for these we can just make them efficient. Load balancing may still be a good idea to help with availability. The librarian is an example of this approach - we have one system which grows quite slowly, and we load balance the software on it to allow smooth upgrades.
Caching: When we have identical tasks coming in at high frequency, we can use caching to hand out slightly stale answers promptly. We have both squid and memcached servers available for doing this. Note that this does not help with responsiveness in general, because caches (unlike memos) are populated by the first request for the data, and if thats slow, the perceived responsiveness will be slow even if the next 10 hits are extremely fast.

And some things that can be particularly painful:

Resources that have a bottleneck. For instance, our postgresql server is a single point of write contention: All database updates go through it, and only one machine can perform the writes, so a large burst of write activity can exceed our capacity, an the only way we can increase capacity of this resource is by buying ever larger and faster machines.
Deep queueing systems: Where we have a queue that can get long, a common failure mode is to have the queue depth (measured in time) drive insertions into the queue. For instance, the bazaar.launchpad.net url mapping code has a queue that apache 'manages' which interacted with the cached url mapping - when the queue exceeded 2 seconds in depth, the cache eviction rate hit 100% and thereafter every item in the queue caused a new backend lookup.

Throughput & Efficiency

Throughput is about the total amount of work you can put through the system; efficiency is about how much responsiveness/capacity/throughput we get per dollar. If we can increase efficiency, we can handle more users and projects without hardware changes - so it's nice as that reduces migrations and downtime.

When we increase responsiveness, for instance by moving work out of a web transaction, we interact with throughput. For most of Launchpad throughput isn't a primary metric: responsiveness is. However, for some resources, like the buildds, throughput is the primary metric, because we expect the load on the system to vary wildly and often exceed our capacity for short periods of time.

The primary tool we have for improving throughput is efficient code / database schemas / components. A highly responsive, high capacity, efficient system typically has pretty good throughput.

Current themes

As we work on Launchpad performance, we need to focus in on specific bits at a time, to make sure they are solid, under control and out of the way so we can pick up a new challenge.

The current theme is appserver rendering time. This is a three part theme:

5 second hard timeout

The hard timeout determines when we release resources from a long running transaction. The longer this is the more a bad page can impact other pages and appservers. Setting this limit to 5 seconds will help reduce contention in the database and means that things which go wrong don't hang on the user for long periods of time.

To set this to 5 seconds we are ratcheting down the database timeout (which is how this is configured) in small intervals, whenever the page performance report and hourly oops graphs show that we have a decent buffer.

1 second rendering for the landing pages for 99% of requests

The second part is to make the landing pages for our apps - /project /projectgroup /distro /distro/+source/sourcepackage - render in 1 second for 99% of requests. This will provide a foundation - we'll know we have the facilities in place to deliver this page rendering time in each domain. This theme is actually pretty easy, but we're not looking at it until the first theme is complete.

1 second rendering for 99% of all requests

This last theme is where we bring the time it takes to use all of launchpad down under 1 second. It's over the hill now, but we can see that it's possible.

Web

foundations performance work On page loads and SSL are very relevant here. For web performance we're primarily interested in how long a page takes to be usable by the user.

Our page performance report, mentioned in responsiveness above, is a only one tool - but a necessary one - in determining the responsiveness of web pages. Many factors impact web rendering and delivery time. The one guaranteed thing is that a slow page render on the appserver guarantees a slow delivery time to the user.

API

Very similar to the Web as a topic area, but unlike browser pages, the requests a client makes will be dependent on user code, not on the UI we give the user.

Currently the API suffers from a many-roundtrips latency multiplication effect, which also interacts with appserver responsiveness issues badly, but this should become much better as we progress through the performance themes.

Database

Our database generally performance well but occasionally we have queries that the planner comes up with terrible ideas for, or missing indices. One critical thing to be aware of is that we have a master-slave replication system using Slony, and that means that all writes occur on one machine: write-heavy codepaths may well be slow / suffer more contention.

Test

Our test performance suffers from two problems: the throughput is poor (2-4 hours to run all the tests), and the responsiveness is also poor (20 seconds to run a single test).

Folk are working on this piecemeal, but there isn't really a coordinated effort at the moment.

Comments

Aren't memos by definition something that's calculated when it's first needed and then remembered? -- mbp
99% = 3*SD+Mean is icky and inaccurate here. Perhaps the difference doesn't matter, but it would be nice to change the page performance report to just actually show the 99% cutoff. -- mbp

launchpad development

Performance