Performance

Not logged in - Log In / Register

Revision 9 as of 2010-08-11 06:40:23

Clear message

Overview

Performance can mean many different things and as such is very hard to measure - but generally we know if something 'performs well' - it feels responsive (we can measure this), it lets us achieve what we want quickly (we can measure this), it handles lots of users (and we can measure this), and make efficient use of resources (we can measure this).

This page focuses on the purely technical aspects of performance: responsiveness, capacity, throughput and efficiency. It aims to provide a common language for talking about challenges and improvements in these areas, and the style of tools available for dealing with them. There are other technology and challenge specific pages in this wiki, which can and should be consulted for the specifics about what a team is working on or the chosen technology for some part of the system.

This is a living document - please edit it and add to it freely! In writing it, I (RobertCollins) have drawn mainly on experiences with bzr, launchpad, squid, and various books : I'm not citing sources in the interests of brevity and clarity, but if there are bits in here that you disagree with, let me know and I'll dig up sources / discuss / whatever.

Responsiveness

Responsiveness is simply how fast-or-slow you get feedback on actions in the system. Depending on the layer you are working at this may be web requests, ui changes, packages building - etc.

We want Launchpad to be very responsive to interactive operations, so that users are not waiting for Launchpad. For batch operations we care less about responsiveness, but where we can be it's nice to still be responsive. Managing a task through an asynchronous work queue does not by itself make the task a batch operation: if it happens on behalf of an interactive user, it should still be considered an interactive task.

We have a (internal only - sorry, it includes confidential URLs at the moment) report which reports how long server side requests take to complete. To find out how long 99% of requests take to complete, add 3* the stddev and the mean. Some of our requests are very fast - many are extremely fast - but many are also very slow.

Things that help perceived responsiveness:

Things that hinder responsiveness(red flags):

Capacity

Launchpad is the hosting site for thousands of projects and the Ubuntu distribution. It is important that it be able to handle the entire set of users, and grow gracefully as we add features and or more users start using it. This is often what people mean when they talk of 'Scaling' in the context of a web site. Scaling can also refer to just handling very large datasets though, and we have that in Launchpad too - the bugs database for instance is write-only, so it grows and grows.

Capacity it related to Responsiveness - when a systems exceeds its capacity, its responsiveness is often one of the first things to downhill.

We have some graphs in the lpstats system (sorry, also staff only) that show system load on the database servers, web servers etc. These combined with the request counts, buildd queue lengths and so on can give a feeling of our workload and how we're managing it.

For now, we can only really tell when we're hitting a specific capacity issue when some part of the system breaks.

However, there are a number of things which can help with capacity:

And some things that can be particularly painful:

Throughput & Efficiency

Throughput is about the total amount of work you can put through the system; efficiency is about how much responsiveness/capacity/throughput we get per dollar. If we can increase efficiency, we can handle more users and projects without hardware changes - so it's nice as that reduces migrations and downtime.

When we increase responsiveness, for instance by moving work out of a web transaction, we interact with throughput. For most of Launchpad throughput isn't a primary metric: responsiveness is. However, for some resources, like the buildds, throughput is the primary metric, because we expect the load on the system to vary wildly and often exceed our capacity for short periods of time.

The primary tool we have for improving throughput is efficient code / database schemas / components. A highly responsive, high capacity, efficient system typically has pretty good throughput.

Current themes

As we work on Launchpad performance, we need to focus in on specific bits at a time, to make sure they are solid, under control and out of the way so we can pick up a new challenge.

The current theme is appserver rendering time. This is a three part theme:

5 second hard timeout

The hard timeout determines when we release resources from a long running transaction. The longer this is the more a bad page can impact other pages and appservers. Setting this limit to 5 seconds will help reduce contention in the database and means that things which go wrong don't hang on the user for long periods of time.

To set this to 5 seconds we are ratcheting down the database timeout (which is how this is configured) in small intervals, whenever the page performance report and hourly oops graphs show that we have a decent buffer.

1 second rendering for the landing pages for 99% of requests

The second part is to make the landing pages for our apps - /project /projectgroup /distro /distro/+source/sourcepackage - render in 1 second for 99% of requests. This will provide a foundation - we'll know we have the facilities in place to deliver this page rendering time in each domain. This theme is actually pretty easy, but we're not looking at it until the first theme is complete.

1 second rendering for 99% of all requests

This last theme is where we bring the time it takes to use all of launchpad down under 1 second. It's over the hill now, but we can see that it's possible.

Web

foundations performance work On page loads and SSL are very relevant here. For web performance we're primarily interested in how long a page takes to be usable by the user.

Our page performance report, mentioned in responsiveness above, is a only one tool - but a necessary one - in determining the responsiveness of web pages. Many factors impact web rendering and delivery time. The one guaranteed thing is that a slow page render on the appserver guarantees a slow delivery time to the user.

API

Very similar to the Web as a topic area, but unlike browser pages, the requests a client makes will be dependent on user code, not on the UI we give the user.

Currently the API suffers from a many-roundtrips latency multiplication effect, which also interacts with appserver responsiveness issues badly, but this should become much better as we progress through the performance themes.

Database

Our database generally performance well but occasionally we have queries that the planner comes up with terrible ideas for, or missing indices. One critical thing to be aware of is that we have a master-slave replication system using Slony, and that means that all writes occur on one machine: write-heavy codepaths may well be slow / suffer more contention.

Test

Our test performance suffers from two problems: the throughput is poor (2-4 hours to run all the tests), and the responsiveness is also poor (20 seconds to run a single test).

Folk are working on this piecemeal, but there isn't really a coordinated effort at the moment.

Comments