Overview

Performance can mean many different things and as such is very hard to measure - but generally we know if something 'performs well' - it feels responsive (we can measure this), it lets us achieve what we want quickly (we can measure this), it handles lots of users (and we can measure this), and make efficient use of resources (we can measure this).

This page focuses on the purely technical aspects of performance: responsiveness, capacity, throughput and efficiency. It aims to provide a common language for talking about challenges and improvements in these areas, and the style of tools available for dealing with them. There are other technology and challenge specific pages in this wiki, which can and should be consulted for the specifics about what a team is working on or the chosen technology for some part of the system.

This is a living document - please edit it and add to it freely! In writing it, I (Robert Collins) have drawn mainly on experiences with bzr, launchpad, squid, and various books : I'm not citing sources in the interests of brevity and clarity, but if there are bits in here that you disagree with, let me know and I'll dig up sources / discuss / whatever.

Fixing performance problems

First, read the definitions and context section to have a common language for discussing problems.

Secondly, make sure you've gathered some data:

Then, pick from this menu:

Page responsiveness - a checklist

This checklist is ordered by how often different sorts of problems were facing us in Jan 2011 - multiple causes may affect a single page, so make sure you've checked them all and evaluated the relative impact before deciding on your course to improve the page.

  1. Check the query count for the page. Is it under 30?
    • If > 30 (or thereabouts) then the page is probably suffering from late evaluation. Also known as death-by-sql, this is almost certainly suffering from the overhead of talking to the DB server a lot - and making it perform repeated work by looking up individual items rather than sets. See Database/Performance for specifics on how to avoid the late evaluation.

  2. A single query is performing slowly? (or possibly queries if late evaluation is also occuring)

    The database statistics may need a tweak, or your query may be tricking the planner into mistaken layouts, or you may be working with something not indexed (or able to be indexed) appropriately. This is a rich topic all of its own -see Database/Performance. Often the query can be improved without altering the schema. Sometimes schema changes are needed (and we should do that).

  3. Doing batch work (noninteractive) in the webapp? Some examples which may fall into this category: sending emails, updating memos in the system, making web requests to external services. Often, the problem isn't that you are doing the work *per se*, its that the amount of work is not controlled by the webapp - its controlled by our users. For instance, sending emails is ok as long as we're sending only a few. Sending 3000 to a large group of members of a big super-team becomes a failure because it violates the principle of constant work for web pages.
    • You could use the job system.

    • You could redefine what gets done (for instance, do people really want an email when someone leaves a team?)
    • You could make what you do drastically more efficient (this is often very hard and eventually something will scale us past it again).

? Your topic here

Definitions and context

Responsiveness

Responsiveness is simply how fast-or-slow you get feedback on actions in the system. Depending on the layer you are working at this may be web requests, ui changes, packages building - etc.

We want Launchpad to be very responsive to interactive operations, so that users are not waiting for Launchpad. For batch operations we care less about responsiveness, but where we can be it's nice to still be responsive. Managing a task through an asynchronous work queue does not by itself make the task a batch operation: if it happens on behalf of an interactive user, it should still be considered an interactive task.

We have a (internal only - but an RT ticket is open to make this part of it public [the whole thing can expose confidential names of things]) report which reports how long server side requests take to complete. This report shows the slowest pages, and the 99% column + the page hit counts are very interesting things to sort by.

Things that help perceived responsiveness:

Things that hinder responsiveness(red flags):

Capacity

Launchpad is the hosting site for thousands of projects and the Ubuntu distribution. It is important that it be able to handle the entire set of users, and grow gracefully as we add features and or more users start using it. This is often what people mean when they talk of 'Scaling' in the context of a web site. Scaling can also refer to just handling very large datasets though, and we have that in Launchpad too - the bugs database for instance is write-only, so it grows and grows.

Capacity it related to Responsiveness - when a systems exceeds its capacity, its responsiveness is often one of the first things to downhill.

We have some graphs in the lpstats system (sorry, also staff only) that show system load on the database servers, web servers etc. These combined with the request counts, buildd queue lengths and so on can give a feeling of our workload and how we're managing it.

For now, we can only really tell when we're hitting a specific capacity issue when some part of the system breaks.

However, there are a number of things which can help with capacity:

And some things that can be particularly painful:

Throughput & Efficiency

Throughput is about the total amount of work you can put through the system; efficiency is about how much responsiveness/capacity/throughput we get per dollar. If we can increase efficiency, we can handle more users and projects without hardware changes - so it's nice as that reduces migrations and downtime.

When we increase responsiveness, for instance by moving work out of a web transaction, we interact with throughput. For most of Launchpad throughput isn't a primary metric: responsiveness is. However, for some resources, like the buildds, throughput is the primary metric, because we expect the load on the system to vary wildly and often exceed our capacity for short periods of time.

The primary tool we have for improving throughput is efficient code / database schemas / components. A highly responsive, high capacity, efficient system typically has pretty good throughput.

Current themes

As we work on Launchpad performance, we need to focus in on specific bits at a time, to make sure they are solid, under control and out of the way so we can pick up a new challenge.

The current theme is appserver rendering time. This is a three part theme:

5 second hard timeout

The hard timeout determines when we release resources from a long running transaction. The longer this is the more a bad page can impact other pages and appservers. Setting this limit to 5 seconds will help reduce contention in the database and means that things which go wrong don't hang on the user for long periods of time.

To set this to 5 seconds we are ratcheting down the database timeout (which is how this is configured) in small intervals, whenever the page performance report and hourly oops graphs show that we have a decent buffer.

1 second rendering for the landing pages for 99% of requests

The second part is to make the landing pages for our apps - /project /projectgroup /distro /distro/+source/sourcepackage - render in 1 second for 99% of requests. This will provide a foundation - we'll know we have the facilities in place to deliver this page rendering time in each domain. This theme is actually pretty easy, but we're not looking at it until the first theme is complete.

1 second rendering for 99% of all requests

This last theme is where we bring the time it takes to use all of launchpad down under 1 second. It's over the hill now, but we can see that it's possible.

Web

foundations performance work On page loads and SSL are very relevant here. For web performance we're primarily interested in how long a page takes to be usable by the user.

Our page performance report, mentioned in responsiveness above, is only one tool - but a necessary one - in determining the responsiveness of web pages. Many factors impact web rendering and delivery time. The one guaranteed thing is that a slow page render on the appserver guarantees a slow delivery time to the user.

One important thing to watch out for is any high latency loop : for instance, requesting a single database row per object in a loop (e.g. for person in team: if person.is_ubuntu_coc_signer: do_something()). This will frequently perform very poorly as soon as a large team is used, and probably force unnecessary batching, which makes things more complicated. We want constant overhead per page as a guiding principle - even though we may show more people on one page than another, the total work done should be pretty much constant (if you hand-wave away the DB server).

API

Very similar to the Web as a topic area, but unlike browser pages, the requests a client makes will be dependent on user code, not on the UI we give the user.

Currently the API suffers from a many-roundtrips latency multiplication effect, which also interacts with appserver responsiveness issues badly, but this should become much better as we progress through the performance themes.

Database

Our database generally performs well but occasionally we have queries that the planner comes up with terrible ideas for, or missing indices. One critical thing to be aware of is that we have a master-slave replication system using Slony, and that means that all writes occur on one machine: write-heavy codepaths may well be slow / suffer more contention.

However, while the database itself performs well, we often run into pages that are slow because of how they interact with it. See Database/Performance.

Test

Our test performance suffers from two problems: the throughput is poor (2-4 hours to run all the tests), and the responsiveness is also poor (20 seconds to run a single test).

Folk are working on this piecemeal, but there isn't really a coordinated effort at the moment.

Presentations

lightning performance.odp -- a lightning talk from the 2011 01 epic.

Comments

Performance (last edited 2012-01-12 08:44:57 by lifeless)