Contents
Overview
Performance can mean many different things and as such is very hard to measure - but generally we know if something 'performs well' - it feels responsive (we can measure this), it lets us achieve what we want quickly (we can measure this), it handles lots of users (and we can measure this), and make efficient use of resources (we can measure this).
This page focuses on the purely technical aspects of performance: responsiveness, capacity, throughput and efficiency. It aims to provide a common language for talking about challenges and improvements in these areas, and the style of tools available for dealing with them. There are other technology and challenge specific pages in this wiki, which can and should be consulted for the specifics about what a team is working on or the chosen technology for some part of the system.
This is a living document - please edit it and add to it freely! In writing it, I (Robert Collins) have drawn mainly on experiences with bzr, launchpad, squid, and various books : I'm not citing sources in the interests of brevity and clarity, but if there are bits in here that you disagree with, let me know and I'll dig up sources / discuss / whatever.
Fixing performance problems
First, read the definitions and context section to have a common language for discussing problems.
Secondly, make sure you've gathered some data:
an OOPS or two (look in the OOPS reports, or generate it, if needed, with ++oops++ or ++profile++);
a kcachegrind profile (see ++profile++), if the OOPS isn't crystal clear about whats going wrong, though that can take a bit of practice before its really clear; and
- if needed, the EXPLAIN ANALYZE output for long queries (you will need to talk to someone who has access to the production database or staging, such as the LOSAs or XXX).
Then, pick from this menu:
Page responsiveness - a checklist
This checklist is ordered by how often different sorts of problems were facing us in Jan 2011 - multiple causes may affect a single page, so make sure you've checked them all and evaluated the relative impact before deciding on your course to improve the page.
- Check the query count for the page. Is it under 30?
If > 30 (or thereabouts) then the page is probably suffering from late evaluation. Also known as death-by-sql, this is almost certainly suffering from the overhead of talking to the DB server a lot - and making it perform repeated work by looking up individual items rather than sets. See Database/Performance for specifics on how to avoid the late evaluation.
Consider writing a test that at most some N queries are performed on your page, and also that the page does the same number of queries when you add more dynamically shown things (like subscribers, or branches, or comments). See the patch sent to the list 10th August in the 'Performance Tuesday' thread for an example test doing this.
- A single query is performing slowly? (or possibly queries if late evaluation is also occuring)
The database statistics may need a tweak, or your query may be tricking the planner into mistaken layouts, or you may be working with something not indexed (or able to be indexed) appropriately. This is a rich topic all of its own -see Database/Performance. Often the query can be improved without altering the schema. Sometimes schema changes are needed (and we should do that).
- Doing batch work (noninteractive) in the webapp? Some examples which may fall into this category: sending emails, updating memos in the system, making web requests to external services. Often, the problem isn't that you are doing the work *per se*, its that the amount of work is not controlled by the webapp - its controlled by our users. For instance, sending emails is ok as long as we're sending only a few. Sending 3000 to a large group of members of a big super-team becomes a failure because it violates the principle of constant work for web pages.
You could use the job system.
- You could redefine what gets done (for instance, do people really want an email when someone leaves a team?)
- You could make what you do drastically more efficient (this is often very hard and eventually something will scale us past it again).
? Your topic here
Definitions and context
Responsiveness
Responsiveness is simply how fast-or-slow you get feedback on actions in the system. Depending on the layer you are working at this may be web requests, ui changes, packages building - etc.
We want Launchpad to be very responsive to interactive operations, so that users are not waiting for Launchpad. For batch operations we care less about responsiveness, but where we can be it's nice to still be responsive. Managing a task through an asynchronous work queue does not by itself make the task a batch operation: if it happens on behalf of an interactive user, it should still be considered an interactive task.
We have a (internal only - but an RT ticket is open to make this part of it public [the whole thing can expose confidential names of things]) report which reports how long server side requests take to complete. This report shows the slowest pages, and the 99% column + the page hit counts are very interesting things to sort by.
Things that help perceived responsiveness:
- Consistently responding quickly: If some requests are very slow, the system as a whole is perceived as being very slow.
- Separating out work-for-the-user and work-for-our-housekeeping.
- Efficient queries / formatting
- Memoised data we can query more easily (indices in the database are an extremely common case of this, but we can also memoise things in other data structures)
Things that hinder responsiveness(red flags):
- Doing memoisation or background processing in an interactive context. Note that sometimes we have to complete some memoisation or background processing work before the user's *desire* is satisfied, but we can still respond quickly to tell them we're working on what they need.
Capacity
Launchpad is the hosting site for thousands of projects and the Ubuntu distribution. It is important that it be able to handle the entire set of users, and grow gracefully as we add features and or more users start using it. This is often what people mean when they talk of 'Scaling' in the context of a web site. Scaling can also refer to just handling very large datasets though, and we have that in Launchpad too - the bugs database for instance is write-only, so it grows and grows.
Capacity it related to Responsiveness - when a systems exceeds its capacity, its responsiveness is often one of the first things to downhill.
We have some graphs in the lpstats system (sorry, also staff only) that show system load on the database servers, web servers etc. These combined with the request counts, buildd queue lengths and so on can give a feeling of our workload and how we're managing it.
For now, we can only really tell when we're hitting a specific capacity issue when some part of the system breaks.
However, there are a number of things which can help with capacity:
Load balanced systems where we can add more of the system when the load exceeds capacity. This lets us spend money to get capacity - which is in some ways easier than redeveloping things
Efficient operation of tasks: Doing things very efficiently can allow enough capacity that we don't need to worry about load balancing and so forth. Many things will be fairly constant for us, and for these we can just make them efficient. Load balancing may still be a good idea to help with availability. The librarian is an example of this approach - we have one system which grows quite slowly, and we load balance the software on it to allow smooth upgrades.
- Caching: When we have identical tasks coming in at high frequency, we can use caching to hand out slightly stale answers promptly. We have both squid and memcached servers available for doing this. Note that this does not help with responsiveness in general, because caches (unlike memos) are populated by the first request for the data, and if thats slow, the perceived responsiveness will be slow even if the next 10 hits are extremely fast.
And some things that can be particularly painful:
- Resources that have a bottleneck. For instance, our postgresql server is a single point of write contention: All database updates go through it, and only one machine can perform the writes, so a large burst of write activity can exceed our capacity, an the only way we can increase capacity of this resource is by buying ever larger and faster machines.
- Deep queueing systems: Where we have a queue that can get long, a common failure mode is to have the queue depth (measured in time) drive insertions into the queue. For instance, the bazaar.launchpad.net url mapping code has a queue that apache 'manages' which interacted with the cached url mapping - when the queue exceeded 2 seconds in depth, the cache eviction rate hit 100% and thereafter every item in the queue caused a new backend lookup.
Throughput & Efficiency
Throughput is about the total amount of work you can put through the system; efficiency is about how much responsiveness/capacity/throughput we get per dollar. If we can increase efficiency, we can handle more users and projects without hardware changes - so it's nice as that reduces migrations and downtime.
When we increase responsiveness, for instance by moving work out of a web transaction, we interact with throughput. For most of Launchpad throughput isn't a primary metric: responsiveness is. However, for some resources, like the buildds, throughput is the primary metric, because we expect the load on the system to vary wildly and often exceed our capacity for short periods of time.
The primary tool we have for improving throughput is efficient code / database schemas / components. A highly responsive, high capacity, efficient system typically has pretty good throughput.
Current themes
As we work on Launchpad performance, we need to focus in on specific bits at a time, to make sure they are solid, under control and out of the way so we can pick up a new challenge.
The current theme is appserver rendering time. This is a three part theme:
5 second hard timeout
The hard timeout determines when we release resources from a long running transaction. The longer this is the more a bad page can impact other pages and appservers. Setting this limit to 5 seconds will help reduce contention in the database and means that things which go wrong don't hang on the user for long periods of time.
To set this to 5 seconds we are ratcheting down the database timeout (which is how this is configured) in small intervals, whenever the page performance report and hourly oops graphs show that we have a decent buffer.
1 second rendering for the landing pages for 99% of requests
The second part is to make the landing pages for our apps - /project /projectgroup /distro /distro/+source/sourcepackage - render in 1 second for 99% of requests. This will provide a foundation - we'll know we have the facilities in place to deliver this page rendering time in each domain. This theme is actually pretty easy, but we're not looking at it until the first theme is complete.
1 second rendering for 99% of all requests
This last theme is where we bring the time it takes to use all of launchpad down under 1 second. It's over the hill now, but we can see that it's possible.
Web
foundations performance work On page loads and SSL are very relevant here. For web performance we're primarily interested in how long a page takes to be usable by the user.
Our page performance report, mentioned in responsiveness above, is only one tool - but a necessary one - in determining the responsiveness of web pages. Many factors impact web rendering and delivery time. The one guaranteed thing is that a slow page render on the appserver guarantees a slow delivery time to the user.
One important thing to watch out for is any high latency loop : for instance, requesting a single database row per object in a loop (e.g. for person in team: if person.is_ubuntu_coc_signer: do_something()). This will frequently perform very poorly as soon as a large team is used, and probably force unnecessary batching, which makes things more complicated. We want constant overhead per page as a guiding principle - even though we may show more people on one page than another, the total work done should be pretty much constant (if you hand-wave away the DB server).
API
Very similar to the Web as a topic area, but unlike browser pages, the requests a client makes will be dependent on user code, not on the UI we give the user.
Currently the API suffers from a many-roundtrips latency multiplication effect, which also interacts with appserver responsiveness issues badly, but this should become much better as we progress through the performance themes.
Database
Our database generally performs well but occasionally we have queries that the planner comes up with terrible ideas for, or missing indices. One critical thing to be aware of is that we have a master-slave replication system using Slony, and that means that all writes occur on one machine: write-heavy codepaths may well be slow / suffer more contention.
However, while the database itself performs well, we often run into pages that are slow because of how they interact with it. See Database/Performance.
Test
Our test performance suffers from two problems: the throughput is poor (2-4 hours to run all the tests), and the responsiveness is also poor (20 seconds to run a single test).
Folk are working on this piecemeal, but there isn't really a coordinated effort at the moment.
Presentations
lightning performance.odp -- a lightning talk from the 2011 01 epic.
Comments
- Aren't memos by definition something that's calculated when it's first needed and then remembered? -- mbp
memos can be calculated earlier - both a cache and a memo have the property of representing some state. I'm looking for a term that means 'thing we pay the cost of calculating early to get better performance later' vs 'thing we stash away after paying for it so that we don't pay for it again for a while'. Anticipated vs on-demand. on-demand can play havoc with responsiveness, anticipated can play havoc with capacity and throughput