Differences between revisions 4 and 5

Overview

Performance can mean many different things and as such is very hard to measure - but generally we know if something 'performs well' - it feels responsive (we can measure this), it lets us achieve what we want quickly (we can measure this), it handles lots of users (and we can measure this), and make efficient use of resources (we can measure this).

This page focuses on the purely technical aspects of performance: responsiveness, capacity, throughput and efficiency. It aims to provide a common language for talking about challenges and improvements in these areas, and the style of tools available for dealing with them. There are other technology and challenge specific pages in this wiki, which can and should be consulted for the specifics about what a team is working on or the chosen technology for some part of the system.

This is a living document - please edit it and add to it freely! In writing it, I (RobertCollins) have drawn mainly on experiences with bzr, launchpad, squid, and various books : I'm not citing sources in the interests of brevity and clarity, but if there are bits in here that you disagree with, let me know and I'll dig up sources / discuss / whatever.

Responsiveness

Responsiveness is simply how fast-or-slow you get feedback on actions in the system. Depending on the layer you are working at this may be web requests, ui changes, packages building - etc.

We want Launchpad to be very responsive to interactive operations, so that users are not waiting for Launchpad. For batch operations we care less about responsiveness, but where we can be its nice to still be responsive. Note that work queues to complete tasks asynchronously does not imply that these tasks are batch operations: if they are on behalf of an interactive user, they should still be considered an interactive task.

We have a (internal only - sorry, it includes confidential URLs at the moment) report which reports how long server side requests take to complete. To find out how long 99% of requests take to complete, add 3* the stddev and the mean. Some of our requests are very fast - many are extremely fast - but many are also very slow.

Things that help perceived responsiveness:

Consistently responding quickly: If some requests are very slow, the system as a whole is perceived as being very slow.
Separating out work-for-the-user and work-for-our-housekeeping.
Efficient queries / formatting
Memoised data we can query more easily (indices in the database are an extremely common case of this, but we can also memoise things in other data structures)

Things that hinder responsiveness(red flags):

Doing memoisation or background processing in an interactive context. Note that sometimes we have to complete some memoisation or background processing work before the user's *desire* is satisfied, but we can still respond quickly to tell them we're working on what they need.

Capacity

Launchpad is the hosting site for thousands of projects and the Ubuntu distribution. It is important that it be able to handle the entire set of users, and grow gracefully as we add features and or more users start using it. This is often what people mean when they talk of 'Scaling' in the context of a web site. Scaling can also refer to just handling very large datasets though, and we have that in Launchpad too - the bugs database for instance is write-only, so it grows and grows.

Capacity it related to Responsiveness - when a systems exceeds its capacity, its responsiveness is often one of the first things to downhill.

We have some graphs in the lpstats system (sorry, also staff only) that show system load on the database servers, web servers etc. These combined with the request counts, buildd queue lengths and so on can give a feeling of our workload and how we're managing it.

For now, we can only really tell when we're hitting a specific capacity issue when some part of the system breaks.

However, there are a number of things which can help with capacity:

Load balanced systems where we can add more of the system when the load exceeds capacity. This lets us spend money to get capacity - which is in some ways easier than redeveloping things
Efficient operation of tasks: Doing things very efficiently can allow enough capacity that we don't need to worry about load balancing and so forth. Many things will be fairly constant for us, and for these we can just make them efficient. Load balancing may still be a good idea to help with availability. The librarian is an example of this approach - we have one system which grows quite slowly, and we load balance the software on it to allow smooth upgrades.
Caching: When we have identical tasks coming in at high frequency, we can use caching to hand out slightly stale answers promptly. We have both squid and memcached servers available for doing this. Note that this does not help with responsiveness in general, because caches (unlike memos) are populated by the first request for the data, and if thats slow, the perceived responsiveness will be slow even if the next 10 hits are extremely fast.

And some things that can be particularly painful:

Resources that have a bottleneck. For instance, our postgresql server is a single point of write contention: All database updates go through it, and only one machine can perform the writes, so a large burst of write activity can exceed our capacity, an the only way we can increase capacity of this resource is by buying ever larger and faster machines.
Deep queueing systems: Where we have a queue that can get long, a common failure mode is to have the queue depth (measured in time) drive insertions into the queue. For instance, the bazaar.launchpad.net url mapping code has a queue that apache 'manages' which interacted with the cached url mapping - when the queue exceeded 2 seconds in depth, the cache eviction rate hit 100% and thereafter every item in the queue caused a new backend lookup.

Throughput & Efficiency

Throughput is about the total amount of work you can put through the system; efficiency is about how much responsiveness/capacity/throughput we get per dollar. If we can increase efficiency, we can handle more users and projects without hardware changes - so its nice as that reduces migrations and downtime.

When we increase responsiveness, for instance by moving work out of a web transaction, we interact with throughput. For most of Launchpad throughput isn't a primary metric: responsiveness is. However, for some resources, like the buildds, throughput is the primary metric, because we expect the load on the system to vary wildly and often exceed our capacity for short periods of time.

The primary tool we have for improving throughput is efficient code / database schemas / components. A highly responsive, high capacity, efficient system typically has pretty good throughput.

-  ⇤ ← Revision 4 as of 2010-08-11 00:01:38 → 
  Size: 2309
  Editor: lifeless
  Comment: snapshot
+   ← Revision 5 as of 2010-08-11 00:30:04 → ⇥
  Size: 7135
  Editor: lifeless
  Comment: more babble
-Deletions are marked like this.
+Additions are marked like this.
 Line 8:
+This is a living document - please edit it and add to it freely! In writing it, I (RobertCollins) have drawn mainly on experiences with bzr, launchpad, squid, and various books : I'm not citing sources in the interests of brevity and clarity, but if there are bits in here that you disagree with, let me know and I'll dig up sources / discuss / whatever.
-Line 17:
+Line 19:
-Things that help responsiveness:
+Things that help perceived responsiveness:
-Line 19:
+Line 21:
+ * Consistently responding quickly: If some requests are very slow, the system as a whole is perceived as being very slow.
-Line 23:
+Line 26:
-== Capacity
+Things that hinder responsiveness(red flags):

 * Doing memoisation or background processing in an interactive context. Note that sometimes we have to complete some memoisation or background processing work before the user's *desire* is satisfied, but we can still respond quickly to tell them we're working on what they need.

== Capacity ==

Launchpad is the hosting site for thousands of projects and the Ubuntu distribution. It is important that it be able to handle the entire set of users, and grow gracefully as we add features and or more users start using it. This is often what people mean when they talk of 'Scaling' in the context of a web site. Scaling can also refer to just handling very large datasets though, and we have that in Launchpad too - the bugs database for instance is write-only, so it grows and grows.

Capacity it related to Responsiveness - when a systems exceeds its capacity, its responsiveness is often one of the first things to downhill.

We have some graphs in the lpstats system (sorry, also staff only) that show system load on the database servers, web servers etc. These combined with the request counts, buildd queue lengths and so on can give a feeling of our workload and how we're managing it.

For now, we can only really tell when we're hitting a specific capacity issue when some part of the system breaks.

However, there are a number of things which can help with capacity:

 * Load balanced systems where we can add more of the system when the load exceeds capacity. This lets us spend money to get capacity - which is in some ways easier than redeveloping things :)
 * Efficient operation of tasks: Doing things very efficiently can allow enough capacity that we don't need to worry about load balancing and so forth. Many things will be fairly constant for us, and for these we can just make them efficient. Load balancing may still be a good idea to help with ''availability''. The librarian is an example of this approach - we have one system which grows quite slowly, and we load balance the software on it to allow smooth upgrades.
 * Caching: When we have identical tasks coming in at high frequency, we can use caching to hand out slightly stale answers promptly. We have both squid and memcached servers available for doing this. Note that this does not help with responsiveness in general, because caches (unlike memos) are populated by the first request for the data, and if thats slow, the perceived responsiveness will be slow even if the next 10 hits are extremely fast.

And some things that can be particularly painful:

 * Resources that have a bottleneck. For instance, our postgresql server is a single point of write contention: All database updates go through it, and only one machine can perform the writes, so a large burst of write activity can exceed our capacity, an the only way we can increase capacity of this resource is by buying ever larger and faster machines.
 * Deep queueing systems: Where we have a queue that can get long, a common failure mode is to have the queue depth (measured in time) drive insertions into the queue. For instance, the bazaar.launchpad.net url mapping code has a queue that apache 'manages' which interacted with the cached url mapping - when the queue exceeded 2 seconds in depth, the cache eviction rate hit 100% and thereafter every item in the queue caused a new backend lookup.

= Throughput & Efficiency =

Throughput is about the total amount of work you can put through the system; efficiency is about how much responsiveness/capacity/throughput we get per dollar. If we can increase efficiency, we can handle more users and projects without hardware changes - so its nice as that reduces migrations and downtime.

When we increase responsiveness, for instance by moving work out of a web transaction, we interact with throughput. For most of Launchpad throughput isn't a primary metric: responsiveness is. However, for some resources, like the buildds, throughput is the primary metric, because we expect the load on the system to vary wildly and often exceed our capacity for short periods of time.

The primary tool we have for improving throughput is efficient code / database schemas / components. A highly responsive, high capacity, efficient system typically has pretty good throughput.

launchpad development

Overview

Responsiveness

Capacity

Throughput & Efficiency

Web

API

Database

Test

launchpad development

Diff for "Performance"

Overview

Responsiveness

Capacity

Throughput & Efficiency

Web

API

Database

Test