Foundations/SystemPerformance/Metrics

Not logged in - Log In / Register

Performance Metrics for Developers

Foundations wants to provide easily available metrics for DIY performance improvements across the Launchpad team. This page documents them and helps you find and use them.

OOPS Reports

These reports, sent to Canonical Launchpad developers, identify pages that are individually slower on the server than the absolute limit that we deem acceptable.

The web interface to the OOPS tool is https://lp-oops.canonical.com/oops/ . Interesting future improvements might be to identify the 10 pages with the most timeouts over a rolling period of 7 days, and keep track of how long those pages have been in the top (bottom) 10 list. It would also be nice to have the reports on the web.

webpagetest.org

https://devpad.canonical.com/~mars/metrics/webpagetest.org/

We can now look at webpagetest.org results over time for high-impact pages. This can give a full picture of time-to-interact (though note that these are for anonymous users, so server-side results are sometimes Squid cache hits).

++profile++

See https://dev.launchpad.net/Debugging#Profiling%20page%20requests .

Further work:

Tracelog Analysis

Page Performance Reports are currently available at https://devpad.canonical.com/~stub/ppr.

They are being generated daily from the Zope tracelogs.

Reports by:

You can help:

The script that generates the report is utilities/page-performance-report.py.

Pay attention to the graph y-axis. It is *highly* logarithmic. Most, of our pages really do render quickly. Some pages consistently render slowly identified by high means and medians. Some of our pages inconsistently render slowly, identified by high standard deviation and variance.

These reports are only what is happening in the appservers. SSL setup times, cache overheads etc. don't count here.

Database Utilization Statistics

Live, daily and weekly reports on master database utilization statistics are available at https://devpad.canonical.com/~stub/dbr. Rate of tuples (database rows) read, updated and deleted per table lets you see what are the busiest tables in the database. CPU utilization lets you see which processes are hogging the database - 100% CPU utilization corresponds to 1 core (we have 16).

We can add statistics for slave databases if needed, but there seems little need so at the moment we are avoiding the overhead and maintenance.

Alarm Page

A challenge with improving the "big picture" of time-to-interact is that improvements may get lost over time, because we have no way to write automated tests for network improvement. We are experimenting with setting up an email alarm for pages when the fastest run in the most recent webpagetest.org sweep goes above a certain specified amount of time. Therefore, when you improve a page because of network changes, you can specify a smaller amount of time to trigger the alarm for that page. Maybe we can identify other approaches with Windmill in the future to catch at least a significant subset of these sorts of problems.

Foundations/SystemPerformance/Metrics (last edited 2010-08-31 15:28:35 by gary)