Performance Metrics for Developers

Foundations wants to provide easily available metrics for DIY performance improvements across the Launchpad team. This page documents them and helps you find and use them.

OOPS Reports

These reports, sent to Canonical Launchpad developers, identify pages that are individually slower on the server than the absolute limit that we deem acceptable.

The web interface to the OOPS tool is https://lp-oops.canonical.com/oops/ . Interesting future improvements might be to identify the 10 pages with the most timeouts over a rolling period of 7 days, and keep track of how long those pages have been in the top (bottom) 10 list. It would also be nice to have the reports on the web.

webpagetest.org

https://devpad.canonical.com/~mars/metrics/webpagetest.org/

We can now look at webpagetest.org results over time for high-impact pages. This can give a full picture of time-to-interact (though note that these are for anonymous users, so server-side results are sometimes Squid cache hits).

++profile++

See https://dev.launchpad.net/Debugging#Profiling%20page%20requests .

Further work:

Make it possible to use this on staging. Needs to be done soon to close the initial bug 598289. The plan is to constrain access to ++profile++ via feature flags; add a feature flag scope for Canonical Launchpad developers (the output contains potentially sensitive information, so is not appropriate for non-Canonical employees); and ask LOSAs to enable.
++profile++download should return a zip or tarball of the generated page output, the raw OOPS information, and the KCacheGrind-friendly profiling information.
++profile++ should be available for the webservice (606952).

Tracelog Analysis

Page Performance Reports are currently available at https://devpad.canonical.com/~stub/ppr.

They are being generated daily from the Zope tracelogs.

Reports by:

Categories. Pages whose URL matches a regular expression.
Pageids. Webservice requests don't show up on here yet, but most other pages have a page id and are grouped here.
Top200. Top200 URLs by total time.

You can help:

Add more categories - they are defined in utilities/page-performance-report.ini.
Make the report more interactive. Javascript love would allow us to collapse unwanted columns and render graphs on demand. There are currently no graphs on sql time and statement count to stop browsers crashing. Python tickcount is missing entirely because the report is already too wide.

The script that generates the report is utilities/page-performance-report.py.

Pay attention to the graph y-axis. It is *highly* logarithmic. Most, of our pages really do render quickly. Some pages consistently render slowly identified by high means and medians. Some of our pages inconsistently render slowly, identified by high standard deviation and variance.

These reports are only what is happening in the appservers. SSL setup times, cache overheads etc. don't count here.

Database Utilization Statistics

Live, daily and weekly reports on master database utilization statistics are available at https://devpad.canonical.com/~stub/dbr. Rate of tuples (database rows) read, updated and deleted per table lets you see what are the busiest tables in the database. CPU utilization lets you see which processes are hogging the database - 100% CPU utilization corresponds to 1 core (we have 16).

We can add statistics for slave databases if needed, but there seems little need so at the moment we are avoiding the overhead and maintenance.

Alarm Page

A challenge with improving the "big picture" of time-to-interact is that improvements may get lost over time, because we have no way to write automated tests for network improvement. We are experimenting with setting up an email alarm for pages when the fastest run in the most recent webpagetest.org sweep goes above a certain specified amount of time. Therefore, when you improve a page because of network changes, you can specify a smaller amount of time to trigger the alarm for that page. Maybe we can identify other approaches with Windmill in the future to catch at least a significant subset of these sorts of problems.

launchpad development

Foundations/SystemPerformance/Metrics