Arsenal launchpadlib performance audit

I went through the Arsenal] scripts looking for performance problems and warts related to their use of launchpadlib. I found everything we could hope for: a single code path causing a huge performance bottleneck in every one of the arsenal scripts. Take a look at arsenal_lib.py, in ArsenalBug.init().

This class was designed to optimize the web service interface for Arsenal programmers. But you pay a high price in unnecessary HTTP requests. Here's the problematic code:

 self.bug_tasks = bug.bug_tasks
 self.attachments = bug.attachments
 self.num_dupes = len(bug.duplicates)
 self.num_subs  = len(bug.subscriptions)

Each line makes an expensive HTTP request, whether or not the data retrieved is used for this ArsenalBug object. Take a look at this code from process_crashes.py:

        bug = ArsenalBug(bugtask.bug, lp.launchpad)
        if not (bug.has_tag("crash") or bug.has_tag("apport-crash") or bug.has_tag("crash-nullptr")):
            continue

That's 5 HTTP requests (the four I mentioned above, plus one to get bugtask.bug), for an object that in many (most?) cases is immediately discarded.

In this particular case, you should be able to get only the bugs you want by using the 'tags' argument to searchTasks(). But even for the bugs you want, you're making 5 HTTP requests and usually only one of those (bug.bugtask) is necessary.

For instance, .num_dupes and .num_subs are used only by gravity(), which is used primarily by bugtask_as_dict(). But I believe 'gravity' is only used in bug-listing.py, and then only if --show-gravity is enabled. Putting that code into gravity() and forcing anyone who wants the bug gravity to explicitly invoke ArsenalBug.gravity() would speed up your application a lot.

I have a plan to reduce the impact of this kind of coding. I will be changing lazr.restfulclient so that when you call eg. 'bug.bug_tasks' you get back a shim object. The expensive HTTP request will be made only when you use the object, by writing 'for x in bug.bug_tasks' or 'len(bug.bug_tasks)'. If you never use .bug_tasks, the request is never made. This will let you set up lots of shims in advance without having to know whether the code path you run will use the underlying data.

This would invisibly solve the problem for .bug_tasks and .attachments, but not for .num_dupes and .num_subs, since you call len() on those objects as soon as you get them.

The best immediate solution is to make .bug_tasks, .attachments, .num_dupes, and num_subs into @property methods. If I'm right, this should speed up all your scripts significantly.

More performance improvements you can make right now

searchTasks()

In some scripts you invoke searchTasks() with the 'tags' and 'tag_combinator' arguments, so that you only retrieve bugs that match certain tags. You could use this in a lot more places. In process-crashes.py you filter out bugs unless they're tagged "crash" or "apport-crash" or "crash-nullptr". By passing those tags into searchTasks() you can eliminate such bugs on the server side.

    crash_tags = ['crash', 'apport-crash', 'crash-nullpts']
    searchTasks(tags=crash_tags, tag_combinator="All")

In some places you check for the _lack_ of a tag. You could do this within searchTasks() as well. For instance, you could use it to filter out the bugs tagged with 'ubuntu_omittable' tags, without having to get a bunch of bugs and call ubuntu_omittable() on each.

    ubuntu_omittable_tags = ['-omit', '-kubuntu', '-xubuntu', '-ppc']
    searchTasks(tags=ubuntu_omittable_tags, tag_combinator="All")

This should eliminate those bugtasks on the server side, so you never see them.

Accessing objects more than once

In process-incomplete-bugs.py you have some code like this:

     for a in bug.attachments:
     author = a.message.owner.name
         if author == bug.bug.owner.name:

Every time you access bug.owner, you trigger an HTTP request (I think). Assign bug.bug.owner.name to a variable outside the loop. (Or set it as ArsenalBug.owner_name).

Server-side changes

This is stuff we could change in lazr.restful or in the Launchpad web service itself.

Representation expansion

There are several places where you could benefit from a feature we have planned. Some sample code:

        for watch in self.bug.bug_watches:
            if watch.bug_tracker.name in distro_trackers:

    for bug in bugs:
        if a.message.owner.name != bug.owner.name:

In each of these cases you iterate over a collection, and then access some related object (watch.bug_tracker, attachment.message and then message.owner). The access of the related object spawns one HTTP request for each element in the collection. With representation expansion, you could write code equivalent to this made-up example:

    for watch in self.bug.bug_watches(expand='bug_tracker'):
        if watch.bug_tracker.name in distro_trackers:

The information about each bug_watch's bug_tracker would be fetched along with the information about the bug_watches themselves. Accessing watch.bug_tracker would no longer spawn an HTTP request. The price would be additional time and bandwidth to fulfil the request for self.bug.bug_watches.

Other

You sometimes get a bunch of bugtasks and reject the ones whose bugs have a certain status (like "Fix Committed"). You could push this work onto the server side by calling searchTasks(status=["Every", "Other", "Status"]). But that's annoying. It would be nice to add an exclude_status argument to searchTasks.
It would be nice if you didn't have to synthesize web service objects into brand new data model objects (bugtask_as_dict) for the sake of programmer convenience, but it's understandable because the bug/bugtask split is weird. Maybe we could give a bugtask some of the attributes of its bug. Bug that wouldn't give any performance improvement above what we'd get from representation expansion, it would just make the code a little nicer. (Note that we cannot publish any 'age' type fields through the web service because that would make our representations totally uncacheable. We can only publish absolute times.)
There are a couple places where you iterate over attachments and filter them by title or contents. Would it be generally useful to publish such a filter on the server side?

Miscellaneous

These are non-performance things I noticed you could improve. In general, I speak of removing hacks that are no longer necessary due to improvement in the web service.

launchpad.load()

The use of launchpad.load() or URI construction is a code smell. In arsenal_lib.py you have this code:

self.launchpad.load("%s%s" % (self.launchpad._root_uri, project))

You can replace this with self.launchpad.projects[project].

You also have this code:

   target = self.launchpad.load(self.project.self_link + "/+source/" + package)
   bug_url = self.launchpad.bugs.createBug(
            target = target,
            title = title,
            description = description)

Here, there's no efficient alternative to constructing the URI, but you shouldn't need to call self.launchpad.load(). You should be able to pass the URI in as 'target', saving one HTTP request. (If this doesn't work, file a bug and I'll add this capability to the client.)

Finally:

   person = self.launchpad.load(service_root + '~' + lp_id)
   self.bug.subscribe(person = person)
   print "bug " + str(self.id) + ": subscribing " + person.display_name

self.launchpad.people[lp_id] looks better. Or, if you're OK printing out lp_id instead of person.display_name, you should be able to save yourself an HTTP request by doing this:

   url = service_root + '~' + lp_id
   self.bug.subscribe(person=url)
   print "bug " + str(self.id) + ": subscribing " + lp_id

Other

In several places you invoke named operations transitionToStatus and transitionToTarget. You can now set .status and .target directly and call lp_save().
In arsenal_lib.py, get_creds() implements a credential management system. You should be able to rewrite it to use launchpadlib's login_with(). This would save a lot of code and make your scripts use the same cache as other launchpadlib installations.
In several places you have code to work around Launchpad's sporadic 5xx service errors, either by retrying or by catching an error and reporting a problem. You even define a function called is_launchpad_down(). New versions of lazr.restfulclient automatically retry requests (with backoff) when they get a 502 or 503 error, so you shouldn't need to worry about this unless there is a serious problem with Launchpad. In most cases your code to deal with these situations can be removed.
It would be nice if you didn't have to synthesize web service objects into brand new data model objects (bugtask_as_dict) for the sake of programmer convenience, but it's understandable because the bug/bugtask split is weird. Maybe we could give a bugtask some of the attributes of its bug. Bug that wouldn't give any performance improvement above what we'd get from representation expansion, it would just make the code a little nicer. (We cannot publish any 'age' type fields through the web service because that would make our representations totally uncacheable. We can only publish times.)
Can you refactor process-tagging.py and process-release-tagging.py? They contain very similar logic.

Foundations/Webservice/ClientSidePerformance/ArsenalPerformanceAudit (last edited 2010-06-07 19:31:51 by leonardr)

launchpad development

Foundations/Webservice/ClientSidePerformance/ArsenalPerformanceAudit

Arsenal launchpadlib performance audit

More performance improvements you can make right now

searchTasks()

Accessing objects more than once

Server-side changes

Representation expansion

Other

Miscellaneous

launchpad.load()

Other