Arsenal launchpadlib performance audit

I went through the Arsenal] scripts looking for performance problems and warts related to their use of launchpadlib. I found everything we could hope for: a single code path causing a huge performance bottleneck in every one of the arsenal scripts. Take a look at arsenal_lib.py, in ArsenalBug.init().

This class was designed to optimize the web service interface for Arsenal programmers. But you pay a high price in unnecessary HTTP requests. Here's the problematic code:

 self.bug_tasks = bug.bug_tasks
 self.attachments = bug.attachments
 self.num_dupes = len(bug.duplicates)
 self.num_subs  = len(bug.subscriptions)

Each line makes an expensive HTTP request, whether or not the data retrieved is used for this ArsenalBug object. Take a look at this code from process_crashes.py:

        bug = ArsenalBug(bugtask.bug, lp.launchpad)
        if not (bug.has_tag("crash") or bug.has_tag("apport-crash") or bug.has_tag("crash-nullptr")):
            continue

That's 5 HTTP requests (the four I mentioned above, plus one to get bugtask.bug), for an object that in many (most?) cases is immediately discarded.

In this particular case, you should be able to get only the bugs you want by using the 'tags' argument to searchTasks(). But even for the bugs you want, you're making 5 HTTP requests and usually only one of those (bug.bugtask) is necessary.

For instance, .num_dupes and .num_subs are used only by gravity(), which is used primarily by bugtask_as_dict(). But I believe 'gravity' is only used in bug-listing.py, and then only if --show-gravity is enabled. Putting that code into gravity() and forcing anyone who wants the bug gravity to explicitly invoke ArsenalBug.gravity() would speed up your application a lot.

I have a plan to reduce the impact of this kind of coding. I will be changing lazr.restfulclient so that when you call eg. 'bug.bug_tasks' you get back a shim object. The expensive HTTP request will be made only when you use the object, by writing 'for x in bug.bug_tasks' or 'len(bug.bug_tasks)'. If you never use .bug_tasks, the request is never made. This will let you set up lots of shims in advance without having to know whether the code path you run will use the underlying data.

This would invisibly solve the problem for .bug_tasks and .attachments, but not for .num_dupes and .num_subs, since you call len() on those objects as soon as you get them.

The best immediate solution is to make .bug_tasks, .attachments, .num_dupes, and num_subs into @property methods. If I'm right, this should speed up all your scripts significantly.


More performance improvements you can make right now

searchTasks()

In some scripts you invoke searchTasks() with the 'tags' and 'tag_combinator' arguments, so that you only retrieve bugs that match certain tags. You could use this in a lot more places. In process-crashes.py you filter out bugs unless they're tagged "crash" or "apport-crash" or "crash-nullptr". By passing those tags into searchTasks() you can eliminate such bugs on the server side.

    crash_tags = ['crash', 'apport-crash', 'crash-nullpts']
    searchTasks(tags=crash_tags, tag_combinator="All")

In some places you check for the _lack_ of a tag. You could do this within searchTasks() as well. For instance, you could use it to filter out the bugs tagged with 'ubuntu_omittable' tags, without having to get a bunch of bugs and call ubuntu_omittable() on each.

    ubuntu_omittable_tags = ['-omit', '-kubuntu', '-xubuntu', '-ppc']
    searchTasks(tags=ubuntu_omittable_tags, tag_combinator="All")

This should eliminate those bugtasks on the server side, so you never see them.

Accessing objects more than once

In process-incomplete-bugs.py you have some code like this:

     for a in bug.attachments:
     author = a.message.owner.name
         if author == bug.bug.owner.name:

Every time you access bug.owner, you trigger an HTTP request (I think). Assign bug.bug.owner.name to a variable outside the loop. (Or set it as ArsenalBug.owner_name).


Server-side changes

This is stuff we could change in lazr.restful or in the Launchpad web service itself.

Representation expansion

There are several places where you could benefit from a feature we have planned. Some sample code:

        for watch in self.bug.bug_watches:
            if watch.bug_tracker.name in distro_trackers:

    for bug in bugs:
        if a.message.owner.name != bug.owner.name:

In each of these cases you iterate over a collection, and then access some related object (watch.bug_tracker, attachment.message and then message.owner). The access of the related object spawns one HTTP request for each element in the collection. With representation expansion, you could write code equivalent to this made-up example:

    for watch in self.bug.bug_watches(expand='bug_tracker'):
        if watch.bug_tracker.name in distro_trackers:

The information about each bug_watch's bug_tracker would be fetched along with the information about the bug_watches themselves. Accessing watch.bug_tracker would no longer spawn an HTTP request. The price would be additional time and bandwidth to fulfil the request for self.bug.bug_watches.

Other


Miscellaneous

These are non-performance things I noticed you could improve. In general, I speak of removing hacks that are no longer necessary due to improvement in the web service.

launchpad.load()

The use of launchpad.load() or URI construction is a code smell. In arsenal_lib.py you have this code:

self.launchpad.load("%s%s" % (self.launchpad._root_uri, project))

You can replace this with self.launchpad.projects[project].

You also have this code:

   target = self.launchpad.load(self.project.self_link + "/+source/" + package)
   bug_url = self.launchpad.bugs.createBug(
            target = target,
            title = title,
            description = description)

Here, there's no efficient alternative to constructing the URI, but you shouldn't need to call self.launchpad.load(). You should be able to pass the URI in as 'target', saving one HTTP request. (If this doesn't work, file a bug and I'll add this capability to the client.)

Finally:

   person = self.launchpad.load(service_root + '~' + lp_id)
   self.bug.subscribe(person = person)
   print "bug " + str(self.id) + ": subscribing " + person.display_name

self.launchpad.people[lp_id] looks better. Or, if you're OK printing out lp_id instead of person.display_name, you should be able to save yourself an HTTP request by doing this:

   url = service_root + '~' + lp_id
   self.bug.subscribe(person=url)
   print "bug " + str(self.id) + ": subscribing " + lp_id

Other

Foundations/Webservice/ClientSidePerformance/ArsenalPerformanceAudit (last edited 2010-06-07 19:31:51 by leonardr)