AdamW's Debugging Adventures: Inadvertent Extreme Optimization
It's time for that rarest of events: a blog post! And it's another debugging adventure. Settle in, folks!
Recently I got interested in improving the time it takes to do a full compose of Fedora. This is when we start from just the packages and a few other inputs (basically image recipes and package groups), and produce a set of repositories and boot trees and images and all that stuff. For a long time this took somewhere between 5 and 10 hours. Recently we've managed to get it down to 3-4, then I figured out a change which has got it under 3 hours.
After that I re-analyzed the process and figured out that the biggest remaining point to attack is something called the 'pkgset' phase, which happens near the start of the process, not in parallel with anything else, and takes 35 minutes or so. So I started digging into that to see if it can be improved.
I fairly quickly found that it spends about 20 minutes in one relatively small codepath. It's created one giant package set (this is just a concept in memory at the time, it gets turned into an actual repo later) with every package in the compose in it. During this 20 minutes, it creates subsets of that package set per architecture, with only the packages relevant to that architecture in it (so packages for that arch, plus noarch packages, plus source packages, plus 'compatible' arches, like including i686 for x86_64).
I poked about at that code a bit and decided I could maybe make it a bit more efficient. The current version works by creating each arch subset one at a time by looping over the big global set. Because every arch includes noarch and src packages, it winds up looping over the noarch and src lists once per arch, which seemed inefficient. So I went ahead and rewrote it to create them all at once, to try and reduce the repeated looping.
Today I was testing that out, which unfortunately has to be done more or less 'in production', so if you like you can watch me here, where you'll see composes appearing and disappearing every fifteen minutes or so. At first of course my change didn't work at all because I'd made the usual few dumb mistakes with wrong variable names and stuff. After fixing all that up, I timed it, and it turned out about 7 minutes faster. Not earth shattering, but hey.
So I started checking it was accurate (i.e. created the same package sets as the old code). It turned out it wasn't quite (a subtle bug with noarch package exclusions). While fixing that, I ran across some lines in the code that had bugged me since the first time I started looking at it:
if i.file_path in self.file_cache: # TODO: test if it really works continue
These were extra suspicious to me because, not much later, they're followed by this:
self.file_cache.file_cache[i.file_path] = i
that is, we check if the thing is in self.file_cache
and move on if it is, but if it's not, we add it to self.file_cache.file_cache
? That didn't look right at all. But up till now I'd left it alone, because hey, it had been this way for years, right? Must be OK. Well, this afternoon, in passing, I thought "eh, let's try changing it".
Then things got weird.
I was having trouble getting the compose process to actually run exactly as it does in production, but once I did, I was getting what seemed like really bizarre results. The original code was taking 22 minutes in my tests. My earlier test of my branch had taken about 14 minutes. Now it was taking three seconds.
I thought, this can't possibly be right! So I spent a few hours running and re-running the tests, adding debug lines, trying to figure out how (surely) I had completely broken it and it was just bypassing the whole block, or something.
Then I thought...what if I go back to the original code, but change the cache thing?
So I went back to unmodified pungi code, commented out those three lines, ran a compose...and it took three seconds. Tried again with the check corrected to self.file_cache.file_cache
instead of self.file_cache
...three seconds.
I repeated this enough times that it must be true, but it still bugged me. So I just spent a while digging into it, and I think I know why. These file caches are kobo.pkgset.FileCache
instances; see the source code here. So, what's the difference between foo in self.file_cache
and foo in self.file_cache.file_cache
? Well, a FileCache
instance's own file_cache
is a dict. FileCache
instances also implement __iter__
, returning iter(self.file_cache)
. I think this is why foo in self.file_cache
works at all - it actually does do the right thing. But the key is, I think, that it does it inefficiently.
Python's preferred way to do foo in bar
is to call bar.__contains__(foo)
. If that doesn't work, it falls back on iterating over bar
until it either hits foo or runs out of iterations. If bar
doesn't support iteration it just raises an exception.
Python dictionaries have a very efficient implementation of __contains__
. So when we do foo in self.file_cache.file_cache
, we hit that efficient algorithm. But FileCache
does not implement __contains__
, so when we do foo in self.file_cache
, we fall back to iteration and wind up using that iterator over the dictionary's keys. This works, but is massively less efficient than the dictionary's __contains__
method would be. And because these package sets are absolutely fracking huge, that makes a very significant difference in the end (because we hit the cache check a huge number of times, and every time it has to iterate over a huge number of dict keys).
So...here's the pull request.
Turns out I could have saved the day and a half it took me to get my rewrite correct. And if anyone had ever got around to TODOing the TODO, we could've saved about 20 minutes out of every Fedora compose for the last nine years...
Comments