AdamW on Linux and more (Posts about Debugging Adventures)

AdamW's Debugging Adventures: Inadvertent Extreme Optimization

2024-11-19T07:10:29Z

It's time for that rarest of events: a blog post! And it's another debugging adventure. Settle in, folks!

Recently I got interested in improving the time it takes to do a full compose of Fedora. This is when we start from just the packages and a few other inputs (basically image recipes and package groups), and produce a set of repositories and boot trees and images and all that stuff. For a long time this took somewhere between 5 and 10 hours. Recently we've managed to get it down to 3-4, then I figured out a change which has got it under 3 hours.

After that I re-analyzed the process and figured out that the biggest remaining point to attack is something called the 'pkgset' phase, which happens near the start of the process, not in parallel with anything else, and takes 35 minutes or so. So I started digging into that to see if it can be improved.

I fairly quickly found that it spends about 20 minutes in one relatively small codepath. It's created one giant package set (this is just a concept in memory at the time, it gets turned into an actual repo later) with every package in the compose in it. During this 20 minutes, it creates subsets of that package set per architecture, with only the packages relevant to that architecture in it (so packages for that arch, plus noarch packages, plus source packages, plus 'compatible' arches, like including i686 for x86_64).

I poked about at that code a bit and decided I could maybe make it a bit more efficient. The current version works by creating each arch subset one at a time by looping over the big global set. Because every arch includes noarch and src packages, it winds up looping over the noarch and src lists once per arch, which seemed inefficient. So I went ahead and rewrote it to create them all at once, to try and reduce the repeated looping.

Today I was testing that out, which unfortunately has to be done more or less 'in production', so if you like you can watch me here, where you'll see composes appearing and disappearing every fifteen minutes or so. At first of course my change didn't work at all because I'd made the usual few dumb mistakes with wrong variable names and stuff. After fixing all that up, I timed it, and it turned out about 7 minutes faster. Not earth shattering, but hey.

So I started checking it was accurate (i.e. created the same package sets as the old code). It turned out it wasn't quite (a subtle bug with noarch package exclusions). While fixing that, I ran across some lines in the code that had bugged me since the first time I started looking at it:

                if i.file_path in self.file_cache:
                    # TODO: test if it really works
                    continue

These were extra suspicious to me because, not much later, they're followed by this:

                self.file_cache.file_cache[i.file_path] = i

that is, we check if the thing is in self.file_cache and move on if it is, but if it's not, we add it to self.file_cache.file_cache? That didn't look right at all. But up till now I'd left it alone, because hey, it had been this way for years, right? Must be OK. Well, this afternoon, in passing, I thought "eh, let's try changing it".

Then things got weird.

I was having trouble getting the compose process to actually run exactly as it does in production, but once I did, I was getting what seemed like really bizarre results. The original code was taking 22 minutes in my tests. My earlier test of my branch had taken about 14 minutes. Now it was taking three seconds.

I thought, this can't possibly be right! So I spent a few hours running and re-running the tests, adding debug lines, trying to figure out how (surely) I had completely broken it and it was just bypassing the whole block, or something.

Then I thought...what if I go back to the original code, but change the cache thing?

So I went back to unmodified pungi code, commented out those three lines, ran a compose...and it took three seconds. Tried again with the check corrected to self.file_cache.file_cache instead of self.file_cache...three seconds.

I repeated this enough times that it must be true, but it still bugged me. So I just spent a while digging into it, and I think I know why. These file caches are kobo.pkgset.FileCache instances; see the source code here. So, what's the difference between foo in self.file_cache and foo in self.file_cache.file_cache? Well, a FileCache instance's own file_cache is a dict. FileCache instances also implement __iter__, returning iter(self.file_cache). I think this is why foo in self.file_cache works at all - it actually does do the right thing. But the key is, I think, that it does it inefficiently.

Python's preferred way to do foo in bar is to call bar.__contains__(foo). If that doesn't work, it falls back on iterating over bar until it either hits foo or runs out of iterations. If bar doesn't support iteration it just raises an exception.

Python dictionaries have a very efficient implementation of __contains__. So when we do foo in self.file_cache.file_cache, we hit that efficient algorithm. But FileCache does not implement __contains__, so when we do foo in self.file_cache, we fall back to iteration and wind up using that iterator over the dictionary's keys. This works, but is massively less efficient than the dictionary's __contains__ method would be. And because these package sets are absolutely fracking huge, that makes a very significant difference in the end (because we hit the cache check a huge number of times, and every time it has to iterate over a huge number of dict keys).

So...here's the pull request.

Turns out I could have saved the day and a half it took me to get my rewrite correct. And if anyone had ever got around to TODOing the TODO, we could've saved about 20 minutes out of every Fedora compose for the last nine years...

AdamW's Debugging Adventures: Bootloaders and machine IDs

2022-01-11T22:08:00Z

Hi folks! Well, it looks like I forgot to blog for...checks watch....checks calendar...a year. Wow. Whoops. Sorry about that. I'm still here, though! We released, uh, lots of Fedoras since the last time I wrote about that. Fedora 35 is the current one. It's, uh, mostly great! Go get a copy, why don't you?

And while that's downloading, you can get comfy and listen to another of Crazy Uncle Adam's Debugging Adventures. In this episode, we'll be uncomfortably reminded just how much of the code that causes your system to actually boot at all consists of fragile shell script with no tests, so this'll be fun!

Last month, booting a system installed from Rawhide live images stopped working properly. You could boot the live image fine, run the installation fine, but on rebooting, the system would fail to boot with an error: dracut: FATAL: Don't know how to handle 'root=live:CDLABEL=Fedora-WS-Live-rawh-20211229-n-1'. openQA caught this, and so did one of our QA community members - Ahed Almeleh - who filed a bug. After the end-of-year holidays, I got to figuring out what was going wrong.

As usual, I got a bit of a head start from pre-existing knowledge. I happen to know that error message is referring to kernel arguments that are set in the bootloader configuration of the live image itself. dracut is the tool that handles an early phase of boot where we boot into a temporary environment that's loaded entirely into system memory, set up the real system environment, and boot that. This early environment is contained in the initrd files you can find alongside the kernel on most Linux distributions; that's what they're for. Part of dracut's job is to be run when a kernel is installed to produce this environment, and then other parts of dracut are included in the environment itself to handle initializing things, finding the real system root, preparing it, and then switching to it. The initrd environments on Fedora live images are built to contain a dracut 'module' (called 90dmsquash-live) that knows to interpret root=live:CDLABEL=Fedora-WS-Live-rawh-20211229-n-1 as meaning 'go look for a live system root on the filesystem with that label and boot that'. Installed systems don't contain that module, because, well, they don't need to know how to do that, and you wouldn't really ever want an installed system to try and do that.

So the short version here is: the installed system has the wrong kernel argument for telling dracut where to find the system root. It should look something like root=/dev/mapper/fedora-root (where we're pointing to a system root on an LVM volume that dracut will set up and then switch to). So the obvious next question is: why? Why is our installed system getting this wrong argument? It seemed likely that it 'leaked' from the live system to the installed system somehow, but I needed to figure out how.

From here, I had kinda two possible ways to investigate. The easiest and fastest would probably be if I happened to know exactly how we deal with setting up bootloader configuration when running a live install. Then I'd likely have been able to start poking the most obvious places right away and figure out the problem. But, as it happens, I didn't at the time remember exactly how that works. I just remembered that I wind up having to figure it out every few years, and it's complicated and scary, so I tend to forget again right afterwards. I kinda knew where to start looking, but didn't really want to have to work it all out again from scratch if I could avoid it.

So I went with the other possibility, which is always: figure out when it broke, and figure out what changed between the last time it worked and the first time it broke. This usually makes life much easier because now you know one of the things on that list is the problem. The shorter and simpler the list, the easier life gets.

I looked at the openQA result history and found that the bug was introduced somewhere between 20211215.n.0 and 20211229.n.1 (unfortunately kind of a wide range). The good news is that only a few packages could plausibly be involved in this bug; the most likely are dracut itself, grub2 (the bootloader), grubby (a Red Hat / Fedora-specific grub configuration...thing), anaconda (the Fedora installer, which obviously does some bootloader configuration stuff), the kernel itself, and systemd (which is of course involved in the boot process itself, but also - perhaps less obviously - is where a script called kernel-install that is used (on Fedora and many other distros) to 'install' kernels lives (this was another handy thing I happened to know already, but really - it's always a safe bet to include systemd on the list of potential suspects for anything boot-related).

Looking at what changed between 2021-12-15 and 2021-12-29, we could let out grub2 and grubby as they didn't change. There were some kernel builds, but nothing in the scriptlets changed in any way that could be related. dracut got a build with one change, but again it seemed clearly unrelated. So I was down to anaconda and systemd as suspects. On an initial quick check during the vacation, I thought anaconda had not changed, and took a brief look at systemd, but didn't see anything immediately obvious.

When I came back to look at it more thoroughly, I realized anaconda did get a new version (36.12) on 2021-12-15, so that initially interested me quite a lot. I spent some time going through the changes in that version, and there were some that really could have been related - it changed how running things during install inside the installed system worked (which is definitely how we do some bootloader setup stuff during install), and it had interesting commit messages like "Remove the dracut_args attribute" and "Remove upd-kernel". So I spent an afternoon fairly sure it'd turn out to be one of those, reviewed all those changes, mocked up locally how they worked, examined the logs of the actual image composes, and...concluded that none of those seemed to be the problem at all. The installer seemed to still be doing things the same as it always had. There weren't any tell-tale missing or failing bootloader config steps. However, this time wasn't entirely wasted: I was reminded of exactly what anaconda does to configure the bootloader when installing from a live image.

When we install from a live image, we don't do what the 'traditional' installer does and install a bunch of RPM packages using dnf. The live image does not contain any RPM packages. The live image itself was built by installing a bunch of RPM packages, but it is the result of that process. Instead, we essentially set up the filesystems on the drive(s) we're installing to and then just dump the contents of the live image filesystem itself onto them. Then we run a few tweaks to adjust anything that needs adjusting for this now being an installed system, not a live one. One of the things we do is re-generate the initrd file for the installed system, and then re-generate the bootloader configuration. This involves running kernel-install (which places the kernel and initrd files onto the boot partition, and writes some bootloader configuration 'snippet' files), and then running grub2-mkconfig. The main thing grub2-mkconfig does is produce the main bootloader configuration file, but that's not really why we run it at this point. There's a very interesting comment explaining why in the anaconda source:

# Update the bootloader configuration to make sure that the BLS
# entries will have the correct kernel cmdline and not the value
# taken from /proc/cmdline, that is used to boot the live image.

Which is exactly what we were dealing with here. The "BLS entries" we're talking about here are the things I called 'snippet' files above, they live in /boot/loader/entries on Fedora systems. These are where the kernel arguments used at boot are specified, and indeed, that's where the problematic root=live:... arguments were specified in broken installs - in the "BLS entries" in /boot/loader/entries. So it seemed like, somehow, this mechanism just wasn't working right any more - we were expecting this run of grub2-mkconfig in the installed system root after live installation to correct those snippets, but it wasn't. However, as I said, I couldn't establish that any change to anaconda was causing this.

So I eventually shelved anaconda at least temporarily and looked at systemd. And it turned out that systemd had changed too. During the time period in question, we'd gone from systemd 250~rc1 to 250~rc3. (If you check the build history of systemd the dates don't seem to match up - by 2021-12-29 the 250-2 build had happened already, but in fact the 250-1 and 250-2 builds were untagged for causing a different problem, so the 2021-12-29 compose had 250~rc3). By now I was obviously pretty focused on kernel-install as the most likely related part of systemd, so I went to my systemd git checkout and ran:

git log v250-rc1..v250-rc3 src/kernel-install/

which shows all the commits under src/kernel-install between 250-rc1 and 250-rc3. And that gave me another juicy-looking, yet thankfully short, set of commits:

641e2124de6047e6010cd2925ea22fba29b25309 kernel-install: replace 00-entry-directory with K_I_LAYOUT in k-i 357376d0bb525b064f468e0e2af8193b4b90d257 kernel-install: Introduce KERNEL_INSTALL_MACHINE_ID in /etc/machine-info 447a822f8ee47b63a4cae00423c4d407bfa5e516 kernel-install: Remove "Default" from list of suffixes checked

So I went and looked at all of those. And again...I got it wrong at first! This is I guess a good lesson from this Debugging Adventure: you don't always get the right answer at first, but that's okay. You just have to keep plugging, and always keep open the possibility that you're wrong and you should try something else. I spent time thinking the cause was likely a change in anaconda before focusing on systemd, then focused on the wrong systemd commit first. I got interested in 641e212 first, and had even written out a whole Bugzilla comment blaming it before I realized it wasn't the culprit (fortunately, I didn't post it!) I thought the problem was that the new check for $BOOT_ROOT/$MACHINE_ID would not behave as it should on Fedora and cause the install scripts to do something different from what they should - generating incorrect snippet files, or putting them in the wrong place, or something.

Fortunately, I decided to test this before declaring it was the problem, and found out that it wasn't. I did this using something that turned out to be invaluable in figuring out the real problem.

You may have noticed by this point - harking back to our intro - that this critical kernel-install script, key to making sure your system boots, is...a shell script. That calls other shell scripts. You know what else is a big pile of shell scripts? dracut. You know, that critical component that both builds and controls the initial boot environment. Big pile of shell script. The install script - the dracut command itself - is shell. All the dracut modules - the bits that do most of the work - are shell. There's a bit of C in the source tree (I'm not entirely sure what that bit does), but most of it's shell.

Critical stuff like this being written in shell makes me shiver, because shell is very easy to get wrong, and quite hard to test properly (and in fact neither dracut nor kernel-install has good tests). But one good thing about it is that it's quite easy to debug, thanks to the magic of sh -x. If you run some shell script via sh -x (whether that's really sh, or bash or some other alternative pretending to be sh), it will run as normal but print out most of the logic (variable assignments, tests, and so on) that happen along the way. So on a VM where I'd run a broken install, I could do chroot /mnt/sysimage (to get into the root of the installed system), find the exact kernel-install command that anaconda ran from one of the logs in /var/log/anaconda (I forget which), and re-run it through sh -x. This showed me all the logic going on through the run of kernel-install itself and all the scripts it sources under /usr/lib/kernel/install.d. Using this, I could confirm that the check I suspected had the result I suspected - I could see that it was deciding that layout="other", not layout="bls", here. But I could also figure out a way to override that decision, confirm that it worked, and find that it didn't solve the problem: the config snippets were still wrong, and running grub2-mkconfig didn't fix them. In fact the config snippets got wronger - it turned out that we do want kernel-install to pick 'other' rather than 'bls' here, because Fedora doesn't really implement BLS according to the upstream specs, so if we let kernel-install think we do, the config snippets we get are wrong.

So now I'd been wrong twice! But each time, I learned a bit more that eventually helped me be right. After I decided that commit wasn't the cause after all, I finally spotted the problem. I figured this out by continuing with the sh -x debugging, and noticing an inconsistency. By this point I'd thought to find out what bit of grub2-mkconfig should be doing the work of correcting the key bit of configuration here. It's in a Fedora-only downstream patch to one of the scriptlets in /etc/grub.d. It replaces the options= line in any snippet files it finds with what it reckons the kernel arguments "should be". So I got curious about what exactly was going wrong there. I tweaked grub2-mkconfig slightly to run those scriptlets using sh -x by changing these lines in grub2-mkconfig:

echo "### BEGIN $i ###"
"$i"
echo "### END $i ###"

to read:

echo "### BEGIN $i ###"
sh -x "$i"
echo "### END $i ###"

Now I could re-run grub2-mkconfig and look at what was going on behind the scenes of the scriptlet, and I noticed that it wasn't finding any snippet files at all. But why not?

The code that looks for the snippet files reads the file /etc/machine-id as a string, then looks for files in /boot/loader/entries whose names start with that string (and end in .conf). So I went and looked at my sample system and...found that the files in /boot/loader/entries did not start with the string in /etc/machine-id. The files in /boot/loader/entries started with a69bd9379d6445668e7df3ddbda62f86, but the ID in /etc/machine-id was b8d80a4c887c40199c4ea1a8f02aa9b4. This is why everything was broken: because those IDs didn't match, grub2-mkconfig couldn't find the files to correct, so the argument was wrong, so the system didn't boot.

Now I knew what was going wrong and I only had two systemd commits left on the list, it was pretty easy to see the problem. It was in 357376d. That changes how kernel-install names these snippet files when creating them. It names them by finding a machine ID to use as a prefix. Previously, it used whatever string was in /etc/machine-id; if that file didn't exist or was empty, it just used the string "Default". After that commit, it also looks for a value specified in /etc/machine-info. If there's a /etc/machine-id but not /etc/machine-info when you run kernel-install, it uses the value from /etc/machine-id and writes it to /etc/machine-info.

When I checked those files, it turned out that on the live image, the ID in both /etc/machine-id and /etc/machine-info was a69bd9379d6445668e7df3ddbda62f86 - the problematic ID on the installed system. When we generate the live image itself, kernel-install uses the value from /etc/machine-id and writes it to /etc/machine-info, and both files wind up in the live filesystem. But on the installed system, the ID in /etc/machine-info was that same value, but the ID in /etc/machine-id was different (as we saw above).

Remember how I mentioned above that when doing a live install, we essentially dump the live filesystem itself onto the installed system? Well, one of the 'tweaks' we make when doing this is to re-generate /etc/machine-id, because that ID is meant to be unique to each installed system - we don't want every system installed from a Fedora live image to have the same machine ID as the live image itself. However, as this /etc/machine-info file is new, we don't strip it from or re-generate it in the installed system, we just install it. The installed system has a /etc/machine-info with the same ID as the live image's machine ID, but a new, different ID in /etc/machine-id. And this (finally) was the ultimate source of the problem! When we run them on the installed system, the new version of kernel-install writes config snippet files using the ID from /etc/machine-info. But Fedora's patched grub2-mkconfig scriptlet doesn't know about that mechanism at all (since it's brand new), and expects the snippet files to contain the ID from /etc/machine-id.

There are various ways you could potentially solve this, but after consulting with systemd upstream, the one we chose is to have anaconda exclude /etc/machine-info when doing a live install. The changes to systemd here aren't wrong - it does potentially make sense that /etc/machine-id and /etc/machine-info could both exist and specify different IDs in some cases. But for the case of Fedora live installs, it doesn't make sense. The sanest result is for those IDs to match and both be the 'fresh' machine ID that's generated at the end of the install process. By just not including /etc/machine-info on the installed system, we achieve this result, because now when kernel-install runs at the end of the install process, it reads the ID from /etc/machine-id and writes it to /etc/machine-info, and both IDs are the same, grub2-mkconfig finds the snippet files and edits them correctly, the installed system boots, and I can move along to the next debugging odyssey...

AdamW's Debugging Adventures: "dnf is locked by another application"

2019-10-18T13:45:04Z

Gather round the fire, kids, it's time for another Debugging Adventure! These are posts where I write up the process of diagnosing the root cause of a bug, where it turned out to be interesting (to me, anyway...) This case - Bugzilla #1750575 - involved dnfdragora, the package management tool used on Fedora Xfce, which is a release-blocking environment for the ARM architecture. It was a pretty easy bug to reproduce: any time you updated a package, the update would work, but then dnfdragora would show an error "DNF is locked by another process. dnfdragora will exit.", and immediately exit. The bug sat around on the blocker list for a while; Daniel Mach (a DNF developer) looked into it a bit but didn't have time to figure it out all the way. So I got tired of waiting for someone else to do it, and decided to work it out myself.

Where's the error coming from?

As a starting point, I had a nice error message - so the obvious thing to do is figure out where that message comes from. The text appears in a couple of places in dnfdragora - in an exception handler and also in a method for setting up a connection to dnfdaemon. So, if we didn't already know (I happened to) this would be the point at which we'd realize that dnfdragora is a frontend app to a backend - dnfdaemon - which does the heavy lifting. So, to figure out in more detail how we were getting to one of these two points, I hacked both the points where that error is logged. Both of them read logger.critical(errmsg). I changed this to logger.exception(errmsg). logger.exception is a very handy feature of Python's logging module which logs whatever message you specify, plus a traceback to the current state, just like the traceback you get if the app actually crashes. So by doing that, the dnfdragora log (it logs to a file dnfdragora.log in the directory you run it from) gave us a traceback showing how we got to the error:

2019-10-14 17:53:29,436 <a href="ERROR">dnfdragora</a> dnfdaemon client error: g-io-error-quark: GDBus.Error:org.baseurl.DnfSystem.LockedError: dnf is locked by another application (36)
Traceback (most recent call last):
File "/usr/bin/dnfdragora", line 85, in <module>
main_gui.handleevent()
File "/usr/lib/python3.7/site-packages/dnfdragora/ui.py", line 1273, in handleevent
if not self._searchPackages(filter, True) :
File "/usr/lib/python3.7/site-packages/dnfdragora/ui.py", line 949, in _searchPackages
packages = self.backend.search(fields, strings, self.match_all, self.newest_only, tags )
File "/usr/lib/python3.7/site-packages/dnfdragora/misc.py", line 135, in newFunc
rc = func(*args, **kwargs)
File "/usr/lib/python3.7/site-packages/dnfdragora/dnf_backend.py", line 464, in search
newest_only, tags)
File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 508, in Search
fields, keys, attrs, match_all, newest_only, tags))
File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 293, in _run_dbus_async
result = self._get_result(data)
File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 277, in _get_result
self._handle_dbus_error(user_data['error'])
File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 250, in _handle_dbus_error
raise DaemonError(str(err))
dnfdaemon.client.DaemonError: g-io-error-quark: GDBus.Error:org.baseurl.DnfSystem.LockedError: dnf is locked by another application (36)</module>

So, this tells us quite a bit of stuff. We know we're crashing in some sort of 'search' operation, and dbus seems to be involved. We can also see a bit more of the architecture here. Note how we have dnfdragora/dnf_backend.py and dnfdaemon/client/__init__.py included in the trace, even though we're only in the dnfdragora executable here (dnfdaemon is a separate process). Looking at that and then looking at those files a bit, it's quite easy to see that the dnfdaemon Python library provides a sort of framework for a client class called (oddly enough) DnfDaemonBase which the actual client - dnfdragora in our case - is expected to subclass and flesh out. dnfdragora does this in a class called DnfRootBackend, which inherits from both dnfdragora.backend.Backend (a sort of abstraction layer for dnfdragora to have multiple of these backends, though at present it only actually has this one) and dnfdaemon.client.Client, which is just a small extension to DnfDaemonBase that adds some dbus signal handling. So now we know more about the design we're dealing with, and we can also see that we're trying to do some sort of search operation which looks like it works by the client class communicating with the actual dnfdaemon server process via dbus, only we're hitting some kind of error in that process, and interpreting it as 'dnf is locked by another application'. If we dig a little deeper, we can figure out a bit more. We have to read through all of the backtrace frames and examine the functions, but ultimately we can figure out that DnfRootBackend.Search() is wrapped by dnfdragora.misc.ExceptionHandler, which handles dnfdaemon.client.DaemonError exceptions - like the one that's ultimately getting raised here! - by calling the base class's own exception_handler() on them...and for us, that's BaseDragora.exception_handler, one of the two places we found earlier that ultimately produces this "DNF is locked by another process. dnfdragora will exit" text. We also now have two indications (the dbus error itself, and the code in exception_handler() that the error we're dealing with is "LockedError".

A misleading error...

At this point, I went looking for the text LockedError, and found it in two files in dnfdaemon that are kinda variants on each other - daemon/dnfdaemon-session.py and daemon/dnfdaemon-system.py. I didn't actually know offhand which of the two is used in our case, but it doesn't really matter, because the codepath to LockedError is the same in both. There's a function called check_lock() which checks that self._lock == sender, and if it doesn't, raises LockedError. That sure looks like where we're at. So at this point I did a bit of poking around into how self._lock gets set and unset in the daemon. It turns out to be pretty simple. The daemon is basically implemented as a class with a bunch of methods that are wrapped by @dbus.service.method, which makes them accessible as DBus methods. (One of them is Search(), and we can see that the client class's own Search() basically just calls that). There are also methods called Lock() and Unlock(), which - not surprisingly - set and release this lock, by setting the daemon class' self._lock to be either an identifier for the DBus client or None, respectively. And when the daemon is first initialized, the value is set to None. At this point, I realized that the error we're dealing with here is actually a lie in two important ways:

The message claims that the problem is the lock being held "by another application", but that's not what check_lock() checks, really. It passes only if the caller holds the lock. It does fail if the lock is held "by another application", but it also fails if the lock is not held at all. Given all the code we looked at so far, we can't actually trust the message's assertion that something else is holding the lock. It is also possible that the lock is not held at all.
The message suggests that the lock in question is a lock on dnf itself. I know dnf/libdnf do have locking, so up to now I'd been assuming we were actually dealing with the locking in dnf itself. But at this point I realized we weren't. The dnfdaemon lock code we just looked at doesn't actually call or wrap dnf's own locking code in any way. This lock we're dealing with is entirely internal to dnfdaemon. It's really a lock on the dnfdaemon instance itself.

So, at this point I started thinking of the error as being "dnfdaemon is either locked by another DBus client, or not locked at all".

So what's going on with this lock anyway?

My next step, now I understood the locking process we're dealing with, was to stick some logging into it. I added log lines to the Lock() and Unlock() methods, and I also made check_lock() log what sender and self._lock were set to before returning. Because it sets self._lock to None, I also added a log line to the daemon's __init__ that just records that we're in it. That got me some more useful information:

2019-10-14 18:53:03.397784 XXX In DnfDaemon.init now!
2019-10-14 18:53:03.402804 XXX LOCK: sender is :1.1835
2019-10-14 18:53:03.407524 XXX CHECK LOCK: sender is :1.1835
XXX CHECK LOCK: self._lock is :1.1835
2019-10-14 18:53:07.556499 XXX CHECK LOCK: sender is :1.1835
XXX CHECK LOCK: self._lock is :1.1835
[...snip a bunch more calls to check_lock where the sender is the same...]
2019-10-14 18:53:13.560828 XXX CHECK LOCK: sender is :1.1835
XXX CHECK LOCK: self._lock is :1.1835
2019-10-14 18:53:13.560941 XXX CHECK LOCK: sender is :1.1835
XXX CHECK LOCK: self._lock is :1.1835
2019-10-14 18:53:16.513900 XXX In DnfDaemon.init now!
2019-10-14 18:53:16.516724 XXX CHECK LOCK: sender is :1.1835
XXX CHECK LOCK: self._lock is None

so we could see that when we started dnfdragora, dnfdaemon started up and dnfdragora locked it almost immediately, then throughout the whole process of reproducing the bug - run dnfdragora, search for a package to be updated, mark it for updating, run the transaction, wait for the error - there were several instances of DBus method calls where everything worked fine (we see check_lock() being called and finding sender and self._lock set to the same value, the identifier for dnfdragora), but then suddenly we see the daemon's __init__ running again for some reason, not being locked, and then a check_lock() call that fails because the daemon instance's self._lock is None. After a couple of minutes, I guessed what was going on here, and the daemon's service logs confirmed it - dnfdaemon was crashing and automatically restarting. The first attempt to invoke a DBus method after the crash and restart fails, because dnfdragora has not locked this new instance of the daemon (it has no idea it just crashed and restarted), so check_lock() fails. So as soon as a DBus method invocation is attempted after the dnfdaemon crash, dnfdragora errors out with the confusing "dnf is locked by another process" error. The crash was already mentioned in the bug report, but until now the exact interaction between the crash and the error had not been worked out - we just knew the daemon crashed and the app errored out, but we didn't really know what order those things happened in or how they related to each other.

OK then...why is dnfdaemon crashing?

So, the question now became: why is dnfdaemon crashing? Well, the backtrace we had didn't tell us a lot; really it only told us that something was going wrong in libdbus, which we could also tell from the dnfdaemon service log:

Oct 14 18:53:15 adam.happyassassin.net dnfdaemon-system[226042]: dbus[226042]: arguments to dbus_connection_unref() were incorrect, assertion "connection->generation == _dbus_current_generation" failed in file ../../dbus/dbus-connection.c line 2823.
Oct 14 18:53:15 adam.happyassassin.net dnfdaemon-system[226042]: This is normally a bug in some application using the D-Bus library.
Oct 14 18:53:15 adam.happyassassin.net dnfdaemon-system[226042]:   D-Bus not built with -rdynamic so unable to print a backtrace

that last line looked like a cue, so of course, off I went to figure out how to build DBus with -rdynamic. A bit of Googling told me - thanks "the3dfxdude"! - that the trick is to compile with --enable-asserts. So I did that and reproduced the bug again, and got a bit of a better backtrace. It's a long one, but by picking through it carefully I could spot - in frame #17 - the actual point at which the problem happened, which was in dnfdaemon.server.DnfDaemonBase.run_transaction(). (Note, this is a different DnfDaemonBase class from dnfdaemon.client.DnfDaemonBase; I don't know why they have the same name, that's just confusing). So, the daemon's crashing on this self.TransactionEvent('end-run', NONE) call. I poked into what that does a bit, and found a design here that kinda mirrors what happens on the client side: this DnfDaemonBase, like the other one, is a framework for a full daemon implementation, and it's subclassed by a DnfDaemon class here. That class defines a TransactionEvent method that emits a DBus signal. So...we're crashing when trying to emit a dbus signal. That all adds up with the backtrace going through libdbus and all. But, why are we crashing? At this point I tried to make a small reproducer (which basically just set up a DnfDaemon instance and called self.TransactionEvent in the same way, I think) but that didn't work - I didn't know why at the time, but figured it out later. Continuing to trace it out through code wouldn't be that easy because now we're in DBus, which I know from experience is a big complex codebase that's not that easy to just reason your way through. We had the actual DBus error to work from too - "arguments to dbus_connection_unref() were incorrect, assertion "connection->generation == _dbus_current_generation" failed" - and I looked into that a bit, but there were no really helpful leads there (I got a bit more understanding about what the error means exactly, but it didn't help me understand *why it was happening* at all).

Time for the old standby...

So, being a bit stuck, I fell back on the most trusty standby: trial and error! Well, also a bit of logic. It did occur to me that the dbus broker is itself a long-running daemon that other things can talk to. So I started just wondering if something was interfering with dnfdaemon's connection with the dbus broker, somehow. This was in my head as I poked around at stuff - wherever I wound up looking, I was looking for stuff that involved dbus. But to figure out where to look, I just started hacking up dnfdaemon a bit. Now this first part is probably pure intuition, but that self._reset_base() call on the line right before the self.TransactionEvent call that crashes bugged me. It's probably just long experience telling me that anything with "reset" or "refresh" in the name is bad news. :P So I thought, hey, what happens if we move it? I stuck some logging lines into this run_transaction so I knew where we got to before we crashed - this is a great dumb trick, btw, just stick lines like self.logger('XXX HERE 1'), self.logger('XXX HERE 2') etc. between every significant line in the thing you're debugging, and grep the logs for "XXX" - and moved the self._reset_base() call down under the self.TransactionEvent call...and found that when I did that, we got further, the self.TransactionEvent call worked and we crashed the next time something else tried to emit a DBus signal. I also tried commenting out the self._reset_base() call entirely, and found that now we would only crash the next time a DBus signal was emitted after a subsequent call to the Unlock() method, which is another method that calls self._reset_base(). So, at this point I was pretty confident in this description: "dnfdaemon is crashing on the first interaction with DBus after self._reset_base() is called". So my next step was to break down what _reset_base() was actually doing. Turns out all of the detail is in the DnfDaemonBase skeleton server class: it has a self._base which is a dnf.base.Base() instance, and that method just calls that instance's close() method and sets self._base to None. So off I went into dnf code to see what dnf.base.Base.close() does. Turns out it basically does two things: it calls self._finalize_base() and then calls self.reset(True, True, True). Looking at the code it wasn't immediately obvious which of these would be the culprit, so it was all aboard the trial and error train again! I changed the call to self._reset_base() in the daemon to self._base.reset(True, True, True)...and the bug stopped happening! So that told me the problem was in the call to _finalize_base(), not the call to reset(). So I dug into what _finalize_base() does and kinda repeated this process - I kept drilling down through layers and splitting up what things did into individual pieces, and doing subsets of those pieces at a time to try and find the "smallest" thing I could which would cause the bug. To take a short aside...this is what I really like about these kinds of debugging odysseys. It's like being a detective, only ultimately you know that there's a definite reason for what's happening and there's always some way you can get closer to it. If you have enough patience there's always a next step you can take that will get you a little bit closer to figuring out what's going on. You just have to keep working through the little steps until you finally get there. Eventually I lit upon this bit of dnf.rpm.transaction.TransactionWrapper.close(). That was the key, as close as I could get to it: reducing the daemon's self._reset_base() call to just self._base._priv_ts.ts = None (which is what that line does) was enough to cause the bug. That was the one thing out of all the things that self._reset_base() does which caused the problem. So, of course, I took a look at what this ts thing was. Turns out it's an instance of rpm.TransactionSet, from RPM's Python library. So, at some point, we're setting up an instance of rpm.TransactionSet, and at this point we're dropping our reference to it, which - point to ponder - might trigger some kind of cleanup on it. Remember how I was looking for things that deal with dbus? Well, that turned out to bear fruit at this point...because what I did next was simply to go to my git checkout of rpm and grep it for 'dbus'. And lo and behold...this showed up. Turns out RPM has plugins (TIL!), and in particular, it has this one, which talks to dbus. (What it actually does is try to inhibit systemd from suspending or shutting down the system while a package transaction is happening). And this plugin has a cleanup function which calls something called dbus_shutdown() - aha! This was enough to get me pretty suspicious. So I checked my system and, indeed, I had a package rpm-plugin-systemd-inhibit installed. I poked at dependencies a bit and found that python3-dnf recommends that package, which means it'll basically be installed on nearly all Fedora installs. Still looking like a prime suspect. So, it was easy enough to check: I put the code back to a state where the crash happened, uninstalled the package, and tried again...and bingo! The crash stopped happening. So at this point the case was more or less closed. I just had to do a bit of confirming and tidying up. I checked and it turned out that indeed this call to dbus_shutdown() had been added quite recently, which tied in with the bug not showing up earlier. I looked up the documentation for dbus_shutdown() which confirmed that it's a bit of a big cannon which certainly could cause a problem like this: "Frees all memory allocated internally by libdbus and reverses the effects of dbus_threads_init(). libdbus keeps internal global variables, for example caches and thread locks, and it can be useful to free these internal data structures. ... You can't continue to use any D-Bus objects, such as connections, that were allocated prior to dbus_shutdown(). You can, however, start over; call dbus_threads_init() again, create new connections, and so forth." and then I did a scratch build of rpm with the commit reverted, tested, and found that indeed, it solved the problem. So, we finally had our culprit: when the rpm.TransactionSet instance went out of scope, it got cleaned up, and that resulted in this plugin's cleanup function getting called, and dbus_shutdown() happening. The RPM devs had intended that call to clean up the RPM plugin's DBus handles, but this is all happening in a single process, so the call also cleaned up the DBus handles used by dnfdaemon itself, and that was enough (as the docs suggest) to cause any further attempts to communicate with DBus in dnfdaemon code to blow up and crash the daemon. So, that's how you get from dnfdragora claiming that DNF is locked by another process to a stray RPM plugin crashing dnfdaemon on a DBus interaction!

AdamW's Debugging Adventures: The Mysterious Disappearing /proc

2019-01-17T19:15:30Z

Yep, folks, it's that time again - time for one of old Grandpa Adam's tall tales of root causing adventure... There's a sort of catch-22 situation in Fedora that has been a personal bugbear for a very long time. It mainly affects Branched releases - each new Fedora release, when it has branched from Rawhide, but before it has been released. During this period the Bodhi update system is in effect, meaning all new packages have to go through Bodhi review before they are included in the composes for the release. This means, in theory, we should be able to make sure nothing really broken lands in the release. However, there's a big class of really important updates we have never been able to test properly at all: updates that affect the installer. The catch-22 is this - release engineering only builds install media from the 'stable' package set, those packages that have gone through review. So if a package under review breaks the installer, we can't test whether it breaks the installer unless we push it stable. Well, you can, but it's quite difficult - you have to learn how to build an installer image yourself, then build one containing the packages from the update and test it. I can do that, but most other people aren't going to bother. I've filed bugs and talked to people about ways to resolve this multiple times over many years, but a few months back I just got sick of the problem and decided to fix it myself. So I wrote an openQA update test which automates the process: it builds an installer image, with the packages from the update available to the installer image build tool. I also included a subsequent test which takes that image and runs an install with it. Since I already had the process for doing this manually down pat, it wasn't actually very difficult. Only...when I deployed the test to the openQA staging instance and actually tried it out, I found the installer image build would frequently fail in a rather strange way. The installer image build process works (more or less) by creating a temporary directory, installing a bunch of packages to it (using dnf's feature of installing to an alternative 'root'), fiddling around with that environment a bit more, creating a disk image whose root is that temporary directory, then fiddling with the image a bit to make it into a bootable ISO. (HANDWAVE HANDWAVE). However, I was finding it would commonly fail in the 'fiddling around with the environment' stage, because somehow some parts of the environment had disappeared. Specifically, it'd show this error:

FileNoteFoundError: [Errno 2] No such file or directory: '/var/tmp/lorax.q8xfvc0p/installroot//proc/modules'

lorax was, at that point, trying to touch that directory (never mind why). That's the /proc/modules inside the temporary root, basically. The question was, why was it disappearing? And why had neither myself nor bcl (the lorax maintainer) seen it happening previously in manual use, or in official composes? I tried reproducing it in a virtual machine...and failed. Then I tried again, and succeeded. Then I ran the command again...and it worked! That pattern turned out to repeat: I could usually get it to happen the first time I tried it in a VM, but any subsequent attempts in the same VM succeeded. So this was seeming really pretty mysterious. Brian couldn't get it to happen at all. At this point I wrote a dumb, short Python script which just constantly monitored the disappearing location and told me when it appeared and disappeared. I hacked up the openQA test to run this script, and upload the result. Using the timestamps, I was able to figure out exactly what bit of lorax was running when the directory suddenly disappeared. But...I couldn't immediately see why anything in that chunk of lorax would wind up deleting the directory. At this point, other work became more important, and I wound up leaving this on the back burner for a couple of months. Then I came back to it a couple days ago. I picked back up where I left off, and did a custom build of lorax with some debug logging statements strewn around the relevant section, to figure out really precisely where we were when things went wrong. But this turned out to be a bit of a brick wall, because it turned out that at the time the directory disappeared, lorax was just...running mksquashfs. And I could not figure out any plausible reason at all why a run of mksquashfs would cause the directory to vanish. After a bit, though, the thought struck me - maybe it's not lorax itself wiping the directory out at all! Maybe something else is doing it. So I thought to look at the system logs. And lo and behold, I found my smoking gun. At the exact time my script logged that the directory had disappeared, this message appeared in the system log:

Jan 18 01:57:30 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com systemd[1]: Starting Cleanup of Temporary Directories...

now, remember our problem directory is in /var/tmp. So this smells very suspicious indeed! So I figured out what that service actually is - to do this, you just grep for the description ("Cleanup of Temporary Directories") in /usr/lib/systemd/system - and it turned out to be /usr/lib/systemd/system/systemd-tmpfiles-clean.service, which is part of systemd's systemd-tmpfiles mechanism, which you can read up on in great detail in man systemd-tmpfiles and man tmpfiles.d. I had run into it a few times before, so I had a vague idea what I was dealing with and what to look for. It's basically a mechanism for managing temporary files and directories: you can write config snippets which systemd will read and do stuff like creating expected temporary files or directories on boot (this lets packages manage temporary directories without doing it themselves in scriptlets). I poked through the docs again and, sure enough, it turns out another thing the system can do is delete temporary files that reach a certain age:

Age
The date field, when set, is used to decide what files to delete when cleaning. If a file or directory is
older than the current time minus the age field, it is deleted. The field format is a series of integers
each followed by one of the following suffixes for the respective time units: s, m or min, h, d, w, ms, and
us, meaning seconds, minutes, hours, days, weeks, milliseconds, and microseconds, respectively. Full names
of the time units can be used too.

This systemd-tmpfiles-clean.service does that job. So I went looking for tmpfiles.d snippets that cover /var/tmp, and sure enough, found one, in Fedora's stock config file /usr/lib/tmpfiles.d/tmp.conf:

q /var/tmp 1777 root root 30d

The 30d there is the 'age' field. So this tells the tmpfiles mechanism that it's fine to wipe anything under /var/tmp which is older than 30 days. Of course, naively we might think our directory won't be older than 30 days - after all, we only just ran lorax! But remember, lorax installs packages into this temporary directory, and files and directories in packages get some of their time attributes from the package. So we (at this point, Brian and I were chatting about the problem as I poked it) looked into how systemd-tmpfiles defines age, precisely:

The age of a file system entry is determined from its last modification timestamp (mtime), its last access
timestamp (atime), and (except for directories) its last status change timestamp (ctime). Any of these three
(or two) values will prevent cleanup if it is more recent than the current time minus the age field.

So since our thing is a directory, its mtime and atime are relevant. So Brian and I both looked into those. He did it manually, while I hacked up my check script to also print the mtime and atime of the directory when it existed. And sure enough, it turned out these were several months in the past - they were obviously related to the date the filesystem package (from which /proc/modules comes) was built. They were certainly longer than 30 days ago. Finally, I looked into what was actually running systemd-tmpfiles-clean.service; it's run on a timer, systemd-tmpfiles-clean.timer. That timer is set to run the service 15 minutes after the system boots, and every day thereafter. So all of this hooked up nicely into a convincing story. openQA kept running into this problem because it always runs the test in a freshly-booted VM - that '15 minutes after boot' was turning out to be right in the middle of the image creation. My manual reproductions were failing on the first try for the same reason - but then succeeding on the second and subsequent tries because the cleaner would not run again until the next day. And Brian and I had either never or rarely seen this when we ran lorax manually for one reason or another because it was pretty unlikely the "once a day" timer would happen to wake up and run just when we had lorax running (and if it did happen, we'd try again, and when it worked, we'd figure it was just some weird transient failure). The problem likely never happens in official composes, I think, because the tmpfiles timer isn't active at all in the environment lorax gets run in (haven't double-checked this, though). Brian now gets to deal with the thorny problem of trying to fix this somehow on the lorax side (so the tmpfiles cleanup won't remove bits of the temporary tree even if it does run while lorax is running). Now I know what's going on, it was easy enough to work around in the openQA test - I just have the test do systemctl stop systemd-tmpfiles-clean.timer before running the image build.

AdamW's Debugging Adventures: Python 3 Porting 201

2019-01-08T20:12:36Z

Hey folks! Time for another edition of AdamW's Debugging Adventures, wherein I boast about how great I am at fixin' stuff.

Today's episode is about a bug in the client for Fedora's Koji buildsystem which has been biting more and more Fedora maintainers lately. The most obvious thing it affects is task watching. When you do a package build with fedpkg, it will by default "watch" the build task - it'll update you when the various subtasks start and finish, and not quit until the build ultimately succeeds or fails. You can also directly watch tasks with koji watch-task. So this is something Fedora maintainers see a lot. There's also a common workflow where you chain something to the successful completion of a fedpkg build or koji watch-task, which relies on the task watch completing successfully and exiting 0, if the build actually completed.

However, recently, people noticed that this task watching seemed to be just...failing, quite a lot. While the task was still running, it'd suddenly exit, usually showing this message:

ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

After a while, nirik realized that this seemed to be associated with the client going from running under Python 2 by default to running under Python 3 by default. This seems to happen when running on Python 3; it doesn't seem to happen when running on Python 2.

Today I finally decided it had got annoying enough that I'd spend some time trying to track it down.

It's pretty obvious that the message we see relates to an exception, in some way. But ultimately something is catching that exception and printing it out and then exiting (we're not actually getting a traceback, as you do if the exception is ultimately left to reach the interpreter). So my first approach was to dig into the watch-task code from the top down, and try and find something that handles exceptions that looks like it might be the bit we were hitting.

And...I failed! This happens, sometimes. In fact I still haven't found the exact bit of code that prints the message and exits. Sometimes, this just happens. It's OK. Don't give up. Try something else!

So what I did next was kind of a long shot - I just grepped the code for the exception text. I wasn't really expecting this to work, as there's nothing to suggest the actual exception is part of Koji; it's most likely the code doesn't contain any of that text at all. But hey, it's easy to do, so why not? And as it happened, I got lucky and hit paydirt: there happens to be a comment with some of the text from the error we're hitting. And it sure looks like it might be relevant to the problem we're having! The comment itself, and the function it's in, looked so obviously promising that I went ahead and dug a little deeper.

That function, is_conn_error(), is used by only one other thing: this _sendCall() method in the same file. And that seems very interesting, because what it does can be boiled down to: "hey, we got an error! OK, send it to is_conn_error(). If that returns True, then just log a debug message and kick the session. If that returns False, then raise an exception". That behaviour obviously smells a lot like it could be causing our problem. So, I now had a working theory: for some reason, given some particular server behaviour, is_conn_error() returns True on Python 2 but False on Python 3. That causes this _sendCall() to raise an exception instead of just resetting the session and carrying on, and some other code - which we no longer need to find - catches that exception, prints it, and quits.

The next step was to test this theory - because at this point it's only a theory, it could be entirely wrong. I've certainly come up with entirely plausible theories like this before which turned out to be not what was going on at all. So, like a true lazy shortcut enthusiast, I hacked up my local copy of Koji's __init__.py and sprinkled a bunch of lines like print("HERE 1!") and print("HERE 2!") through the whole of is_conn_error(). Then I just rankoji wait-task commands on random tasks until one failed.

This is fine. When you're just trying to debug the problem you don't need to be super elegant about it. You don't need to do a proper git patch and rebuild the Koji package for your system and use proper logging methods and all the rest of it. Just dumping some print lines in a working copy of the file is just fine, if it works. Just remember to put everything back as it was before later. :)

So, as it happened the god of root causing was on my side today, and it turned out I was right on the money. When one of the koji watch-task commands failed, it hit my HERE 1! and HERE 3! lines right when it died. Those told me we were indeed running through is_conn_error() right before the error, and further, where we were coming out of it. We were entering the if isinstance(e, socket.error) block at the start of the function, and returning False because the exception (e) did appear to be an instance of socket.error, but either did not have an errno attribute, or it was not one of errno.ECONNRESET, errno.ECONNABORTED, or errno.EPIPE.

Obviously, this made me curious as to what the exception actually is, whether it has an errno at all, and if so, what it is. So I threw in a few more debugging lines - to print out type(e), and getattr(e, 'errno', 'foobar'). The result of this was pretty interesting. The second print statement gave me 'foobar', meaning the exception doesn't have an errno attribute at all. And the type of the exception was...requests.exceptions.ConnectionError.

That's a bit curious! You wouldn't necessarily expect requests.exceptions.ConnectionError to be an instance of socket.error, would you? So why are we in a block that only handles instances of socket.error? Also, it's clear the code doesn't expect this, because there's a block later in the function that explicitly handles instances of requests.exceptions.ConnectionError - but because this earlier block that handles socket.error instances always returns, we will never reach that block if requests.exceptions.ConnectionError instances are also instances of socket.error. So there's clearly something screwy going on here.

So of course the next thing to do is...look up socket.error in the Python 2 and Python 3 docs. ANY TIME you're investigating a mysterious Python 3 porting issue, remember this can be useful. Here's the Python 2 socket.error entry, and the Python 3 socket.error entry. And indeed there's a rather significant difference! The Python 2 docs talk about socket.error as an exception that is, well, its own unique thing. However, the Python 3 docs say: "A deprecated alias of OSError." - and even tell us specifically that this changed in Python 3.3: "Changed in version 3.3: Following PEP 3151, this class was made an alias of OSError." Obviously, this is looking an awful lot like one more link in the chain of what's going wrong here.

With a bit of Python knowledge you should be able to figure out what's going on now. Think: if socket.error is now just an alias of OSError, what does if isinstance(e, socket.error) mean, in Python 3.3+ ? It means just the same as if isinstance(e, OSError). And guess what? requests.exception.ConnectionError happens to be a subclass of OSError. Thus, if e is an instance of requests.exception.ConnectionError, isinstance(e, socket.error) will return True in Python 3.3+. In Python 2, it returns False. It's easy to check this in an interactive Python shell or with a test script, to confirm.

Because of this, when we run under Python 3 and e is a requests.exception.ConnectionError, we're unexpectedly entering this block intended for handling socket.error exceptions and - because that block always returns, having the return False line that gets hit if the errno attribute check fails - we're never actually reaching the later block that's actually intended to handle requests.exception.ConnectionError instances at all, we return False before we get there.

There are a few different ways you could fix this - you could just drop the return False short-circuit line in the socket.error block, for instance, or change the ordering so the requests.exception.ConnectionError handling is done first. In the end I sent a pull request which drops the return False, but also drops the if isinstance(e, socket.error) checks (there's another, for nested exceptions, later) entirely. Since socket.error is meant to be deprecated in Python 3.3+ we shouldn't really use it, and we probably don't need to - we can just rely on the errno attribute check alone. Whatever type the exception is, if it has an errno attribute and that attribute is errno.ECONNRESET, errno.ECONNABORTED, or errno.EPIPE, I think we can be pretty sure this is a connection error.

What's the moral of this debugging tale? I guess it's this: when porting from Python 2 to Python 3 (or doing anything similar to that), fixing the things that outright crash or obviously behave wrong is sometimes the easy part. Even if everything seems to be working fine on a simple test, it's certainly possible that subtler issues like this could be lurking in the background, causing unexpected failures or (possibly worse) subtly incorrect behaviour. And of course, that's just another reason to add to the big old "Why To Have A Really Good Test Suite" list!

There's also a 'secondary moral', I guess, and that's this: predicting all the impacts of an interface change like this is hard. Remember the Python 3 docs mentioned a PEP associated with this change? Well, here it is. If you read it, it's clear the proposers actually put quite a lot of effort into thinking about how existing code might be affected by the change, but it looks like they still didn't consider a case like this. They talk about "Careless (or "naïve") code" which "blindly catches any of OSError, IOError, socket.error, mmap.error, WindowsError, select.error without checking the errno attribute", and about "Careful code is defined as code which, when catching any of the above exceptions, examines the errno attribute to determine the actual error condition and takes action depending on it" - and claim that "useful compatibility doesn't alter the behaviour of careful exception-catching code". However, Koji's code here clearly matches their definition of "careful" code - it considers both the exception's type, and the errno attribute, in making decisions - but because it is not just doing except socket.error as e or similar, but catching the exception elsewhere and then passing it to this function and using isinstance, it still gets tripped up by the change.

So...the ur-moral, as always, is: software is hard!

AdamW's Debugging Adventures: Has Anyone Seen My Kernel?

2018-11-05T15:59:44Z

Welp, since I haven't blogged for a while, here's another root-causing write up! Honestly, root-causing things is one of my favorite parts of the job, lately.

I've been on vacation recently, and when I came back, it looked like several things were wrong with Rawhide. Several of these were relatively easy fixes: live images not behaving right at all and the installer not launching properly any more both only took a couple of hours to track down. But there were also a couple of bugs causing more recent composes to fail entirely. The inimitable puiterwijk dealt with one of those (aarch64 cloud images not building properly), and I wound up taking the other one: overnight, most live image composes had suddenly stopped working.

What's The Problem?

The starting point for this debugging odyssey was the error messages we got in the failed composes:

DEBUG pylorax.treebuilder: kernels=[]
ERROR livemedia-creator: No kernels found, cannot rebuild_initrds

So, the first step is pretty easy: let's just go look up where those errors come from. pylorax and livemedia-creator are both part of the lorax tree, so we'll start there. It's easy to use grep to find the sources of those two messages: they're both in treebuilder.py, the first here, in findkernels() and the second here, in TreeBuilder.rebuild_initrds(). As the second happens if there is no self.kernels, and we we can see just a few lines further back that self.kernels is a property based on a call to findkernels(), it's pretty obvious that ultimately what's going wrong here is that findkernels() isn't finding any kernels.

So...Why Aren't We Finding Any Kernels?

So next, of course, I put my thinking cap on and had a good look at findkernels() - not just at the code itself, but at its history. When something that was working breaks, you're always looking for a change that caused it. There were no recent changes to findkernels(), and I couldn't see anything obviously wrong in its implementation - it's basically looking for files named vmlinuz-(something) in /boot - so it didn't look like the quest was over yet.

So at this point there were kinda two possibilities: findkernels() was broken, or it was working fine but there really weren't any kernels to find. I decided the easiest way to figure out which we were dealing with was just to reproduce the problem and take a look. With any sufficiently complex problem, you can usually get some way into figuring it out just by looking and logs and code and thinking about things logically, but at some point you'll want to try and reproduce the problem in a controlled environment you can poke at - knowing when you're at that point isn't an exact science, but if you find yourself with multiple possibilities and no easy way to decide which you're dealing with, that's often a good indicator.

Doing a test live image build is a bit awkward for Fedora at present, unfortunately, but not really too bad. I've been meaning to write a helper tool for a while, but never got around to it. The way I do it - because I want to replicate as close as possible how the official images are built - is just to try and do exactly what the build system does, only on my local system and by hand. What the build system does is spin up a mock root, install the live image creation tools and a flattened kickstart into it, and run the tool. So, that's what I do! A handy 'trick' here is just to look at the logs from a real live image creation task - like this one - and shadow what they do. Note these logs will get garbage collected at some point, but I can't do anything about that.

The main log to look at is root.log. First, we see that the tools are installed to the mock:

DEBUG util.py:439:  Marking packages as installed by the group:
DEBUG util.py:439:   @livemedia-build glibc-all-langpacks coreutils               lorax-lmc-novirt
DEBUG util.py:439:                    util-linux          selinux-policy-targeted bash            
DEBUG util.py:439:                    shadow-utils

So we'll do just the same with our mock:

mock -r fedora-rawhide-x86_64 --install glibc-all-langpacks coreutils lorax-lmc-novirt util-linux selinux-policy-targeted bash shadow-utils

(Really all you want to do is install the livemedia-build group, but I haven't actually found an equivalent of mock --groupinstall, so I just do that).

We can find the actual image compose command in the task log again, by looking for 'livemedia-creator' - it looks like this:

INFO backend.py:285:  Running in chroot: ['/sbin/livemedia-creator', '--ks', '/chroot_tmpdir/koji-image-f30-build-30594219.ks', '--logfile', '/chroot_tmpdir/lmc-logs/livemedia-out.log', '--no-virt', '--resultdir', '/chroot_tmpdir/lmc', '--project', 'Fedora-KDE-Live', '--make-iso', '--volid', 'Fedora-KDE-Live-rawh-20181101.n.', '--iso-only', '--iso-name', 'Fedora-KDE-Live-x86_64-Rawhide-20181101.n.0.iso', '--releasever', 'Rawhide', '--title', 'Fedora-KDE-Live', '--macboot']

we can easily turn that Python array into a console command by just replacing occurrences of ', ' with :

/sbin/livemedia-creator --ks /chroot_tmpdir/koji-image-f30-build-30594219.ks --logfile /chroot_tmpdir/lmc-logs/livemedia-out.log --no-virt --resultdir /chroot_tmpdir/lmc --project Fedora-KDE-Live --make-iso --volid Fedora-KDE-Live-rawh-20181101.n. --iso-only --iso-name Fedora-KDE-Live-x86_64-Rawhide-20181101.n.0.iso --releasever Rawhide --title Fedora-KDE-Live --macboot

We can see that's using a scratch directory called /chroot_tmpdir, and a kickstart file called koji-image-f30-build-30594219.ks. This kickstart can be found as one of the task assets, so we'll grab it and copy it into the mock's /root for now:

sudo cp koji-image-f30-build-30606109.ks /var/lib/mock/fedora-rawhide-x86_64/root/root

Then finally, we're going to get a shell in the mock root, using the old-style chroot implementation (this is necessary for live image builds to work, the new systemd-based implementation doesn't work yet):

mock -r fedora-rawhide-x86_64 --shell --old-chroot

Inside the mock, we'll create the /chroot_tmpdir scratch dir, copy the kickstart into it, and finally run the image creation:

mkdir -p /chroot_tmpdir
cd /chroot_tmpdir
cp /root/koji-image-f30-build-30606109.ks .
/sbin/livemedia-creator --ks /chroot_tmpdir/koji-image-f30-build-30594219.ks --logfile /chroot_tmpdir/lmc-logs/livemedia-out.log --no-virt --resultdir /chroot_tmpdir/lmc --project Fedora-KDE-Live --make-iso --volid Fedora-KDE-Live-rawh-20181101.n. --iso-only --iso-name Fedora-KDE-Live-x86_64-Rawhide-20181101.n.0.iso --releasever Rawhide --title Fedora-KDE-Live --macboot

And when I did that, it worked away for a while - half an hour or so - and eventually failed exactly like the 'official' build had! So now I had a failure in a controlled environment (my little mock root) to look at. Note that if you're playing along at home, this will only work so long as you can grab that kickstart from Koji, and the 20181101.n.0 compose files are kept around, which will only be for another two weeks or so - after that you won't be able to reproduce this, but you can of course follow the same procedure with a newer Koji task if you want to reproduce a newer official live image build.

Next, I needed to examine the actual filesystem produced by the image build process and see if it really had any kernels in it (remember, that's what we were trying to figure out). This requires a bit of knowledge about how livemedia-creator works, which you'd have had to look up if you didn't know it already: it creates an image file, loopback mounts it, and installs into the loopback-mounted directory. When it fails, it leaves the image file around, and you can just mount it again and poke around. The file will be in the lmc/ subdirectory of the directory where the image build command was run, with a filename like lmc-disk-ctfz98m5.img (the alphanumeric soup bit is random), and we mount it like this:

mount -o loop lmc/lmc-disk-ctfz98m5.img /mnt/sysimage

(the mount point having been created and left around by the tool). Now we can look in /mnt/sysimage/boot, and when I did that...I found that, indeed, it contained no vmlinuz-* files at all! So, I had eliminated the possibility that findkernels() was going wrong: it was doing its job just fine, and it wasn't finding any kernels because...there were no kernels to find.

OK...So Why Aren't There Any Kernels?

So now I had to try and work out: why were there no kernels in the image's /boot? I knew from the logs of earlier, successful image composes that, in successful composes, there really were kernels to find: it wasn't that there had never been kernels, but this had only recently become fatal for some reason. The difference really was that there used to be kernels present when this lorax code ran, but now there weren't.

This led me into a fun new piece of research: figuring out how kernel files get into /boot in a Fedora system at all. You might think - I did - that they're simply perfectly normal packaged files installed by the kernel-core package. But it turns out, they're not! The kernel-core package does own the relevant /boot/vmlinuz-* file, but it doesn't actually directly install it: it uses an RPM directive called %ghost to own it without installing it. So the file must get installed some other way. Here again I cheated a bit with prior knowledge - I knew this overall mechanism existed, though I didn't know until now that it really installed the kernel file itself - but if you don't have that, you could look at the %posttrans script in the kernel package: when a kernel is installed, a command called kernel-install gets run.

I also found out (by diffing the logged packages from the 20181030.n.0 and 20181101.n.0 live image composes) that the kernel itself had been updated in the 20181101.n.0 compose (which was when things started breaking). So once again I had a couple of promising lines of inquiry: the new kernel, and this kernel-install path.

Well, turns out systemd haters in the audience can get very excited, because kernel-install is part of systemd:

[adamw@adam lorax (master %)]$ rpm -qf `which kernel-install`
systemd-udev-239-3.fc29.x86_64

Anyway, I refreshed my memory a bit about what kernel-install does, but it's kinda complicated and it calls out to various other things, including /usr/sbin/new-kernel-pkg (part of grubby) and some plugins in /usr/lib/kernel/install.d (various of which come from systemd, grub2, and dracut). So I think what I did next (my memory gets a bit hazy) was to wonder whether the same problem would affect a regular install from the same packages.

I got the last working Rawhide network install image, and set it to install from the Everything repo from the failed 20181101.n.0 compose. I let that install run, then checked that the /boot/vmlinuz-(whatever) file existed in the installed system...which it did. This sort of let out one theory I had: that the new kernel package had somehow messed something up such that the kernel file never actually got installed properly at all.

So, I got to wondering whether kernel-install really was the thing that put the /boot/vmlinuz-(whatever) file in place (I still wasn't sure at this point), whether it reproducibly failed to do so in the live image creation environment but succeeded in doing so in the regular install environment, and if so, what differed between the two.

I could see the exact kernel-install command just by examining the kernel-core package scripts:

rpm -q --scripts kernel-core | grep kernel-install
/bin/kernel-install add 4.20.0-0.rc0.git7.1.fc30.x86_64 /lib/modules/4.20.0-0.rc0.git7.1.fc30.x86_64/vmlinuz || exit $?

So I tried just deleting /boot/vmlinuz* and re-running that command in the installed system...and sure enough, the file re-appeared! So now I was pretty sure kernel-install was the thing that's supposed to install it. I also tried doing this in my live image creation mock environment:

chroot /mnt/sysimage
/bin/kernel-install add 4.20.0-0.rc0.git7.1.fc30.x86_64 /lib/modules/4.20.0-0.rc0.git7.1.fc30.x86_64/vmlinuz

...and sure enough, it didn't create the /boot/vmlinuz-(foo) file. So now I was narrowing in on the problem: something about the live image creation environment meant that this kernel-install invocation didn't install the file, when it probably should.

OK...So Why Isn't `kernel-install` Installing Kernels?

At this point I probably could've figured out the problem by reading the source if I'd read it carefully enough, but I decided to carry on with the practical experimentation. I tried running the script through sh -x in each environment, to see exactly what commands were run by the script in each case, and somehow - I forget how - I zeroed in on one of the /usr/lib/kernel/install.d plugin scripts: /usr/lib/kernel/install.d/20-grub.install. This is part of grub2. I think I found these scripts from the sh -x output, and noticed that this one has what looks like the code to actually install the kernel file to /boot. So I made that script run with -z as well, and this finally got me my next big breakthrough. In the installed system I could see that script doing a lot of stuff, but in the live environment it seemed to exit almost as soon as it started:

+ [[ -n '' ]]
+ exit 0

It's not 100% obvious, but I was pretty sure that just meant it was failing in the test right at the start:

if ! [[ $KERNEL_INSTALL_MACHINE_ID ]]; then
    exit 0
fi

So I went and looked up $KERNEL_INSTALL_MACHINE_ID and the references suggested that it's basically tied to /etc/machine-id. So I looked, and, lo and behold, in the regular installed system, that file contained a random alphanumeric string, but in the live image creation environment, the file was empty! This sure looked suspicious.

I read through some references on the file, and found that it's usually meant to get created by a call to /usr/bin/systemd-machine-id-setup in systemd package scripts. So I tried running systemd-machine-id-setup in the live image creation environment, and suddenly the file got a random string, and when I ran kernel-install again, the kernel file did get installed to /boot!

OK...So Why Isn't There A Machine ID?

So now I'd moved along to a new problem again: why was /etc/machine-id empty when the kernel %post script ran, but if I tried to generate it again, it worked? Was the initial generation failing? Was it happening too late? Was it working, but the file getting touched by something else?

Again, I looked at quite a lot of code to try and figure it out - there's code that at least could touch /etc/machine-id in anaconda, for instance, and in lorax - but in the end I decided to go with practical experimentation again. So I did a custom scratch build of systemd to try and find out what happened when the %post script ran. I changed the command from this:

systemd-machine-id-setup &>/dev/null || :

to this:

systemd-machine-id-setup > /tmp/smids.log 2>&1
cat /etc/machine-id >> /tmp/smids.log

pulled that build into a side repository, edited the live kickstart to use that side repository, and re-ran the live image creation. And that hit paydirt in a big way, because in smids.log I saw this:

systemd-machine-id-setup: error while loading shared libraries: libssl.so.1.1: cannot open shared object file: No such file or directory

...and here was the ultimate solution to our mystery! The attempt to set the machine-id in systemd %post was failing because it needs libssl, but it obviously wasn't present yet. libssl is part of openssl-libs, but the systemd spec did not specify that its %post script needs openssl-libs installed. What I surmise had happened was that up until 20181030, some other dependency in some other package happened to mean that dnf would always choose to install openssl-libs before installing systemd, so no-one had ever noticed this missing dependency...but on 20181101, some change to some package caused dnf to start installing systemd before openssl-libs, and suddenly, this problem showed up. So - as is very often the case - once I'd finally managed to nail down the problem, the fix was simple: we just add the missing dependency to systemd, so that openssl-libs will always be installed before systemd's %post is run. With that fix, generation of /etc/machine-id will succeed again, and so the plugin script that installs the kernel file to /boot won't bail early, and so there will be a kernel file in /boot, and lorax won't fail when it tries to regenerate initrds because there aren't any kernels present!

...and so ends another exciting episode of AdamW's Debugging Adventures :)

Adam's Debugging Adventures: The Immutable Mutable Object

2018-06-27T16:03:50Z

Here's a puzzle for you, Python fans:

[adamw@adam dnf (master %)]$ python3
Python 3.6.5 (default, Apr 23 2018, 22:53:50) 
[GCC 8.0.1 20180410 (Red Hat 8.0.1-0.21)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from dnf.conf import MainConf
>>> mc = MainConf()
>>> print(mc.group_package_types)
['mandatory', 'default', 'conditional']
>>> mc.group_package_types.append('foo')
>>> print(mc.group_package_types)
['mandatory', 'default', 'conditional']
>>>

Note: if you want to try reproducing this...make sure you use DNF 3. It works as expected with DNF < 3. That's why it just showed up as a problem.

Before I explain what's going on there...let's unpick the problem a bit for non-Pythonistas.

In Python (and in many other languages) some things - objects - are 'mutable', and some are 'immutable'. A 'mutable' object can be changed. An 'immutable' object cannot.

In the Python session above, we create an object, mc, which is an instance of the class MainConf (don't worry if you don't entirely understand that, it's not compulsory). We then examine one attribute of mc: mc.group_package_types. Python tells us that this is ['mandatory', 'default', 'conditional'] - which is a Python list containing the values 'mandatory', 'default' and 'conditional'.

In Python, lists are 'mutable'. That means you can take a list and change it somehow - remove an item from it, add an item to it, re-order it - and it's still the same object. Any existing reference to the object is still valid, and now refers to the changed list.

For comparison, an example of an 'immutable' object is a string. If you do this:

mystring = "here's some text"

you can't then change the actual string object referenced by mystring there. It has no methods that let you change it in any way, and there are no functions that can operate on it and change it. You can do this:

mystring = "here's some text"
mystring = "here's some different text"
mystring = mystring.replace("some", "some more")

and at each step the contents of the string to which the name mystring refers are different - but also, at each step mystring refers to a DIFFERENT object. (That's why the replace() string method returns a new string - it can't mutate the existing string). So if you did this:

mystring = "here's some text"
otherref = mystring
mystring = "here's some different text"

then at the end, otherref still points to the first-created string object and its value is still "here's some text", while mystring points to a new string object and its value is "here's some different text". Let's compare with a similar case using a mutable object:

mylist = [1, 2, 3]
otherref = mylist
mylist.append(4)
print(mylist)
print(otherref)

In this case, when we get to the end, both mylist and otherref are still pointing to the same object, the original object, and both prints will print [1, 2, 3, 4]. No new list object was created at any point after the initial creation of mylist.

So with that understood, take a look back at the original example, and maybe you can see why it's so weird. By all appearances, it looks like we have a pretty simple scenario here: we have an object that has an attribute which is a list, and we just append a value to that list. Then we go look at its value again, and it...hasn't changed at all? But we didn't get any kind of crash, or error, or anything. Our append call returned apparently successfully. It just...didn't seem to change anything. The list is an immutable mutable object!

This is a real problem in real code: it broke the most recent Fedora Rawhide compose. So, obviously, I was pretty keen to figure out what the hell was going on, here! It turns out that it's down to dnf getting clever (arguably over-clever).

Python's a very...flexible language. The key to the problem here turned out to be exactly what happens when we actually look up the group_package_types attribute of mc, the dnf.conf.MainConf instance.

Getting and setting attributes of objects in Python is usually a kind of core operation that you never really think about, you just expect it to work in the 'usual way'. A simple approximation of how it works is that the object has a Python dict (like a 'hash' in some languages - a key/value store, more or less) whose keys are attribute names and whose values are the attribute values. When you ask for an attribute of an object, Python checks if its name is one of the keys in that object's dict, and if it is, returns the value. If it's not, Python raises an error. When you set an attribute, Python stuffs it into the dict.

But since Python's flexible, it provides some mechanisms to let you mess around with this stuff, if you want to. You can define __setattr__, __getattr__ and/or __getattribute__ methods in a class, and they'll affect this behaviour.

The base object class that almost all Python classes inherit from defines the default __setattr__ and __getattribute__, which work sort of like the approximation I gave above. If you override __setattr__ in a class, then when something tries to set an attribute for an instance of that class, that method will get called instead of the default object.__setattr__. If you override __getattribute__, then that method will get called instead of object.__getattribute__ when something tries to look up an attribute of an instance of that class.

If you leave __getattribute__ alone but define __getattr__, then when something tries to look up an attribute, first the stock object.__getattribute__ will be used to try and look it up, but if that doesn't find it, rather than raising an exception immediately, Python will try your __getattr__ method.

We can actually override __setattr__ and __getattr__ to do a very simplified demonstration of how the default implementation usually works:

#!/bin/python3

class TestClass(object):
    def __init__(self):
        self.__dict__["_attrs"] = {}

    def __setattr__(self, name, value):
        print("Hi, setattr here, just settin' attrs...")
        self._attrs[name] = value

    def __getattr__(self, name):
        print("Hi, getattr here, here's your attr!")
        return self._attrs[name]

tc = TestClass()
tc.foo = [1, 2, 3]
print(tc.foo)
tc.foo.append(4)
print(tc.foo)

Note that __dict__ is the store that object.__getattribute__ uses, so that's why we set up our backing store with self.__dict__["_attrs"] = {} - that ensures that when we look up self._attrs, we will find it via __getattribute__. We can't just do self._attrs = {} because then we wind up in an infinite recursion in our __getattr__.

If you save that and run it, you'll see [1, 2, 3] then [1, 2, 3, 4] (plus the messages that prove our new methods are being used). Our mutable attribute is nice and properly mutable. We can append things to it and everything. Notice that when we append the value, we hit __getattr__ but not __setattr__.

So, how does this manage not to work with dnf config classes? (MainConf is a subclass of BaseConfig, and so are a ton of other config-related classes in dnf - we actually encountered this bug with another subclass, RepoConf). It turns out to be because dnf overrides BaseConfig.__setattr__ and BaseConfig.__getattr__ to do some "clever" stuff, and it breaks this!

We don't need to go into what its __setattr__ does in detail, except to note one thing: it doesn't store the values in the __dict__ store, so object.__getattribute__ can never find them. When looking up any attribute on an instance of one of these classes except _config (which is the store the class' __setattr__ and __getattr__ methods themselves use, just like _attrs in our example, and is created directly in __dict__ in the same way), we always wind up in the class's __getattr__.

Here's the whole of current dnf's BaseConfig.__getattr__:

def __getattr__(self, name):
    option = getattr(self._config, name, None)
    if option is None:
        return None
    try:
        value = option().getValue()
    except Exception as ex:
        return None
    if isinstance(value, cfg.VectorString):
        return list(value)
    if isinstance(value, str):
        return ucd(value)
    return value

There is some more stuff going on in the background here that we don't need to worry about too much (a feature of DNF, I have found, is that it has layers upon layers. It contains multitudes. You usually can't debug anything in DNF without going through at least eight levels of things calling other things that turn out to be yet other things that turn out to be written in C++ just cuz.) In the case of the group_package_types option, and also the option we were actually dealing with in the buggy case (the baseurl option for a repo), the option is basically a list-y type, so we wind up in the if isinstance(value, cfg.VectorString): branch here.

So if you follow it through, when we asked for mc.group_package_types, unlike in the default Python implementation or our simplified example, we didn't just pull an existing list object out from some sekrit store in the mc object. No. We got some kind of object (fact fans: it's a libdnf.conf.OptionStringList - libdnf is the C++ bit I mentioned earlier...) out of the self._config dict that's acting as our sort-of attribute store, and ran its getValue method to get some other sort of object (fact fans: it's a libdnf.conf.VectorString), then we ran list() on that object, and returned that.

The problem is that the thing that gets returned is basically a temporary copy of the 'real' backing value. It's a mutable object - it really is a list! - and we can mutate it...but the next time anyone looks up the same attribute we looked up to get that list, they won't get the same list object we got. This wacky __getattr__ will run through the same indirection maze and return a new listified copy of the backing value. Every time you look up the attribute, it does that. We can mutate the copies all we like, but doing that doesn't actually change the backing value.

Now, it's easy enough to work around this, once you know what's going on. The overridden __setattr__, of course, actually does change the backing store. So if you explicitly try to set an attribute (rather than getting one and mutating it), that does work:

[adamw@adam dnf (master %)]$ python3
Python 3.6.5 (default, Apr 23 2018, 22:53:50) 
[GCC 8.0.1 20180410 (Red Hat 8.0.1-0.21)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from dnf.conf import MainConf
>>> mc = MainConf()
>>> print(mc.group_package_types)
['mandatory', 'default', 'conditional']
>>> mc.group_package_types = mc.group_package_types + ['foo']
>>> print(mc.group_package_types)
['mandatory', 'default', 'conditional', 'foo']
>>>

That time, it worked because we didn't try to mutate our magical immutable mutable object. We just flat out replaced it with a new list. When we explicitly set the attribute like that, we hit the overridden __setattr__ method, which does the necessary magic to write the new value to the backing store.

But any regular Pythonista who sees that the instance attribute foo.bar is a mutable object is naturally going to assume that they can go ahead and mutate it. That's just...standard. They aren't going to figure they should ignore the fact that it's a mutable object and just replace it every time they want to change it. That's exactly what we did do in the code that got broken. That's the exact code that used to work with DNF 2 but no longer does with DNF 3.

So, that took a few hours to figure out! But I got there in the end. I really just wanted to write up my latest 'interesting' debugging adventure! But if there's a moral to this story...I guess it's "think really hard about whether messing with core behaviour like this is the right way to go about implementing your special thing"?

Oh, and by the way: comments should be working again now! I just disabled the plugin that was interfering with them. So, you know, comment away.

The Tale Of The Two-Day, One-Character Patch

2017-01-11T18:57:21Z

I'm feeling like writing a very long explanation of a very small change again. Some folks have told me they enjoy my attempts to detail the entire step-by-step process of debugging some somewhat complex problem, so sit back, folks, and enjoy...The Tale Of The Two-Day, One-Character Patch!

Recently we landed Python 3.6 in Fedora Rawhide. A Python version bump like that requires all Python-dependent packages in the distribution to be rebuilt. As usually happens, several packages failed to rebuild successfully, so among other work, I've been helping work through the list of failed packages and fixing them up.

Two days ago, I reached python-deap. As usual, I first simply tried a mock build of the package: sometimes it turns out we already fixed whatever had previously caused the build to fail, and simply retrying will make it work. But that wasn't the case this time.

The build failed due to build dependencies not being installable - python2-pypandoc, in this case. It turned out that this depends on pandoc-citeproc, and that wasn't installable because a new ghc build had been done without rebuilds of the set of pandoc-related packages that must be rebuilt after a ghc bump. So I rebuilt pandoc, and ghc-aeson-pretty (an updated version was needed to build an updated pandoc-citeproc which had been committed but not built), and finally pandoc-citeproc.

With that done, I could do a successful scratch build of python-deap. I tweaked the package a bit to enable the test suites - another thing I'm doing for each package I'm fixing the build of, if possible - and fired off an official build.

Now you may notice that this looks a bit odd, because all the builds for the different arches succeeded (they're green), but the overall 'State' is "failed". What's going on there? Well, if you click "Show result", you'll see this:

BuildError: The following noarch package built differently on different architectures: python-deap-doc-1.0.1-2.20160624git232ed17.fc26.noarch.rpm
rpmdiff output was:
error: cannot open Packages index using db5 - Permission denied (13)
error: cannot open Packages database in /var/lib/rpm
error: cannot open Packages database in /var/lib/rpm
removed     /usr/share/doc/python-deap/html/_images/cma_plotting_01_00.png
removed     /usr/share/doc/python-deap/html/examples/es/cma_plotting_01_00.hires.png
removed     /usr/share/doc/python-deap/html/examples/es/cma_plotting_01_00.pdf
removed     /usr/share/doc/python-deap/html/examples/es/cma_plotting_01_00.png

So, this is a good example of where background knowledge is valuable. Getting from step to step in this kind of debugging/troubleshooting process is a sort of combination of logic, knowledge and perseverance. Always try to be logical and methodical. When you start out you won't have an awful lot of knowledge, so you'll need a lot of perseverance; hopefully, the longer you go on, the more knowledge you'll pick up, and thus the less perseverance you'll need!

In this case the error is actually fairly helpful, but I also know a bit about packages (which helps) and remembered a recent mailing list discussion. Fedora allows arched packages with noarch subpackages, and this is how python-deap is set up: the main packages are arched, but there is a python-deap-docs subpackage that is noarch. We're concerned with that package here. I recalled a recent mailing list discussion of this "built differently on different architectures" error.

As discussed in that thread, we're failing a Koji check specific to this kind of package. If all the per-arch builds succeed individually, Koji will take the noarch subpackage(s) from each arch and compare them; if they're not all the same, Koji will consider this an error and fail the build. After all, the point of a noarch package is that its contents are the same for all arches and so it shouldn't matter which arch build we take the noarch subpackage from. If it comes out different on different arches, something is clearly up.

So this left me with the problem of figuring out which arch was different (it'd be nice if the Koji message actually told us...) and why. I started out just looking at the build logs for each arch and searching for 'cma_plotting'. This is actually another important thing: one of the most important approaches to have in your toolbox for this kind of work is just 'searching for significant-looking text strings'. That might be a grep or it might be a web search, but you'll probably wind up doing a lot of both. Remember good searching technique: try to find the most 'unusual' strings you can to search for, ones for which the results will be strongly correlated with your problem. This quickly told me that the problematic arch was ppc64. The 'removed' files were not present in that build, but they were present in the builds for all other arches.

So I started looking more deeply into the ppc64 build log. If you search for 'cma_plotting' in that file, you'll see the very first result is "WARNING: Exception occurred in plotting cma_plotting". That sounds bad! Below it is a long Python traceback - the text starting "Traceback (most recent call last):".

So what we have here is some kind of Python thing crashing during the build. If we quickly compare with the build logs on other arches, we don't see the same thing at all - there is no traceback in those build logs. Especially since this shows up right when the build process should be generating the files we know are the problem (the cma_plotting files, remember), we can be pretty sure this is our culprit.

Now this is a pretty big scary traceback, but we can learn some things from it quite easily. One is very important: we can see quite easily what it is that's going wrong. If we look at the end of the traceback, we see that all the last calls involve files in /usr/lib64/python2.7/site-packages/matplotlib. This means we're dealing with a Python module called matplotlib. We can quite easily associate that with the package python-matplotlib, and now we have our next suspect.

If we look a bit before the traceback, we can get a bit more general context of what's going on, though it turns out not to be very important in this case. Sometimes it is, though. In this case we can see this:

+ sphinx-build-2 doc build/html
Running Sphinx v1.5.1

Again, background knowledge comes in handy here: I happen to know that Sphinx is a tool for generating documentation. But if you didn't already know that, you should quite easily be able to find it out, by good old web search. So what's going on is the package build process is trying to generate python-deap's documentation, and that process uses this matplotlib library, and something is going very wrong - but only on ppc64, remember - in matplotlib when we try to generate one particular set of doc files.

So next I start trying to figure out what's actually going wrong in matplotlib. As I mentioned, the traceback is pretty long. This is partly just because matplotlib is big and complex, but it's more because it's a fairly rare type of Python error - an infinite recursion. You'll see the traceback ends with many, many repetitions of this line:

  File "/usr/lib64/python2.7/site-packages/matplotlib/mathtext.py", line 861, in _get_glyph
    return self._get_glyph('rm', font_class, sym, fontsize)

followed by:

  File "/usr/lib64/python2.7/site-packages/matplotlib/mathtext.py", line 816, in _get_glyph
    uniindex = get_unicode_index(sym, math)
  File "/usr/lib64/python2.7/site-packages/matplotlib/mathtext.py", line 87, in get_unicode_index
    if symbol == '-':
RuntimeError: maximum recursion depth exceeded in cmp

What 'recursion' means is pretty simple: it just means that a function can call itself. A common example of where you might want to do this is if you're trying to walk a directory tree. In Python-y pseudo-code it might look a bit like this:

def read_directory(directory):
    print(directory.name)
    for entry in directory:
        if entry is file:
            print(entry.name)
        if entry is directory:
            read_directory(entry)

To deal with directories nested in other directories, the function just calls itself. The danger is if you somehow mess up when writing code like this, and it winds up in a loop, calling itself over and over and never escaping: this is 'infinite recursion'. Python, being a nice language, notices when this is going on, and bails after a certain number of recursions, which is what's happening here.

So now we know where to look in matplotlib, and what to look for. Let's go take a look! matplotlib, like most everything else in the universe these days, is in github, which is bad for ecosystem health but handy just for finding stuff. Let's go look at the function from the backtrace.

Well, this is pretty long, and maybe a bit intimidating. But an interesting thing is, we don't really need to know what this function is for - I actually still don't know precisely (according to the name it should be returning a 'glyph' - a single visual representation for a specific character from a font - but it actually returns a font, the unicode index for the glyph, the name of the glyph, the font size, and whether the glyph is italicized, for some reason). What we need to concentrate on is the question of why this function is getting in a recursion loop on one arch (ppc64) but not any others.

First let's figure out how the recursion is actually triggered - that's vital to figuring out what the next step in our chain is. The line that triggers the loop is this one:

                return self._get_glyph('rm', font_class, sym, fontsize)

That's where it calls itself. It's kinda obvious that the authors expect that call to succeed - it shouldn't run down the same logical path, but instead get to the 'success' path (the return font, uniindex, symbol_name, fontsize, slanted line at the end of the function) and thus break the loop. But on ppc64, for some reason, it doesn't.

So what's the logic path that leads us to that call, both initially and when it recurses? Well, it's down three levels of conditionals:

    if not found_symbol:
        if self.cm_fallback:
            <other path>
        else:
            if fontname in ('it', 'regular') and isinstance(self, StixFonts):
                return self._get_glyph('rm', font_class, sym, fontsize)

So we only get to this path if found_symbol is not set by the time we reach that first if, then if self.cm_fallback is not set, then if the fontname given when the function was called was 'it' or 'regular' and if the class instance this function (actually method) is a part of is an instance of the StixFonts class (or a subclass). Don't worry if we're getting a bit too technical at this point, because I did spend a bit of time looking into those last two conditions, but ultimately they turned out not to be that significant. The important one is the first one: if not found_symbol.

By this point, I'm starting to wonder if the problem is that we're failing to 'find' the symbol - in the first half of the function - when we shouldn't be. Now there are a couple of handy logical shortcuts we can take here that turned out to be rather useful. First we look at the whole logic flow of the found_symbol variable and see that it's a bit convoluted. From the start of the function, there are two different ways it can be set True - the if self.use_cmex block and then the 'fallback' if not found_symbol block after that. Then there's another block that starts if found_symbol: where it gets set back to False again, and another lookup is done:

    if found_symbol:
    (...)
        found_symbol = False
        font = self._get_font(new_fontname)
        if font is not None:
            glyphindex = font.get_char_index(uniindex)
            if glyphindex != 0:
                found_symbol = True

At first, though, we don't know if we're even hitting that block, or if we're failing to 'find' the symbol earlier on. It turns out, though, that it's easy to tell - because of this earlier block:

    if not found_symbol:
        try:
            uniindex = get_unicode_index(sym, math)
            found_symbol = True
        except ValueError:
            uniindex = ord('?')
            warn("No TeX to unicode mapping for '%s'" %
                 sym.encode('ascii', 'backslashreplace'),
                 MathTextWarning)

Basically, if we don't find the symbol there, the code logs a warning. We can see from our build log that we don't see any such warning, so we know that the code does initially succeed in finding the symbol - that is, when we get to the if found_symbol: block, found_symbol is True. That logically means that it's that block where the problem occurs - we have found_symbol going in, but where that block sets it back to False then looks it up again (after doing some kind of font substitution, I don't know why, don't care), it fails.

The other thing I noticed while poking through this code is a later warning. Remember that the infinite recursion only happens if fontname in ('it', 'regular') and isinstance(self, StixFonts)? Well, what happens if that's not the case is interesting:

            if fontname in ('it', 'regular') and isinstance(self, StixFonts):
                return self._get_glyph('rm', font_class, sym, fontsize)
            warn("Font '%s' does not have a glyph for '%s' [U+%x]" %
                 (new_fontname,
                  sym.encode('ascii', 'backslashreplace').decode('ascii'),
                  uniindex),
                 MathTextWarning)

that is, if that condition isn't satisfied, instead of calling itself, the next thing the function does is log a warning. So it occurred to me to go and see if there are any of those warnings in the build logs. And, whaddayaknow, there are four such warnings in the ppc64 build log:

/usr/lib64/python2.7/site-packages/matplotlib/mathtext.py:866: MathTextWarning: Font 'rm' does not have a glyph for '1' [U+1d7e3]
  MathTextWarning)
/usr/lib64/python2.7/site-packages/matplotlib/mathtext.py:867: MathTextWarning: Substituting with a dummy symbol.
  warn("Substituting with a dummy symbol.", MathTextWarning)
/usr/lib64/python2.7/site-packages/matplotlib/mathtext.py:866: MathTextWarning: Font 'rm' does not have a glyph for '0' [U+1d7e2]
  MathTextWarning)
/usr/lib64/python2.7/site-packages/matplotlib/mathtext.py:866: MathTextWarning: Font 'rm' does not have a glyph for '-' [U+2212]
  MathTextWarning)
/usr/lib64/python2.7/site-packages/matplotlib/mathtext.py:866: MathTextWarning: Font 'rm' does not have a glyph for '2' [U+1d7e4]
  MathTextWarning)

but there are no such warnings in the logs for other arches. That's really rather interesting. It makes one possibility very unlikely: that we do reach the recursed call on all arches, but it fails on ppc64 and succeeds on the other arches. It's looking far more likely that the problem is the "re-discovery" bit of the function - the if found_symbol: block where it looks up the symbol again - is usually working on other arches, but failing on ppc64.

So just by looking at the logical flow of the function, particularly what happens in different conditional branches, we've actually been able to figure out quite a lot, without knowing or even caring what the function is really for. By this point, I was really focusing in on that if found_symbol: block. And that leads us to our next suspect. The most important bit in that block is where it actually decides whether to set found_symbol to True or not, here:

        font = self._get_font(new_fontname)
        if font is not None:
            glyphindex = font.get_char_index(uniindex)
            if glyphindex != 0:
                found_symbol = True

I didn't actually know whether it was failing because self._get_font didn't find anything, or because font.get_char_index returned 0. I think I just played a hunch that get_char_index was the problem, but it wouldn't be too difficult to find out by just editing the code a bit to log a message telling you whether or not font was None, and re-running the test suite.

Anyhow, I wound up looking at get_char_index, so we need to go find that. You could work backwards through the code and figure out what font is an instance of so you can find it, but that's boring: it's far quicker just to grep the damn code. If you do that, you get various results that are calls of it, then this:

src/ft2font_wrapper.cpp:const char *PyFT2Font_get_char_index__doc__ =
src/ft2font_wrapper.cpp:    "get_char_index()\n"
src/ft2font_wrapper.cpp:static PyObject *PyFT2Font_get_char_index(PyFT2Font *self, PyObject *args, PyObject *kwds)
src/ft2font_wrapper.cpp:    if (!PyArg_ParseTuple(args, "k:get_char_index", &ccode)) {
src/ft2font_wrapper.cpp:        {"get_char_index", (PyCFunction)PyFT2Font_get_char_index, METH_VARARGS, PyFT2Font_get_char_index__doc__},

Which is the point at which I started mentally buckling myself in, because now we're out of Python and into C++. Glorious C++! I should note at this point that, while I'm probably a half-decent Python coder at this point, I am still pretty awful at C(++). I may be somewhat or very wrong in anything I say about it. Corrections welcome.

So I buckled myself in and went for a look at this ft2font_wrapper.cpp thing. I've seen this kind of thing a couple of times before, so by squinting at it a bit sideways, I could more or less see that this is what Python calls an extension module: basically, it's a Python module written in C or C++. This gets done if you need to create a new built-in type, or for speed, or - as in this case - because the Python project wants to work directly with a system shared library (in this case, freetype), either because it doesn't have Python bindings or because the project doesn't want to use them for some reason.

This code pretty much provides a few classes for working with Freetype fonts. It defines a class called matplotlib.ft2font.FT2Font with a method get_char_index, and that's what the code back up in mathtext.py is dealing with: that font we were dealing with is an FT2Font instance, and we're using its get_char_index method to try and 'find' our 'symbol'.

Fortunately, this get_char_index method is actually simple enough that even I can figure out what it's doing:

static PyObject *PyFT2Font_get_char_index(PyFT2Font *self, PyObject *args, PyObject *kwds)
{
    FT_UInt index;
    FT_ULong ccode;

    if (!PyArg_ParseTuple(args, "I:get_char_index", &ccode)) {
        return NULL;
    }

    index = FT_Get_Char_Index(self->x->get_face(), ccode);

    return PyLong_FromLong(index);
}

(If you're playing along at home for MEGA BONUS POINTS, you now have all the necessary information and you can try to figure out what the bug is. If you just want me to explain it, keep reading!)

There's really not an awful lot there. It's calling FT_Get_Char_Index with a couple of args and returning the result. Not rocket science.

In fact, this seemed like a good point to start just doing a bit of experimenting to identify the precise problem, because we've reduced the problem to a very small area. So this is where I stopped just reading the code and started hacking it up to see what it did.

First I tweaked the relevant block in mathtext.py to just log the values it was feeding in and getting out:

        font = self._get_font(new_fontname)
        if font is not None:
            glyphindex = font.get_char_index(uniindex)
            warn("uniindex: %s, glyphindex: %s" % (uniindex, glyphindex))
            if glyphindex != 0:
                found_symbol = True

Sidenote: how exactly to just print something out to the console when you're building or running tests can vary quite a bit depending on the codebase in question. What I usually do is just look at how the project already does it - find some message that is being printed when you build or run the tests, and then copy that. Thus in this case we can see that the code is using this warn function (it's actually warnings.warn), and we know those messages are appearing in our build logs, so...let's just copy that.

Then I ran the test suite on both x86_64 and ppc64, and compared. This told me that the Python code was passing the same uniindex values to the C code on both x86_64 and ppc64, but getting different results back - that is, I got the same recorded uniindex values, but on x86_64 the resulting glyphindex value was always something larger than 0, but on ppc64, it was sometimes 0.

The next step should be pretty obvious: log the input and output values in the C code.

index = FT_Get_Char_Index(self->x->get_face(), ccode);
printf("ccode: %lu index: %u\n", ccode, index);

Another sidenote: one of the more annoying things with this particular issue was just being able to run the tests with modifications and see what happened. First, I needed an actual ppc64 environment to use. The awesome Patrick Uiterwijk of Fedora release engineering provided me with one. Then I built a .src.rpm of the python-matplotlib package, ran a mock build of it, and shelled into the mock environment. That gives you an environment with all the necessary build dependencies and the source and the tests all there and prepared already. Then I just copied the necessary build, install and test commands from the spec file. For a simple pure-Python module this is all usually pretty easy and you can just check the source out and do it right in your regular environment or in a virtualenv or something, but for something like matplotlib which has this C++ extension module too, it's more complex. The spec builds the code, then installs it, then runs the tests out of the source directory with PYTHONPATH=BUILDROOT/usr/lib64/python2.7/site-packages , so the code that was actually built and installed is used for the tests. When I wanted to modify the C part of matplotlib, I edited it in the source directory, then re-ran the 'build' and 'install' steps, then ran the tests; if I wanted to modify the Python part I just edited it directly in the BUILDROOT location and re-ran the tests. When I ran the tests on ppc64, I noticed that several hundred of them failed with exactly the bug we'd seen in the python-deap package build - this infinite recursion problem. Several others failed due to not being able to find the glyph, without hitting the recursion. It turned out the package maintainer had disabled the tests on ppc64, and so Fedora 24+'s python-matplotlib has been broken on ppc64 since about April).

So anyway, with that modified C code built and used to run the test suite, I finally had a smoking gun. Running this on x86_64 and ppc64, the logged ccode values were totally different. The values logged on ppc64 were huge. But as we know from the previous logging, there was no difference in the value when the Python code passed it to the C code (the uniindex value logged in the Python code).

So now I knew: the problem lay in how the C code took the value from the Python code. At this point I started figuring out how that worked. The key line is this one:

if (!PyArg_ParseTuple(args, "I:get_char_index", &ccode)) {

That PyArg_ParseTuple function is what the C code is using to read in the value that mathtext.py calls uniindex and it calls ccode, the one that's somehow being messed up on ppc64. So let's read the docs!

This is one unusual example where the Python docs, which are usually awesome, are a bit difficult, because that's a very thin description which doesn't provide the references you usually get. But all you really need to do is read up - go back to the top of the page, and you get a much more comprehensive explanation. Reading carefully through the whole page, we can see pretty much what's going on in this call. It basically means that args is expected to be a structure representing a single Python object, a number, which we will store into the C variable ccode. The tricky bit is that second arg, "I:get_char_index". This is the 'format string' that the Python page goes into a lot of helpful detail about.

As it tells us, PyArg_ParseTuple "use[s] format strings which are used to tell the function about the expected arguments...A format string consists of zero or more “format units.” A format unit describes one Python object; it is usually a single character or a parenthesized sequence of format units. With a few exceptions, a format unit that is not a parenthesized sequence normally corresponds to a single address argument to these functions." Next we get a list of the 'format units', and I is one of those:

 I (integer) [unsigned int]
    Convert a Python integer to a C unsigned int, without overflow checking.

You might also notice that the list of format units include several for converting Python integers to other things, like i for 'signed int' and h for 'short int'. This will become significant soon!

The :get_char_index bit threw me for a minute, but it's explained further down:

"A few other characters have a meaning in a format string. These may not occur inside nested parentheses. They are: ... : The list of format units ends here; the string after the colon is used as the function name in error messages (the “associated value” of the exception that PyArg_ParseTuple() raises)." So in our case here, we have only a single 'format unit' - I - and get_char_index is just a name that'll be used in any error messages this call might produce.

So now we know what this call is doing. It's saying "when some Python code calls this function, take the args it was called with and parse them into C structures so we can do stuff with them. In this case, we expect there to be just a single arg, which will be a Python integer, and we want to convert it to a C unsigned integer, and store it in the C variable ccode."

(If you're playing along at home but you didn't get it earlier, you really should be able to get it now! Hint: read up just a few lines in the C code. If not, go refresh your memory about architectures...)

And once I understood that, I realized what the problem was. Let's read up just a few lines in the C code:

FT_ULong ccode;

Unlike Python, C and C++ are 'typed languages'. That just means that all variables must be declared to be of a specific type, unlike Python variables, which you don't have to declare explicitly and which can change type any time you like. This is a variable declaration: it's simply saying "we want a variable called ccode, and it's of type FT_ULong".

If you know anything at all about C integer types, you should know what the problem is by now (you probably worked it out a few paragraphs back). But if you don't, now's a good time to learn!

There are several different types you can use for storing integers in C: short, int, long, and possibly long long (depends on your arch). This is basically all about efficiency: you can only put a small number in a short, but if you only need to store small numbers, it might be more efficient to use a short than a long. Theoretically, when you use a short the compiler will allocate less memory than when you use an int, which uses less memory again than a long, which uses less than a long long. Practically speaking some of them wind up being the same size on some platforms, but the basic idea's there.

All the types have signed and unsigned variants. The difference there is simple: signed numbers can be negative, unsigned ones can't. Say an int is big enough to let you store 101 different values: a signed int would let you store any number between -50 and +50, while an unsigned int would let you store any number between 0 and 100.

Now look at that ccode declaration again. What is its type? FT_ULong. That ULong...sounds a lot like unsigned long, right?

Yes it does! Here, have a cookie. C code often declares its own aliases for standard C types like this; we can find Freetype's in its API documentation, which I found by the cunning technique of doing a web search for FT_ULong. That finds us this handy definition: "A typedef for unsigned long."

Aaaaaaand herein lies our bug! Whew, at last. As, hopefully, you can now see, this ccode variable is declared as an unsigned long, but we're telling PyArg_ParseTuple to convert the Python object such that we can store it as an unsigned int, not an unsigned long.

But wait, you think. Why does this seem to work OK on most arches, and only fail on ppc64? Again, some of you will already know the answer, good for you, now go read something else. ;) For the rest of you, it's all about this concept called 'endianness', which you might have come across and completely failed to understand, like I did many times! But it's really pretty simple, at least if we skate over it just a bit.

Consider the number "forty-two". Here is how we write it with numerals: 42. Right? At least, that's how most humans do it, these days, unless you're a particularly hardy survivor of the fall of Rome, or something. This means we humans are 'big-endian'. If we were 'little-endian', we'd write it like this: 24. 'Big-endian' just means the most significant element comes 'first' in the representation; 'little-endian' means the most significant element comes last.

All the arches Fedora supports except for ppc64 are little-endian. On little-endian arches, this error doesn't actually cause a problem: even though we used the wrong format unit, the value winds up being correct. On (64-bit) big-endian arches, however, it does cause a problem - when you tell PyArg_ParseTuple to convert to an unsigned long, but store the result into a variable that was declared as an unsigned int, you get a completely different value (it's multiplied by 2x32). The reasons for this involve getting into a more technical understanding of little-endian vs. big-endian (we actually have to get into the icky details of how values are really represented in memory), which I'm going to skip since this post is already long enough.

But you don't really need to understand it completely, certainly not to be able to spot problems like this. All you need to know is that there are little-endian and big-endian arches, and little-endian are far more prevalent these days, so it's not unusual for low-level code to have weird bugs on big-endian arches. If something works fine on most arches but not on one or two, check if the ones where it fails are big-endian. If so, then keep a careful eye out for this kind of integer type mismatch problem, because it's very, very likely to be the cause.

So now all that remained to do was to fix the problem. And here we go, with our one character patch:

diff --git a/src/ft2font_wrapper.cpp b/src/ft2font_wrapper.cpp
index a97de68..c77dd83 100644
--- a/src/ft2font_wrapper.cpp
+++ b/src/ft2font_wrapper.cpp
@@ -971,7 +971,7 @@ static PyObject *PyFT2Font_get_char_index(PyFT2Font *self, PyObject *args, PyObj
     FT_UInt index;
     FT_ULong ccode;

-    if (!PyArg_ParseTuple(args, "I:get_char_index", &ccode)) {
+    if (!PyArg_ParseTuple(args, "k:get_char_index", &ccode)) {
         return NULL;
     }

There's something I just love about a one-character change that fixes several hundred test failures. :) As you can see, we simply change the I - the format unit for unsigned int - to k - the format unit for unsigned long. And with that, the bug is solved! I applied this change on both x86_64 and ppc64, re-built the code and re-ran the test suite, and observed that several hundred errors disappeared from the test suite on ppc64, while the x86_64 tests continued to pass.

So I was able to send that patch upstream, apply it to the Fedora package, and once the package build went through, I could finally build python-deap successfully, two days after I'd first tried it.

Bonus extra content: even though I'd fixed the python-deap problem, as I'm never able to leave well enough alone, it wound up bugging me that there were still several hundred other failures in the matplotlib test suite on ppc64. So I wound up looking into all the other failures, and finding several other similar issues, which got the failure count down to just two sets of problems that are too domain-specific for me to figure out, and actually also happen on aarch64 and ppc64le (they're not big-endian issues). So to both the people running matplotlib on ppc64...you're welcome ;)

Seriously, though, I suspect without these fixes, we might have had some odd cases where a noarch package's documentation would suddenly get messed up if the package happened to get built on a ppc64 builder.

Of DPIs, desktops, and toolkits

2015-07-09T15:40:01Z

Fair warning up front: this is going to be one of my long, rambling, deep dives. It doesn't even have a conclusion yet, just lots of messy details about what I'm trying to deal with.

So this week I was aiming to do some work on our openQA stuff - ideally, add some extra tests for Server things. But first, since we finally had working Rawhide images again, I figured I'd check the existing openQA tests are still working.

Unfortunately...nope. There are various problems, but I'm focusing on one particular case here. The 'default install of a live image' test reached the 'Root Password' screen and then failed.

With a bit of careful image comparison, I figured out why openQA wasn't happy. With the Workstation live image, the capital letters in the text - 'R', 'P', 'D' - were all one pixel shorter than on the reference screenshot, which had been taken from a boot.iso install. openQA works on screenshot comparisons; this difference was sufficient for it to decide the screen of the test didn't match the reference screen, and so the test failed. Now I needed to figure out why the fonts rendered ever so slightly differently between the two cases.

I happen to know a bit about font rendering and GNOME internals, so I had an initial theory: the difference was in the font rendering configuration. Somehow some element of this was set differently in the Workstation live environment vs. the traditional installer environment. On the Workstation live, obviously, GNOME is running. On 'traditional install' images - boot.iso (network install) and the Server DVD image - GNOME is not running; you just have anaconda running on metacity on X.

I happen to know that GNOME - via gnome-settings-daemon - has some of its own default settings for font rendering. It defaults to 'medium' hinting, 'greyscale' antialiasing, and a resolution of exactly 96dpi (ignoring hidpi cases for now). You can see the hinting values in this gsettings schema, and the DPI comes in via gsd-xsettings-manager.c.

I also knew that there's a file /etc/X11/Xresources, which can be used to set the same configuration at the X level in some way. I checked a clean Fedora install and found that file looks like this (minus comments):

Xft.dpi: 96
Xft.hintstyle: hintmedium
Xft.hinting: true

so...it appeared to be identical to the GNOME defaults. At this point I was a bit stuck, and was very grateful to Matthias Clasen for helping with the next bit.

He told me how to find out what settings a given anaconda instance was actually using. You can press ctrl+shift+d (or ctrl+shift+i) in any running GTK+ 3 app to launch GtkInspector, a rather handy debugging tool. By clicking the GtkSettings 'object' and then going to 'Properties', we can see the relevant settings (down at the bottom): gtk-xft-antialias, gtk-xft-dpi, gtk-xft-hinting and gtk-xft-hintstyle. In my test case - a KVM at 1024x768 resolution - I found that in the Workstation live case, gtk-xft-dpi was 98304 and gtk-xft-hintstyle was 'medium', while in the boot.iso case (and also on KDE live images), gtk-xft-dpi was 98401 and gtk-xft-hintstyle was 'full'.

So then we had to figure out why. gtk-xft-dpi has a multiplication factor of 1024. So on the Workstation image the actual DPI being used was 98304 / 1024 = 96 (precisely), while on boot.iso it was 98401 / 1024 = 96.094726563 . I confirmed that changing the gtk-xft-dpi value to 98304 and gtk-xft-hintstyle to medium made the affected glyphs the same in boot.iso as in Workstation. But why was boot.iso getting different values from Workstation? And where did the rather odd 98401 gtk-xft-dpi value come from?

Matthias again pointed out what actually happens with the /etc/X11/Xresources values, which I never knew. They get loaded via a command xrdb, which affects the contents of something called an X resource database. This is reasonably well documented on Wikipedia and in the xrdb man page. Basically, the resource database is used as a store of configuration data. The /etc/X11/Xresources values (usually...) get loaded into the RESOURCE_MANAGER property of the root window of screen 0 at X startup, and act as defaults, more or less.

There are various paths on which the /etc/X11/Xresources values actually get loaded into the RESOURCE_MANAGER. There's a file /etc/X11/xinitrc-common which does it: it calls xrdb -nocpp -merge /etc/X11/Xresources if that files exists. (It also loads in ~/.Xresources afterwards if that exists, allowing a user to override the system-wide config). xinitrc-common is sourced by /etc/X11/xinit/xinitrc and /etc/X11/xinit/Xsession, meaning any way you start X that runs through one of those files will get the /etc/X11/Xresources values applied. The startx command runs through xinit which runs through xinitrc, so that's covered. Display managers either run through /etc/X11/xinit/Xsession or implement something rather similar themselves; either way, they all ultimately load in the Xresources settings via xrdb.

Sidebar: how gnome-settings-daemon actually implements the configuration that's specified in dconf is to load it into the X resource database and xsettings. So what happens when you boot to GNOME is that gdm loads in the values from /etc/X11/Xresources, then gnome-settings-daemon overwrites them with whatever it gets from dconf).

In the anaconda traditional installer image case, though, nothing does actually load the settings. X gets started by anaconda itself, which does it simply by calling /usr/libexec/Xorg directly; it doesn't do anything to load in the Xresources settings, either then or at any other time.

GTK+ is written to use the values from xsettings or the X resource database, if there are any. But what does it do when nothing has written the settings?

Well, as Matthias and I finally found out, it has its own internal fallbacks. As of Monday, it would default to hintstyle 'full', and for the DPI, it would calculate it from the resolution and the display size reported by X. I found that, on the test VM, with a resolution of 1024x768, xdpyinfo reported the display size as 270x203mm. The calculation uses the vertical values, so we get: 768 / (203 / 25.4) * 1024 = 98400.851231527 , which rounds up to 98401 - the gtk-xft-dpi value we found that boot.iso was using! Eureka.

So finally we understood what was actually going on: nothing was loading in the Xresources or dconf settings (DPI exactly 96, medium hinting) so GTK+ was falling back on its internal defaults, and getting a DPI of just over 96 and 'full' hinting, and that was causing the discrepancy in rendering.

The next thing I did for extra credit was figure out what's going on with KDE, which had the same rendering as traditional installer images. You'd think the KDE path would load in the Xresources values, though. And - although at first I guessed the problem was that it was failing to - it does. SDDM, the login manager we use for KDE, certainly does load them in via xrdb. No, the problem turned out to be a bit subtler.

KDE does something broadly similar to GNOME. It has its own font rendering configuration options stored in its own settings database, and a little widget called 'krdb' which kicks in on session start and loads them into the X resource database. Qt then (I believe) respects the xrdb and xsettings values like GTK+ does.

Like GNOME, KDE overrides any defaults loaded in from Xresources (or anywhere else) with its own settings. In the KDE control center you can pick hinting and antialiasing settings and a DPI value, and they all get written out via xrdb. However, KDE has one important difference in how it handles the DPI setting. As I said above, GNOME wires it to 96 by default. KDE, instead, tries to entirely unset the Xft.dpi setting. What KDE expects to happen in that case is that apps/toolkits will calculate a 'correct' DPI for the screen and use that. As we saw earlier, this is indeed what GTK+ attempts to do, and Qt attempts much the same.

As another sidebar, while figuring all this out, Kevin Kofler - who was equally helpful on the KDE side of things as Matthias was on the GNOME side - and I worked out that the krdb code is actually quite badly broken here: instead of just unsetting Xft.dpi, it throws away all existing X resource settings. But even if that bug gets fixed, the fact that it throws out any DPI setting that came from /etc/X11/Xresources is actually intentional. So in the KDE path, the values from /etc/X11/Xresources do get loaded (by SDDM), but krdb then throws them away. And so we fall into the same path we did on boot.iso, for anaconda: GTK+ has no X resource or xsetting value for the DPI, so it calculates it, and gets the same 98401 result.

For today's extra credit, I started looking at the DPI calculation aspect of this whole shebang from the other end: where was the 270x203mm 'display size', that results in GTK+ calculating a DPI of just over 96, coming from in the first place?

The (rather funny, to me at least...) answer is that it comes from X!

When you start X, it sets an initial physical screen size - and it doesn't do it the way you might, perhaps, expect. It doesn't look at the results of an EDID monitor probe, or try in some way to combine the results of multiple probes for multiple monitors. Nope, it simply specifies a physical size with the intent of producing a particular DPI value!

Here's the code.

As you can see there are basically three paths: one if monitorResolution is set, one if output->conf_monitor->mon_width and output->conf_monitor->mon_height are set, and a fallback if neither of those are true. monitorResolution is set if the Xorg binary was called with the -dpi parameter. mon_width and mon_height are set if some X config file or another contains the DisplaySize parameter.

If you specified a dpi on the command line, X will figure out whatever 'physical size' would result in that DPI for the resolution that's in use, and claim that the screen is that size. If you specify a size with the DisplaySize parameter, X just uses that. And if neither of those is true, X will pick a physical size based on a built-in default DPI, which is...96.

One important wrinkle is that X uses ints for this calculation. What that means is it uses whole numbers; the ultimate result (the display size in millimetres) is rounded to the nearest whole number.

You know those bits where GTK+ and Qt try to calculate a 'correct' DPI based on the monitor size? Well (I bet you can guess where this is going), so far as I can tell, the value they use is the value X made up to match some arbitrary DPI in the first place!

Let's have a look at the full series of magnificent inanity that goes on when you boot up a clean Fedora KDE and run an app:

X is called without -dpi and there is no DisplaySize config option, so X calculates a completely artificial 'physical size' for the screen in order to get a result of DEFAULT_DPI, which is 96. Let's say we get 1024x768 resolution - X figures out that a display of size 270x203mm has a resolution of 96dpi, and sets that as the 'physical size' of the display. (Well...almost. Note that bit about rounding above.)
sddm loads in the /etc/X11/Xresources DPI value - precisely 96 - to the X resource database: Xft.dpi: 96
krdb says 'no, we don't want some silly hardcoded default DPI! We want to be clever and calculate the 'correct' DPI for the display!' and throws away the Xft.dpi value from the X resource database. (Because it's broken it throws away all the other settings too, but for now let's pretend it just does what it actually means to do.)
When the app starts up, Qt (or GTK+, they do the same calculation) takes the resolution and the physical display size reported by X and calculates the DPI...which, not surprisingly, comes out to about 96. Only not exactly 96, because of the rounding thing: X rounded off its calculation, so in most cases, the DPI the toolkit arrives at won't be quite exactly 96.

Seems a mite silly, doesn't it? :) There is something of a 'heated debate' about whether it's better to try and calculate a DPI based on screen size or simply assume 96 (with some kind of integer scaling for hidpi displays) - the GNOME code which assumes 96dpi has a long email justification from ajax as a comment block - but regardless of which side of that debate you pick, it's fairly absurd that the toolkits have this code to perform the 'correct' DPI calculation based on a screen size which is almost always going to be a lie intended to produce a specific DPI result!

Someone proposed a patch for X.org back in 2011 which would've used the EDID-reported display size (so KDE would actually get the result it wants), but it seems to have died on the vine. There's some more colorful, er, 'context' in this bug report.

I haven't figured out absolutely every wrinkle of this stuff, yet. There's clearly some code in at least Qt and possibly in X which does...stuff...to at least some 'display size' values when the monitor mode is changed (via RandR) in any way. One rather funny thing is that if you go into the KDE Control Center and change the display resolution, the physical display size reported by xdpyinfo changes - but it doesn't change to match the correct physical size of the screen, it looks like something else does the same 'report a physical screen size that would be 96dpi at this resolution' calculation and changes the physical size to that. It doesn't quite seem to do the calculation identically to how X does it, though, or something - it comes up with slightly different answers.

So on my 1920x1080 laptop, X initially sets the physical size to 507x285mm (which gives a DPI of 96.252631579). If I use KDE control center to change the resolution to 1400x1050, the physical size reported by xdpyinfo changes to 369x277mm (which gives a DPI of 96.281588448). If I then change it back to 1920x1080, the physical size changes to 506x284mm - nearly the same value X initially set it to, but not quite. That gives a DPI of 96.591549296. Why is that? I'm damned if I know. None of those sizes, of course, resembles the actual size of the laptop's display, which is 294x165mm (as reported by xrandr) - it's a 13" laptop.

None of this happens, though, if I change the resolution via xrandr instead of the KDE control panel: the reported physical size stays at 507x285mm throughout. (So if you change resolutions in KDE with xrandr then launch a new app, it'll be using some different DPI value from your other apps, most likely).

So what the hell do I do to make anaconda look consistent in all cases, for openQA purposes? Well, mclasen made a change to GTK+ which may be controversial, but would certainly solve my problem: he changed its 'fallback' behaviour. If no hinting style is set in the X resource database or xsettings, now it'll default to 'medium' instead of 'full' - that's fairly uncontroversial. But if no specific DPI is configured in xrdb or xsettings, instead of doing the calculation, it now simply assumes 96dpi (pace hidpi scaling). In theory, this could be a significant change - if things all worked as KDE thought they did, there'd now be a significant discrepancy between Qt and GTK+ apps in KDE on higher resolution displays. But in point of fact, since the 'calculated' DPI is going to wind up being ~96dpi in most cases anyway, it's not going to make such a huge difference...

Without the GTK+ change (or if it gets reverted), we'd probably have to make anaconda load the Xresources settings in when it starts X, so it gets the precise 96dpi value rather than the calculation which ends up giving not-quite-exactly 96dpi. For the KDE case, we'd probably have to futz up the tests a bit to somehow force the 96dpi value into xrdb or xsettings before running anaconda. We're not yet running the openQA tests on the KDE live image, though, so it's not urgent yet.