AdamW on Linux and more (Posts about Debugging Adventures) https://www.happyassassin.net/categories/debugging-adventures.atom 2023-06-20T12:09:46Z Adam Williamson Nikola AdamW's Debugging Adventures: Bootloaders and machine IDs https://www.happyassassin.net/posts/2022/01/11/adamws-debugging-adventures-bootloaders-and-machine-ids/ 2022-01-11T22:08:00Z 2022-01-11T22:08:00Z Adam Williamson <p>Hi folks! Well, it looks like I forgot to blog for...<em>checks watch</em>....<em>checks calendar</em>...a year. Wow. Whoops. Sorry about that. I'm still here, though! We released, uh, lots of Fedoras since the last time I wrote about that. Fedora 35 is the current one. It's, uh, <a href="https://fedoraproject.org/wiki/Common_F35_bugs">mostly great!</a> Go <a href="https://getfedora.org/">get a copy</a>, why don't you?</p> <p>And while that's downloading, you can get comfy and listen to another of Crazy Uncle Adam's Debugging Adventures. In this episode, we'll be uncomfortably reminded just how much of the code that causes your system to actually boot at all consists of fragile shell script with no tests, so this'll be fun!</p> <p>Last month, booting a system installed from Rawhide live images stopped working properly. You could boot the live image fine, run the installation fine, but on rebooting, the system would fail to boot with an error: <code>dracut: FATAL: Don't know how to handle 'root=live:CDLABEL=Fedora-WS-Live-rawh-20211229-n-1'</code>. openQA caught this, and so did one of our QA community members - Ahed Almeleh - who <a href="https://bugzilla.redhat.com/show_bug.cgi?id=2036199">filed a bug</a>. After the end-of-year holidays, I got to figuring out what was going wrong.</p> <p>As usual, I got a bit of a head start from pre-existing knowledge. I happen to know that error message is referring to kernel arguments that are set in the bootloader configuration of <em>the live image itself</em>. dracut is the tool that handles an early phase of boot where we boot into a temporary environment that's loaded entirely into system memory, set up the <em>real</em> system environment, and boot that. This early environment is contained in the <code>initrd</code> files you can find alongside the kernel on most Linux distributions; that's what they're for. Part of dracut's job is to be run when a kernel is installed to <em>produce</em> this environment, and then other parts of dracut are included <em>in the environment itself</em> to handle initializing things, finding the real system root, preparing it, and then switching to it. The initrd environments on Fedora live images are built to contain a dracut 'module' (called <code>90dmsquash-live</code>) that knows to interpret <code>root=live:CDLABEL=Fedora-WS-Live-rawh-20211229-n-1</code> as meaning 'go look for a live system root on the filesystem with that label and boot that'. Installed systems don't contain that module, because, well, they don't need to know how to do that, and you wouldn't really ever want an installed system to <em>try</em> and do that.</p> <p>So the short version here is: the installed system has the wrong kernel argument for telling dracut where to find the system root. It <em>should</em> look something like <code>root=/dev/mapper/fedora-root</code> (where we're pointing to a system root on an LVM volume that dracut will set up and then switch to). So the obvious next question is: why? Why is our installed system getting this wrong argument? It seemed likely that it 'leaked' from the live system to the installed system somehow, but I needed to figure out how.</p> <p>From here, I had kinda two possible ways to investigate. The easiest and fastest would probably be if I happened to know exactly how we deal with setting up bootloader configuration when running a live install. Then I'd likely have been able to start poking the most obvious places right away and figure out the problem. But, as it happens, I didn't at the time remember exactly how that works. I just remembered that I wind up having to figure it out every few years, and it's complicated and scary, so I tend to forget again right afterwards. I kinda knew where to start looking, but didn't really want to have to work it all out again from scratch if I could avoid it.</p> <p>So I went with the other possibility, which is always: figure out when it broke, and figure out what changed between the last time it worked and the first time it broke. This usually makes life much easier because now you know one of the things on that list is the problem. The shorter and simpler the list, the easier life gets.</p> <p>I looked at the openQA result history and found that the bug was introduced somewhere between 20211215.n.0 and 20211229.n.1 (unfortunately kind of a wide range). The good news is that only a few packages could plausibly be involved in this bug; the most likely are dracut itself, grub2 (the bootloader), grubby (a Red Hat / Fedora-specific grub configuration...thing), anaconda (the Fedora installer, which obviously does some bootloader configuration stuff), the kernel itself, and systemd (which is of course involved in the boot process itself, but also - perhaps less obviously - is where a script called <a href="https://github.com/systemd/systemd/tree/main/src/kernel-install"><code>kernel-install</code></a> that is used (on Fedora and many other distros) to 'install' kernels lives (this was another handy thing I happened to know already, but really - it's always a safe bet to include systemd on the list of potential suspects for anything boot-related).</p> <p>Looking at what changed between 2021-12-15 and 2021-12-29, we could let out grub2 and grubby as they didn't change. There were some kernel builds, but nothing in the scriptlets changed in any way that could be related. dracut got a build with <a href="https://src.fedoraproject.org/rpms/dracut/c/76eb28fc2ef2f9e43b5ea66d0b9c96f83e124d4b?branch=rawhide">one change</a>, but again it seemed clearly unrelated. So I was down to anaconda and systemd as suspects. On an initial quick check during the vacation, I thought anaconda had not changed, and took a brief look at systemd, but didn't see anything immediately obvious.</p> <p>When I came back to look at it more thoroughly, I realized anaconda did get a new version (36.12) on 2021-12-15, so that initially interested me quite a lot. I spent some time going through the changes in that version, and there were some that really could have been related - it changed how running things during install inside the installed system worked (which is definitely how we do some bootloader setup stuff during install), and it had interesting commit messages like "Remove the dracut_args attribute" and "Remove upd-kernel". So I spent an afternoon fairly sure it'd turn out to be one of those, reviewed all those changes, mocked up locally how they worked, examined the logs of the actual image composes, and...concluded that none of those seemed to be the problem at all. The installer seemed to still be doing things the same as it always had. There weren't any tell-tale missing or failing bootloader config steps. However, this time wasn't entirely wasted: I was reminded of exactly what anaconda does to configure the bootloader when installing from a live image.</p> <p>When we install from a live image, we don't do what the 'traditional' installer does and install a bunch of RPM packages using dnf. The live image does not contain any RPM packages. The live image itself was <em>built</em> by installing a bunch of RPM packages, but it is the <em>result</em> of that process. Instead, we essentially set up the filesystems on the drive(s) we're installing to and then just dump the contents of the live image filesystem <em>itself</em> onto them. Then we run a few tweaks to adjust anything that needs adjusting for this now being an installed system, not a live one. One of the things we do is re-generate the <code>initrd</code> file for the installed system, and then re-generate the bootloader configuration. This involves running <code>kernel-install</code> (which places the kernel and initrd files onto the boot partition, and writes some bootloader configuration 'snippet' files), and then running <code>grub2-mkconfig</code>. The main thing <code>grub2-mkconfig</code> does is produce the main bootloader configuration file, but that's not really why we run it at this point. There's <a href="https://github.com/rhinstaller/anaconda/blob/anaconda-36.12-1/pyanaconda/modules/storage/bootloader/utils.py#L244">a very interesting comment</a> explaining why in the anaconda source:</p> <div class="code"><pre class="code literal-block"><span class="c1"># Update the bootloader configuration to make sure that the BLS</span> <span class="c1"># entries will have the correct kernel cmdline and not the value</span> <span class="c1"># taken from /proc/cmdline, that is used to boot the live image.</span> </pre></div> <p>Which is exactly what we were dealing with here. The "BLS entries" we're talking about here are the things I called 'snippet' files above, they live in <code>/boot/loader/entries</code> on Fedora systems. These are where the kernel arguments used at boot are specified, and indeed, that's where the problematic <code>root=live:...</code> arguments were specified in broken installs - in the "BLS entries" in <code>/boot/loader/entries</code>. So it seemed like, somehow, this mechanism just wasn't working right any more - we were expecting this run of <code>grub2-mkconfig</code> in the installed system root after live installation to correct those snippets, but it wasn't. However, as I said, I couldn't establish that any change to anaconda was causing this.</p> <p>So I eventually shelved anaconda at least temporarily and looked at systemd. And it turned out that systemd had changed too. During the time period in question, we'd gone from systemd 250~rc1 to 250~rc3. (If you check the build history of systemd the dates don't seem to match up - by 2021-12-29 the 250-2 build had happened already, but in fact the 250-1 and 250-2 builds were untagged for causing a different problem, so the 2021-12-29 compose had 250~rc3). By now I was obviously pretty focused on <code>kernel-install</code> as the most likely related part of systemd, so I went to my systemd git checkout and ran:</p> <div class="code"><pre class="code literal-block">git log v250-rc1..v250-rc3 src/kernel-install/ </pre></div> <p>which shows all the commits under <code>src/kernel-install</code> between 250-rc1 and 250-rc3. And that gave me another juicy-looking, yet thankfully short, set of commits:</p> <p><a href="https://github.com/systemd/systemd/commit/641e2124de6047e6010cd2925ea22fba29b25309">641e2124de6047e6010cd2925ea22fba29b25309</a> kernel-install: replace 00-entry-directory with K_I_LAYOUT in k-i <a href="https://github.com/systemd/systemd/commit/357376d0bb525b064f468e0e2af8193b4b90d257">357376d0bb525b064f468e0e2af8193b4b90d257</a> kernel-install: Introduce KERNEL_INSTALL_MACHINE_ID in /etc/machine-info <a href="https://github.com/systemd/systemd/commit/447a822f8ee47b63a4cae00423c4d407bfa5e516">447a822f8ee47b63a4cae00423c4d407bfa5e516</a> kernel-install: Remove "Default" from list of suffixes checked</p> <p>So I went and looked at all of those. And again...I got it wrong at first! This is I guess a good lesson from this Debugging Adventure: you don't always get the right answer at first, but that's okay. You just have to keep plugging, and always keep open the possibility that you're wrong and you should try something else. I spent time thinking the cause was likely a change in anaconda before focusing on systemd, then focused on the wrong systemd commit first. I got interested in 641e212 first, and had even written out a whole Bugzilla comment blaming it before I realized it wasn't the culprit (fortunately, I didn't post it!) I thought the problem was that the new check for <code>$BOOT_ROOT/$MACHINE_ID</code> would not behave as it should on Fedora and cause the install scripts to do something different from what they should - generating incorrect snippet files, or putting them in the wrong place, or something.</p> <p>Fortunately, I decided to test this before declaring it was the problem, and found out that it wasn't. I did this using something that turned out to be invaluable in figuring out the real problem.</p> <p>You may have noticed by this point - harking back to our intro - that this critical <code>kernel-install</code> script, key to making sure your system boots, is...a shell script. That calls other shell scripts. You know what else is a big pile of shell scripts? <code>dracut</code>. You know, that critical component that both builds and controls the initial boot environment. Big pile of shell script. The install script - the <code>dracut</code> command itself - is shell. All the dracut <code>modules</code> - the bits that do most of the work - are shell. There's a bit of C in the source tree (I'm not entirely sure what that bit does), but most of it's shell.</p> <p>Critical stuff like this being written in shell makes me shiver, because shell is very easy to get wrong, and quite hard to test properly (and in fact neither dracut nor kernel-install has good tests). But one good thing about it is that it's quite easy to <em>debug</em>, thanks to the magic of <code>sh -x</code>. If you run some shell script via <code>sh -x</code> (whether that's really <code>sh</code>, or <code>bash</code> or some other alternative pretending to be <code>sh</code>), it will run as normal but print out most of the logic (variable assignments, tests, and so on) that happen along the way. So on a VM where I'd run a broken install, I could do <code>chroot /mnt/sysimage</code> (to get into the root of the installed system), find the exact <code>kernel-install</code> command that anaconda ran from one of the logs in <code>/var/log/anaconda</code> (I forget which), and re-run it through <code>sh -x</code>. This showed me all the logic going on through the run of <code>kernel-install</code> itself and all the scripts it sources under <code>/usr/lib/kernel/install.d</code>. Using this, I could confirm that the check I suspected had the result I suspected - I could see that it was deciding that <code>layout="other"</code>, not <code>layout="bls"</code>, <a href="https://github.com/systemd/systemd/blob/v250-rc3/src/kernel-install/kernel-install#L134">here</a>. But I could <em>also</em> figure out a way to override that decision, confirm that it worked, and find that it didn't solve the problem: the config snippets were still wrong, and running <code>grub2-mkconfig</code> didn't fix them. In fact the config snippets got wronger - it turned out that we do <em>want</em> <code>kernel-install</code> to pick 'other' rather than 'bls' here, because Fedora doesn't really implement BLS according to the upstream specs, so if we let kernel-install think we do, the config snippets we get are wrong.</p> <p>So now I'd been wrong twice! But each time, I learned a bit more that eventually helped me be right. After I decided that commit wasn't the cause after all, I finally spotted the problem. I figured this out by continuing with the <code>sh -x</code> debugging, and noticing an inconsistency. By this point I'd thought to find out what bit of <code>grub2-mkconfig</code> should be doing the work of correcting the key bit of configuration here. It's in a <a href="https://src.fedoraproject.org/rpms/grub2/blob/b25606806096b51e9d920f50b9bb47773641644d/f/0062-Add-BLS-support-to-grub-mkconfig.patch#_230">Fedora-only downstream patch to one of the scriptlets in <code>/etc/grub.d</code></a>. It replaces the <code>options=</code> line in any snippet files it finds with what it reckons the kernel arguments "should be". So I got curious about what exactly was going wrong there. I tweaked <code>grub2-mkconfig</code> slightly to run those scriptlets using <code>sh -x</code> by changing these lines in <code>grub2-mkconfig</code>:</p> <div class="code"><pre class="code literal-block"><span class="n">echo</span> <span class="s">"### BEGIN $i ###"</span> <span class="s">"$i"</span> <span class="n">echo</span> <span class="s">"### END $i ###"</span> </pre></div> <p>to read:</p> <div class="code"><pre class="code literal-block"><span class="n">echo</span> <span class="s">"### BEGIN $i ###"</span> <span class="n">sh</span> <span class="o">-</span><span class="n">x</span> <span class="s">"$i"</span> <span class="n">echo</span> <span class="s">"### END $i ###"</span> </pre></div> <p>Now I could re-run <code>grub2-mkconfig</code> and look at what was going on behind the scenes of the scriptlet, and I noticed that it wasn't <em>finding</em> any snippet files at all. But why not?</p> <p><a href="https://src.fedoraproject.org/rpms/grub2/blob/b25606806096b51e9d920f50b9bb47773641644d/f/0062-Add-BLS-support-to-grub-mkconfig.patch#_211">The code that looks for the snippet files</a> reads the file <code>/etc/machine-id</code> as a string, then looks for files in <code>/boot/loader/entries</code> whose names start with that string (and end in <code>.conf</code>). So I went and looked at my sample system and...found that the files in <code>/boot/loader/entries</code> did <em>not</em> start with the string in <code>/etc/machine-id</code>. The files in <code>/boot/loader/entries</code> started with <code>a69bd9379d6445668e7df3ddbda62f86</code>, but the ID in <code>/etc/machine-id</code> was <code>b8d80a4c887c40199c4ea1a8f02aa9b4</code>. This is why everything was broken: because those IDs didn't match, <code>grub2-mkconfig</code> couldn't find the files to correct, so the argument was wrong, so the system didn't boot.</p> <p>Now I knew what was going wrong and I only had two systemd commits left on the list, it was pretty easy to see the problem. It was in <a href="https://github.com/systemd/systemd/commit/357376d0bb525b064f468e0e2af8193b4b90d257">357376d</a>. That changes how <code>kernel-install</code> names these snippet files when creating them. It names them by finding a machine ID to use as a prefix. Previously, it used whatever string was in <code>/etc/machine-id</code>; if that file didn't exist or was empty, it just used the string "Default". After that commit, it also looks for a value specified in <code>/etc/machine-info</code>. If there's a <code>/etc/machine-id</code> but not <code>/etc/machine-info</code> when you run <code>kernel-install</code>, it uses the value from <code>/etc/machine-id</code> and writes it to <code>/etc/machine-info</code>.</p> <p>When I checked those files, it turned out that on the live image, the ID in both <code>/etc/machine-id</code> and <code>/etc/machine-info</code> was <code>a69bd9379d6445668e7df3ddbda62f86</code> - the problematic ID on the installed system. When we generate the live image itself, <code>kernel-install</code> uses the value from <code>/etc/machine-id</code> and writes it to <code>/etc/machine-info</code>, and both files wind up in the live filesystem. But <em>on the installed system</em>, the ID in <code>/etc/machine-info</code> was that same value, but the ID in <code>/etc/machine-id</code> was different (as we saw above).</p> <p>Remember how I mentioned above that when doing a live install, we essentially dump the live filesystem itself onto the installed system? Well, one of the 'tweaks' we make when doing this is to <a href="https://github.com/rhinstaller/anaconda/blob/612bfee7d37e16d7bfab329b44182a71d04a3344/pyanaconda/modules/storage/bootloader/utils.py#L41-L48">re-generate <code>/etc/machine-id</code></a>, because that ID is meant to be unique to each installed system - we don't want every system installed from a Fedora live image to have the same machine ID as the live image itself. However, as this <code>/etc/machine-info</code> file is new, we don't strip it from or re-generate it in the installed system, we just install it. The installed system has a <code>/etc/machine-info</code> with the same ID as the live image's machine ID, but a new, different ID in <code>/etc/machine-id</code>. And this (finally) was the ultimate source of the problem! When we run them on the installed system, the new version of <code>kernel-install</code> writes config snippet files using the ID from <code>/etc/machine-info</code>. But Fedora's patched <code>grub2-mkconfig</code> scriptlet doesn't know about that mechanism at all (since it's brand new), and expects the snippet files to contain the ID from <code>/etc/machine-id</code>.</p> <p>There are various ways you could potentially solve this, but after consulting with systemd upstream, the one we chose is to <a href="https://github.com/rhinstaller/anaconda/commit/fe652add9733943a3476128a83b24b9a8c63b335">have anaconda exclude <code>/etc/machine-info</code></a> when doing a live install. The changes to systemd here aren't wrong - it does potentially make sense that <code>/etc/machine-id</code> and <code>/etc/machine-info</code> could both exist and specify different IDs in some cases. But for the case of Fedora live installs, it doesn't make sense. The sanest result is for those IDs to match and both be the 'fresh' machine ID that's generated at the end of the install process. By just not including <code>/etc/machine-info</code> on the installed system, we achieve this result, because now when <code>kernel-install</code> runs at the end of the install process, it reads the ID from <code>/etc/machine-id</code> and writes it to <code>/etc/machine-info</code>, and both IDs are the same, <code>grub2-mkconfig</code> finds the snippet files and edits them correctly, the installed system boots, and I can move along to the next debugging odyssey...</p> <p>Hi folks! Well, it looks like I forgot to blog for...<em>checks watch</em>....<em>checks calendar</em>...a year. Wow. Whoops. Sorry about that. I'm still here, though! We released, uh, lots of Fedoras since the last time I wrote about that. Fedora 35 is the current one. It's, uh, <a href="https://fedoraproject.org/wiki/Common_F35_bugs">mostly great!</a> Go <a href="https://getfedora.org/">get a copy</a>, why don't you?</p> <p>And while that's downloading, you can get comfy and listen to another of Crazy Uncle Adam's Debugging Adventures. In this episode, we'll be uncomfortably reminded just how much of the code that causes your system to actually boot at all consists of fragile shell script with no tests, so this'll be fun!</p> <p>Last month, booting a system installed from Rawhide live images stopped working properly. You could boot the live image fine, run the installation fine, but on rebooting, the system would fail to boot with an error: <code>dracut: FATAL: Don't know how to handle 'root=live:CDLABEL=Fedora-WS-Live-rawh-20211229-n-1'</code>. openQA caught this, and so did one of our QA community members - Ahed Almeleh - who <a href="https://bugzilla.redhat.com/show_bug.cgi?id=2036199">filed a bug</a>. After the end-of-year holidays, I got to figuring out what was going wrong.</p> <p>As usual, I got a bit of a head start from pre-existing knowledge. I happen to know that error message is referring to kernel arguments that are set in the bootloader configuration of <em>the live image itself</em>. dracut is the tool that handles an early phase of boot where we boot into a temporary environment that's loaded entirely into system memory, set up the <em>real</em> system environment, and boot that. This early environment is contained in the <code>initrd</code> files you can find alongside the kernel on most Linux distributions; that's what they're for. Part of dracut's job is to be run when a kernel is installed to <em>produce</em> this environment, and then other parts of dracut are included <em>in the environment itself</em> to handle initializing things, finding the real system root, preparing it, and then switching to it. The initrd environments on Fedora live images are built to contain a dracut 'module' (called <code>90dmsquash-live</code>) that knows to interpret <code>root=live:CDLABEL=Fedora-WS-Live-rawh-20211229-n-1</code> as meaning 'go look for a live system root on the filesystem with that label and boot that'. Installed systems don't contain that module, because, well, they don't need to know how to do that, and you wouldn't really ever want an installed system to <em>try</em> and do that.</p> <p>So the short version here is: the installed system has the wrong kernel argument for telling dracut where to find the system root. It <em>should</em> look something like <code>root=/dev/mapper/fedora-root</code> (where we're pointing to a system root on an LVM volume that dracut will set up and then switch to). So the obvious next question is: why? Why is our installed system getting this wrong argument? It seemed likely that it 'leaked' from the live system to the installed system somehow, but I needed to figure out how.</p> <p>From here, I had kinda two possible ways to investigate. The easiest and fastest would probably be if I happened to know exactly how we deal with setting up bootloader configuration when running a live install. Then I'd likely have been able to start poking the most obvious places right away and figure out the problem. But, as it happens, I didn't at the time remember exactly how that works. I just remembered that I wind up having to figure it out every few years, and it's complicated and scary, so I tend to forget again right afterwards. I kinda knew where to start looking, but didn't really want to have to work it all out again from scratch if I could avoid it.</p> <p>So I went with the other possibility, which is always: figure out when it broke, and figure out what changed between the last time it worked and the first time it broke. This usually makes life much easier because now you know one of the things on that list is the problem. The shorter and simpler the list, the easier life gets.</p> <p>I looked at the openQA result history and found that the bug was introduced somewhere between 20211215.n.0 and 20211229.n.1 (unfortunately kind of a wide range). The good news is that only a few packages could plausibly be involved in this bug; the most likely are dracut itself, grub2 (the bootloader), grubby (a Red Hat / Fedora-specific grub configuration...thing), anaconda (the Fedora installer, which obviously does some bootloader configuration stuff), the kernel itself, and systemd (which is of course involved in the boot process itself, but also - perhaps less obviously - is where a script called <a href="https://github.com/systemd/systemd/tree/main/src/kernel-install"><code>kernel-install</code></a> that is used (on Fedora and many other distros) to 'install' kernels lives (this was another handy thing I happened to know already, but really - it's always a safe bet to include systemd on the list of potential suspects for anything boot-related).</p> <p>Looking at what changed between 2021-12-15 and 2021-12-29, we could let out grub2 and grubby as they didn't change. There were some kernel builds, but nothing in the scriptlets changed in any way that could be related. dracut got a build with <a href="https://src.fedoraproject.org/rpms/dracut/c/76eb28fc2ef2f9e43b5ea66d0b9c96f83e124d4b?branch=rawhide">one change</a>, but again it seemed clearly unrelated. So I was down to anaconda and systemd as suspects. On an initial quick check during the vacation, I thought anaconda had not changed, and took a brief look at systemd, but didn't see anything immediately obvious.</p> <p>When I came back to look at it more thoroughly, I realized anaconda did get a new version (36.12) on 2021-12-15, so that initially interested me quite a lot. I spent some time going through the changes in that version, and there were some that really could have been related - it changed how running things during install inside the installed system worked (which is definitely how we do some bootloader setup stuff during install), and it had interesting commit messages like "Remove the dracut_args attribute" and "Remove upd-kernel". So I spent an afternoon fairly sure it'd turn out to be one of those, reviewed all those changes, mocked up locally how they worked, examined the logs of the actual image composes, and...concluded that none of those seemed to be the problem at all. The installer seemed to still be doing things the same as it always had. There weren't any tell-tale missing or failing bootloader config steps. However, this time wasn't entirely wasted: I was reminded of exactly what anaconda does to configure the bootloader when installing from a live image.</p> <p>When we install from a live image, we don't do what the 'traditional' installer does and install a bunch of RPM packages using dnf. The live image does not contain any RPM packages. The live image itself was <em>built</em> by installing a bunch of RPM packages, but it is the <em>result</em> of that process. Instead, we essentially set up the filesystems on the drive(s) we're installing to and then just dump the contents of the live image filesystem <em>itself</em> onto them. Then we run a few tweaks to adjust anything that needs adjusting for this now being an installed system, not a live one. One of the things we do is re-generate the <code>initrd</code> file for the installed system, and then re-generate the bootloader configuration. This involves running <code>kernel-install</code> (which places the kernel and initrd files onto the boot partition, and writes some bootloader configuration 'snippet' files), and then running <code>grub2-mkconfig</code>. The main thing <code>grub2-mkconfig</code> does is produce the main bootloader configuration file, but that's not really why we run it at this point. There's <a href="https://github.com/rhinstaller/anaconda/blob/anaconda-36.12-1/pyanaconda/modules/storage/bootloader/utils.py#L244">a very interesting comment</a> explaining why in the anaconda source:</p> <div class="code"><pre class="code literal-block"><span class="c1"># Update the bootloader configuration to make sure that the BLS</span> <span class="c1"># entries will have the correct kernel cmdline and not the value</span> <span class="c1"># taken from /proc/cmdline, that is used to boot the live image.</span> </pre></div> <p>Which is exactly what we were dealing with here. The "BLS entries" we're talking about here are the things I called 'snippet' files above, they live in <code>/boot/loader/entries</code> on Fedora systems. These are where the kernel arguments used at boot are specified, and indeed, that's where the problematic <code>root=live:...</code> arguments were specified in broken installs - in the "BLS entries" in <code>/boot/loader/entries</code>. So it seemed like, somehow, this mechanism just wasn't working right any more - we were expecting this run of <code>grub2-mkconfig</code> in the installed system root after live installation to correct those snippets, but it wasn't. However, as I said, I couldn't establish that any change to anaconda was causing this.</p> <p>So I eventually shelved anaconda at least temporarily and looked at systemd. And it turned out that systemd had changed too. During the time period in question, we'd gone from systemd 250~rc1 to 250~rc3. (If you check the build history of systemd the dates don't seem to match up - by 2021-12-29 the 250-2 build had happened already, but in fact the 250-1 and 250-2 builds were untagged for causing a different problem, so the 2021-12-29 compose had 250~rc3). By now I was obviously pretty focused on <code>kernel-install</code> as the most likely related part of systemd, so I went to my systemd git checkout and ran:</p> <div class="code"><pre class="code literal-block">git log v250-rc1..v250-rc3 src/kernel-install/ </pre></div> <p>which shows all the commits under <code>src/kernel-install</code> between 250-rc1 and 250-rc3. And that gave me another juicy-looking, yet thankfully short, set of commits:</p> <p><a href="https://github.com/systemd/systemd/commit/641e2124de6047e6010cd2925ea22fba29b25309">641e2124de6047e6010cd2925ea22fba29b25309</a> kernel-install: replace 00-entry-directory with K_I_LAYOUT in k-i <a href="https://github.com/systemd/systemd/commit/357376d0bb525b064f468e0e2af8193b4b90d257">357376d0bb525b064f468e0e2af8193b4b90d257</a> kernel-install: Introduce KERNEL_INSTALL_MACHINE_ID in /etc/machine-info <a href="https://github.com/systemd/systemd/commit/447a822f8ee47b63a4cae00423c4d407bfa5e516">447a822f8ee47b63a4cae00423c4d407bfa5e516</a> kernel-install: Remove "Default" from list of suffixes checked</p> <p>So I went and looked at all of those. And again...I got it wrong at first! This is I guess a good lesson from this Debugging Adventure: you don't always get the right answer at first, but that's okay. You just have to keep plugging, and always keep open the possibility that you're wrong and you should try something else. I spent time thinking the cause was likely a change in anaconda before focusing on systemd, then focused on the wrong systemd commit first. I got interested in 641e212 first, and had even written out a whole Bugzilla comment blaming it before I realized it wasn't the culprit (fortunately, I didn't post it!) I thought the problem was that the new check for <code>$BOOT_ROOT/$MACHINE_ID</code> would not behave as it should on Fedora and cause the install scripts to do something different from what they should - generating incorrect snippet files, or putting them in the wrong place, or something.</p> <p>Fortunately, I decided to test this before declaring it was the problem, and found out that it wasn't. I did this using something that turned out to be invaluable in figuring out the real problem.</p> <p>You may have noticed by this point - harking back to our intro - that this critical <code>kernel-install</code> script, key to making sure your system boots, is...a shell script. That calls other shell scripts. You know what else is a big pile of shell scripts? <code>dracut</code>. You know, that critical component that both builds and controls the initial boot environment. Big pile of shell script. The install script - the <code>dracut</code> command itself - is shell. All the dracut <code>modules</code> - the bits that do most of the work - are shell. There's a bit of C in the source tree (I'm not entirely sure what that bit does), but most of it's shell.</p> <p>Critical stuff like this being written in shell makes me shiver, because shell is very easy to get wrong, and quite hard to test properly (and in fact neither dracut nor kernel-install has good tests). But one good thing about it is that it's quite easy to <em>debug</em>, thanks to the magic of <code>sh -x</code>. If you run some shell script via <code>sh -x</code> (whether that's really <code>sh</code>, or <code>bash</code> or some other alternative pretending to be <code>sh</code>), it will run as normal but print out most of the logic (variable assignments, tests, and so on) that happen along the way. So on a VM where I'd run a broken install, I could do <code>chroot /mnt/sysimage</code> (to get into the root of the installed system), find the exact <code>kernel-install</code> command that anaconda ran from one of the logs in <code>/var/log/anaconda</code> (I forget which), and re-run it through <code>sh -x</code>. This showed me all the logic going on through the run of <code>kernel-install</code> itself and all the scripts it sources under <code>/usr/lib/kernel/install.d</code>. Using this, I could confirm that the check I suspected had the result I suspected - I could see that it was deciding that <code>layout="other"</code>, not <code>layout="bls"</code>, <a href="https://github.com/systemd/systemd/blob/v250-rc3/src/kernel-install/kernel-install#L134">here</a>. But I could <em>also</em> figure out a way to override that decision, confirm that it worked, and find that it didn't solve the problem: the config snippets were still wrong, and running <code>grub2-mkconfig</code> didn't fix them. In fact the config snippets got wronger - it turned out that we do <em>want</em> <code>kernel-install</code> to pick 'other' rather than 'bls' here, because Fedora doesn't really implement BLS according to the upstream specs, so if we let kernel-install think we do, the config snippets we get are wrong.</p> <p>So now I'd been wrong twice! But each time, I learned a bit more that eventually helped me be right. After I decided that commit wasn't the cause after all, I finally spotted the problem. I figured this out by continuing with the <code>sh -x</code> debugging, and noticing an inconsistency. By this point I'd thought to find out what bit of <code>grub2-mkconfig</code> should be doing the work of correcting the key bit of configuration here. It's in a <a href="https://src.fedoraproject.org/rpms/grub2/blob/b25606806096b51e9d920f50b9bb47773641644d/f/0062-Add-BLS-support-to-grub-mkconfig.patch#_230">Fedora-only downstream patch to one of the scriptlets in <code>/etc/grub.d</code></a>. It replaces the <code>options=</code> line in any snippet files it finds with what it reckons the kernel arguments "should be". So I got curious about what exactly was going wrong there. I tweaked <code>grub2-mkconfig</code> slightly to run those scriptlets using <code>sh -x</code> by changing these lines in <code>grub2-mkconfig</code>:</p> <div class="code"><pre class="code literal-block"><span class="n">echo</span> <span class="s">"### BEGIN $i ###"</span> <span class="s">"$i"</span> <span class="n">echo</span> <span class="s">"### END $i ###"</span> </pre></div> <p>to read:</p> <div class="code"><pre class="code literal-block"><span class="n">echo</span> <span class="s">"### BEGIN $i ###"</span> <span class="n">sh</span> <span class="o">-</span><span class="n">x</span> <span class="s">"$i"</span> <span class="n">echo</span> <span class="s">"### END $i ###"</span> </pre></div> <p>Now I could re-run <code>grub2-mkconfig</code> and look at what was going on behind the scenes of the scriptlet, and I noticed that it wasn't <em>finding</em> any snippet files at all. But why not?</p> <p><a href="https://src.fedoraproject.org/rpms/grub2/blob/b25606806096b51e9d920f50b9bb47773641644d/f/0062-Add-BLS-support-to-grub-mkconfig.patch#_211">The code that looks for the snippet files</a> reads the file <code>/etc/machine-id</code> as a string, then looks for files in <code>/boot/loader/entries</code> whose names start with that string (and end in <code>.conf</code>). So I went and looked at my sample system and...found that the files in <code>/boot/loader/entries</code> did <em>not</em> start with the string in <code>/etc/machine-id</code>. The files in <code>/boot/loader/entries</code> started with <code>a69bd9379d6445668e7df3ddbda62f86</code>, but the ID in <code>/etc/machine-id</code> was <code>b8d80a4c887c40199c4ea1a8f02aa9b4</code>. This is why everything was broken: because those IDs didn't match, <code>grub2-mkconfig</code> couldn't find the files to correct, so the argument was wrong, so the system didn't boot.</p> <p>Now I knew what was going wrong and I only had two systemd commits left on the list, it was pretty easy to see the problem. It was in <a href="https://github.com/systemd/systemd/commit/357376d0bb525b064f468e0e2af8193b4b90d257">357376d</a>. That changes how <code>kernel-install</code> names these snippet files when creating them. It names them by finding a machine ID to use as a prefix. Previously, it used whatever string was in <code>/etc/machine-id</code>; if that file didn't exist or was empty, it just used the string "Default". After that commit, it also looks for a value specified in <code>/etc/machine-info</code>. If there's a <code>/etc/machine-id</code> but not <code>/etc/machine-info</code> when you run <code>kernel-install</code>, it uses the value from <code>/etc/machine-id</code> and writes it to <code>/etc/machine-info</code>.</p> <p>When I checked those files, it turned out that on the live image, the ID in both <code>/etc/machine-id</code> and <code>/etc/machine-info</code> was <code>a69bd9379d6445668e7df3ddbda62f86</code> - the problematic ID on the installed system. When we generate the live image itself, <code>kernel-install</code> uses the value from <code>/etc/machine-id</code> and writes it to <code>/etc/machine-info</code>, and both files wind up in the live filesystem. But <em>on the installed system</em>, the ID in <code>/etc/machine-info</code> was that same value, but the ID in <code>/etc/machine-id</code> was different (as we saw above).</p> <p>Remember how I mentioned above that when doing a live install, we essentially dump the live filesystem itself onto the installed system? Well, one of the 'tweaks' we make when doing this is to <a href="https://github.com/rhinstaller/anaconda/blob/612bfee7d37e16d7bfab329b44182a71d04a3344/pyanaconda/modules/storage/bootloader/utils.py#L41-L48">re-generate <code>/etc/machine-id</code></a>, because that ID is meant to be unique to each installed system - we don't want every system installed from a Fedora live image to have the same machine ID as the live image itself. However, as this <code>/etc/machine-info</code> file is new, we don't strip it from or re-generate it in the installed system, we just install it. The installed system has a <code>/etc/machine-info</code> with the same ID as the live image's machine ID, but a new, different ID in <code>/etc/machine-id</code>. And this (finally) was the ultimate source of the problem! When we run them on the installed system, the new version of <code>kernel-install</code> writes config snippet files using the ID from <code>/etc/machine-info</code>. But Fedora's patched <code>grub2-mkconfig</code> scriptlet doesn't know about that mechanism at all (since it's brand new), and expects the snippet files to contain the ID from <code>/etc/machine-id</code>.</p> <p>There are various ways you could potentially solve this, but after consulting with systemd upstream, the one we chose is to <a href="https://github.com/rhinstaller/anaconda/commit/fe652add9733943a3476128a83b24b9a8c63b335">have anaconda exclude <code>/etc/machine-info</code></a> when doing a live install. The changes to systemd here aren't wrong - it does potentially make sense that <code>/etc/machine-id</code> and <code>/etc/machine-info</code> could both exist and specify different IDs in some cases. But for the case of Fedora live installs, it doesn't make sense. The sanest result is for those IDs to match and both be the 'fresh' machine ID that's generated at the end of the install process. By just not including <code>/etc/machine-info</code> on the installed system, we achieve this result, because now when <code>kernel-install</code> runs at the end of the install process, it reads the ID from <code>/etc/machine-id</code> and writes it to <code>/etc/machine-info</code>, and both IDs are the same, <code>grub2-mkconfig</code> finds the snippet files and edits them correctly, the installed system boots, and I can move along to the next debugging odyssey...</p> AdamW's Debugging Adventures: "dnf is locked by another application" https://www.happyassassin.net/posts/2019/10/18/adamws-debugging-adventures-dnf-is-locked-by-another-application/ 2019-10-18T13:45:04Z 2019-10-18T13:45:04Z Adam Williamson <p></p><p>Gather round the fire, kids, it's time for another Debugging Adventure! These are posts where I write up the process of diagnosing the root cause of a bug, where it turned out to be interesting (to me, anyway...)</p> <p>This case - <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1750575">Bugzilla #1750575</a> - involved dnfdragora, the package management tool used on Fedora Xfce, which is a release-blocking environment for the ARM architecture. It was a pretty easy bug to reproduce: any time you updated a package, the update would work, but then dnfdragora would show an error "DNF is locked by another process. dnfdragora will exit.", and immediately exit.</p> <p>The bug sat around on the blocker list for a while; Daniel Mach (a DNF developer) <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1750575#c10">looked into it a bit</a> but didn't have time to figure it out all the way. So I got tired of waiting for someone else to do it, and decided to work it out myself.</p> <p></p><h2>Where's the error coming from?</h2> <p>As a starting point, I had a nice error message - so the obvious thing to do is figure out where that message comes from. The text appears in a couple of places in dnfdragora - in an <a href="https://github.com/manatools/dnfdragora/blob/master/dnfdragora/basedragora.py#L131">exception handler</a> and also in a <a href="https://github.com/manatools/dnfdragora/blob/master/dnfdragora/basedragora.py#L90">method for setting up a connection to dnfdaemon</a>. So, if we didn't already know (I happened to) this would be the point at which we'd realize that dnfdragora is a frontend app to a backend - <a href="https://github.com/manatools/dnfdaemon">dnfdaemon</a> - which does the heavy lifting.</p> <p>So, to figure out in more detail how we were getting to one of these two points, I hacked both the points where that error is logged. Both of them read <code>logger.critical(errmsg)</code>. I changed this to <code>logger.exception(errmsg)</code>. <code>logger.exception</code> is a very handy feature of Python's <a href="https://docs.python.org/3/library/logging.html">logging module</a> which logs whatever message you specify, <strong>plus a traceback to the current state</strong>, just like the traceback you get if the app actually crashes. So by doing that, the dnfdragora log (it logs to a file <code>dnfdragora.log</code> in the directory you run it from) gave us a traceback showing how we got to the error:</p> <pre><code>2019-10-14 17:53:29,436 &lt;a href="ERROR"&gt;dnfdragora&lt;/a&gt; dnfdaemon client error: g-io-error-quark: GDBus.Error:org.baseurl.DnfSystem.LockedError: dnf is locked by another application (36) Traceback (most recent call last): File "/usr/bin/dnfdragora", line 85, in &lt;module&gt; main_gui.handleevent() File "/usr/lib/python3.7/site-packages/dnfdragora/ui.py", line 1273, in handleevent if not self._searchPackages(filter, True) : File "/usr/lib/python3.7/site-packages/dnfdragora/ui.py", line 949, in _searchPackages packages = self.backend.search(fields, strings, self.match_all, self.newest_only, tags ) File "/usr/lib/python3.7/site-packages/dnfdragora/misc.py", line 135, in newFunc rc = func(*args, **kwargs) File "/usr/lib/python3.7/site-packages/dnfdragora/dnf_backend.py", line 464, in search newest_only, tags) File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 508, in Search fields, keys, attrs, match_all, newest_only, tags)) File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 293, in _run_dbus_async result = self._get_result(data) File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 277, in _get_result self._handle_dbus_error(user_data['error']) File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 250, in _handle_dbus_error raise DaemonError(str(err)) dnfdaemon.client.DaemonError: g-io-error-quark: GDBus.Error:org.baseurl.DnfSystem.LockedError: dnf is locked by another application (36)&lt;/module&gt; </code></pre> <p>So, this tells us quite a bit of stuff. We know we're crashing in some sort of 'search' operation, and dbus seems to be involved. We can also see a bit more of the architecture here. Note how we have <code>dnfdragora/dnf_backend.py</code> and <code>dnfdaemon/client/<strong>init</strong>.py</code> included in the trace, even though we're only in the dnfdragora executable here (dnfdaemon is a separate process). Looking at that and then looking at those files a bit, it's quite easy to see that the dnfdaemon Python library provides a sort of <a href="https://github.com/manatools/dnfdaemon/blob/master/python/dnfdaemon/client/__init__.py#L199">framework for a client class called (oddly enough) <code>DnfDaemonBase</code></a> which the actual client - dnfdragora in our case - is expected to subclass and flesh out. dnfdragora does this in a <a href="https://github.com/manatools/dnfdragora/blob/master/dnfdragora/dnf_backend.py#L144">class called <code>DnfRootBackend</code></a>, which inherits from both <code>dnfdragora.backend.Backend</code> (a sort of abstraction layer for dnfdragora to have multiple of these backends, though at present it only actually has this one) and <code>dnfdaemon.client.Client</code>, which is just a <a href="https://github.com/manatools/dnfdaemon/blob/master/python/dnfdaemon/client/__init__.py#L544">small extension to <code>DnfDaemonBase</code></a> that adds some dbus signal handling.</p> <p>So now we know more about the design we're dealing with, and we can also see that we're trying to do some sort of search operation which looks like it works by the client class communicating with the actual <code>dnfdaemon</code> server process via dbus, only we're hitting some kind of error in that process, and interpreting it as 'dnf is locked by another application'. If we dig a little deeper, we can figure out a bit more. We have to read through all of the backtrace frames and examine the functions, but ultimately we can figure out that <code>DnfRootBackend.Search()</code> is wrapped by <code>dnfdragora.misc.ExceptionHandler</code>, which handles <code>dnfdaemon.client.DaemonError</code> exceptions - like the one that's ultimately getting raised here! - by calling the base class's own <code>exception_handler()</code> on them...and for us, that's <a href="https://github.com/manatools/dnfdragora/blob/master/dnfdragora/basedragora.py#L121">BaseDragora.exception_handler</a>, one of the two places we found earlier that ultimately produces this "DNF is locked by another process. dnfdragora will exit" text. We also now have <em>two</em> indications (the dbus error itself, and the code in <code>exception_handler()</code> that the error we're dealing with is "LockedError".</p> <h2>A misleading error...</h2> <p>At this point, I went looking for the text <code>LockedError</code>, and found it in two files in dnfdaemon that are kinda variants on each other - <a href="https://github.com/manatools/dnfdaemon/blob/master/daemon/dnfdaemon-session.py">daemon/dnfdaemon-session.py</a> and <a href="https://github.com/manatools/dnfdaemon/blob/master/daemon/dnfdaemon-system.py">daemon/dnfdaemon-system.py</a>. I didn't actually know offhand which of the two is used in our case, but it doesn't really matter, because the codepath to <code>LockedError</code> is the same in both. There's a <a href="https://github.com/manatools/dnfdaemon/blob/master/daemon/dnfdaemon-system.py#L700">function called <code>check_lock()</code></a> which checks that <code>self._lock == sender</code>, and if it doesn't, raises <code>LockedError</code>. That sure looks like where we're at.</p> <p>So at this point I did a bit of poking around into how <code>self._lock</code> gets set and unset in the daemon. It turns out to be pretty simple. The daemon is basically implemented as a class with a bunch of methods that are wrapped by <code>@dbus.service.method</code>, which makes them accessible as <em>DBus</em> methods. (One of them is <code>Search()</code>, and we can see that the <a href="https://github.com/manatools/dnfdaemon/blob/master/python/dnfdaemon/client/__init__.py#L493">client class's own <code>Search()</code> basically just calls that</a>). There are also methods called <a href="https://github.com/manatools/dnfdaemon/blob/master/daemon/dnfdaemon-system.py#L105"><code>Lock()</code></a> and <a href="https://github.com/manatools/dnfdaemon/blob/master/daemon/dnfdaemon-system.py#L350"><code>Unlock()</code></a>, which - not surprisingly - set and release this lock, by setting the daemon class' <code>self._lock</code> to be either an identifier for the DBus client or <code>None</code>, respectively. And when the daemon is first initialized, the value is set to <code>None</code>.</p> <p>At this point, I realized that the error we're dealing with here is actually a lie in two important ways:</p> <ol> <li>The message claims that the problem is the lock being held "by another application", but that's not what <code>check_lock()</code> checks, really. It passes <strong>only if the caller holds the lock</strong>. It <em>does</em> fail if the lock is held "by another application", but it <em>also</em> fails if <strong>the lock is not held at all</strong>. Given all the code we looked at so far, we can't actually trust the message's assertion that something else is holding the lock. It is also possible that the lock is not held at all.</li> <li>The message suggests that the lock in question is a lock on dnf itself. I know dnf/libdnf <em>do</em> have locking, so up to now I'd been assuming we were actually dealing with the locking in dnf itself. But at this point I realized we weren't. The dnfdaemon lock code we just looked at doesn't actually call or wrap dnf's own locking code in any way. This lock we're dealing with is entirely internal to dnfdaemon. It's really a lock on <em>the dnfdaemon instance itself</em>.</li> </ol> <p>So, at this point I started thinking of the error as being "dnfdaemon is either locked by another DBus client, or not locked at all".</p> <h2>So what's going on with this lock anyway?</h2> <p>My next step, now I understood the locking process we're dealing with, was to stick some logging into it. I added log lines to the <code>Lock()</code> and <code>Unlock()</code> methods, and I also made <code>check_lock()</code> log what <code>sender</code> and <code>self._lock</code> were set to before returning. Because it sets <code>self._lock</code> to <code>None</code>, I also added a log line to the daemon's <code><strong>init</strong></code> that just records that we're in it. That got me some more useful information:</p> <pre><code>2019-10-14 18:53:03.397784 XXX In DnfDaemon.init now! 2019-10-14 18:53:03.402804 XXX LOCK: sender is :1.1835 2019-10-14 18:53:03.407524 XXX CHECK LOCK: sender is :1.1835 XXX CHECK LOCK: self._lock is :1.1835 2019-10-14 18:53:07.556499 XXX CHECK LOCK: sender is :1.1835 XXX CHECK LOCK: self._lock is :1.1835 [...snip a bunch more calls to check_lock where the sender is the same...] 2019-10-14 18:53:13.560828 XXX CHECK LOCK: sender is :1.1835 XXX CHECK LOCK: self._lock is :1.1835 2019-10-14 18:53:13.560941 XXX CHECK LOCK: sender is :1.1835 XXX CHECK LOCK: self._lock is :1.1835 2019-10-14 18:53:16.513900 XXX In DnfDaemon.init now! 2019-10-14 18:53:16.516724 XXX CHECK LOCK: sender is :1.1835 XXX CHECK LOCK: self._lock is None </code></pre> <p>so we could see that when we started dnfdragora, dnfdaemon started up and dnfdragora locked it almost immediately, then throughout the whole process of reproducing the bug - run dnfdragora, search for a package to be updated, mark it for updating, run the transaction, wait for the error - there were several instances of DBus method calls where everything worked fine (we see <code>check_lock()</code> being called and finding <code>sender</code> and <code>self._lock</code> set to the same value, the identifier for dnfdragora), but then suddenly we see the daemon's <code><strong>init</strong></code> running again for some reason, not being locked, and then a <code>check_lock()</code> call that fails because the daemon instance's <code>self._lock</code> is <code>None</code>.</p> <p>After a couple of minutes, I guessed what was going on here, and the <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1750575#c23">daemon's service logs confirmed it</a> - dnfdaemon was crashing and automatically restarting. The first attempt to invoke a DBus method after the crash and restart fails, because dnfdragora has not locked this <em>new</em> instance of the daemon (it has no idea it just crashed and restarted), so <code>check_lock()</code> fails. So as soon as a DBus method invocation is attempted after the dnfdaemon crash, dnfdragora errors out with the confusing "dnf is locked by another process" error.</p> <p>The crash was already mentioned in the bug report, but until now the exact interaction between the crash and the error had not been worked out - we just knew the daemon crashed and the app errored out, but we didn't really know what order those things happened in or how they related to each other.</p> <h2>OK then...why is dnfdaemon crashing?</h2> <p>So, the question now became: why is dnfdaemon crashing? Well, the backtrace we had didn't tell us a lot; really it only told us that something was going wrong in libdbus, which we could also tell from the dnfdaemon service log:</p> <pre><code>Oct 14 18:53:15 adam.happyassassin.net dnfdaemon-system[226042]: dbus[226042]: arguments to dbus_connection_unref() were incorrect, assertion "connection-&gt;generation == _dbus_current_generation" failed in file ../../dbus/dbus-connection.c line 2823. Oct 14 18:53:15 adam.happyassassin.net dnfdaemon-system[226042]: This is normally a bug in some application using the D-Bus library. Oct 14 18:53:15 adam.happyassassin.net dnfdaemon-system[226042]: D-Bus not built with -rdynamic so unable to print a backtrace </code></pre> <p>that last line looked like a cue, so of course, off I went to figure out how to build DBus with <code>-rdynamic</code>. A bit of Googling told me - thanks "the3dfxdude"! - that <a href="https://www.linuxquestions.org/questions/slackware-14/compile-dbus-with-backtrace-support-4175459330/#post4938346">the trick is to compile with --enable-asserts</a>. So I did that and reproduced the bug again, and <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1750575#c25">got a bit of a better backtrace</a>. It's a long one, but by picking through it carefully I could spot - in frame #17 - the actual point at which the problem happened, which was <a href="https://github.com/manatools/dnfdaemon/blob/master/python/dnfdaemon/server/__init__.py#L595">in <code>dnfdaemon.server.DnfDaemonBase.run_transaction()</code></a>. (Note, this is a <em>different</em> <code>DnfDaemonBase</code> class from <code>dnfdaemon.client.DnfDaemonBase</code>; I don't know why they have the same name, that's just confusing).</p> <p>So, the daemon's crashing on this <code>self.TransactionEvent('end-run', NONE)</code> call. I poked into what that does a bit, and found a design here that kinda mirrors what happens on the client side: this <code>DnfDaemonBase</code>, like the other one, is a framework for a full daemon implementation, and it's subclassed by <a href="https://github.com/manatools/dnfdaemon/blob/master/daemon/dnfdaemon-system.py#L59">a <code>DnfDaemon</code> class here</a>. That class defines <a href="https://github.com/manatools/dnfdaemon/blob/master/daemon/dnfdaemon-system.py#L639">a <code>TransactionEvent</code> method</a> that emits a DBus signal. So...we're crashing when trying to emit a dbus signal. That all adds up with the backtrace going through libdbus and all. But, <em>why</em> are we crashing?</p> <p>At this point I tried to make a small reproducer (which basically just set up a <code>DnfDaemon</code> instance and called <code>self.TransactionEvent</code> in the same way, I think) but that didn't work - I didn't know why at the time, but figured it out later. Continuing to trace it out through code wouldn't be that easy because now we're in DBus, which I know from experience is a big complex codebase that's not that easy to just reason your way through. We had the actual DBus error to work from too - "arguments to dbus_connection_unref() were incorrect, assertion "connection-&gt;generation == _dbus_current_generation" failed" - and I looked into that a bit, but there were no really helpful leads there (I got a bit more understanding about what the error means exactly, but it didn't help me understand <em>why it was happening</em> at all).</p> <h2>Time for the old standby...</h2> <p>So, being a bit stuck, I fell back on the most trusty standby: trial and error! Well, also a <em>bit</em> of logic. It did occur to me that the dbus broker is itself a long-running daemon that other things can talk to. So I started just wondering if something was interfering with dnfdaemon's connection with the dbus broker, somehow. This was in my head as I poked around at stuff - wherever I wound up looking, I was looking for stuff that involved dbus.</p> <p>But to figure out where to look, I just started hacking up dnfdaemon a bit. Now this first part is probably pure intuition, but that <code>self._reset_base()</code> call on the line right before the <code>self.TransactionEvent</code> call that crashes bugged me. It's probably just long experience telling me that anything with "reset" or "refresh" in the name is bad news. :P So I thought, hey, what happens if we move it?</p> <p>I stuck some logging lines into this <code>run_transaction</code> so I knew where we got to before we crashed - this is a great dumb trick, btw, just stick lines like <code>self.logger('XXX HERE 1')</code>, <code>self.logger('XXX HERE 2')</code> etc. between every significant line in the thing you're debugging, and grep the logs for "XXX" - and moved the <code>self._reset_base()</code> call down under the <code>self.TransactionEvent</code> call...and found that when I did that, we got further, the <code>self.TransactionEvent</code> call worked and we crashed the next time something <em>else</em> tried to emit a DBus signal. I also tried commenting out the <code>self._reset_base()</code> call entirely, and found that now we would only crash the next time a DBus signal was emitted <em>after a subsequent call to the <code>Unlock()</code> method</em>, which is another method that calls <code>self._reset_base()</code>. So, at this point I was pretty confident in this description: "dnfdaemon is crashing on the first interaction with DBus after <code>self._reset_base()</code> is called".</p> <p>So my next step was to break down what <code>_reset_base()</code> was actually <em>doing</em>. Turns out all of the detail is in the <code>DnfDaemonBase</code> skeleton server class: it has a <code>self._base</code> which is a <code>dnf.base.Base()</code> instance, and that method just calls that instance's <code>close()</code> method and sets <code>self._base</code> to <code>None</code>. So off I went into dnf code to see <a href="https://github.com/rpm-software-management/dnf/blob/master/dnf/base.py#L454">what <code>dnf.base.Base.close()</code> does</a>. Turns out it basically does two things: it calls <code>self._finalize_base()</code> and then calls <code>self.reset(True, True, True)</code>.</p> <p>Looking at the code it wasn't immediately obvious which of these would be the culprit, so it was all aboard the trial and error train again! I changed the call to <code>self._reset_base()</code> in the daemon to <code>self._base.reset(True, True, True)</code>...and the bug stopped happening! So that told me the problem was in the call to <code>_finalize_base()</code>, not the call to <code>reset()</code>. So I dug into what <code>_finalize_base()</code> does and kinda repeated this process - I kept drilling down through layers and splitting up what things did into individual pieces, and doing subsets of those pieces at a time to try and find the "smallest" thing I could which would cause the bug.</p> <p>To take a short aside...this is what I really like about these kinds of debugging odysseys. It's like being a detective, only ultimately you know that there's a definite <em>reason</em> for what's happening and there's always some way you can get closer to it. If you have enough patience there's always a next step you can take that will get you a little bit closer to figuring out what's going on. You just have to keep working through the little steps until you finally get there.</p> <p>Eventually I lit upon <a href="https://github.com/rpm-software-management/dnf/blob/master/dnf/rpm/transaction.py#L50">this bit of <code>dnf.rpm.transaction.TransactionWrapper.close()</code></a>. That was the key, as close as I could get to it: reducing the daemon's <code>self._reset_base()</code> call to just <code>self._base._priv_ts.ts = None</code> (which is what that line does) was enough to cause the bug. That was the one thing out of all the things that <code>self._reset_base()</code> does which caused the problem.</p> <p>So, of course, I took a look at what this <code>ts</code> thing was. Turns out it's an instance of <code>rpm.TransactionSet</code>, from RPM's Python library. So, at some point, we're setting up an instance of <code>rpm.TransactionSet</code>, and at this point we're dropping our reference to it, which - point to ponder - might trigger some kind of cleanup on it.</p> <p>Remember how I was looking for things that deal with dbus? Well, that turned out to bear fruit at this point...because what I did next was simply to go to my git checkout of rpm and grep it for 'dbus'. And lo and behold...<a href="https://github.com/rpm-software-management/rpm/blob/ab601b882b9d9d8248250111317615db1aa7b7c6/plugins/systemd_inhibit.c">this showed up</a>.</p> <p>Turns out RPM has plugins (TIL!), and in particular, it has <em>this</em> one, which talks to dbus. (What it actually does is try to inhibit systemd from suspending or shutting down the system while a package transaction is happening). And this plugin has a <a href="https://github.com/rpm-software-management/rpm/blob/ab601b882b9d9d8248250111317615db1aa7b7c6/plugins/systemd_inhibit.c#L85">cleanup function which calls something called <code>dbus_shutdown()</code></a> - aha!</p> <p>This was enough to get me pretty suspicious. So I checked my system and, indeed, I had a package <code>rpm-plugin-systemd-inhibit</code> installed. I poked at dependencies a bit and found that <code>python3-dnf</code> recommends that package, which means it'll basically be installed on nearly all Fedora installs. Still looking like a prime suspect. So, it was easy enough to check: I put the code back to a state where the crash happened, uninstalled the package, and tried again...and bingo! The crash stopped happening.</p> <p>So at this point the case was more or less closed. I just had to do a bit of confirming and tidying up. I checked and it turned out that indeed this call to <code>dbus_shutdown()</code> had been <a href="https://github.com/rpm-software-management/rpm/commit/d5f201345f6d27b6280750e5c6502f4418614fbc">added quite recently</a>, which tied in with the bug not showing up earlier. I looked up the <a href="https://dbus.freedesktop.org/doc/api/html/group__DBusMemory.html#ga01912903e39428872920d861ef565bac">documentation for <code>dbus_shutdown()</code></a> which confirmed that it's a bit of a big cannon which certainly could cause a problem like this:</p> <p>"Frees all memory allocated internally by libdbus and reverses the effects of dbus_threads_init().</p> <p>libdbus keeps internal global variables, for example caches and thread locks, and it can be useful to free these internal data structures.</p> <p>...</p> <p>You can't continue to use any D-Bus objects, such as connections, that were allocated prior to dbus_shutdown(). You can, however, start over; call dbus_threads_init() again, create new connections, and so forth."</p> <p>and then I did a scratch build of rpm with the commit reverted, tested, and found that indeed, it solved the problem. So, we finally had our culprit: when the <code>rpm.TransactionSet</code> instance went out of scope, it got cleaned up, and that resulted in this plugin's cleanup function getting called, and <code>dbus_shutdown()</code> happening. The RPM devs had intended that call to clean up <em>the RPM plugin's</em> DBus handles, but this is all happening in a single process, so the call also cleaned up the DBus handles used by dnfdaemon itself, and that was enough (as the docs suggest) to cause any further attempts to communicate with DBus in dnfdaemon code to blow up and crash the daemon.</p> <p>So, that's how you get from dnfdragora claiming that DNF is locked by another process to a stray RPM plugin crashing dnfdaemon on a DBus interaction!</p> <p></p><p>Gather round the fire, kids, it's time for another Debugging Adventure! These are posts where I write up the process of diagnosing the root cause of a bug, where it turned out to be interesting (to me, anyway...)</p> <p>This case - <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1750575">Bugzilla #1750575</a> - involved dnfdragora, the package management tool used on Fedora Xfce, which is a release-blocking environment for the ARM architecture. It was a pretty easy bug to reproduce: any time you updated a package, the update would work, but then dnfdragora would show an error "DNF is locked by another process. dnfdragora will exit.", and immediately exit.</p> <p>The bug sat around on the blocker list for a while; Daniel Mach (a DNF developer) <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1750575#c10">looked into it a bit</a> but didn't have time to figure it out all the way. So I got tired of waiting for someone else to do it, and decided to work it out myself.</p> <p></p><h2>Where's the error coming from?</h2> <p>As a starting point, I had a nice error message - so the obvious thing to do is figure out where that message comes from. The text appears in a couple of places in dnfdragora - in an <a href="https://github.com/manatools/dnfdragora/blob/master/dnfdragora/basedragora.py#L131">exception handler</a> and also in a <a href="https://github.com/manatools/dnfdragora/blob/master/dnfdragora/basedragora.py#L90">method for setting up a connection to dnfdaemon</a>. So, if we didn't already know (I happened to) this would be the point at which we'd realize that dnfdragora is a frontend app to a backend - <a href="https://github.com/manatools/dnfdaemon">dnfdaemon</a> - which does the heavy lifting.</p> <p>So, to figure out in more detail how we were getting to one of these two points, I hacked both the points where that error is logged. Both of them read <code>logger.critical(errmsg)</code>. I changed this to <code>logger.exception(errmsg)</code>. <code>logger.exception</code> is a very handy feature of Python's <a href="https://docs.python.org/3/library/logging.html">logging module</a> which logs whatever message you specify, <strong>plus a traceback to the current state</strong>, just like the traceback you get if the app actually crashes. So by doing that, the dnfdragora log (it logs to a file <code>dnfdragora.log</code> in the directory you run it from) gave us a traceback showing how we got to the error:</p> <pre><code>2019-10-14 17:53:29,436 &lt;a href="ERROR"&gt;dnfdragora&lt;/a&gt; dnfdaemon client error: g-io-error-quark: GDBus.Error:org.baseurl.DnfSystem.LockedError: dnf is locked by another application (36) Traceback (most recent call last): File "/usr/bin/dnfdragora", line 85, in &lt;module&gt; main_gui.handleevent() File "/usr/lib/python3.7/site-packages/dnfdragora/ui.py", line 1273, in handleevent if not self._searchPackages(filter, True) : File "/usr/lib/python3.7/site-packages/dnfdragora/ui.py", line 949, in _searchPackages packages = self.backend.search(fields, strings, self.match_all, self.newest_only, tags ) File "/usr/lib/python3.7/site-packages/dnfdragora/misc.py", line 135, in newFunc rc = func(*args, **kwargs) File "/usr/lib/python3.7/site-packages/dnfdragora/dnf_backend.py", line 464, in search newest_only, tags) File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 508, in Search fields, keys, attrs, match_all, newest_only, tags)) File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 293, in _run_dbus_async result = self._get_result(data) File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 277, in _get_result self._handle_dbus_error(user_data['error']) File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 250, in _handle_dbus_error raise DaemonError(str(err)) dnfdaemon.client.DaemonError: g-io-error-quark: GDBus.Error:org.baseurl.DnfSystem.LockedError: dnf is locked by another application (36)&lt;/module&gt; </code></pre> <p>So, this tells us quite a bit of stuff. We know we're crashing in some sort of 'search' operation, and dbus seems to be involved. We can also see a bit more of the architecture here. Note how we have <code>dnfdragora/dnf_backend.py</code> and <code>dnfdaemon/client/<strong>init</strong>.py</code> included in the trace, even though we're only in the dnfdragora executable here (dnfdaemon is a separate process). Looking at that and then looking at those files a bit, it's quite easy to see that the dnfdaemon Python library provides a sort of <a href="https://github.com/manatools/dnfdaemon/blob/master/python/dnfdaemon/client/__init__.py#L199">framework for a client class called (oddly enough) <code>DnfDaemonBase</code></a> which the actual client - dnfdragora in our case - is expected to subclass and flesh out. dnfdragora does this in a <a href="https://github.com/manatools/dnfdragora/blob/master/dnfdragora/dnf_backend.py#L144">class called <code>DnfRootBackend</code></a>, which inherits from both <code>dnfdragora.backend.Backend</code> (a sort of abstraction layer for dnfdragora to have multiple of these backends, though at present it only actually has this one) and <code>dnfdaemon.client.Client</code>, which is just a <a href="https://github.com/manatools/dnfdaemon/blob/master/python/dnfdaemon/client/__init__.py#L544">small extension to <code>DnfDaemonBase</code></a> that adds some dbus signal handling.</p> <p>So now we know more about the design we're dealing with, and we can also see that we're trying to do some sort of search operation which looks like it works by the client class communicating with the actual <code>dnfdaemon</code> server process via dbus, only we're hitting some kind of error in that process, and interpreting it as 'dnf is locked by another application'. If we dig a little deeper, we can figure out a bit more. We have to read through all of the backtrace frames and examine the functions, but ultimately we can figure out that <code>DnfRootBackend.Search()</code> is wrapped by <code>dnfdragora.misc.ExceptionHandler</code>, which handles <code>dnfdaemon.client.DaemonError</code> exceptions - like the one that's ultimately getting raised here! - by calling the base class's own <code>exception_handler()</code> on them...and for us, that's <a href="https://github.com/manatools/dnfdragora/blob/master/dnfdragora/basedragora.py#L121">BaseDragora.exception_handler</a>, one of the two places we found earlier that ultimately produces this "DNF is locked by another process. dnfdragora will exit" text. We also now have <em>two</em> indications (the dbus error itself, and the code in <code>exception_handler()</code> that the error we're dealing with is "LockedError".</p> <h2>A misleading error...</h2> <p>At this point, I went looking for the text <code>LockedError</code>, and found it in two files in dnfdaemon that are kinda variants on each other - <a href="https://github.com/manatools/dnfdaemon/blob/master/daemon/dnfdaemon-session.py">daemon/dnfdaemon-session.py</a> and <a href="https://github.com/manatools/dnfdaemon/blob/master/daemon/dnfdaemon-system.py">daemon/dnfdaemon-system.py</a>. I didn't actually know offhand which of the two is used in our case, but it doesn't really matter, because the codepath to <code>LockedError</code> is the same in both. There's a <a href="https://github.com/manatools/dnfdaemon/blob/master/daemon/dnfdaemon-system.py#L700">function called <code>check_lock()</code></a> which checks that <code>self._lock == sender</code>, and if it doesn't, raises <code>LockedError</code>. That sure looks like where we're at.</p> <p>So at this point I did a bit of poking around into how <code>self._lock</code> gets set and unset in the daemon. It turns out to be pretty simple. The daemon is basically implemented as a class with a bunch of methods that are wrapped by <code>@dbus.service.method</code>, which makes them accessible as <em>DBus</em> methods. (One of them is <code>Search()</code>, and we can see that the <a href="https://github.com/manatools/dnfdaemon/blob/master/python/dnfdaemon/client/__init__.py#L493">client class's own <code>Search()</code> basically just calls that</a>). There are also methods called <a href="https://github.com/manatools/dnfdaemon/blob/master/daemon/dnfdaemon-system.py#L105"><code>Lock()</code></a> and <a href="https://github.com/manatools/dnfdaemon/blob/master/daemon/dnfdaemon-system.py#L350"><code>Unlock()</code></a>, which - not surprisingly - set and release this lock, by setting the daemon class' <code>self._lock</code> to be either an identifier for the DBus client or <code>None</code>, respectively. And when the daemon is first initialized, the value is set to <code>None</code>.</p> <p>At this point, I realized that the error we're dealing with here is actually a lie in two important ways:</p> <ol> <li>The message claims that the problem is the lock being held "by another application", but that's not what <code>check_lock()</code> checks, really. It passes <strong>only if the caller holds the lock</strong>. It <em>does</em> fail if the lock is held "by another application", but it <em>also</em> fails if <strong>the lock is not held at all</strong>. Given all the code we looked at so far, we can't actually trust the message's assertion that something else is holding the lock. It is also possible that the lock is not held at all.</li> <li>The message suggests that the lock in question is a lock on dnf itself. I know dnf/libdnf <em>do</em> have locking, so up to now I'd been assuming we were actually dealing with the locking in dnf itself. But at this point I realized we weren't. The dnfdaemon lock code we just looked at doesn't actually call or wrap dnf's own locking code in any way. This lock we're dealing with is entirely internal to dnfdaemon. It's really a lock on <em>the dnfdaemon instance itself</em>.</li> </ol> <p>So, at this point I started thinking of the error as being "dnfdaemon is either locked by another DBus client, or not locked at all".</p> <h2>So what's going on with this lock anyway?</h2> <p>My next step, now I understood the locking process we're dealing with, was to stick some logging into it. I added log lines to the <code>Lock()</code> and <code>Unlock()</code> methods, and I also made <code>check_lock()</code> log what <code>sender</code> and <code>self._lock</code> were set to before returning. Because it sets <code>self._lock</code> to <code>None</code>, I also added a log line to the daemon's <code><strong>init</strong></code> that just records that we're in it. That got me some more useful information:</p> <pre><code>2019-10-14 18:53:03.397784 XXX In DnfDaemon.init now! 2019-10-14 18:53:03.402804 XXX LOCK: sender is :1.1835 2019-10-14 18:53:03.407524 XXX CHECK LOCK: sender is :1.1835 XXX CHECK LOCK: self._lock is :1.1835 2019-10-14 18:53:07.556499 XXX CHECK LOCK: sender is :1.1835 XXX CHECK LOCK: self._lock is :1.1835 [...snip a bunch more calls to check_lock where the sender is the same...] 2019-10-14 18:53:13.560828 XXX CHECK LOCK: sender is :1.1835 XXX CHECK LOCK: self._lock is :1.1835 2019-10-14 18:53:13.560941 XXX CHECK LOCK: sender is :1.1835 XXX CHECK LOCK: self._lock is :1.1835 2019-10-14 18:53:16.513900 XXX In DnfDaemon.init now! 2019-10-14 18:53:16.516724 XXX CHECK LOCK: sender is :1.1835 XXX CHECK LOCK: self._lock is None </code></pre> <p>so we could see that when we started dnfdragora, dnfdaemon started up and dnfdragora locked it almost immediately, then throughout the whole process of reproducing the bug - run dnfdragora, search for a package to be updated, mark it for updating, run the transaction, wait for the error - there were several instances of DBus method calls where everything worked fine (we see <code>check_lock()</code> being called and finding <code>sender</code> and <code>self._lock</code> set to the same value, the identifier for dnfdragora), but then suddenly we see the daemon's <code><strong>init</strong></code> running again for some reason, not being locked, and then a <code>check_lock()</code> call that fails because the daemon instance's <code>self._lock</code> is <code>None</code>.</p> <p>After a couple of minutes, I guessed what was going on here, and the <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1750575#c23">daemon's service logs confirmed it</a> - dnfdaemon was crashing and automatically restarting. The first attempt to invoke a DBus method after the crash and restart fails, because dnfdragora has not locked this <em>new</em> instance of the daemon (it has no idea it just crashed and restarted), so <code>check_lock()</code> fails. So as soon as a DBus method invocation is attempted after the dnfdaemon crash, dnfdragora errors out with the confusing "dnf is locked by another process" error.</p> <p>The crash was already mentioned in the bug report, but until now the exact interaction between the crash and the error had not been worked out - we just knew the daemon crashed and the app errored out, but we didn't really know what order those things happened in or how they related to each other.</p> <h2>OK then...why is dnfdaemon crashing?</h2> <p>So, the question now became: why is dnfdaemon crashing? Well, the backtrace we had didn't tell us a lot; really it only told us that something was going wrong in libdbus, which we could also tell from the dnfdaemon service log:</p> <pre><code>Oct 14 18:53:15 adam.happyassassin.net dnfdaemon-system[226042]: dbus[226042]: arguments to dbus_connection_unref() were incorrect, assertion "connection-&gt;generation == _dbus_current_generation" failed in file ../../dbus/dbus-connection.c line 2823. Oct 14 18:53:15 adam.happyassassin.net dnfdaemon-system[226042]: This is normally a bug in some application using the D-Bus library. Oct 14 18:53:15 adam.happyassassin.net dnfdaemon-system[226042]: D-Bus not built with -rdynamic so unable to print a backtrace </code></pre> <p>that last line looked like a cue, so of course, off I went to figure out how to build DBus with <code>-rdynamic</code>. A bit of Googling told me - thanks "the3dfxdude"! - that <a href="https://www.linuxquestions.org/questions/slackware-14/compile-dbus-with-backtrace-support-4175459330/#post4938346">the trick is to compile with --enable-asserts</a>. So I did that and reproduced the bug again, and <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1750575#c25">got a bit of a better backtrace</a>. It's a long one, but by picking through it carefully I could spot - in frame #17 - the actual point at which the problem happened, which was <a href="https://github.com/manatools/dnfdaemon/blob/master/python/dnfdaemon/server/__init__.py#L595">in <code>dnfdaemon.server.DnfDaemonBase.run_transaction()</code></a>. (Note, this is a <em>different</em> <code>DnfDaemonBase</code> class from <code>dnfdaemon.client.DnfDaemonBase</code>; I don't know why they have the same name, that's just confusing).</p> <p>So, the daemon's crashing on this <code>self.TransactionEvent('end-run', NONE)</code> call. I poked into what that does a bit, and found a design here that kinda mirrors what happens on the client side: this <code>DnfDaemonBase</code>, like the other one, is a framework for a full daemon implementation, and it's subclassed by <a href="https://github.com/manatools/dnfdaemon/blob/master/daemon/dnfdaemon-system.py#L59">a <code>DnfDaemon</code> class here</a>. That class defines <a href="https://github.com/manatools/dnfdaemon/blob/master/daemon/dnfdaemon-system.py#L639">a <code>TransactionEvent</code> method</a> that emits a DBus signal. So...we're crashing when trying to emit a dbus signal. That all adds up with the backtrace going through libdbus and all. But, <em>why</em> are we crashing?</p> <p>At this point I tried to make a small reproducer (which basically just set up a <code>DnfDaemon</code> instance and called <code>self.TransactionEvent</code> in the same way, I think) but that didn't work - I didn't know why at the time, but figured it out later. Continuing to trace it out through code wouldn't be that easy because now we're in DBus, which I know from experience is a big complex codebase that's not that easy to just reason your way through. We had the actual DBus error to work from too - "arguments to dbus_connection_unref() were incorrect, assertion "connection-&gt;generation == _dbus_current_generation" failed" - and I looked into that a bit, but there were no really helpful leads there (I got a bit more understanding about what the error means exactly, but it didn't help me understand <em>why it was happening</em> at all).</p> <h2>Time for the old standby...</h2> <p>So, being a bit stuck, I fell back on the most trusty standby: trial and error! Well, also a <em>bit</em> of logic. It did occur to me that the dbus broker is itself a long-running daemon that other things can talk to. So I started just wondering if something was interfering with dnfdaemon's connection with the dbus broker, somehow. This was in my head as I poked around at stuff - wherever I wound up looking, I was looking for stuff that involved dbus.</p> <p>But to figure out where to look, I just started hacking up dnfdaemon a bit. Now this first part is probably pure intuition, but that <code>self._reset_base()</code> call on the line right before the <code>self.TransactionEvent</code> call that crashes bugged me. It's probably just long experience telling me that anything with "reset" or "refresh" in the name is bad news. :P So I thought, hey, what happens if we move it?</p> <p>I stuck some logging lines into this <code>run_transaction</code> so I knew where we got to before we crashed - this is a great dumb trick, btw, just stick lines like <code>self.logger('XXX HERE 1')</code>, <code>self.logger('XXX HERE 2')</code> etc. between every significant line in the thing you're debugging, and grep the logs for "XXX" - and moved the <code>self._reset_base()</code> call down under the <code>self.TransactionEvent</code> call...and found that when I did that, we got further, the <code>self.TransactionEvent</code> call worked and we crashed the next time something <em>else</em> tried to emit a DBus signal. I also tried commenting out the <code>self._reset_base()</code> call entirely, and found that now we would only crash the next time a DBus signal was emitted <em>after a subsequent call to the <code>Unlock()</code> method</em>, which is another method that calls <code>self._reset_base()</code>. So, at this point I was pretty confident in this description: "dnfdaemon is crashing on the first interaction with DBus after <code>self._reset_base()</code> is called".</p> <p>So my next step was to break down what <code>_reset_base()</code> was actually <em>doing</em>. Turns out all of the detail is in the <code>DnfDaemonBase</code> skeleton server class: it has a <code>self._base</code> which is a <code>dnf.base.Base()</code> instance, and that method just calls that instance's <code>close()</code> method and sets <code>self._base</code> to <code>None</code>. So off I went into dnf code to see <a href="https://github.com/rpm-software-management/dnf/blob/master/dnf/base.py#L454">what <code>dnf.base.Base.close()</code> does</a>. Turns out it basically does two things: it calls <code>self._finalize_base()</code> and then calls <code>self.reset(True, True, True)</code>.</p> <p>Looking at the code it wasn't immediately obvious which of these would be the culprit, so it was all aboard the trial and error train again! I changed the call to <code>self._reset_base()</code> in the daemon to <code>self._base.reset(True, True, True)</code>...and the bug stopped happening! So that told me the problem was in the call to <code>_finalize_base()</code>, not the call to <code>reset()</code>. So I dug into what <code>_finalize_base()</code> does and kinda repeated this process - I kept drilling down through layers and splitting up what things did into individual pieces, and doing subsets of those pieces at a time to try and find the "smallest" thing I could which would cause the bug.</p> <p>To take a short aside...this is what I really like about these kinds of debugging odysseys. It's like being a detective, only ultimately you know that there's a definite <em>reason</em> for what's happening and there's always some way you can get closer to it. If you have enough patience there's always a next step you can take that will get you a little bit closer to figuring out what's going on. You just have to keep working through the little steps until you finally get there.</p> <p>Eventually I lit upon <a href="https://github.com/rpm-software-management/dnf/blob/master/dnf/rpm/transaction.py#L50">this bit of <code>dnf.rpm.transaction.TransactionWrapper.close()</code></a>. That was the key, as close as I could get to it: reducing the daemon's <code>self._reset_base()</code> call to just <code>self._base._priv_ts.ts = None</code> (which is what that line does) was enough to cause the bug. That was the one thing out of all the things that <code>self._reset_base()</code> does which caused the problem.</p> <p>So, of course, I took a look at what this <code>ts</code> thing was. Turns out it's an instance of <code>rpm.TransactionSet</code>, from RPM's Python library. So, at some point, we're setting up an instance of <code>rpm.TransactionSet</code>, and at this point we're dropping our reference to it, which - point to ponder - might trigger some kind of cleanup on it.</p> <p>Remember how I was looking for things that deal with dbus? Well, that turned out to bear fruit at this point...because what I did next was simply to go to my git checkout of rpm and grep it for 'dbus'. And lo and behold...<a href="https://github.com/rpm-software-management/rpm/blob/ab601b882b9d9d8248250111317615db1aa7b7c6/plugins/systemd_inhibit.c">this showed up</a>.</p> <p>Turns out RPM has plugins (TIL!), and in particular, it has <em>this</em> one, which talks to dbus. (What it actually does is try to inhibit systemd from suspending or shutting down the system while a package transaction is happening). And this plugin has a <a href="https://github.com/rpm-software-management/rpm/blob/ab601b882b9d9d8248250111317615db1aa7b7c6/plugins/systemd_inhibit.c#L85">cleanup function which calls something called <code>dbus_shutdown()</code></a> - aha!</p> <p>This was enough to get me pretty suspicious. So I checked my system and, indeed, I had a package <code>rpm-plugin-systemd-inhibit</code> installed. I poked at dependencies a bit and found that <code>python3-dnf</code> recommends that package, which means it'll basically be installed on nearly all Fedora installs. Still looking like a prime suspect. So, it was easy enough to check: I put the code back to a state where the crash happened, uninstalled the package, and tried again...and bingo! The crash stopped happening.</p> <p>So at this point the case was more or less closed. I just had to do a bit of confirming and tidying up. I checked and it turned out that indeed this call to <code>dbus_shutdown()</code> had been <a href="https://github.com/rpm-software-management/rpm/commit/d5f201345f6d27b6280750e5c6502f4418614fbc">added quite recently</a>, which tied in with the bug not showing up earlier. I looked up the <a href="https://dbus.freedesktop.org/doc/api/html/group__DBusMemory.html#ga01912903e39428872920d861ef565bac">documentation for <code>dbus_shutdown()</code></a> which confirmed that it's a bit of a big cannon which certainly could cause a problem like this:</p> <p>"Frees all memory allocated internally by libdbus and reverses the effects of dbus_threads_init().</p> <p>libdbus keeps internal global variables, for example caches and thread locks, and it can be useful to free these internal data structures.</p> <p>...</p> <p>You can't continue to use any D-Bus objects, such as connections, that were allocated prior to dbus_shutdown(). You can, however, start over; call dbus_threads_init() again, create new connections, and so forth."</p> <p>and then I did a scratch build of rpm with the commit reverted, tested, and found that indeed, it solved the problem. So, we finally had our culprit: when the <code>rpm.TransactionSet</code> instance went out of scope, it got cleaned up, and that resulted in this plugin's cleanup function getting called, and <code>dbus_shutdown()</code> happening. The RPM devs had intended that call to clean up <em>the RPM plugin's</em> DBus handles, but this is all happening in a single process, so the call also cleaned up the DBus handles used by dnfdaemon itself, and that was enough (as the docs suggest) to cause any further attempts to communicate with DBus in dnfdaemon code to blow up and crash the daemon.</p> <p>So, that's how you get from dnfdragora claiming that DNF is locked by another process to a stray RPM plugin crashing dnfdaemon on a DBus interaction!</p> AdamW's Debugging Adventures: The Mysterious Disappearing /proc https://www.happyassassin.net/posts/2019/01/17/adamws-debugging-adventures-the-mysterious-disappearing-proc/ 2019-01-17T19:15:30Z 2019-01-17T19:15:30Z Adam Williamson <p></p><p>Yep, folks, it's that time again - time for one of old Grandpa Adam's tall tales of root causing adventure...</p> <p>There's a sort of catch-22 situation in Fedora that has been a personal bugbear for a very long time. It mainly affects <a href="https://fedoraproject.org/wiki/Releases/Branched">Branched</a> releases - each new Fedora release, when it has branched from Rawhide, but before it has been released. During this period the <a href="https://admin.fedoraproject.org/updates/">Bodhi update system</a> is in effect, meaning all new packages have to go through Bodhi review before they are included in the composes for the release. This means, in theory, we should be able to make sure nothing really broken lands in the release. However, there's a big class of really important updates we have never been able to test properly at all: updates that affect the installer.</p> <p>The catch-22 is this - release engineering only builds install media from the 'stable' package set, those packages that have gone through review. So if a package under review breaks the installer, we can't test whether it breaks the installer unless we push it stable. Well, you <em>can</em>, but it's quite difficult - you have to learn how to build an installer image yourself, then build one containing the packages from the update and test it. I can do that, but most other people aren't going to bother.</p> <p>I've filed bugs and talked to people about ways to resolve this multiple times over many years, but a few months back I just got sick of the problem and decided to fix it myself. So I wrote an <a href="https://fedoraproject.org/wiki/OpenQA">openQA</a> update test which automates the process: it builds an installer image, with the packages from the update available to the installer image build tool. I also included a subsequent test which takes that image and runs an install with it. Since I already had the process for doing this manually down pat, it wasn't actually very difficult.</p> <p>Only...when I deployed the test to the openQA staging instance and actually tried it out, I found the installer image build would frequently fail in a rather strange way.</p> <p>The installer image build process works (more or less) by creating a temporary directory, installing a bunch of packages to it (using dnf's feature of installing to an alternative 'root'), fiddling around with that environment a bit more, creating a disk image whose root is that temporary directory, then fiddling with the image a bit to make it into a bootable ISO. (HANDWAVE HANDWAVE). However, I was finding it would commonly fail in the 'fiddling around with the environment' stage, because somehow some parts of the environment had <em>disappeared</em>. Specifically, it'd show this error:</p> <p></p><pre><code>FileNoteFoundError: [Errno 2] No such file or directory: '/var/tmp/lorax.q8xfvc0p/installroot//proc/modules' </code></pre> <p>lorax was, at that point, trying to touch that directory (never mind why). That's the <code>/proc/modules</code> inside the temporary root, basically. The question was, why was it disappearing? And why had neither myself nor <a href="https://www.brianlane.com/">bcl</a> (the lorax maintainer) seen it happening previously in manual use, or in official composes?</p> <p>I tried reproducing it in a virtual machine...and failed. Then I tried again, and succeeded. Then I ran the command again...and it worked! That pattern turned out to repeat: I could usually get it to happen the first time I tried it in a VM, but any subsequent attempts in the same VM succeeded.</p> <p>So this was seeming really pretty mysterious. Brian couldn't get it to happen at all.</p> <p>At this point I wrote a dumb, short Python script which just constantly monitored the disappearing location and told me when it appeared and disappeared. I hacked up the openQA test to run this script, and upload the result. Using the timestamps, I was able to figure out exactly what bit of lorax was running when the directory suddenly disappeared. But...I couldn't immediately see why anything in that chunk of lorax would wind up deleting the directory.</p> <p>At this point, other work became more important, and I wound up leaving this on the back burner for a couple of months. Then I came back to it a couple days ago. I picked back up where I left off, and did a custom build of lorax with some debug logging statements strewn around the relevant section, to figure out <em>really precisely</em> where we were when things went wrong. But this turned out to be a bit of a brick wall, because it turned out that at the time the directory disappeared, lorax was just...running <code>mksquashfs</code>. And I could not figure out any plausible reason at all why a run of <code>mksquashfs</code> would cause the directory to vanish.</p> <p>After a bit, though, the thought struck me - maybe it's <em>not</em> lorax itself wiping the directory out at all! Maybe something else is doing it. So I thought to look at the system logs. And lo and behold, I found my smoking gun. At the exact time my script logged that the directory had disappeared, this message appeared in the system log:</p> <pre><code>Jan 18 01:57:30 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com systemd[1]: Starting Cleanup of Temporary Directories... </code></pre> <p>now, remember our problem directory is in <code>/var/tmp</code>. So this smells very suspicious indeed! So I figured out what that service actually is - to do this, you just grep for the description ("Cleanup of Temporary Directories") in <code>/usr/lib/systemd/system</code> - and it turned out to be <code>/usr/lib/systemd/system/systemd-tmpfiles-clean.service</code>, which is part of systemd's <code>systemd-tmpfiles</code> mechanism, which you can read up on in great detail in <code>man systemd-tmpfiles</code> and <code>man tmpfiles.d</code>.</p> <p>I had run into it a few times before, so I had a vague idea what I was dealing with and what to look for. It's basically a mechanism for managing temporary files and directories: you can write config snippets which systemd will read and do stuff like creating expected temporary files or directories on boot (this lets packages manage temporary directories without doing it themselves in scriptlets). I poked through the docs again and, sure enough, it turns out another thing the system can do is <em>delete</em> temporary files that reach a certain age:</p> <pre><code>Age The date field, when set, is used to decide what files to delete when cleaning. If a file or directory is older than the current time minus the age field, it is deleted. The field format is a series of integers each followed by one of the following suffixes for the respective time units: s, m or min, h, d, w, ms, and us, meaning seconds, minutes, hours, days, weeks, milliseconds, and microseconds, respectively. Full names of the time units can be used too. </code></pre> <p>This <code>systemd-tmpfiles-clean.service</code> does that job. So I went looking for <code>tmpfiles.d</code> snippets that cover <code>/var/tmp</code>, and sure enough, found one, in Fedora's stock config file <code>/usr/lib/tmpfiles.d/tmp.conf</code>:</p> <pre><code>q /var/tmp 1777 root root 30d </code></pre> <p>The <code>30d</code> there is the 'age' field. So this tells the tmpfiles mechanism that it's fine to wipe anything under <code>/var/tmp</code> which is older than 30 days.</p> <p>Of course, naively we might think our directory <em>won't be</em> older than 30 days - after all, we only just ran lorax! But remember, lorax installs packages into this temporary directory, and files and directories in packages get some of their time attributes from the package. So we (at this point, Brian and I were chatting about the problem as I poked it) looked into how systemd-tmpfiles defines age, precisely:</p> <pre><code>The age of a file system entry is determined from its last modification timestamp (mtime), its last access timestamp (atime), and (except for directories) its last status change timestamp (ctime). Any of these three (or two) values will prevent cleanup if it is more recent than the current time minus the age field. </code></pre> <p>So since our thing is a directory, its <code>mtime</code> and <code>atime</code> are relevant. So Brian and I both looked into those. He did it manually, while I hacked up my check script to also print the <code>mtime</code> and <code>atime</code> of the directory when it existed. And sure enough, it turned out these were several months in the past - they were obviously related to the date the <code>filesystem</code> package (from which <code>/proc/modules</code> comes) was built. They were certainly longer than 30 days ago.</p> <p>Finally, I looked into what was actually running <code>systemd-tmpfiles-clean.service</code>; it's run on a timer, <code>systemd-tmpfiles-clean.timer</code>. That timer is set to run the service 15 minutes after the system boots, and every day thereafter.</p> <p>So all of this hooked up nicely into a convincing story. openQA kept running into this problem <em>because it always runs the test in a freshly-booted VM</em> - that '15 minutes after boot' was turning out to be right in the middle of the image creation. My manual reproductions were failing on the first try for the same reason - but then succeeding on the <em>second</em> and subsequent tries because the cleaner would not run again until the next day. And Brian and I had either never or rarely seen this when we ran lorax manually for one reason or another because it was pretty unlikely the "once a day" timer would happen to wake up and run just when we had lorax running (and if it did happen, we'd try again, and when it worked, we'd figure it was just some weird transient failure). The problem likely never happens in official composes, I think, because the tmpfiles timer isn't active at all in the environment lorax gets run in (haven't double-checked this, though).</p> <p>Brian now gets to deal with the thorny problem of trying to fix this somehow on the lorax side (so the tmpfiles cleanup won't remove bits of the temporary tree even if it does run while lorax is running). Now I know what's going on, it was easy enough to work around in the openQA test - I just have the test do <code>systemctl stop systemd-tmpfiles-clean.timer</code> before running the image build.</p> <p></p><p>Yep, folks, it's that time again - time for one of old Grandpa Adam's tall tales of root causing adventure...</p> <p>There's a sort of catch-22 situation in Fedora that has been a personal bugbear for a very long time. It mainly affects <a href="https://fedoraproject.org/wiki/Releases/Branched">Branched</a> releases - each new Fedora release, when it has branched from Rawhide, but before it has been released. During this period the <a href="https://admin.fedoraproject.org/updates/">Bodhi update system</a> is in effect, meaning all new packages have to go through Bodhi review before they are included in the composes for the release. This means, in theory, we should be able to make sure nothing really broken lands in the release. However, there's a big class of really important updates we have never been able to test properly at all: updates that affect the installer.</p> <p>The catch-22 is this - release engineering only builds install media from the 'stable' package set, those packages that have gone through review. So if a package under review breaks the installer, we can't test whether it breaks the installer unless we push it stable. Well, you <em>can</em>, but it's quite difficult - you have to learn how to build an installer image yourself, then build one containing the packages from the update and test it. I can do that, but most other people aren't going to bother.</p> <p>I've filed bugs and talked to people about ways to resolve this multiple times over many years, but a few months back I just got sick of the problem and decided to fix it myself. So I wrote an <a href="https://fedoraproject.org/wiki/OpenQA">openQA</a> update test which automates the process: it builds an installer image, with the packages from the update available to the installer image build tool. I also included a subsequent test which takes that image and runs an install with it. Since I already had the process for doing this manually down pat, it wasn't actually very difficult.</p> <p>Only...when I deployed the test to the openQA staging instance and actually tried it out, I found the installer image build would frequently fail in a rather strange way.</p> <p>The installer image build process works (more or less) by creating a temporary directory, installing a bunch of packages to it (using dnf's feature of installing to an alternative 'root'), fiddling around with that environment a bit more, creating a disk image whose root is that temporary directory, then fiddling with the image a bit to make it into a bootable ISO. (HANDWAVE HANDWAVE). However, I was finding it would commonly fail in the 'fiddling around with the environment' stage, because somehow some parts of the environment had <em>disappeared</em>. Specifically, it'd show this error:</p> <p></p><pre><code>FileNoteFoundError: [Errno 2] No such file or directory: '/var/tmp/lorax.q8xfvc0p/installroot//proc/modules' </code></pre> <p>lorax was, at that point, trying to touch that directory (never mind why). That's the <code>/proc/modules</code> inside the temporary root, basically. The question was, why was it disappearing? And why had neither myself nor <a href="https://www.brianlane.com/">bcl</a> (the lorax maintainer) seen it happening previously in manual use, or in official composes?</p> <p>I tried reproducing it in a virtual machine...and failed. Then I tried again, and succeeded. Then I ran the command again...and it worked! That pattern turned out to repeat: I could usually get it to happen the first time I tried it in a VM, but any subsequent attempts in the same VM succeeded.</p> <p>So this was seeming really pretty mysterious. Brian couldn't get it to happen at all.</p> <p>At this point I wrote a dumb, short Python script which just constantly monitored the disappearing location and told me when it appeared and disappeared. I hacked up the openQA test to run this script, and upload the result. Using the timestamps, I was able to figure out exactly what bit of lorax was running when the directory suddenly disappeared. But...I couldn't immediately see why anything in that chunk of lorax would wind up deleting the directory.</p> <p>At this point, other work became more important, and I wound up leaving this on the back burner for a couple of months. Then I came back to it a couple days ago. I picked back up where I left off, and did a custom build of lorax with some debug logging statements strewn around the relevant section, to figure out <em>really precisely</em> where we were when things went wrong. But this turned out to be a bit of a brick wall, because it turned out that at the time the directory disappeared, lorax was just...running <code>mksquashfs</code>. And I could not figure out any plausible reason at all why a run of <code>mksquashfs</code> would cause the directory to vanish.</p> <p>After a bit, though, the thought struck me - maybe it's <em>not</em> lorax itself wiping the directory out at all! Maybe something else is doing it. So I thought to look at the system logs. And lo and behold, I found my smoking gun. At the exact time my script logged that the directory had disappeared, this message appeared in the system log:</p> <pre><code>Jan 18 01:57:30 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com systemd[1]: Starting Cleanup of Temporary Directories... </code></pre> <p>now, remember our problem directory is in <code>/var/tmp</code>. So this smells very suspicious indeed! So I figured out what that service actually is - to do this, you just grep for the description ("Cleanup of Temporary Directories") in <code>/usr/lib/systemd/system</code> - and it turned out to be <code>/usr/lib/systemd/system/systemd-tmpfiles-clean.service</code>, which is part of systemd's <code>systemd-tmpfiles</code> mechanism, which you can read up on in great detail in <code>man systemd-tmpfiles</code> and <code>man tmpfiles.d</code>.</p> <p>I had run into it a few times before, so I had a vague idea what I was dealing with and what to look for. It's basically a mechanism for managing temporary files and directories: you can write config snippets which systemd will read and do stuff like creating expected temporary files or directories on boot (this lets packages manage temporary directories without doing it themselves in scriptlets). I poked through the docs again and, sure enough, it turns out another thing the system can do is <em>delete</em> temporary files that reach a certain age:</p> <pre><code>Age The date field, when set, is used to decide what files to delete when cleaning. If a file or directory is older than the current time minus the age field, it is deleted. The field format is a series of integers each followed by one of the following suffixes for the respective time units: s, m or min, h, d, w, ms, and us, meaning seconds, minutes, hours, days, weeks, milliseconds, and microseconds, respectively. Full names of the time units can be used too. </code></pre> <p>This <code>systemd-tmpfiles-clean.service</code> does that job. So I went looking for <code>tmpfiles.d</code> snippets that cover <code>/var/tmp</code>, and sure enough, found one, in Fedora's stock config file <code>/usr/lib/tmpfiles.d/tmp.conf</code>:</p> <pre><code>q /var/tmp 1777 root root 30d </code></pre> <p>The <code>30d</code> there is the 'age' field. So this tells the tmpfiles mechanism that it's fine to wipe anything under <code>/var/tmp</code> which is older than 30 days.</p> <p>Of course, naively we might think our directory <em>won't be</em> older than 30 days - after all, we only just ran lorax! But remember, lorax installs packages into this temporary directory, and files and directories in packages get some of their time attributes from the package. So we (at this point, Brian and I were chatting about the problem as I poked it) looked into how systemd-tmpfiles defines age, precisely:</p> <pre><code>The age of a file system entry is determined from its last modification timestamp (mtime), its last access timestamp (atime), and (except for directories) its last status change timestamp (ctime). Any of these three (or two) values will prevent cleanup if it is more recent than the current time minus the age field. </code></pre> <p>So since our thing is a directory, its <code>mtime</code> and <code>atime</code> are relevant. So Brian and I both looked into those. He did it manually, while I hacked up my check script to also print the <code>mtime</code> and <code>atime</code> of the directory when it existed. And sure enough, it turned out these were several months in the past - they were obviously related to the date the <code>filesystem</code> package (from which <code>/proc/modules</code> comes) was built. They were certainly longer than 30 days ago.</p> <p>Finally, I looked into what was actually running <code>systemd-tmpfiles-clean.service</code>; it's run on a timer, <code>systemd-tmpfiles-clean.timer</code>. That timer is set to run the service 15 minutes after the system boots, and every day thereafter.</p> <p>So all of this hooked up nicely into a convincing story. openQA kept running into this problem <em>because it always runs the test in a freshly-booted VM</em> - that '15 minutes after boot' was turning out to be right in the middle of the image creation. My manual reproductions were failing on the first try for the same reason - but then succeeding on the <em>second</em> and subsequent tries because the cleaner would not run again until the next day. And Brian and I had either never or rarely seen this when we ran lorax manually for one reason or another because it was pretty unlikely the "once a day" timer would happen to wake up and run just when we had lorax running (and if it did happen, we'd try again, and when it worked, we'd figure it was just some weird transient failure). The problem likely never happens in official composes, I think, because the tmpfiles timer isn't active at all in the environment lorax gets run in (haven't double-checked this, though).</p> <p>Brian now gets to deal with the thorny problem of trying to fix this somehow on the lorax side (so the tmpfiles cleanup won't remove bits of the temporary tree even if it does run while lorax is running). Now I know what's going on, it was easy enough to work around in the openQA test - I just have the test do <code>systemctl stop systemd-tmpfiles-clean.timer</code> before running the image build.</p> AdamW's Debugging Adventures: Python 3 Porting 201 https://www.happyassassin.net/posts/2019/01/08/adamws-debugging-adventures-python-3-porting-201/ 2019-01-08T20:12:36Z 2019-01-08T20:12:36Z Adam Williamson <p></p><p>Hey folks! Time for another edition of AdamW's Debugging Adventures, wherein I boast about how great I am at fixin' stuff.</p> <p>Today's episode is about <a href="https://pagure.io/koji/issue/1192">a bug in the client for Fedora's Koji buildsystem</a> which has been biting more and more Fedora maintainers lately. The most obvious thing it affects is task watching. When you do a package build with <code>fedpkg</code>, it will by default "watch" the build task - it'll update you when the various subtasks start and finish, and not quit until the build ultimately succeeds or fails. You can also directly watch tasks with <code>koji watch-task</code>. So this is something Fedora maintainers see a lot. There's also a common workflow where you chain something to the successful completion of a <code>fedpkg build</code> or <code>koji watch-task</code>, which relies on the task watch completing successfully and exiting 0, if the build actually completed.</p> <p>However, recently, people noticed that this task watching seemed to be just...failing, quite a lot. While the task was still running, it'd suddenly exit, usually showing this message:</p> <p>ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))</p> <p>After a while, <a href="https://www.scrye.com/wordpress/nirik/">nirik</a> realized that this seemed to be associated with the client going from running under Python 2 by default to running under Python 3 by default. This seems to happen when running on Python 3; it doesn't seem to happen when running on Python 2.</p> <p>Today I finally decided it had got annoying enough that I'd spend some time trying to track it down.</p> <p>It's pretty obvious that the message we see relates to an exception, in some way. But ultimately something is catching that exception and printing it out and then exiting (we're not actually getting a traceback, as you do if the exception is ultimately left to reach the interpreter). So my first approach was to dig into the <code>watch-task</code> code from the top down, and try and find something that handles exceptions that looks like it might be the bit we were hitting.</p> <p>And...I failed! This happens, sometimes. In fact I still haven't found the exact bit of code that prints the message and exits. Sometimes, this just happens. It's OK. Don't give up. Try something else!</p> <p>So what I did next was kind of a long shot - I just grepped the code for the exception text. I wasn't really expecting this to work, as there's nothing to suggest the actual exception is part of Koji; it's most likely the code doesn't contain any of that text at all. But hey, it's easy to do, so why not? And as it happened, I got lucky and <a href="https://pagure.io/koji/blob/2b8a9dd5739cc8ca725a4efc15003d467267ec04/f/koji/__init__.py#_1991">hit paydirt</a>: there happens to be a comment with some of the text from the error we're hitting. And it sure looks like it might be relevant to the problem we're having! The comment itself, and the function it's in, looked so obviously promising that I went ahead and dug a little deeper.</p> <p>That function, <code>is_conn_error()</code>, is used by only one other thing: this <a href="https://pagure.io/koji/blob/2b8a9dd5739cc8ca725a4efc15003d467267ec04/f/koji/__init__.py#_2397"><code>_sendCall()</code> method in the same file</a>. And that seems <strong>very</strong> interesting, because what it does can be boiled down to: "hey, we got an error! OK, send it to <code>is_conn_error()</code>. If that returns True, then just log a debug message and kick the session. If that returns False, then raise an exception". That behaviour obviously smells a <em>lot</em> like it could be causing our problem. So, I now had a working theory: for some reason, given some particular server behaviour, <code>is_conn_error()</code> returns <code>True</code> on Python 2 but <code>False</code> on Python 3. That causes this <code>_sendCall()</code> to raise an exception instead of just resetting the session and carrying on, and some other code - which we no longer need to find - catches that exception, prints it, and quits.</p> <p>The next step was to test this theory - because at this point it's only a theory, it could be entirely wrong. I've certainly come up with entirely plausible theories like this before which turned out to be not what was going on at all. So, like a true lazy shortcut enthusiast, I hacked up my local copy of Koji's <code><strong>init</strong>.py</code> and sprinkled a bunch of lines like <code>print("HERE 1!")</code> and <code>print("HERE 2!")</code> through the whole of <code>is_conn_error()</code>. Then I just ran<code>koji wait-task</code> commands on random tasks until one failed.</p> <p>This is fine. When you're just trying to debug the problem you don't need to be super elegant about it. You don't need to do a proper git patch and rebuild the Koji package for your system and use proper logging methods and all the rest of it. Just dumping some <code>print</code> lines in a working copy of the file is <em>just fine</em>, if it works. Just remember to put everything back as it was before later. :)</p> <p>So, as it happened the god of root causing was on my side today, and it turned out I was right on the money. When one of the <code>koji watch-task</code> commands failed, it hit my <code>HERE 1!</code> and <code>HERE 3!</code> lines right when it died. Those told me we were indeed running through <code>is_conn_error()</code> right before the error, and further, where we were coming <em>out</em> of it. We were entering the <code>if isinstance(e, socket.error)</code> block at the start of the function, and returning <code>False</code> because the exception (<code>e</code>) <em>did</em> appear to be an instance of <code>socket.error</code>, but either did <em>not</em> have an <code>errno</code> attribute, or it was not one of <code>errno.ECONNRESET</code>, <code>errno.ECONNABORTED</code>, or <code>errno.EPIPE</code>.</p> <p>Obviously, this made me curious as to what the exception actually is, whether it has an <code>errno</code> at all, and if so, what it is. So I threw in a few more debugging lines - to print out <code>type(e)</code>, and <code>getattr(e, 'errno', 'foobar')</code>. The result of this was pretty interesting. The second print statement gave me 'foobar', meaning the exception doesn't have an <code>errno</code> attribute at all. And the type of the exception was...<code>requests.exceptions.ConnectionError</code>.</p> <p>That's a bit curious! You wouldn't necessarily expect <code>requests.exceptions.ConnectionError</code> to be an instance of <code>socket.error</code>, would you? So why are we in a block that only handles instances of <code>socket.error</code>? Also, it's clear the <em>code</em> doesn't expect this, because there's a block later in the function that explicitly handles instances of <code>requests.exceptions.ConnectionError</code> - but because this earlier block that handles <code>socket.error</code> instances always returns, we will <em>never</em> reach that block if <code>requests.exceptions.ConnectionError</code> instances are also instances of <code>socket.error</code>. So there's clearly something screwy going on here.</p> <p>So of course the next thing to do is...look up <code>socket.error</code> in the Python 2 and Python 3 docs. <em>ANY TIME</em> you're investigating a mysterious Python 3 porting issue, remember this can be useful. Here's the <a href="https://docs.python.org/2/library/socket.html#socket.error">Python 2 <code>socket.error</code> entry</a>, and the <a href="https://docs.python.org/3/library/socket.html#socket.error">Python 3 <code>socket.error</code> entry</a>. And indeed there's a rather significant difference! The Python 2 docs talk about <code>socket.error</code> as an exception that is, well, its own unique thing. However, the Python 3 docs say: "A deprecated alias of OSError." - and even tell us specifically that this changed in Python 3.3: "Changed in version 3.3: Following PEP 3151, this class was made an alias of OSError." Obviously, this is looking an <em>awful</em> lot like one more link in the chain of what's going wrong here.</p> <p>With a bit of Python knowledge you should be able to figure out what's going on now. Think: if <code>socket.error</code> is now just an alias of <code>OSError</code>, what does <code>if isinstance(e, socket.error)</code> mean, in Python 3.3+ ? It means just the same as <code>if isinstance(e, OSError)</code>. And guess what? <code>requests.exception.ConnectionError</code> happens to be a subclass of <code>OSError</code>. Thus, if <code>e</code> is an instance of <code>requests.exception.ConnectionError</code>, <code>isinstance(e, socket.error)</code> will return <code>True</code> in Python 3.3+. In Python 2, it returns <code>False</code>. It's easy to check this in an interactive Python shell or with a test script, to confirm.</p> <p>Because of this, when we run under Python 3 and <code>e</code> is a <code>requests.exception.ConnectionError</code>, we're unexpectedly entering this block intended for handling <code>socket.error</code> exceptions and - because that block always returns, having the <code>return False</code> line that gets hit if the <code>errno</code> attribute check fails - we're never actually reaching the later block that's actually intended to handle <code>requests.exception.ConnectionError</code> instances at all, we return <code>False</code> before we get there.</p> <p>There are a few different ways you could fix this - you could just drop the <code>return False</code> short-circuit line in the <code>socket.error</code> block, for instance, or change the ordering so the <code>requests.exception.ConnectionError</code> handling is done first. In the end I sent <a href="https://pagure.io/koji/pull-request/1203">a pull request</a> which drops the <code>return False</code>, but also drops the <code>if isinstance(e, socket.error)</code> checks (there's another, for nested exceptions, later) entirely. Since <code>socket.error</code> is meant to be deprecated in Python 3.3+ we shouldn't really use it, and we probably don't need to - we can just rely on the <code>errno</code> attribute check alone. Whatever type the exception is, if it has an <code>errno</code> attribute and that attribute is <code>errno.ECONNRESET</code>, <code>errno.ECONNABORTED</code>, or <code>errno.EPIPE</code>, I think we can be pretty sure this is a connection error.</p> <p>What's the moral of this debugging tale? I guess it's this: when porting from Python 2 to Python 3 (or doing anything similar to that), fixing the things that outright crash or obviously behave wrong is sometimes the <em>easy</em> part. Even if everything seems to be working fine on a simple test, it's certainly possible that subtler issues like this could be lurking in the background, causing unexpected failures or (possibly worse) subtly incorrect behaviour. And of course, that's just another reason to add to the big old "Why To Have A Really Good Test Suite" list!</p> <p>There's also a 'secondary moral', I guess, and that's this: predicting all the impacts of an interface change like this is hard. Remember the Python 3 docs mentioned a PEP associated with this change? Well, <a href="https://www.python.org/dev/peps/pep-3151/">here it is</a>. If you read it, it's clear the proposers actually put quite a lot of effort into thinking about how existing code might be affected by the change, but it looks like they still didn't consider a case like this. They talk about "Careless (or "naïve") code" which "blindly catches any of OSError, IOError, socket.error, mmap.error, WindowsError, select.error without checking the errno attribute", and about "Careful code is defined as code which, when catching any of the above exceptions, examines the errno attribute to determine the actual error condition and takes action depending on it" - and claim that "useful compatibility doesn't alter the behaviour of careful exception-catching code". However, Koji's code here clearly matches their definition of "careful" code - it considers both the exception's type, and the <code>errno</code> attribute, in making decisions - but because it is not just doing <code>except socket.error as e</code> or similar, but catching the exception elsewhere and then passing it to this function and using <code>isinstance</code>, it still gets tripped up by the change.</p> <p>So...the ur-moral, as always, is: software is hard!</p> <p></p><p>Hey folks! Time for another edition of AdamW's Debugging Adventures, wherein I boast about how great I am at fixin' stuff.</p> <p>Today's episode is about <a href="https://pagure.io/koji/issue/1192">a bug in the client for Fedora's Koji buildsystem</a> which has been biting more and more Fedora maintainers lately. The most obvious thing it affects is task watching. When you do a package build with <code>fedpkg</code>, it will by default "watch" the build task - it'll update you when the various subtasks start and finish, and not quit until the build ultimately succeeds or fails. You can also directly watch tasks with <code>koji watch-task</code>. So this is something Fedora maintainers see a lot. There's also a common workflow where you chain something to the successful completion of a <code>fedpkg build</code> or <code>koji watch-task</code>, which relies on the task watch completing successfully and exiting 0, if the build actually completed.</p> <p>However, recently, people noticed that this task watching seemed to be just...failing, quite a lot. While the task was still running, it'd suddenly exit, usually showing this message:</p> <p>ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))</p> <p>After a while, <a href="https://www.scrye.com/wordpress/nirik/">nirik</a> realized that this seemed to be associated with the client going from running under Python 2 by default to running under Python 3 by default. This seems to happen when running on Python 3; it doesn't seem to happen when running on Python 2.</p> <p>Today I finally decided it had got annoying enough that I'd spend some time trying to track it down.</p> <p>It's pretty obvious that the message we see relates to an exception, in some way. But ultimately something is catching that exception and printing it out and then exiting (we're not actually getting a traceback, as you do if the exception is ultimately left to reach the interpreter). So my first approach was to dig into the <code>watch-task</code> code from the top down, and try and find something that handles exceptions that looks like it might be the bit we were hitting.</p> <p>And...I failed! This happens, sometimes. In fact I still haven't found the exact bit of code that prints the message and exits. Sometimes, this just happens. It's OK. Don't give up. Try something else!</p> <p>So what I did next was kind of a long shot - I just grepped the code for the exception text. I wasn't really expecting this to work, as there's nothing to suggest the actual exception is part of Koji; it's most likely the code doesn't contain any of that text at all. But hey, it's easy to do, so why not? And as it happened, I got lucky and <a href="https://pagure.io/koji/blob/2b8a9dd5739cc8ca725a4efc15003d467267ec04/f/koji/__init__.py#_1991">hit paydirt</a>: there happens to be a comment with some of the text from the error we're hitting. And it sure looks like it might be relevant to the problem we're having! The comment itself, and the function it's in, looked so obviously promising that I went ahead and dug a little deeper.</p> <p>That function, <code>is_conn_error()</code>, is used by only one other thing: this <a href="https://pagure.io/koji/blob/2b8a9dd5739cc8ca725a4efc15003d467267ec04/f/koji/__init__.py#_2397"><code>_sendCall()</code> method in the same file</a>. And that seems <strong>very</strong> interesting, because what it does can be boiled down to: "hey, we got an error! OK, send it to <code>is_conn_error()</code>. If that returns True, then just log a debug message and kick the session. If that returns False, then raise an exception". That behaviour obviously smells a <em>lot</em> like it could be causing our problem. So, I now had a working theory: for some reason, given some particular server behaviour, <code>is_conn_error()</code> returns <code>True</code> on Python 2 but <code>False</code> on Python 3. That causes this <code>_sendCall()</code> to raise an exception instead of just resetting the session and carrying on, and some other code - which we no longer need to find - catches that exception, prints it, and quits.</p> <p>The next step was to test this theory - because at this point it's only a theory, it could be entirely wrong. I've certainly come up with entirely plausible theories like this before which turned out to be not what was going on at all. So, like a true lazy shortcut enthusiast, I hacked up my local copy of Koji's <code><strong>init</strong>.py</code> and sprinkled a bunch of lines like <code>print("HERE 1!")</code> and <code>print("HERE 2!")</code> through the whole of <code>is_conn_error()</code>. Then I just ran<code>koji wait-task</code> commands on random tasks until one failed.</p> <p>This is fine. When you're just trying to debug the problem you don't need to be super elegant about it. You don't need to do a proper git patch and rebuild the Koji package for your system and use proper logging methods and all the rest of it. Just dumping some <code>print</code> lines in a working copy of the file is <em>just fine</em>, if it works. Just remember to put everything back as it was before later. :)</p> <p>So, as it happened the god of root causing was on my side today, and it turned out I was right on the money. When one of the <code>koji watch-task</code> commands failed, it hit my <code>HERE 1!</code> and <code>HERE 3!</code> lines right when it died. Those told me we were indeed running through <code>is_conn_error()</code> right before the error, and further, where we were coming <em>out</em> of it. We were entering the <code>if isinstance(e, socket.error)</code> block at the start of the function, and returning <code>False</code> because the exception (<code>e</code>) <em>did</em> appear to be an instance of <code>socket.error</code>, but either did <em>not</em> have an <code>errno</code> attribute, or it was not one of <code>errno.ECONNRESET</code>, <code>errno.ECONNABORTED</code>, or <code>errno.EPIPE</code>.</p> <p>Obviously, this made me curious as to what the exception actually is, whether it has an <code>errno</code> at all, and if so, what it is. So I threw in a few more debugging lines - to print out <code>type(e)</code>, and <code>getattr(e, 'errno', 'foobar')</code>. The result of this was pretty interesting. The second print statement gave me 'foobar', meaning the exception doesn't have an <code>errno</code> attribute at all. And the type of the exception was...<code>requests.exceptions.ConnectionError</code>.</p> <p>That's a bit curious! You wouldn't necessarily expect <code>requests.exceptions.ConnectionError</code> to be an instance of <code>socket.error</code>, would you? So why are we in a block that only handles instances of <code>socket.error</code>? Also, it's clear the <em>code</em> doesn't expect this, because there's a block later in the function that explicitly handles instances of <code>requests.exceptions.ConnectionError</code> - but because this earlier block that handles <code>socket.error</code> instances always returns, we will <em>never</em> reach that block if <code>requests.exceptions.ConnectionError</code> instances are also instances of <code>socket.error</code>. So there's clearly something screwy going on here.</p> <p>So of course the next thing to do is...look up <code>socket.error</code> in the Python 2 and Python 3 docs. <em>ANY TIME</em> you're investigating a mysterious Python 3 porting issue, remember this can be useful. Here's the <a href="https://docs.python.org/2/library/socket.html#socket.error">Python 2 <code>socket.error</code> entry</a>, and the <a href="https://docs.python.org/3/library/socket.html#socket.error">Python 3 <code>socket.error</code> entry</a>. And indeed there's a rather significant difference! The Python 2 docs talk about <code>socket.error</code> as an exception that is, well, its own unique thing. However, the Python 3 docs say: "A deprecated alias of OSError." - and even tell us specifically that this changed in Python 3.3: "Changed in version 3.3: Following PEP 3151, this class was made an alias of OSError." Obviously, this is looking an <em>awful</em> lot like one more link in the chain of what's going wrong here.</p> <p>With a bit of Python knowledge you should be able to figure out what's going on now. Think: if <code>socket.error</code> is now just an alias of <code>OSError</code>, what does <code>if isinstance(e, socket.error)</code> mean, in Python 3.3+ ? It means just the same as <code>if isinstance(e, OSError)</code>. And guess what? <code>requests.exception.ConnectionError</code> happens to be a subclass of <code>OSError</code>. Thus, if <code>e</code> is an instance of <code>requests.exception.ConnectionError</code>, <code>isinstance(e, socket.error)</code> will return <code>True</code> in Python 3.3+. In Python 2, it returns <code>False</code>. It's easy to check this in an interactive Python shell or with a test script, to confirm.</p> <p>Because of this, when we run under Python 3 and <code>e</code> is a <code>requests.exception.ConnectionError</code>, we're unexpectedly entering this block intended for handling <code>socket.error</code> exceptions and - because that block always returns, having the <code>return False</code> line that gets hit if the <code>errno</code> attribute check fails - we're never actually reaching the later block that's actually intended to handle <code>requests.exception.ConnectionError</code> instances at all, we return <code>False</code> before we get there.</p> <p>There are a few different ways you could fix this - you could just drop the <code>return False</code> short-circuit line in the <code>socket.error</code> block, for instance, or change the ordering so the <code>requests.exception.ConnectionError</code> handling is done first. In the end I sent <a href="https://pagure.io/koji/pull-request/1203">a pull request</a> which drops the <code>return False</code>, but also drops the <code>if isinstance(e, socket.error)</code> checks (there's another, for nested exceptions, later) entirely. Since <code>socket.error</code> is meant to be deprecated in Python 3.3+ we shouldn't really use it, and we probably don't need to - we can just rely on the <code>errno</code> attribute check alone. Whatever type the exception is, if it has an <code>errno</code> attribute and that attribute is <code>errno.ECONNRESET</code>, <code>errno.ECONNABORTED</code>, or <code>errno.EPIPE</code>, I think we can be pretty sure this is a connection error.</p> <p>What's the moral of this debugging tale? I guess it's this: when porting from Python 2 to Python 3 (or doing anything similar to that), fixing the things that outright crash or obviously behave wrong is sometimes the <em>easy</em> part. Even if everything seems to be working fine on a simple test, it's certainly possible that subtler issues like this could be lurking in the background, causing unexpected failures or (possibly worse) subtly incorrect behaviour. And of course, that's just another reason to add to the big old "Why To Have A Really Good Test Suite" list!</p> <p>There's also a 'secondary moral', I guess, and that's this: predicting all the impacts of an interface change like this is hard. Remember the Python 3 docs mentioned a PEP associated with this change? Well, <a href="https://www.python.org/dev/peps/pep-3151/">here it is</a>. If you read it, it's clear the proposers actually put quite a lot of effort into thinking about how existing code might be affected by the change, but it looks like they still didn't consider a case like this. They talk about "Careless (or "naïve") code" which "blindly catches any of OSError, IOError, socket.error, mmap.error, WindowsError, select.error without checking the errno attribute", and about "Careful code is defined as code which, when catching any of the above exceptions, examines the errno attribute to determine the actual error condition and takes action depending on it" - and claim that "useful compatibility doesn't alter the behaviour of careful exception-catching code". However, Koji's code here clearly matches their definition of "careful" code - it considers both the exception's type, and the <code>errno</code> attribute, in making decisions - but because it is not just doing <code>except socket.error as e</code> or similar, but catching the exception elsewhere and then passing it to this function and using <code>isinstance</code>, it still gets tripped up by the change.</p> <p>So...the ur-moral, as always, is: software is hard!</p> AdamW's Debugging Adventures: Has Anyone Seen My Kernel? https://www.happyassassin.net/posts/2018/11/05/adamws-debugging-adventures-has-anyone-seen-my-kernel/ 2018-11-05T15:59:44Z 2018-11-05T15:59:44Z Adam Williamson <p></p><p>Welp, since I haven't blogged for a while, here's another root-causing write up! Honestly, root-causing things is one of my favorite parts of the job, lately.</p> <p>I've been on vacation recently, and when I came back, it looked like several things were wrong with Rawhide. Several of these were relatively easy fixes: <a href="https://pagure.io/fedora-kickstarts/c/0ed1396cb94161825756900224f179589914e250?branch=master">live images not behaving right at all</a> and <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1644936">the installer not launching properly any more</a> both only took a couple of hours to track down. But there were also a couple of bugs causing more recent composes to fail entirely. The inimitable <a href="https://patrick.uiterwijk.org/">puiterwijk</a> dealt with one of those (aarch64 cloud images not building properly), and I wound up taking the other one: overnight, most live image composes had suddenly stopped working.</p> <h2>What's The Problem?</h2> <p>The starting point for this debugging odyssey was the error messages we got in the failed composes:</p> <pre><code>DEBUG pylorax.treebuilder: kernels=[] ERROR livemedia-creator: No kernels found, cannot rebuild_initrds </code></pre> <p>So, the first step is pretty easy: let's just go look up where those errors come from. pylorax and livemedia-creator are both part of the <a href="https://github.com/weldr/lorax">lorax</a> tree, so we'll start there. It's easy to use grep to find the sources of those two messages: they're both in <a href="https://github.com/weldr/lorax/blob/59d8c91ae5fae00d9b92411b720a245871a02c16/src/pylorax/treebuilder.py">treebuilder.py</a>, the first <a href="https://github.com/weldr/lorax/blob/59d8c91ae5fae00d9b92411b720a245871a02c16/src/pylorax/treebuilder.py#L390">here, in <code>findkernels()</code></a> and the second <a href="https://github.com/weldr/lorax/blob/59d8c91ae5fae00d9b92411b720a245871a02c16/src/pylorax/treebuilder.py#L287">here, in <code>TreeBuilder.rebuild_initrds()</code></a>. As the second happens if there is no <code>self.kernels</code>, and we we can see <a href="https://github.com/weldr/lorax/blob/59d8c91ae5fae00d9b92411b720a245871a02c16/src/pylorax/treebuilder.py#L269">just a few lines further back</a> that <code>self.kernels</code> is a property based on a call to <code>findkernels()</code>, it's pretty obvious that ultimately what's going wrong here is that <code>findkernels()</code> isn't finding any kernels.</p> <h2>So...Why Aren't We Finding Any Kernels?</h2> <p>So next, of course, I put my thinking cap on and had a good look at <code>findkernels()</code> - not just at the code itself, but at its history. When something that was working breaks, you're always looking for a <em>change</em> that caused it. There were no recent changes to <code>findkernels()</code>, and I couldn't see anything obviously wrong in its implementation - it's basically looking for files named <code>vmlinuz-(something)</code> in <code>/boot</code> - so it didn't look like the quest was over yet.</p> <p>So at this point there were kinda two possibilities: <code>findkernels()</code> was broken, or it was working fine but <em>there really weren't any kernels to find</em>. I decided the easiest way to figure out which we were dealing with was just to reproduce the problem and take a look. With any sufficiently complex problem, you can usually get some way into figuring it out just by looking and logs and code and thinking about things logically, but at some point you'll want to try and reproduce the problem in a controlled environment you can poke at - knowing when you're at that point isn't an exact science, but if you find yourself with multiple possibilities and no easy way to decide which you're dealing with, that's often a good indicator.</p> <p>Doing a test live image build is a <em>bit</em> awkward for Fedora at present, unfortunately, but not really too bad. I've been meaning to write a helper tool for a while, but never got around to it. The way I do it - because I want to replicate as close as possible how the official images are built - is just to try and do exactly what the build system does, only on my local system and by hand. What the build system does is spin up a mock root, install the live image creation tools and a flattened kickstart into it, and run the tool. So, that's what I do! A handy 'trick' here is just to look at the logs from a real live image creation task - like <a href="https://koji.fedoraproject.org/koji/taskinfo?taskID=30594219">this one</a> - and shadow what they do. Note these logs will get garbage collected at some point, but I can't do anything about that.</p> <p>The main log to look at is <a href="https://kojipkgs.fedoraproject.org//work/tasks/4219/30594219/root.log">root.log</a>. First, we see that the tools are installed to the mock:</p> <pre><code>DEBUG util.py:439: Marking packages as installed by the group: DEBUG util.py:439: @livemedia-build glibc-all-langpacks coreutils lorax-lmc-novirt DEBUG util.py:439: util-linux selinux-policy-targeted bash DEBUG util.py:439: shadow-utils </code></pre> <p>So we'll do just the same with our mock:</p> <pre><code>mock -r fedora-rawhide-x86_64 --install glibc-all-langpacks coreutils lorax-lmc-novirt util-linux selinux-policy-targeted bash shadow-utils </code></pre> <p>(Really all you want to do is install the <code>livemedia-build</code> group, but I haven't actually found an equivalent of <code>mock --groupinstall</code>, so I just do that).</p> <p>We can find the actual image compose command in the task log again, by looking for 'livemedia-creator' - it looks like this:</p> <pre><code>INFO backend.py:285: Running in chroot: ['/sbin/livemedia-creator', '--ks', '/chroot_tmpdir/koji-image-f30-build-30594219.ks', '--logfile', '/chroot_tmpdir/lmc-logs/livemedia-out.log', '--no-virt', '--resultdir', '/chroot_tmpdir/lmc', '--project', 'Fedora-KDE-Live', '--make-iso', '--volid', 'Fedora-KDE-Live-rawh-20181101.n.', '--iso-only', '--iso-name', 'Fedora-KDE-Live-x86_64-Rawhide-20181101.n.0.iso', '--releasever', 'Rawhide', '--title', 'Fedora-KDE-Live', '--macboot'] </code></pre> <p>we can easily turn that Python array into a console command by just replacing occurrences of <code>', '</code> with <code></code>:</p> <pre><code>/sbin/livemedia-creator --ks /chroot_tmpdir/koji-image-f30-build-30594219.ks --logfile /chroot_tmpdir/lmc-logs/livemedia-out.log --no-virt --resultdir /chroot_tmpdir/lmc --project Fedora-KDE-Live --make-iso --volid Fedora-KDE-Live-rawh-20181101.n. --iso-only --iso-name Fedora-KDE-Live-x86_64-Rawhide-20181101.n.0.iso --releasever Rawhide --title Fedora-KDE-Live --macboot </code></pre> <p>We can see that's using a scratch directory called <code>/chroot_tmpdir</code>, and a kickstart file called <code>koji-image-f30-build-30594219.ks</code>. This kickstart can be found as <a href="https://kojipkgs.fedoraproject.org//work/tasks/6109/30606109/koji-image-f30-build-30606109.ks">one of the task assets</a>, so we'll grab it and copy it into the mock's <code>/root</code> for now:</p> <pre><code>sudo cp koji-image-f30-build-30606109.ks /var/lib/mock/fedora-rawhide-x86_64/root/root </code></pre> <p>Then finally, we're going to get a shell in the mock root, using the old-style chroot implementation (this is necessary for live image builds to work, the new systemd-based implementation doesn't work yet):</p> <pre><code>mock -r fedora-rawhide-x86_64 --shell --old-chroot </code></pre> <p>Inside the mock, we'll create the <code>/chroot_tmpdir</code> scratch dir, copy the kickstart into it, and finally run the image creation:</p> <pre><code>mkdir -p /chroot_tmpdir cd /chroot_tmpdir cp /root/koji-image-f30-build-30606109.ks . /sbin/livemedia-creator --ks /chroot_tmpdir/koji-image-f30-build-30594219.ks --logfile /chroot_tmpdir/lmc-logs/livemedia-out.log --no-virt --resultdir /chroot_tmpdir/lmc --project Fedora-KDE-Live --make-iso --volid Fedora-KDE-Live-rawh-20181101.n. --iso-only --iso-name Fedora-KDE-Live-x86_64-Rawhide-20181101.n.0.iso --releasever Rawhide --title Fedora-KDE-Live --macboot </code></pre> <p>And when I did that, it worked away for a while - half an hour or so - and eventually failed exactly like the 'official' build had! So now I had a failure in a controlled environment (my little mock root) to look at. Note that if you're playing along at home, this will only work so long as you can grab that kickstart from Koji, and the 20181101.n.0 compose files are kept around, which will only be for another two weeks or so - after that you won't be able to reproduce this, but you can of course follow the same procedure with a newer Koji task if you want to reproduce a newer official live image build.</p> <p>Next, I needed to examine the actual filesystem produced by the image build process and see if it really had any kernels in it (remember, that's what we were trying to figure out). This requires a bit of knowledge about how livemedia-creator works, which you'd have had to look up if you didn't know it already: it creates an image file, loopback mounts it, and installs into the loopback-mounted directory. When it fails, it leaves the image file around, and you can just mount it again and poke around. The file will be in the <code>lmc/</code> subdirectory of the directory where the image build command was run, with a filename like <code>lmc-disk-ctfz98m5.img</code> (the alphanumeric soup bit is random), and we mount it like this:</p> <pre><code>mount -o loop lmc/lmc-disk-ctfz98m5.img /mnt/sysimage </code></pre> <p>(the mount point having been created and left around by the tool). Now we can look in <code>/mnt/sysimage/boot</code>, and when I did that...I found that, indeed, it contained no <code>vmlinuz-*</code> files at all! So, I had eliminated the possibility that <code>findkernels()</code> was going wrong: it was doing its job just fine, and it wasn't finding any kernels because...there were no kernels to find.</p> <h2>OK...So Why Aren't There Any Kernels?</h2> <p>So now I had to try and work out: <em>why</em> were there no kernels in the image's <code>/boot</code>? I knew from the logs of earlier, successful image composes that, in successful composes, there really were kernels to find: it wasn't that there had <em>never</em> been kernels, but this had only recently become fatal for some reason. The difference really was that there used to be kernels present when this lorax code ran, but now there weren't.</p> <p>This led me into a fun new piece of research: figuring out how kernel files get into <code>/boot</code> in a Fedora system at all. You might think - I did - that they're simply perfectly normal packaged files installed by the <code>kernel-core</code> package. But it turns out, they're not! The <code>kernel-core</code> package does own the relevant <code>/boot/vmlinuz-*</code> file, but it doesn't actually directly install it: it <a href="https://src.fedoraproject.org/rpms/kernel/blob/980b5751f77839480ec6f0c805b36e4ee1d9ebc4/f/kernel.spec#_1832">uses an RPM directive called <code>%ghost</code> to own it without installing it</a>. So the file must get installed some other way. Here again I cheated a bit with prior knowledge - I knew this overall mechanism existed, though I didn't know until now that it really installed the kernel file itself - but if you don't have that, you could look at <a href="https://src.fedoraproject.org/rpms/kernel/blob/980b5751f77839480ec6f0c805b36e4ee1d9ebc4/f/kernel.spec#_1746-1753">the %posttrans script in the kernel package</a>: when a kernel is installed, a command called <code>kernel-install</code> gets run.</p> <p>I also found out (by diffing the logged packages from the 20181030.n.0 and 20181101.n.0 live image composes) that the kernel itself had been updated in the 20181101.n.0 compose (which was when things started breaking). So once again I had a couple of promising lines of inquiry: the new kernel, and this <code>kernel-install</code> path.</p> <p>Well, turns out systemd haters in the audience can get very excited, because kernel-install is part of systemd:</p> <pre><code>[adamw@adam lorax (master %)]$ rpm -qf `which kernel-install` systemd-udev-239-3.fc29.x86_64 </code></pre> <p>Anyway, I refreshed my memory a bit about what kernel-install does, but it's kinda complicated and it calls out to various other things, including <code>/usr/sbin/new-kernel-pkg</code> (part of grubby) and some plugins in <code>/usr/lib/kernel/install.d</code> (various of which come from systemd, grub2, and dracut). So I think what I did next (my memory gets a bit hazy) was to wonder whether the same problem would affect a regular install from the same packages.</p> <p>I got the last working Rawhide network install image, and set it to install from the Everything repo from the failed 20181101.n.0 compose. I let that install run, then checked that the <code>/boot/vmlinuz-(whatever)</code> file existed in the installed system...which it did. This sort of let out one theory I had: that the new kernel package had somehow messed something up such that the kernel file <em>never</em> actually got installed properly at all.</p> <p>So, I got to wondering whether <code>kernel-install</code> really <em>was</em> the thing that put the <code>/boot/vmlinuz-(whatever)</code> file in place (I still wasn't sure at this point), whether it <em>reproducibly</em> failed to do so in the live image creation environment but succeeded in doing so in the regular install environment, and if so, what differed between the two.</p> <p>I could see the exact <code>kernel-install</code> command just by examining the <code>kernel-core</code> package scripts:</p> <pre><code>rpm -q --scripts kernel-core | grep kernel-install /bin/kernel-install add 4.20.0-0.rc0.git7.1.fc30.x86_64 /lib/modules/4.20.0-0.rc0.git7.1.fc30.x86_64/vmlinuz || exit $? </code></pre> <p>So I tried just deleting /boot/vmlinuz* and re-running that command in the installed system...and sure enough, the file re-appeared! So now I was pretty sure <code>kernel-install</code> was the thing that's <em>supposed</em> to install it. I also tried doing this in my live image creation mock environment:</p> <pre><code>chroot /mnt/sysimage /bin/kernel-install add 4.20.0-0.rc0.git7.1.fc30.x86_64 /lib/modules/4.20.0-0.rc0.git7.1.fc30.x86_64/vmlinuz </code></pre> <p>...and sure enough, it <em>didn't</em> create the <code>/boot/vmlinuz-(foo)</code> file. So now I was narrowing in on the problem: something about the live image creation environment meant that this <code>kernel-install</code> invocation didn't install the file, when it probably should.</p> <h2>OK...So Why Isn't <code>kernel-install</code> Installing Kernels?</h2> <p>At this point I probably could've figured out the problem by reading the source if I'd read it carefully enough, but I decided to carry on with the practical experimentation. I tried running the script through <code>sh -x</code> in each environment, to see exactly what commands were run by the script in each case, and somehow - I forget how - I zeroed in on one of the <code>/usr/lib/kernel/install.d</code> plugin scripts: <code>/usr/lib/kernel/install.d/20-grub.install</code>. This is part of grub2. I think I found these scripts from the <code>sh -x</code> output, and noticed that this one has what looks like <a href="https://src.fedoraproject.org/rpms/grub2/blob/master/f/20-grub.install#_55-65">the code to actually install the kernel file to /boot</a>. So I made <em>that</em> script run with <code>-z</code> as well, and this finally got me my next big breakthrough. In the installed system I could see that script doing a lot of stuff, but in the live environment it seemed to exit almost as soon as it started:</p> <pre><code>+ [[ -n '' ]] + exit 0 </code></pre> <p>It's not 100% obvious, but I was pretty sure that just meant it was failing in the test right at the start:</p> <pre><code>if ! [[ $KERNEL_INSTALL_MACHINE_ID ]]; then exit 0 fi </code></pre> <p>So I went and <a href="https://www.google.com/search?q=KERNEL_INSTALL_MACHINE_ID+fedora&amp;ie=utf-8&amp;oe=utf-8&amp;client=firefox-b-ab">looked up <code>$KERNEL_INSTALL_MACHINE_ID</code></a> and the references suggested that it's basically tied to <code>/etc/machine-id</code>. So I looked, and, lo and behold, in the regular installed system, that file contained a random alphanumeric string, but in the live image creation environment, the file was empty! This sure looked suspicious.</p> <p>I read through some references on the file, and found that it's usually meant to get created by <a href="https://src.fedoraproject.org/rpms/systemd/blob/71e781a09623e04dc1006dcba12838578595d70f/f/systemd.spec#_484">a call to <code>/usr/bin/systemd-machine-id-setup</code> in systemd package scripts</a>. So I tried running <code>systemd-machine-id-setup</code> in the live image creation environment, and suddenly the file got a random string, and when I ran <code>kernel-install</code> again, the kernel file <em>did</em> get installed to <code>/boot</code>!</p> <h2>OK...So Why Isn't There A Machine ID?</h2> <p>So now I'd moved along to a new problem again: why was <code>/etc/machine-id</code> empty when the kernel %post script ran, but if I tried to generate it again, it worked? Was the initial generation failing? Was it happening too late? Was it working, but the file getting touched by something else?</p> <p>Again, I looked at quite a lot of code to try and figure it out - there's code that at least <em>could</em> touch <code>/etc/machine-id</code> in anaconda, for instance, and in lorax - but in the end I decided to go with practical experimentation again. So I did a custom scratch build of systemd to try and find out what happened when the <code>%post</code> script ran. I changed the command from this:</p> <pre><code>systemd-machine-id-setup &amp;&gt;/dev/null || : </code></pre> <p>to this:</p> <pre><code>systemd-machine-id-setup &gt; /tmp/smids.log 2&gt;&amp;1 cat /etc/machine-id &gt;&gt; /tmp/smids.log </code></pre> <p>pulled that build into a side repository, edited the live kickstart to use that side repository, and re-ran the live image creation. And that hit paydirt in a big way, because in smids.log I saw this:</p> <pre><code>systemd-machine-id-setup: error while loading shared libraries: libssl.so.1.1: cannot open shared object file: No such file or directory </code></pre> <p>...and here was the ultimate solution to our mystery! The attempt to set the machine-id in systemd <code>%post</code> was failing because it needs libssl, but it obviously wasn't present yet. libssl is part of <code>openssl-libs</code>, but the systemd spec did not specify that its <code>%post</code> script needs <code>openssl-libs</code> installed. What I surmise had happened was that up until 20181030, some other dependency in some other package happened to mean that dnf would always choose to install <code>openssl-libs</code> before installing <code>systemd</code>, so no-one had ever noticed this missing dependency...but on 20181101, some change to some package caused dnf to start installing systemd <em>before</em> openssl-libs, and suddenly, this problem showed up. So - as is very often the case - once I'd finally managed to nail down the problem, the <a href="https://src.fedoraproject.org/rpms/systemd/c/71e781a09623e04dc1006dcba12838578595d70f?branch=master">fix was simple</a>: we just add the missing dependency to systemd, so that openssl-libs will always be installed before systemd's <code>%post</code> is run. With that fix, generation of <code>/etc/machine-id</code> will succeed again, and so the plugin script that installs the kernel file to <code>/boot</code> won't bail early, and so there will <em>be</em> a kernel file in <code>/boot</code>, and lorax won't fail when it tries to regenerate initrds because there aren't any kernels present!</p> <p>...and so ends another exciting episode of AdamW's Debugging Adventures :)</p> <p></p> <p></p><p>Welp, since I haven't blogged for a while, here's another root-causing write up! Honestly, root-causing things is one of my favorite parts of the job, lately.</p> <p>I've been on vacation recently, and when I came back, it looked like several things were wrong with Rawhide. Several of these were relatively easy fixes: <a href="https://pagure.io/fedora-kickstarts/c/0ed1396cb94161825756900224f179589914e250?branch=master">live images not behaving right at all</a> and <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1644936">the installer not launching properly any more</a> both only took a couple of hours to track down. But there were also a couple of bugs causing more recent composes to fail entirely. The inimitable <a href="https://patrick.uiterwijk.org/">puiterwijk</a> dealt with one of those (aarch64 cloud images not building properly), and I wound up taking the other one: overnight, most live image composes had suddenly stopped working.</p> <h2>What's The Problem?</h2> <p>The starting point for this debugging odyssey was the error messages we got in the failed composes:</p> <pre><code>DEBUG pylorax.treebuilder: kernels=[] ERROR livemedia-creator: No kernels found, cannot rebuild_initrds </code></pre> <p>So, the first step is pretty easy: let's just go look up where those errors come from. pylorax and livemedia-creator are both part of the <a href="https://github.com/weldr/lorax">lorax</a> tree, so we'll start there. It's easy to use grep to find the sources of those two messages: they're both in <a href="https://github.com/weldr/lorax/blob/59d8c91ae5fae00d9b92411b720a245871a02c16/src/pylorax/treebuilder.py">treebuilder.py</a>, the first <a href="https://github.com/weldr/lorax/blob/59d8c91ae5fae00d9b92411b720a245871a02c16/src/pylorax/treebuilder.py#L390">here, in <code>findkernels()</code></a> and the second <a href="https://github.com/weldr/lorax/blob/59d8c91ae5fae00d9b92411b720a245871a02c16/src/pylorax/treebuilder.py#L287">here, in <code>TreeBuilder.rebuild_initrds()</code></a>. As the second happens if there is no <code>self.kernels</code>, and we we can see <a href="https://github.com/weldr/lorax/blob/59d8c91ae5fae00d9b92411b720a245871a02c16/src/pylorax/treebuilder.py#L269">just a few lines further back</a> that <code>self.kernels</code> is a property based on a call to <code>findkernels()</code>, it's pretty obvious that ultimately what's going wrong here is that <code>findkernels()</code> isn't finding any kernels.</p> <h2>So...Why Aren't We Finding Any Kernels?</h2> <p>So next, of course, I put my thinking cap on and had a good look at <code>findkernels()</code> - not just at the code itself, but at its history. When something that was working breaks, you're always looking for a <em>change</em> that caused it. There were no recent changes to <code>findkernels()</code>, and I couldn't see anything obviously wrong in its implementation - it's basically looking for files named <code>vmlinuz-(something)</code> in <code>/boot</code> - so it didn't look like the quest was over yet.</p> <p>So at this point there were kinda two possibilities: <code>findkernels()</code> was broken, or it was working fine but <em>there really weren't any kernels to find</em>. I decided the easiest way to figure out which we were dealing with was just to reproduce the problem and take a look. With any sufficiently complex problem, you can usually get some way into figuring it out just by looking and logs and code and thinking about things logically, but at some point you'll want to try and reproduce the problem in a controlled environment you can poke at - knowing when you're at that point isn't an exact science, but if you find yourself with multiple possibilities and no easy way to decide which you're dealing with, that's often a good indicator.</p> <p>Doing a test live image build is a <em>bit</em> awkward for Fedora at present, unfortunately, but not really too bad. I've been meaning to write a helper tool for a while, but never got around to it. The way I do it - because I want to replicate as close as possible how the official images are built - is just to try and do exactly what the build system does, only on my local system and by hand. What the build system does is spin up a mock root, install the live image creation tools and a flattened kickstart into it, and run the tool. So, that's what I do! A handy 'trick' here is just to look at the logs from a real live image creation task - like <a href="https://koji.fedoraproject.org/koji/taskinfo?taskID=30594219">this one</a> - and shadow what they do. Note these logs will get garbage collected at some point, but I can't do anything about that.</p> <p>The main log to look at is <a href="https://kojipkgs.fedoraproject.org//work/tasks/4219/30594219/root.log">root.log</a>. First, we see that the tools are installed to the mock:</p> <pre><code>DEBUG util.py:439: Marking packages as installed by the group: DEBUG util.py:439: @livemedia-build glibc-all-langpacks coreutils lorax-lmc-novirt DEBUG util.py:439: util-linux selinux-policy-targeted bash DEBUG util.py:439: shadow-utils </code></pre> <p>So we'll do just the same with our mock:</p> <pre><code>mock -r fedora-rawhide-x86_64 --install glibc-all-langpacks coreutils lorax-lmc-novirt util-linux selinux-policy-targeted bash shadow-utils </code></pre> <p>(Really all you want to do is install the <code>livemedia-build</code> group, but I haven't actually found an equivalent of <code>mock --groupinstall</code>, so I just do that).</p> <p>We can find the actual image compose command in the task log again, by looking for 'livemedia-creator' - it looks like this:</p> <pre><code>INFO backend.py:285: Running in chroot: ['/sbin/livemedia-creator', '--ks', '/chroot_tmpdir/koji-image-f30-build-30594219.ks', '--logfile', '/chroot_tmpdir/lmc-logs/livemedia-out.log', '--no-virt', '--resultdir', '/chroot_tmpdir/lmc', '--project', 'Fedora-KDE-Live', '--make-iso', '--volid', 'Fedora-KDE-Live-rawh-20181101.n.', '--iso-only', '--iso-name', 'Fedora-KDE-Live-x86_64-Rawhide-20181101.n.0.iso', '--releasever', 'Rawhide', '--title', 'Fedora-KDE-Live', '--macboot'] </code></pre> <p>we can easily turn that Python array into a console command by just replacing occurrences of <code>', '</code> with <code></code>:</p> <pre><code>/sbin/livemedia-creator --ks /chroot_tmpdir/koji-image-f30-build-30594219.ks --logfile /chroot_tmpdir/lmc-logs/livemedia-out.log --no-virt --resultdir /chroot_tmpdir/lmc --project Fedora-KDE-Live --make-iso --volid Fedora-KDE-Live-rawh-20181101.n. --iso-only --iso-name Fedora-KDE-Live-x86_64-Rawhide-20181101.n.0.iso --releasever Rawhide --title Fedora-KDE-Live --macboot </code></pre> <p>We can see that's using a scratch directory called <code>/chroot_tmpdir</code>, and a kickstart file called <code>koji-image-f30-build-30594219.ks</code>. This kickstart can be found as <a href="https://kojipkgs.fedoraproject.org//work/tasks/6109/30606109/koji-image-f30-build-30606109.ks">one of the task assets</a>, so we'll grab it and copy it into the mock's <code>/root</code> for now:</p> <pre><code>sudo cp koji-image-f30-build-30606109.ks /var/lib/mock/fedora-rawhide-x86_64/root/root </code></pre> <p>Then finally, we're going to get a shell in the mock root, using the old-style chroot implementation (this is necessary for live image builds to work, the new systemd-based implementation doesn't work yet):</p> <pre><code>mock -r fedora-rawhide-x86_64 --shell --old-chroot </code></pre> <p>Inside the mock, we'll create the <code>/chroot_tmpdir</code> scratch dir, copy the kickstart into it, and finally run the image creation:</p> <pre><code>mkdir -p /chroot_tmpdir cd /chroot_tmpdir cp /root/koji-image-f30-build-30606109.ks . /sbin/livemedia-creator --ks /chroot_tmpdir/koji-image-f30-build-30594219.ks --logfile /chroot_tmpdir/lmc-logs/livemedia-out.log --no-virt --resultdir /chroot_tmpdir/lmc --project Fedora-KDE-Live --make-iso --volid Fedora-KDE-Live-rawh-20181101.n. --iso-only --iso-name Fedora-KDE-Live-x86_64-Rawhide-20181101.n.0.iso --releasever Rawhide --title Fedora-KDE-Live --macboot </code></pre> <p>And when I did that, it worked away for a while - half an hour or so - and eventually failed exactly like the 'official' build had! So now I had a failure in a controlled environment (my little mock root) to look at. Note that if you're playing along at home, this will only work so long as you can grab that kickstart from Koji, and the 20181101.n.0 compose files are kept around, which will only be for another two weeks or so - after that you won't be able to reproduce this, but you can of course follow the same procedure with a newer Koji task if you want to reproduce a newer official live image build.</p> <p>Next, I needed to examine the actual filesystem produced by the image build process and see if it really had any kernels in it (remember, that's what we were trying to figure out). This requires a bit of knowledge about how livemedia-creator works, which you'd have had to look up if you didn't know it already: it creates an image file, loopback mounts it, and installs into the loopback-mounted directory. When it fails, it leaves the image file around, and you can just mount it again and poke around. The file will be in the <code>lmc/</code> subdirectory of the directory where the image build command was run, with a filename like <code>lmc-disk-ctfz98m5.img</code> (the alphanumeric soup bit is random), and we mount it like this:</p> <pre><code>mount -o loop lmc/lmc-disk-ctfz98m5.img /mnt/sysimage </code></pre> <p>(the mount point having been created and left around by the tool). Now we can look in <code>/mnt/sysimage/boot</code>, and when I did that...I found that, indeed, it contained no <code>vmlinuz-*</code> files at all! So, I had eliminated the possibility that <code>findkernels()</code> was going wrong: it was doing its job just fine, and it wasn't finding any kernels because...there were no kernels to find.</p> <h2>OK...So Why Aren't There Any Kernels?</h2> <p>So now I had to try and work out: <em>why</em> were there no kernels in the image's <code>/boot</code>? I knew from the logs of earlier, successful image composes that, in successful composes, there really were kernels to find: it wasn't that there had <em>never</em> been kernels, but this had only recently become fatal for some reason. The difference really was that there used to be kernels present when this lorax code ran, but now there weren't.</p> <p>This led me into a fun new piece of research: figuring out how kernel files get into <code>/boot</code> in a Fedora system at all. You might think - I did - that they're simply perfectly normal packaged files installed by the <code>kernel-core</code> package. But it turns out, they're not! The <code>kernel-core</code> package does own the relevant <code>/boot/vmlinuz-*</code> file, but it doesn't actually directly install it: it <a href="https://src.fedoraproject.org/rpms/kernel/blob/980b5751f77839480ec6f0c805b36e4ee1d9ebc4/f/kernel.spec#_1832">uses an RPM directive called <code>%ghost</code> to own it without installing it</a>. So the file must get installed some other way. Here again I cheated a bit with prior knowledge - I knew this overall mechanism existed, though I didn't know until now that it really installed the kernel file itself - but if you don't have that, you could look at <a href="https://src.fedoraproject.org/rpms/kernel/blob/980b5751f77839480ec6f0c805b36e4ee1d9ebc4/f/kernel.spec#_1746-1753">the %posttrans script in the kernel package</a>: when a kernel is installed, a command called <code>kernel-install</code> gets run.</p> <p>I also found out (by diffing the logged packages from the 20181030.n.0 and 20181101.n.0 live image composes) that the kernel itself had been updated in the 20181101.n.0 compose (which was when things started breaking). So once again I had a couple of promising lines of inquiry: the new kernel, and this <code>kernel-install</code> path.</p> <p>Well, turns out systemd haters in the audience can get very excited, because kernel-install is part of systemd:</p> <pre><code>[adamw@adam lorax (master %)]$ rpm -qf `which kernel-install` systemd-udev-239-3.fc29.x86_64 </code></pre> <p>Anyway, I refreshed my memory a bit about what kernel-install does, but it's kinda complicated and it calls out to various other things, including <code>/usr/sbin/new-kernel-pkg</code> (part of grubby) and some plugins in <code>/usr/lib/kernel/install.d</code> (various of which come from systemd, grub2, and dracut). So I think what I did next (my memory gets a bit hazy) was to wonder whether the same problem would affect a regular install from the same packages.</p> <p>I got the last working Rawhide network install image, and set it to install from the Everything repo from the failed 20181101.n.0 compose. I let that install run, then checked that the <code>/boot/vmlinuz-(whatever)</code> file existed in the installed system...which it did. This sort of let out one theory I had: that the new kernel package had somehow messed something up such that the kernel file <em>never</em> actually got installed properly at all.</p> <p>So, I got to wondering whether <code>kernel-install</code> really <em>was</em> the thing that put the <code>/boot/vmlinuz-(whatever)</code> file in place (I still wasn't sure at this point), whether it <em>reproducibly</em> failed to do so in the live image creation environment but succeeded in doing so in the regular install environment, and if so, what differed between the two.</p> <p>I could see the exact <code>kernel-install</code> command just by examining the <code>kernel-core</code> package scripts:</p> <pre><code>rpm -q --scripts kernel-core | grep kernel-install /bin/kernel-install add 4.20.0-0.rc0.git7.1.fc30.x86_64 /lib/modules/4.20.0-0.rc0.git7.1.fc30.x86_64/vmlinuz || exit $? </code></pre> <p>So I tried just deleting /boot/vmlinuz* and re-running that command in the installed system...and sure enough, the file re-appeared! So now I was pretty sure <code>kernel-install</code> was the thing that's <em>supposed</em> to install it. I also tried doing this in my live image creation mock environment:</p> <pre><code>chroot /mnt/sysimage /bin/kernel-install add 4.20.0-0.rc0.git7.1.fc30.x86_64 /lib/modules/4.20.0-0.rc0.git7.1.fc30.x86_64/vmlinuz </code></pre> <p>...and sure enough, it <em>didn't</em> create the <code>/boot/vmlinuz-(foo)</code> file. So now I was narrowing in on the problem: something about the live image creation environment meant that this <code>kernel-install</code> invocation didn't install the file, when it probably should.</p> <h2>OK...So Why Isn't <code>kernel-install</code> Installing Kernels?</h2> <p>At this point I probably could've figured out the problem by reading the source if I'd read it carefully enough, but I decided to carry on with the practical experimentation. I tried running the script through <code>sh -x</code> in each environment, to see exactly what commands were run by the script in each case, and somehow - I forget how - I zeroed in on one of the <code>/usr/lib/kernel/install.d</code> plugin scripts: <code>/usr/lib/kernel/install.d/20-grub.install</code>. This is part of grub2. I think I found these scripts from the <code>sh -x</code> output, and noticed that this one has what looks like <a href="https://src.fedoraproject.org/rpms/grub2/blob/master/f/20-grub.install#_55-65">the code to actually install the kernel file to /boot</a>. So I made <em>that</em> script run with <code>-z</code> as well, and this finally got me my next big breakthrough. In the installed system I could see that script doing a lot of stuff, but in the live environment it seemed to exit almost as soon as it started:</p> <pre><code>+ [[ -n '' ]] + exit 0 </code></pre> <p>It's not 100% obvious, but I was pretty sure that just meant it was failing in the test right at the start:</p> <pre><code>if ! [[ $KERNEL_INSTALL_MACHINE_ID ]]; then exit 0 fi </code></pre> <p>So I went and <a href="https://www.google.com/search?q=KERNEL_INSTALL_MACHINE_ID+fedora&amp;ie=utf-8&amp;oe=utf-8&amp;client=firefox-b-ab">looked up <code>$KERNEL_INSTALL_MACHINE_ID</code></a> and the references suggested that it's basically tied to <code>/etc/machine-id</code>. So I looked, and, lo and behold, in the regular installed system, that file contained a random alphanumeric string, but in the live image creation environment, the file was empty! This sure looked suspicious.</p> <p>I read through some references on the file, and found that it's usually meant to get created by <a href="https://src.fedoraproject.org/rpms/systemd/blob/71e781a09623e04dc1006dcba12838578595d70f/f/systemd.spec#_484">a call to <code>/usr/bin/systemd-machine-id-setup</code> in systemd package scripts</a>. So I tried running <code>systemd-machine-id-setup</code> in the live image creation environment, and suddenly the file got a random string, and when I ran <code>kernel-install</code> again, the kernel file <em>did</em> get installed to <code>/boot</code>!</p> <h2>OK...So Why Isn't There A Machine ID?</h2> <p>So now I'd moved along to a new problem again: why was <code>/etc/machine-id</code> empty when the kernel %post script ran, but if I tried to generate it again, it worked? Was the initial generation failing? Was it happening too late? Was it working, but the file getting touched by something else?</p> <p>Again, I looked at quite a lot of code to try and figure it out - there's code that at least <em>could</em> touch <code>/etc/machine-id</code> in anaconda, for instance, and in lorax - but in the end I decided to go with practical experimentation again. So I did a custom scratch build of systemd to try and find out what happened when the <code>%post</code> script ran. I changed the command from this:</p> <pre><code>systemd-machine-id-setup &amp;&gt;/dev/null || : </code></pre> <p>to this:</p> <pre><code>systemd-machine-id-setup &gt; /tmp/smids.log 2&gt;&amp;1 cat /etc/machine-id &gt;&gt; /tmp/smids.log </code></pre> <p>pulled that build into a side repository, edited the live kickstart to use that side repository, and re-ran the live image creation. And that hit paydirt in a big way, because in smids.log I saw this:</p> <pre><code>systemd-machine-id-setup: error while loading shared libraries: libssl.so.1.1: cannot open shared object file: No such file or directory </code></pre> <p>...and here was the ultimate solution to our mystery! The attempt to set the machine-id in systemd <code>%post</code> was failing because it needs libssl, but it obviously wasn't present yet. libssl is part of <code>openssl-libs</code>, but the systemd spec did not specify that its <code>%post</code> script needs <code>openssl-libs</code> installed. What I surmise had happened was that up until 20181030, some other dependency in some other package happened to mean that dnf would always choose to install <code>openssl-libs</code> before installing <code>systemd</code>, so no-one had ever noticed this missing dependency...but on 20181101, some change to some package caused dnf to start installing systemd <em>before</em> openssl-libs, and suddenly, this problem showed up. So - as is very often the case - once I'd finally managed to nail down the problem, the <a href="https://src.fedoraproject.org/rpms/systemd/c/71e781a09623e04dc1006dcba12838578595d70f?branch=master">fix was simple</a>: we just add the missing dependency to systemd, so that openssl-libs will always be installed before systemd's <code>%post</code> is run. With that fix, generation of <code>/etc/machine-id</code> will succeed again, and so the plugin script that installs the kernel file to <code>/boot</code> won't bail early, and so there will <em>be</em> a kernel file in <code>/boot</code>, and lorax won't fail when it tries to regenerate initrds because there aren't any kernels present!</p> <p>...and so ends another exciting episode of AdamW's Debugging Adventures :)</p> <p></p> Adam's Debugging Adventures: The Immutable Mutable Object https://www.happyassassin.net/posts/2018/06/27/adams-debugging-adventures-the-immutable-mutable-object/ 2018-06-27T16:03:50Z 2018-06-27T16:03:50Z Adam Williamson <p></p><p>Here's a puzzle for you, Python fans:</p> <pre><code>[adamw@adam dnf (master %)]$ python3 Python 3.6.5 (default, Apr 23 2018, 22:53:50) [GCC 8.0.1 20180410 (Red Hat 8.0.1-0.21)] on linux Type "help", "copyright", "credits" or "license" for more information. &gt;&gt;&gt; from dnf.conf import MainConf &gt;&gt;&gt; mc = MainConf() &gt;&gt;&gt; print(mc.group_package_types) ['mandatory', 'default', 'conditional'] &gt;&gt;&gt; mc.group_package_types.append('foo') &gt;&gt;&gt; print(mc.group_package_types) ['mandatory', 'default', 'conditional'] &gt;&gt;&gt; </code></pre> <p>Note: if you want to try reproducing this...make sure you use DNF 3. It works as expected with DNF &lt; 3. That's why it just showed up as a problem.</p> <p>Before I explain what's going on there...let's unpick the problem a bit for non-Pythonistas.</p> <p>In Python (and in many other languages) some things - objects - are 'mutable', and some are 'immutable'. A 'mutable' object can be changed. An 'immutable' object cannot.</p> <p>In the Python session above, we create an object, <code>mc</code>, which is an instance of the class <code>MainConf</code> (don't worry if you don't entirely understand that, it's not compulsory). We then examine one <em>attribute</em> of <code>mc</code>: <code>mc.group_package_types</code>. Python tells us that this is <code>['mandatory', 'default', 'conditional']</code> - which is a Python list containing the values <code>'mandatory'</code>, <code>'default'</code> and <code>'conditional'</code>.</p> <p>In Python, lists are 'mutable'. That means you can take a list and change it somehow - remove an item from it, add an item to it, re-order it - and it's <em>still the same object</em>. Any existing reference to the object is still valid, and now refers to the changed list.</p> <p>For comparison, an example of an 'immutable' object is a string. If you do this:</p> <pre><code>mystring = "here's some text" </code></pre> <p>you can't then change the actual string object referenced by <code>mystring</code> there. It has no methods that let you change it in any way, and there are no functions that can operate on it and change it. You can do this:</p> <pre><code>mystring = "here's some text" mystring = "here's some different text" mystring = mystring.replace("some", "some more") </code></pre> <p>and at each step the contents of the string to which the name <code>mystring</code> refers are different - but also, at each step <em><code>mystring</code> refers to a DIFFERENT object</em>. (That's why the <code>replace()</code> string method returns a new string - it <em>can't</em> mutate the existing string). So if you did this:</p> <pre><code>mystring = "here's some text" otherref = mystring mystring = "here's some different text" </code></pre> <p>then at the end, <code>otherref</code> still points to the first-created string object and its value is still <code>"here's some text"</code>, while <code>mystring</code> points to a <em>new</em> string object and its value is <code>"here's some different text"</code>. Let's compare with a similar case using a mutable object:</p> <pre><code>mylist = [1, 2, 3] otherref = mylist mylist.append(4) print(mylist) print(otherref) </code></pre> <p>In this case, when we get to the end, both <code>mylist</code> and <code>otherref</code> are still pointing to the <em>same</em> object, the original object, and both <code>print</code>s will print <code>[1, 2, 3, 4]</code>. No <em>new</em> list object was created at any point after the initial creation of <code>mylist</code>.</p> <p>So with that understood, take a look back at the original example, and maybe you can see why it's so weird. By all appearances, it looks like we have a pretty simple scenario here: we have an object that has an attribute which is a list, and we just append a value to that list. Then we go look at its value again, and it...hasn't changed at all? But we didn't get any kind of <em>crash</em>, or <em>error</em>, or anything. Our <code>append</code> call returned apparently successfully. It just...didn't seem to change anything. The list is an immutable mutable object!</p> <p>This is a real problem in real code: it <a href="https://pagure.io/dusty/failed-composes/issue/481">broke the most recent Fedora Rawhide compose</a>. So, obviously, I was pretty keen to figure out what the hell was going on, here! It turns out that it's down to dnf getting clever (arguably over-clever).</p> <p>Python's a very...flexible language. The key to the problem here turned out to be exactly what happens when we actually look up the <code>group_package_types</code> attribute of <code>mc</code>, the <code>dnf.conf.MainConf</code> instance.</p> <p>Getting and setting attributes of objects in Python is usually a kind of core operation that you never really think about, you just expect it to work in the 'usual way'. A simple approximation of how it works is that the object has a Python <code>dict</code> (like a 'hash' in some languages - a key/value store, more or less) whose keys are attribute names and whose values are the attribute values. When you ask for an attribute of an object, Python checks if its name is one of the keys in that object's dict, and if it is, returns the value. If it's not, Python raises an error. When you <em>set</em> an attribute, Python stuffs it into the dict.</p> <p>But since Python's flexible, it provides some mechanisms to let you mess around with this stuff, if you want to. You can define <code>__setattr__</code>, <code>__getattr__</code> and/or <code>__getattribute__</code> methods in a class, and they'll affect this behaviour.</p> <p>The base <code>object</code> class that almost all Python classes inherit from defines the default <code>__setattr__</code> and <code>__getattribute__</code>, which work sort of like the approximation I gave above. If you override <code>__setattr__</code> in a class, then when something tries to set an attribute for an instance of that class, that method will get called instead of the default <code>object.__setattr__</code>. If you override <code>__getattribute__</code>, then that method will get called instead of <code>object.__getattribute__</code> when something tries to <em>look up</em> an attribute of an instance of that class.</p> <p>If you leave <code>__getattribute__</code> alone but define <code>__getattr__</code>, then when something tries to look up an attribute, first the stock <code>object.__getattribute__</code> will be used to try and look it up, but if that doesn't find it, rather than raising an exception immediately, Python will try your <code>__getattr__</code> method.</p> <p>We can actually override <code>__setattr__</code> and <code>__getattr__</code> to do a very simplified demonstration of how the default implementation usually works:</p> <pre><code>#!/bin/python3 class TestClass(object): def __init__(self): self.__dict__["_attrs"] = {} def __setattr__(self, name, value): print("Hi, setattr here, just settin' attrs...") self._attrs[name] = value def __getattr__(self, name): print("Hi, getattr here, here's your attr!") return self._attrs[name] tc = TestClass() tc.foo = [1, 2, 3] print(tc.foo) tc.foo.append(4) print(tc.foo) </code></pre> <p>Note that <code>__dict__</code> is the store that <code>object.__getattribute__</code> uses, so that's why we set up our backing store with <code>self.__dict__["_attrs"] = {}</code> - that ensures that when we look up <code>self._attrs</code>, we will find it via <code>__getattribute__</code>. We can't just do <code>self._attrs = {}</code> because then we wind up in an infinite recursion in our <code>__getattr__</code>.</p> <p>If you save that and run it, you'll see <code>[1, 2, 3]</code> then <code>[1, 2, 3, 4]</code> (plus the messages that prove our new methods are being used). Our mutable attribute is nice and properly mutable. We can append things to it and everything. Notice that when we append the value, we hit <code>__getattr__</code> but not <code>__setattr__</code>.</p> <p>So, how does this manage <em>not</em> to work with dnf config classes? (<code>MainConf</code> is a subclass of <code>BaseConfig</code>, and so are a ton of other config-related classes in dnf - we actually encountered this bug with another subclass, <code>RepoConf</code>). It turns out to be because dnf overrides <code>BaseConfig.__setattr__</code> and <code>BaseConfig.__getattr__</code> to do some "clever" stuff, and it breaks this!</p> <p>We don't need to go into what its <code>__setattr__</code> does in detail, except to note one thing: it <em>doesn't</em> store the values in the <code>__dict__</code> store, so <code>object.__getattribute__</code> can never find them. When looking up any attribute on an instance of one of these classes except <code>_config</code> (which is the store the class' <code>__setattr__</code> and <code>__getattr__</code> methods themselves use, just like <code>_attrs</code> in our example, and is created directly in <code>__dict__</code> in the same way), we always wind up in the class's <code>__getattr__</code>.</p> <p>Here's the whole of <a href="https://github.com/rpm-software-management/dnf/blob/3.0.1/dnf/conf/config.py#L198-L210">current dnf's <code>BaseConfig.__getattr__</code></a>:</p> <pre><code>def __getattr__(self, name): option = getattr(self._config, name, None) if option is None: return None try: value = option().getValue() except Exception as ex: return None if isinstance(value, cfg.VectorString): return list(value) if isinstance(value, str): return ucd(value) return value </code></pre> <p>There is some more stuff going on in the background here that we don't need to worry about <em>too</em> much (a feature of DNF, I have found, is that it has layers upon layers. It contains multitudes. You usually can't debug anything in DNF without going through at least eight levels of things calling other things that turn out to be yet other things that turn out to be written in C++ just cuz.) In the case of the <code>group_package_types</code> option, and also the option we were actually dealing with in the buggy case (the <code>baseurl</code> option for a repo), the option is basically a list-y type, so we wind up in the <code>if isinstance(value, cfg.VectorString):</code> branch here.</p> <p>So if you follow it through, when we asked for <code>mc.group_package_types</code>, <em>unlike</em> in the default Python implementation or our simplified example, we didn't just pull an existing list object out from some sekrit store in the <code>mc</code> object. No. We got some kind of object (fact fans: it's a <code>libdnf.conf.OptionStringList</code> - libdnf is the C++ bit I mentioned earlier...) out of the <code>self._config</code> dict that's acting as our sort-of attribute store, and ran its <code>getValue</code> method to get some <em>other</em> sort of object (fact fans: it's a <code>libdnf.conf.VectorString</code>), then we ran <code>list()</code> on <em>that</em> object, and returned <em>that</em>.</p> <p>The problem is that the thing that gets returned is basically a temporary copy of the 'real' backing value. It's a mutable object - it really is a list! - and we can mutate it...but the next time anyone looks up the same attribute we looked up to get that list, they won't get the <em>same</em> list object we got. This wacky <code>__getattr__</code> will run through the same indirection maze and return a <em>new listified copy</em> of the backing value. Every time you look up the attribute, it does that. We can mutate the copies all we like, but doing that doesn't actually change the backing value.</p> <p>Now, it's easy enough to work around this, <em>once you know what's going on</em>. The overridden <code>__setattr__</code>, of course, actually <em>does</em> change the backing store. So if you explicitly try to <em>set</em> an attribute (rather than getting one and mutating it), that <em>does</em> work:</p> <pre><code>[adamw@adam dnf (master %)]$ python3 Python 3.6.5 (default, Apr 23 2018, 22:53:50) [GCC 8.0.1 20180410 (Red Hat 8.0.1-0.21)] on linux Type "help", "copyright", "credits" or "license" for more information. &gt;&gt;&gt; from dnf.conf import MainConf &gt;&gt;&gt; mc = MainConf() &gt;&gt;&gt; print(mc.group_package_types) ['mandatory', 'default', 'conditional'] &gt;&gt;&gt; mc.group_package_types = mc.group_package_types + ['foo'] &gt;&gt;&gt; print(mc.group_package_types) ['mandatory', 'default', 'conditional', 'foo'] &gt;&gt;&gt; </code></pre> <p>That time, it worked because we didn't try to mutate our magical immutable mutable object. We just flat out <em>replaced</em> it with a <em>new</em> list. When we explicitly set the attribute like that, we hit the overridden <code>__setattr__</code> method, which does the necessary magic to write the new value to the backing store.</p> <p>But any regular Pythonista who sees that the instance attribute <code>foo.bar</code> is a mutable object is naturally going to assume that they can go ahead and mutate it. That's just...standard. They aren't going to figure they should ignore the fact that it's a mutable object and just replace it every time they want to change it. That's <a href="https://github.com/livecd-tools/livecd-tools/blob/livecd-tools-25.0/imgcreate/dnfinst.py#L181-L182">exactly what we did do</a> in the code that got broken. That's the exact code that used to work with DNF 2 but no longer does with DNF 3.</p> <p>So, that took a few hours to figure out! But <a href="https://github.com/livecd-tools/livecd-tools/pull/102">I got there in the end</a>. I really just wanted to write up my latest 'interesting' debugging adventure! But if there's a moral to this story...I guess it's "think <em>really</em> hard about whether messing with core behaviour like this is the right way to go about implementing your special thing"?</p> <p>Oh, and by the way: comments should be working again now! I just disabled the plugin that was interfering with them. So, you know, comment away.</p> <p></p> <p></p><p>Here's a puzzle for you, Python fans:</p> <pre><code>[adamw@adam dnf (master %)]$ python3 Python 3.6.5 (default, Apr 23 2018, 22:53:50) [GCC 8.0.1 20180410 (Red Hat 8.0.1-0.21)] on linux Type "help", "copyright", "credits" or "license" for more information. &gt;&gt;&gt; from dnf.conf import MainConf &gt;&gt;&gt; mc = MainConf() &gt;&gt;&gt; print(mc.group_package_types) ['mandatory', 'default', 'conditional'] &gt;&gt;&gt; mc.group_package_types.append('foo') &gt;&gt;&gt; print(mc.group_package_types) ['mandatory', 'default', 'conditional'] &gt;&gt;&gt; </code></pre> <p>Note: if you want to try reproducing this...make sure you use DNF 3. It works as expected with DNF &lt; 3. That's why it just showed up as a problem.</p> <p>Before I explain what's going on there...let's unpick the problem a bit for non-Pythonistas.</p> <p>In Python (and in many other languages) some things - objects - are 'mutable', and some are 'immutable'. A 'mutable' object can be changed. An 'immutable' object cannot.</p> <p>In the Python session above, we create an object, <code>mc</code>, which is an instance of the class <code>MainConf</code> (don't worry if you don't entirely understand that, it's not compulsory). We then examine one <em>attribute</em> of <code>mc</code>: <code>mc.group_package_types</code>. Python tells us that this is <code>['mandatory', 'default', 'conditional']</code> - which is a Python list containing the values <code>'mandatory'</code>, <code>'default'</code> and <code>'conditional'</code>.</p> <p>In Python, lists are 'mutable'. That means you can take a list and change it somehow - remove an item from it, add an item to it, re-order it - and it's <em>still the same object</em>. Any existing reference to the object is still valid, and now refers to the changed list.</p> <p>For comparison, an example of an 'immutable' object is a string. If you do this:</p> <pre><code>mystring = "here's some text" </code></pre> <p>you can't then change the actual string object referenced by <code>mystring</code> there. It has no methods that let you change it in any way, and there are no functions that can operate on it and change it. You can do this:</p> <pre><code>mystring = "here's some text" mystring = "here's some different text" mystring = mystring.replace("some", "some more") </code></pre> <p>and at each step the contents of the string to which the name <code>mystring</code> refers are different - but also, at each step <em><code>mystring</code> refers to a DIFFERENT object</em>. (That's why the <code>replace()</code> string method returns a new string - it <em>can't</em> mutate the existing string). So if you did this:</p> <pre><code>mystring = "here's some text" otherref = mystring mystring = "here's some different text" </code></pre> <p>then at the end, <code>otherref</code> still points to the first-created string object and its value is still <code>"here's some text"</code>, while <code>mystring</code> points to a <em>new</em> string object and its value is <code>"here's some different text"</code>. Let's compare with a similar case using a mutable object:</p> <pre><code>mylist = [1, 2, 3] otherref = mylist mylist.append(4) print(mylist) print(otherref) </code></pre> <p>In this case, when we get to the end, both <code>mylist</code> and <code>otherref</code> are still pointing to the <em>same</em> object, the original object, and both <code>print</code>s will print <code>[1, 2, 3, 4]</code>. No <em>new</em> list object was created at any point after the initial creation of <code>mylist</code>.</p> <p>So with that understood, take a look back at the original example, and maybe you can see why it's so weird. By all appearances, it looks like we have a pretty simple scenario here: we have an object that has an attribute which is a list, and we just append a value to that list. Then we go look at its value again, and it...hasn't changed at all? But we didn't get any kind of <em>crash</em>, or <em>error</em>, or anything. Our <code>append</code> call returned apparently successfully. It just...didn't seem to change anything. The list is an immutable mutable object!</p> <p>This is a real problem in real code: it <a href="https://pagure.io/dusty/failed-composes/issue/481">broke the most recent Fedora Rawhide compose</a>. So, obviously, I was pretty keen to figure out what the hell was going on, here! It turns out that it's down to dnf getting clever (arguably over-clever).</p> <p>Python's a very...flexible language. The key to the problem here turned out to be exactly what happens when we actually look up the <code>group_package_types</code> attribute of <code>mc</code>, the <code>dnf.conf.MainConf</code> instance.</p> <p>Getting and setting attributes of objects in Python is usually a kind of core operation that you never really think about, you just expect it to work in the 'usual way'. A simple approximation of how it works is that the object has a Python <code>dict</code> (like a 'hash' in some languages - a key/value store, more or less) whose keys are attribute names and whose values are the attribute values. When you ask for an attribute of an object, Python checks if its name is one of the keys in that object's dict, and if it is, returns the value. If it's not, Python raises an error. When you <em>set</em> an attribute, Python stuffs it into the dict.</p> <p>But since Python's flexible, it provides some mechanisms to let you mess around with this stuff, if you want to. You can define <code>__setattr__</code>, <code>__getattr__</code> and/or <code>__getattribute__</code> methods in a class, and they'll affect this behaviour.</p> <p>The base <code>object</code> class that almost all Python classes inherit from defines the default <code>__setattr__</code> and <code>__getattribute__</code>, which work sort of like the approximation I gave above. If you override <code>__setattr__</code> in a class, then when something tries to set an attribute for an instance of that class, that method will get called instead of the default <code>object.__setattr__</code>. If you override <code>__getattribute__</code>, then that method will get called instead of <code>object.__getattribute__</code> when something tries to <em>look up</em> an attribute of an instance of that class.</p> <p>If you leave <code>__getattribute__</code> alone but define <code>__getattr__</code>, then when something tries to look up an attribute, first the stock <code>object.__getattribute__</code> will be used to try and look it up, but if that doesn't find it, rather than raising an exception immediately, Python will try your <code>__getattr__</code> method.</p> <p>We can actually override <code>__setattr__</code> and <code>__getattr__</code> to do a very simplified demonstration of how the default implementation usually works:</p> <pre><code>#!/bin/python3 class TestClass(object): def __init__(self): self.__dict__["_attrs"] = {} def __setattr__(self, name, value): print("Hi, setattr here, just settin' attrs...") self._attrs[name] = value def __getattr__(self, name): print("Hi, getattr here, here's your attr!") return self._attrs[name] tc = TestClass() tc.foo = [1, 2, 3] print(tc.foo) tc.foo.append(4) print(tc.foo) </code></pre> <p>Note that <code>__dict__</code> is the store that <code>object.__getattribute__</code> uses, so that's why we set up our backing store with <code>self.__dict__["_attrs"] = {}</code> - that ensures that when we look up <code>self._attrs</code>, we will find it via <code>__getattribute__</code>. We can't just do <code>self._attrs = {}</code> because then we wind up in an infinite recursion in our <code>__getattr__</code>.</p> <p>If you save that and run it, you'll see <code>[1, 2, 3]</code> then <code>[1, 2, 3, 4]</code> (plus the messages that prove our new methods are being used). Our mutable attribute is nice and properly mutable. We can append things to it and everything. Notice that when we append the value, we hit <code>__getattr__</code> but not <code>__setattr__</code>.</p> <p>So, how does this manage <em>not</em> to work with dnf config classes? (<code>MainConf</code> is a subclass of <code>BaseConfig</code>, and so are a ton of other config-related classes in dnf - we actually encountered this bug with another subclass, <code>RepoConf</code>). It turns out to be because dnf overrides <code>BaseConfig.__setattr__</code> and <code>BaseConfig.__getattr__</code> to do some "clever" stuff, and it breaks this!</p> <p>We don't need to go into what its <code>__setattr__</code> does in detail, except to note one thing: it <em>doesn't</em> store the values in the <code>__dict__</code> store, so <code>object.__getattribute__</code> can never find them. When looking up any attribute on an instance of one of these classes except <code>_config</code> (which is the store the class' <code>__setattr__</code> and <code>__getattr__</code> methods themselves use, just like <code>_attrs</code> in our example, and is created directly in <code>__dict__</code> in the same way), we always wind up in the class's <code>__getattr__</code>.</p> <p>Here's the whole of <a href="https://github.com/rpm-software-management/dnf/blob/3.0.1/dnf/conf/config.py#L198-L210">current dnf's <code>BaseConfig.__getattr__</code></a>:</p> <pre><code>def __getattr__(self, name): option = getattr(self._config, name, None) if option is None: return None try: value = option().getValue() except Exception as ex: return None if isinstance(value, cfg.VectorString): return list(value) if isinstance(value, str): return ucd(value) return value </code></pre> <p>There is some more stuff going on in the background here that we don't need to worry about <em>too</em> much (a feature of DNF, I have found, is that it has layers upon layers. It contains multitudes. You usually can't debug anything in DNF without going through at least eight levels of things calling other things that turn out to be yet other things that turn out to be written in C++ just cuz.) In the case of the <code>group_package_types</code> option, and also the option we were actually dealing with in the buggy case (the <code>baseurl</code> option for a repo), the option is basically a list-y type, so we wind up in the <code>if isinstance(value, cfg.VectorString):</code> branch here.</p> <p>So if you follow it through, when we asked for <code>mc.group_package_types</code>, <em>unlike</em> in the default Python implementation or our simplified example, we didn't just pull an existing list object out from some sekrit store in the <code>mc</code> object. No. We got some kind of object (fact fans: it's a <code>libdnf.conf.OptionStringList</code> - libdnf is the C++ bit I mentioned earlier...) out of the <code>self._config</code> dict that's acting as our sort-of attribute store, and ran its <code>getValue</code> method to get some <em>other</em> sort of object (fact fans: it's a <code>libdnf.conf.VectorString</code>), then we ran <code>list()</code> on <em>that</em> object, and returned <em>that</em>.</p> <p>The problem is that the thing that gets returned is basically a temporary copy of the 'real' backing value. It's a mutable object - it really is a list! - and we can mutate it...but the next time anyone looks up the same attribute we looked up to get that list, they won't get the <em>same</em> list object we got. This wacky <code>__getattr__</code> will run through the same indirection maze and return a <em>new listified copy</em> of the backing value. Every time you look up the attribute, it does that. We can mutate the copies all we like, but doing that doesn't actually change the backing value.</p> <p>Now, it's easy enough to work around this, <em>once you know what's going on</em>. The overridden <code>__setattr__</code>, of course, actually <em>does</em> change the backing store. So if you explicitly try to <em>set</em> an attribute (rather than getting one and mutating it), that <em>does</em> work:</p> <pre><code>[adamw@adam dnf (master %)]$ python3 Python 3.6.5 (default, Apr 23 2018, 22:53:50) [GCC 8.0.1 20180410 (Red Hat 8.0.1-0.21)] on linux Type "help", "copyright", "credits" or "license" for more information. &gt;&gt;&gt; from dnf.conf import MainConf &gt;&gt;&gt; mc = MainConf() &gt;&gt;&gt; print(mc.group_package_types) ['mandatory', 'default', 'conditional'] &gt;&gt;&gt; mc.group_package_types = mc.group_package_types + ['foo'] &gt;&gt;&gt; print(mc.group_package_types) ['mandatory', 'default', 'conditional', 'foo'] &gt;&gt;&gt; </code></pre> <p>That time, it worked because we didn't try to mutate our magical immutable mutable object. We just flat out <em>replaced</em> it with a <em>new</em> list. When we explicitly set the attribute like that, we hit the overridden <code>__setattr__</code> method, which does the necessary magic to write the new value to the backing store.</p> <p>But any regular Pythonista who sees that the instance attribute <code>foo.bar</code> is a mutable object is naturally going to assume that they can go ahead and mutate it. That's just...standard. They aren't going to figure they should ignore the fact that it's a mutable object and just replace it every time they want to change it. That's <a href="https://github.com/livecd-tools/livecd-tools/blob/livecd-tools-25.0/imgcreate/dnfinst.py#L181-L182">exactly what we did do</a> in the code that got broken. That's the exact code that used to work with DNF 2 but no longer does with DNF 3.</p> <p>So, that took a few hours to figure out! But <a href="https://github.com/livecd-tools/livecd-tools/pull/102">I got there in the end</a>. I really just wanted to write up my latest 'interesting' debugging adventure! But if there's a moral to this story...I guess it's "think <em>really</em> hard about whether messing with core behaviour like this is the right way to go about implementing your special thing"?</p> <p>Oh, and by the way: comments should be working again now! I just disabled the plugin that was interfering with them. So, you know, comment away.</p> <p></p> The Tale Of The Two-Day, One-Character Patch https://www.happyassassin.net/posts/2017/01/11/the-tale-of-the-two-day-one-character-patch/ 2017-01-11T18:57:21Z 2017-01-11T18:57:21Z Adam Williamson <p></p><p>I'm feeling like writing a very long explanation of a very small change again. Some folks have told me they enjoy my attempts to detail the entire step-by-step process of debugging some somewhat complex problem, so sit back, folks, and enjoy...The Tale Of The Two-Day, One-Character Patch!</p> <p>Recently we landed Python 3.6 in Fedora Rawhide. A Python version bump like that requires all Python-dependent packages in the distribution to be rebuilt. As usually happens, several packages failed to rebuild successfully, so among other work, I've been helping work through the list of failed packages and fixing them up.</p> <p>Two days ago, I reached <a href="https://admin.fedoraproject.org/pkgdb/package/python-deap">python-deap</a>. As usual, I first simply tried a mock build of the package: sometimes it turns out we already fixed whatever had previously caused the build to fail, and simply retrying will make it work. But that wasn't the case this time.</p> <p>The build failed due to build dependencies not being installable - <code>python2-pypandoc</code>, in this case. It turned out that this depends on <code>pandoc-citeproc</code>, and that wasn't installable because a new ghc build had been done without rebuilds of the set of pandoc-related packages that must be rebuilt after a ghc bump. So I rebuilt <code>pandoc</code>, and <code>ghc-aeson-pretty</code> (an updated version was needed to build an updated <code>pandoc-citeproc</code> which had been committed but not built), and finally <code>pandoc-citeproc</code>.</p> <p>With that done, I could do a successful scratch build of <code>python-deap</code>. I tweaked the package a bit to enable the test suites - another thing I'm doing for each package I'm fixing the build of, if possible - and fired off an <a href="https://koji.fedoraproject.org/koji/taskinfo?taskID=17206208">official build</a>.</p> <p>Now you may notice that this looks a bit odd, because all the builds for the different arches succeeded (they're green), but the overall 'State' is "failed". What's going on there? Well, if you click "Show result", you'll see this:</p> <pre><code>BuildError: The following noarch package built differently on different architectures: python-deap-doc-1.0.1-2.20160624git232ed17.fc26.noarch.rpm rpmdiff output was: error: cannot open Packages index using db5 - Permission denied (13) error: cannot open Packages database in /var/lib/rpm error: cannot open Packages database in /var/lib/rpm removed /usr/share/doc/python-deap/html/_images/cma_plotting_01_00.png removed /usr/share/doc/python-deap/html/examples/es/cma_plotting_01_00.hires.png removed /usr/share/doc/python-deap/html/examples/es/cma_plotting_01_00.pdf removed /usr/share/doc/python-deap/html/examples/es/cma_plotting_01_00.png </code></pre> <p>So, this is a good example of where background knowledge is valuable. Getting from step to step in this kind of debugging/troubleshooting process is a sort of combination of logic, knowledge and perseverance. Always try to be logical and methodical. When you start out you won't have an awful lot of knowledge, so you'll need a lot of perseverance; hopefully, the longer you go on, the more knowledge you'll pick up, and thus the less perseverance you'll need!</p> <p>In this case the error is actually fairly helpful, but I also know a bit about packages (which helps) and remembered a recent mailing list discussion. Fedora allows arched packages with noarch subpackages, and this is how <code>python-deap</code> is set up: the main packages are arched, but there is a <code>python-deap-docs</code> subpackage that is noarch. We're concerned with that package here. I recalled <a href="https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/message/VGL6XRJRZ4DM2ZFQCP5IRLLHT6XL4GW6/">a recent mailing list discussion</a> of this "built differently on different architectures" error.</p> <p>As discussed in that thread, we're failing a Koji check specific to this kind of package. If all the per-arch builds succeed individually, Koji will take the noarch subpackage(s) from each arch and compare them; if they're not all the same, Koji will consider this an error and fail the build. After all, the point of a noarch package is that its contents are the same for all arches and so it shouldn't matter which arch build we take the noarch subpackage from. If it comes out different on different arches, something is clearly up.</p> <p>So this left me with the problem of figuring out which arch was different (it'd be nice if the Koji message actually told us...) and why. I started out just looking at the build logs for each arch and searching for 'cma_plotting'. This is actually another important thing: one of the most important approaches to have in your toolbox for this kind of work is just 'searching for significant-looking text strings'. That might be a grep or it might be a web search, but you'll probably wind up doing a lot of both. Remember good searching technique: try to find the most 'unusual' strings you can to search for, ones for which the results will be strongly correlated with your problem. This quickly told me that the problematic arch was ppc64. The 'removed' files were not present in that build, but they were present in the builds for all other arches.</p> <p>So I started looking more deeply into the <a href="https://kojipkgs.fedoraproject.org//work/tasks/6214/17206214/build.log">ppc64 build log</a>. If you search for 'cma_plotting' in that file, you'll see the very first result is "WARNING: Exception occurred in plotting cma_plotting". That sounds bad! Below it is a long Python traceback - the text starting "Traceback (most recent call last):".</p> <p>So what we have here is some kind of Python thing crashing during the build. If we quickly compare with the build logs on other arches, we don't see the same thing at all - there is no traceback in those build logs. Especially since this shows up right when the build process should be generating the files we know are the problem (the <code>cma_plotting</code> files, remember), we can be pretty sure this is our culprit.</p> <p>Now this is a pretty big scary traceback, but we can learn some things from it quite easily. One is very important: we can see quite easily <em>what</em> it is that's going wrong. If we look at the end of the traceback, we see that all the last calls involve files in <code>/usr/lib64/python2.7/site-packages/matplotlib</code>. This means we're dealing with a Python module called <code>matplotlib</code>. We can quite easily associate that with the package <a href="https://admin.fedoraproject.org/pkgdb/package/python-matplotlib">python-matplotlib</a>, and now we have our next suspect.</p> <p>If we look a bit before the traceback, we can get a bit more general context of what's going on, though it turns out not to be very important in this case. Sometimes it is, though. In this case we can see this:</p> <pre><code>+ sphinx-build-2 doc build/html Running Sphinx v1.5.1 </code></pre> <p>Again, background knowledge comes in handy here: I happen to know that Sphinx is a tool for generating documentation. But if you didn't already know that, you should quite easily be able to find it out, by good old web search. So what's going on is the package build process is trying to generate python-deap's documentation, and that process uses this <code>matplotlib</code> library, and something is going very wrong - but only on ppc64, remember - in <code>matplotlib</code> when we try to generate one particular set of doc files.</p> <p>So next I start trying to figure out what's actually going wrong in matplotlib. As I mentioned, the traceback is pretty long. This is partly just because matplotlib is big and complex, but it's more because it's a fairly rare type of Python error - an infinite recursion. You'll see the traceback ends with many, many repetitions of this line:</p> <pre><code> File "/usr/lib64/python2.7/site-packages/matplotlib/mathtext.py", line 861, in _get_glyph return self._get_glyph('rm', font_class, sym, fontsize) </code></pre> <p>followed by:</p> <pre><code> File "/usr/lib64/python2.7/site-packages/matplotlib/mathtext.py", line 816, in _get_glyph uniindex = get_unicode_index(sym, math) File "/usr/lib64/python2.7/site-packages/matplotlib/mathtext.py", line 87, in get_unicode_index if symbol == '-': RuntimeError: maximum recursion depth exceeded in cmp </code></pre> <p>What 'recursion' means is pretty simple: it just means that a function can call <em>itself</em>. A common example of where you might want to do this is if you're trying to walk a directory tree. In Python-y pseudo-code it might look a bit like this:</p> <pre><code>def read_directory(directory): print(directory.name) for entry in directory: if entry is file: print(entry.name) if entry is directory: read_directory(entry) </code></pre> <p>To deal with directories nested in other directories, the function just calls itself. The danger is if you somehow mess up when writing code like this, and it winds up in a loop, calling itself over and over and never escaping: this is 'infinite recursion'. Python, being a nice language, notices when this is going on, and bails after a certain number of recursions, which is what's happening here.</p> <p>So now we know where to look in matplotlib, and what to look for. Let's go take a look! matplotlib, like most everything else in the universe these days, is <a href="https://github.com/matplotlib/matplotlib">in github</a>, which is bad for ecosystem health but handy just for finding stuff. Let's go look at the <a href="https://github.com/matplotlib/matplotlib/blob/7a83fc426f689f15df812676dc36d003f7d39a9a/lib/matplotlib/mathtext.py#L801">function from the backtrace</a>.</p> <p>Well, this is pretty long, and maybe a bit intimidating. But an interesting thing is, we don't really need to know what this function is <em>for</em> - I actually still don't know precisely (according to the name it should be returning a 'glyph' - a single visual representation for a specific character from a font - but it actually returns a font, the unicode index for the glyph, the name of the glyph, the font size, and whether the glyph is italicized, for some reason). What we need to concentrate on is the question of why this function is getting in a recursion loop on one arch (ppc64) but not any others.</p> <p>First let's figure out how the recursion is actually triggered - that's vital to figuring out what the next step in our chain is. The line that triggers the loop is this one:</p> <pre><code> return self._get_glyph('rm', font_class, sym, fontsize) </code></pre> <p>That's where it calls itself. It's kinda obvious that the authors expect that call to succeed - it shouldn't run down the same logical path, but instead get to the 'success' path (the <code>return font, uniindex, symbol_name, fontsize, slanted</code> line at the end of the function) and thus break the loop. But on ppc64, for some reason, it doesn't.</p> <p>So what's the logic path that leads us to that call, both initially and when it recurses? Well, it's down three levels of conditionals:</p> <pre><code> if not found_symbol: if self.cm_fallback: &lt;other path&gt; else: if fontname in ('it', 'regular') and isinstance(self, StixFonts): return self._get_glyph('rm', font_class, sym, fontsize) </code></pre> <p>So we only get to this path if <code>found_symbol</code> is not set by the time we reach that first <code>if</code>, then if <code>self.cm_fallback</code> is not set, then if the <code>fontname</code> given when the function was called was 'it' or 'regular' and if the class instance this function (actually method) is a part of is an instance of the <code>StixFonts</code> class (or a subclass). Don't worry if we're getting a bit too technical at this point, because I did spend a bit of time looking into those last two conditions, but ultimately they turned out not to be that significant. The important one is the first one: <code>if not found_symbol</code>.</p> <p>By this point, I'm starting to wonder if the problem is that we're failing to 'find' the symbol - in the first half of the function - when we shouldn't be. Now there are a couple of handy logical shortcuts we can take here that turned out to be rather useful. First we look at the whole logic flow of the <code>found_symbol</code> variable and see that it's a bit convoluted. From the start of the function, there are two different ways it can be set <code>True</code> - the <code>if self.use_cmex</code> block and then the 'fallback' <code>if not found_symbol</code> block after that. Then there's another block that starts <code>if found_symbol:</code> where it gets <em>set back to False again</em>, and another lookup is done:</p> <pre><code> if found_symbol: (...) found_symbol = False font = self._get_font(new_fontname) if font is not None: glyphindex = font.get_char_index(uniindex) if glyphindex != 0: found_symbol = True </code></pre> <p>At first, though, we don't know if we're even hitting that block, or if we're failing to 'find' the symbol earlier on. It turns out, though, that it's easy to tell - because of this earlier block:</p> <pre><code> if not found_symbol: try: uniindex = get_unicode_index(sym, math) found_symbol = True except ValueError: uniindex = ord('?') warn("No TeX to unicode mapping for '%s'" % sym.encode('ascii', 'backslashreplace'), MathTextWarning) </code></pre> <p>Basically, if we don't find the symbol there, the code logs a warning. We can see from our build log that we <em>don't</em> see any such warning, so we know that the code <em>does</em> initially succeed in finding the symbol - that is, when we get to the <code>if found_symbol:</code> block, <code>found_symbol</code> is True. That logically means that it's <em>that</em> block where the problem occurs - we have <code>found_symbol</code> going in, but where that block sets it back to <code>False</code> then looks it up again (after doing some kind of font substitution, I don't know why, don't care), it fails.</p> <p>The other thing I noticed while poking through this code is a later warning. Remember that the infinite recursion only happens <code>if fontname in ('it', 'regular') and isinstance(self, StixFonts)</code>? Well, what happens if that's <em>not</em> the case is interesting:</p> <pre><code> if fontname in ('it', 'regular') and isinstance(self, StixFonts): return self._get_glyph('rm', font_class, sym, fontsize) warn("Font '%s' does not have a glyph for '%s' [U+%x]" % (new_fontname, sym.encode('ascii', 'backslashreplace').decode('ascii'), uniindex), MathTextWarning) </code></pre> <p>that is, if that condition <em>isn't</em> satisfied, instead of calling itself, the next thing the function does is log a warning. So it occurred to me to go and see if there are any of those warnings in the build logs. And, whaddayaknow, there are four such warnings in the ppc64 build log:</p> <pre><code>/usr/lib64/python2.7/site-packages/matplotlib/mathtext.py:866: MathTextWarning: Font 'rm' does not have a glyph for '1' [U+1d7e3] MathTextWarning) /usr/lib64/python2.7/site-packages/matplotlib/mathtext.py:867: MathTextWarning: Substituting with a dummy symbol. warn("Substituting with a dummy symbol.", MathTextWarning) /usr/lib64/python2.7/site-packages/matplotlib/mathtext.py:866: MathTextWarning: Font 'rm' does not have a glyph for '0' [U+1d7e2] MathTextWarning) /usr/lib64/python2.7/site-packages/matplotlib/mathtext.py:866: MathTextWarning: Font 'rm' does not have a glyph for '-' [U+2212] MathTextWarning) /usr/lib64/python2.7/site-packages/matplotlib/mathtext.py:866: MathTextWarning: Font 'rm' does not have a glyph for '2' [U+1d7e4] MathTextWarning) </code></pre> <p>but there are <em>no</em> such warnings in the logs for other arches. That's really rather interesting. It makes one possibility very unlikely: that we do reach the recursed call on all arches, but it fails on ppc64 and succeeds on the other arches. It's looking far more likely that the problem is the "re-discovery" bit of the function - the <code>if found_symbol:</code> block where it looks up the symbol again - is usually working on other arches, but failing on ppc64.</p> <p>So just by looking at the logical flow of the function, particularly what happens in different conditional branches, we've actually been able to figure out quite a lot, without knowing or even caring what the function is really <em>for</em>. By this point, I was really focusing in on that <code>if found_symbol:</code> block. And that leads us to our next suspect. The most important bit in that block is where it actually decides whether to set <code>found_symbol</code> to True or not, here:</p> <pre><code> font = self._get_font(new_fontname) if font is not None: glyphindex = font.get_char_index(uniindex) if glyphindex != 0: found_symbol = True </code></pre> <p>I didn't actually know whether it was failing because <code>self._get_font</code> didn't find anything, or because <code>font.get_char_index</code> returned 0. I think I just played a hunch that <code>get_char_index</code> was the problem, but it wouldn't be too difficult to find out by just editing the code a bit to log a message telling you whether or not <code>font</code> was <code>None</code>, and re-running the test suite.</p> <p>Anyhow, I wound up looking at <code>get_char_index</code>, so we need to go find that. You could work backwards through the code and figure out what <code>font</code> is an instance of so you can find it, but that's boring: it's far quicker just to grep the damn code. If you do that, you get various results that are calls of it, then this:</p> <pre><code>src/ft2font_wrapper.cpp:const char *PyFT2Font_get_char_index__doc__ = src/ft2font_wrapper.cpp: "get_char_index()\n" src/ft2font_wrapper.cpp:static PyObject *PyFT2Font_get_char_index(PyFT2Font *self, PyObject *args, PyObject *kwds) src/ft2font_wrapper.cpp: if (!PyArg_ParseTuple(args, "k:get_char_index", &amp;ccode)) { src/ft2font_wrapper.cpp: {"get_char_index", (PyCFunction)PyFT2Font_get_char_index, METH_VARARGS, PyFT2Font_get_char_index__doc__}, </code></pre> <p>Which is the point at which I started mentally buckling myself in, because now we're out of Python and into C++. Glorious C++! I should note at this point that, while I'm probably a half-decent Python coder at this point, I am still pretty awful at C(++). I may be somewhat or very wrong in anything I say about it. Corrections welcome.</p> <p>So I buckled myself in and went for a look at this <a href="https://github.com/matplotlib/matplotlib/blob/v2.0.0rc2/src/ft2font_wrapper.cpp">ft2font_wrapper.cpp thing</a>. I've seen this kind of thing a couple of times before, so by squinting at it a bit sideways, I could more or less see that this is what Python calls an <a href="https://docs.python.org/2/extending/extending.html">extension module</a>: basically, it's a Python module written in C or C++. This gets done if you need to create a new built-in type, or for speed, or - as in this case - because the Python project wants to work directly with a system shared library (in this case, freetype), either because it doesn't have Python bindings or because the project doesn't want to use them for some reason.</p> <p>This code pretty much provides a few classes for working with Freetype fonts. It defines a class called <code>matplotlib.ft2font.FT2Font</code> with a method <code>get_char_index</code>, and that's what the code back up in <code>mathtext.py</code> is dealing with: that <code>font</code> we were dealing with is an <code>FT2Font</code> instance, and we're using its <code>get_char_index</code> method to try and 'find' our 'symbol'.</p> <p>Fortunately, this <code>get_char_index</code> method is actually simple enough that even I can figure out what it's doing:</p> <pre><code>static PyObject *PyFT2Font_get_char_index(PyFT2Font *self, PyObject *args, PyObject *kwds) { FT_UInt index; FT_ULong ccode; if (!PyArg_ParseTuple(args, "I:get_char_index", &amp;ccode)) { return NULL; } index = FT_Get_Char_Index(self-&gt;x-&gt;get_face(), ccode); return PyLong_FromLong(index); } </code></pre> <p>(If you're playing along at home for MEGA BONUS POINTS, you now have all the necessary information and you can try to figure out what the bug is. If you just want me to explain it, keep reading!)</p> <p>There's really not an awful lot there. It's calling <code>FT_Get_Char_Index</code> with a couple of args and returning the result. Not rocket science.</p> <p>In fact, this seemed like a good point to start just doing a bit of experimenting to identify the precise problem, because we've reduced the problem to a very small area. So this is where I stopped just reading the code and started hacking it up to see what it did.</p> <p>First I tweaked the relevant block in <code>mathtext.py</code> to just log the values it was feeding in and getting out:</p> <pre><code> font = self._get_font(new_fontname) if font is not None: glyphindex = font.get_char_index(uniindex) warn("uniindex: %s, glyphindex: %s" % (uniindex, glyphindex)) if glyphindex != 0: found_symbol = True </code></pre> <p>Sidenote: how exactly to just print something out to the console when you're building or running tests can vary quite a bit depending on the codebase in question. What I usually do is just look at how the project already does it - find some message that <em>is</em> being printed when you build or run the tests, and then copy that. Thus in this case we can see that the code is using this <code>warn</code> function (it's actually <code>warnings.warn</code>), and we know those messages are appearing in our build logs, so...let's just copy that.</p> <p>Then I ran the test suite on both x86_64 and ppc64, and compared. This told me that the Python code was passing the same <code>uniindex</code> values to the C code on both x86_64 and ppc64, but getting different results back - that is, I got the same recorded <code>uniindex</code> values, but on x86_64 the resulting <code>glyphindex</code> value was always something larger than 0, but on ppc64, it was sometimes 0.</p> <p>The next step should be pretty obvious: log the input and output values in the <em>C</em> code.</p> <pre><code>index = FT_Get_Char_Index(self-&gt;x-&gt;get_face(), ccode); printf("ccode: %lu index: %u\n", ccode, index); </code></pre> <p>Another sidenote: one of the more annoying things with this particular issue was just being able to run the tests with modifications and see what happened. First, I needed an actual ppc64 environment to use. The awesome <a href="https://patrick.uiterwijk.org/">Patrick Uiterwijk</a> of Fedora release engineering provided me with one. Then I built a .src.rpm of the python-matplotlib package, ran a mock build of it, and shelled into the mock environment. That gives you an environment with all the necessary build dependencies and the source and the tests all there and prepared already. Then I just copied the necessary build, install and test commands from the spec file. For a simple pure-Python module this is all usually pretty easy and you can just check the source out and do it right in your regular environment or in a virtualenv or something, but for something like matplotlib which has this C++ extension module too, it's more complex. The spec builds the code, then installs it, then runs the tests out of the source directory with <code>PYTHONPATH=BUILDROOT/usr/lib64/python2.7/site-packages</code> , so the code that was actually built and installed is used for the tests. When I wanted to modify the C part of matplotlib, I edited it in the source directory, then re-ran the 'build' and 'install' steps, then ran the tests; if I wanted to modify the Python part I just edited it directly in the BUILDROOT location and re-ran the tests. When I ran the tests on ppc64, I noticed that several hundred of them failed with exactly the bug we'd seen in the <code>python-deap</code> package build - this infinite recursion problem. Several others failed due to not being able to find the glyph, without hitting the recursion. It turned out the package maintainer had disabled the tests on ppc64, and so Fedora 24+'s <code>python-matplotlib</code> has been broken on ppc64 since about April).</p> <p>So anyway, with that modified C code built and used to run the test suite, I finally had a smoking gun. Running this on x86_64 and ppc64, the logged <code>ccode</code> values were totally different. The values logged on ppc64 were huge. But as we know from the previous logging, there was no difference in the value when the Python code passed it to the C code (the <code>uniindex</code> value logged in the Python code).</p> <p>So now I knew: the problem lay in how the C code took the value from the Python code. At this point I started figuring out how that worked. The key line is this one:</p> <pre><code>if (!PyArg_ParseTuple(args, "I:get_char_index", &amp;ccode)) { </code></pre> <p>That <code>PyArg_ParseTuple</code> function is what the C code is using to read in the value that <code>mathtext.py</code> calls <code>uniindex</code> and it calls <code>ccode</code>, the one that's somehow being messed up on ppc64. So let's <a href="https://docs.python.org/2/c-api/arg.html#c.PyArg_ParseTuple">read the docs</a>!</p> <p>This is one unusual example where the Python docs, which are usually awesome, are a bit difficult, because that's a very thin description which doesn't provide the references you usually get. But all you really need to do is read up - go back to the top of the page, and you get a much more comprehensive explanation. Reading carefully through the whole page, we can see pretty much what's going on in this call. It basically means that <code>args</code> is expected to be a structure representing a single Python object, a number, which we will store into the C variable <code>ccode</code>. The tricky bit is that second arg, <code>"I:get_char_index"</code>. This is the 'format string' that the Python page goes into a lot of helpful detail about.</p> <p>As it tells us, <code>PyArg_ParseTuple</code> "use[s] format strings which are used to tell the function about the expected arguments...A format string consists of zero or more “format units.” A format unit describes one Python object; it is usually a single character or a parenthesized sequence of format units. With a few exceptions, a format unit that is not a parenthesized sequence normally corresponds to a single address argument to these functions." Next we get a list of the 'format units', and <code>I</code> is one of those:</p> <pre><code> I (integer) [unsigned int] Convert a Python integer to a C unsigned int, without overflow checking. </code></pre> <p>You might also notice that the list of format units include several for converting Python integers to <em>other</em> things, like <code>i</code> for 'signed int' and <code>h</code> for 'short int'. This will become significant soon!</p> <p>The <code>:get_char_index</code> bit threw me for a minute, but it's explained further down:</p> <p>"A few other characters have a meaning in a format string. These may not occur inside nested parentheses. They are: ... : The list of format units ends here; the string after the colon is used as the function name in error messages (the “associated value” of the exception that PyArg_ParseTuple() raises)." So in our case here, we have only a single 'format unit' - <code>I</code> - and <code>get_char_index</code> is just a name that'll be used in any error messages this call might produce.</p> <p>So now we know what this call is doing. It's saying "when some Python code calls this function, take the args it was called with and parse them into C structures so we can do stuff with them. In this case, we expect there to be just a single arg, which will be a Python integer, and we want to convert it to a C unsigned integer, and store it in the C variable <code>ccode</code>."</p> <p>(If you're playing along at home but you didn't get it earlier, you really should be able to get it now! Hint: read up just a few lines in the C code. If not, go refresh your memory about architectures...)</p> <p>And once I understood that, I realized what the problem was. Let's read up just a few lines in the C code:</p> <pre><code>FT_ULong ccode; </code></pre> <p>Unlike Python, C and C++ are 'typed languages'. That just means that all variables must be declared to be of a specific type, unlike Python variables, which you don't have to declare explicitly and which can change type any time you like. This is a variable declaration: it's simply saying "we want a variable called <code>ccode</code>, and it's of type <code>FT_ULong</code>".</p> <p>If you know anything at all about C integer types, you should know what the problem is by now (you probably worked it out a few paragraphs back). But if you don't, now's a good time to learn!</p> <p>There are several different types you can use for storing integers in C: <code>short</code>, <code>int</code>, <code>long</code>, and possibly <code>long long</code> (depends on your arch). This is basically all about efficiency: you can only put a small number in a <code>short</code>, but if you only <em>need</em> to store small numbers, it might be more efficient to use a <code>short</code> than a <code>long</code>. Theoretically, when you use a <code>short</code> the compiler will allocate less memory than when you use an <code>int</code>, which uses less memory again than a <code>long</code>, which uses less than a <code>long long</code>. Practically speaking some of them wind up being the same size on some platforms, but the basic idea's there.</p> <p>All the types have <code>signed</code> and <code>unsigned</code> variants. The difference there is simple: <code>signed</code> numbers can be negative, <code>unsigned</code> ones can't. Say an <code>int</code> is big enough to let you store 101 different values: a <code>signed int</code> would let you store any number between -50 and +50, while an <code>unsigned int</code> would let you store any number between 0 and 100.</p> <p>Now look at that <code>ccode</code> declaration again. What is its type? <code>FT_ULong</code>. That <code>ULong</code>...sounds a lot like <code>unsigned long</code>, right?</p> <p>Yes it does! Here, have a cookie. C code often declares its own aliases for standard C types like this; we can find Freetype's in its API documentation, which I found by the cunning technique of doing a <a href="https://duckduckgo.com/?q=ft_ulong&amp;t=h_&amp;ia=web">web search for FT_ULong</a>. That finds us <a href="https://www.freetype.org/freetype2/docs/reference/ft2-basic_types.html#FT_ULong">this handy definition</a>: "A typedef for unsigned long."</p> <p>Aaaaaaand herein lies our bug! Whew, at last. As, hopefully, you can now see, this <code>ccode</code> variable is declared as an <code>unsigned long</code>, but we're telling <code>PyArg_ParseTuple</code> to convert the Python object such that we can store it as an <code>unsigned int</code>, not an <code>unsigned long</code>.</p> <p>But wait, you think. Why does this seem to work OK on most arches, and only fail on ppc64? Again, some of you will already know the answer, good for you, now go read something else. ;) For the rest of you, it's all about this concept called 'endianness', which you might have come across and completely failed to understand, like I did many times! But it's really pretty simple, at least if we skate over it just a bit.</p> <p>Consider the number "forty-two". Here is how we write it with numerals: 42. Right? At least, that's how most humans do it, these days, unless you're a particularly hardy survivor of the fall of Rome, or something. This means we humans are 'big-endian'. If we were 'little-endian', we'd write it like this: 24. 'Big-endian' just means the most significant element comes 'first' in the representation; 'little-endian' means the most significant element comes last.</p> <p>All the arches Fedora supports except for ppc64 are little-endian. On little-endian arches, this error doesn't actually cause a problem: even though we used the wrong format unit, the value winds up being correct. On (64-bit) big-endian arches, however, it <em>does</em> cause a problem - when you tell <code>PyArg_ParseTuple</code> to convert to an unsigned long, but store the result into a variable that was declared as an unsigned int, you get a completely different value (it's multiplied by 2x32). The reasons for this involve getting into a more technical understanding of little-endian vs. big-endian (we actually have to get into the icky details of how values are <em>really</em> represented in memory), which I'm going to skip since this post is already long enough.</p> <p>But you don't <em>really</em> need to understand it completely, certainly not to be able to spot problems like this. All you need to know is that there are little-endian and big-endian arches, and little-endian are far more prevalent these days, so it's not unusual for low-level code to have weird bugs on big-endian arches. If something works fine on most arches but not on one or two, check if the ones where it fails are big-endian. If so, then keep a careful eye out for this kind of integer type mismatch problem, because it's very, very likely to be the cause.</p> <p>So now all that remained to do was to <em>fix</em> the problem. And here we go, with our <a href="https://patch-diff.githubusercontent.com/raw/matplotlib/matplotlib/pull/7768.patch">one character patch</a>:</p> <pre><code>diff --git a/src/ft2font_wrapper.cpp b/src/ft2font_wrapper.cpp index a97de68..c77dd83 100644 --- a/src/ft2font_wrapper.cpp +++ b/src/ft2font_wrapper.cpp @@ -971,7 +971,7 @@ static PyObject *PyFT2Font_get_char_index(PyFT2Font *self, PyObject *args, PyObj FT_UInt index; FT_ULong ccode; - if (!PyArg_ParseTuple(args, "I:get_char_index", &amp;ccode)) { + if (!PyArg_ParseTuple(args, "k:get_char_index", &amp;ccode)) { return NULL; } </code></pre> <p>There's something I just love about a one-character change that fixes several hundred test failures. :) As you can see, we simply change the <code>I</code> - the format unit for <code>unsigned int</code> - to <code>k</code> - the format unit for <code>unsigned long</code>. And with that, the bug is solved! I applied this change on both x86_64 and ppc64, re-built the code and re-ran the test suite, and observed that several hundred errors disappeared from the test suite on ppc64, while the x86_64 tests continued to pass.</p> <p>So I was able to send that patch upstream, apply it to the Fedora package, and once the package build went through, I could finally <a href="https://koji.fedoraproject.org/koji/buildinfo?buildID=827137">build python-deap successfully</a>, two days after I'd first tried it.</p> <p>Bonus extra content: even though I'd fixed the <code>python-deap</code> problem, as I'm never able to leave well enough alone, it wound up bugging me that there were still several hundred other failures in the matplotlib test suite on ppc64. So I wound up looking into all the other failures, and finding <a href="https://github.com/matplotlib/matplotlib/pull/7781">several</a> <a href="https://github.com/matplotlib/matplotlib/pull/7782">other</a> <a href="https://github.com/matplotlib/matplotlib/pull/7791">similar</a> <a href="https://github.com/matplotlib/matplotlib/pull/7796">issues</a>, which got the failure count down to <a href="https://github.com/matplotlib/matplotlib/issues/7797">just</a> <a href="https://github.com/matplotlib/matplotlib/issues/7798">two</a> sets of problems that are too domain-specific for me to figure out, and actually also happen on aarch64 and ppc64le (they're not big-endian issues). So to both the people running matplotlib on ppc64...you're welcome ;)</p> <p>Seriously, though, I suspect without these fixes, we might have had some odd cases where a noarch package's documentation would suddenly get messed up if the package happened to get built on a ppc64 builder.</p> <p></p> <p></p><p>I'm feeling like writing a very long explanation of a very small change again. Some folks have told me they enjoy my attempts to detail the entire step-by-step process of debugging some somewhat complex problem, so sit back, folks, and enjoy...The Tale Of The Two-Day, One-Character Patch!</p> <p>Recently we landed Python 3.6 in Fedora Rawhide. A Python version bump like that requires all Python-dependent packages in the distribution to be rebuilt. As usually happens, several packages failed to rebuild successfully, so among other work, I've been helping work through the list of failed packages and fixing them up.</p> <p>Two days ago, I reached <a href="https://admin.fedoraproject.org/pkgdb/package/python-deap">python-deap</a>. As usual, I first simply tried a mock build of the package: sometimes it turns out we already fixed whatever had previously caused the build to fail, and simply retrying will make it work. But that wasn't the case this time.</p> <p>The build failed due to build dependencies not being installable - <code>python2-pypandoc</code>, in this case. It turned out that this depends on <code>pandoc-citeproc</code>, and that wasn't installable because a new ghc build had been done without rebuilds of the set of pandoc-related packages that must be rebuilt after a ghc bump. So I rebuilt <code>pandoc</code>, and <code>ghc-aeson-pretty</code> (an updated version was needed to build an updated <code>pandoc-citeproc</code> which had been committed but not built), and finally <code>pandoc-citeproc</code>.</p> <p>With that done, I could do a successful scratch build of <code>python-deap</code>. I tweaked the package a bit to enable the test suites - another thing I'm doing for each package I'm fixing the build of, if possible - and fired off an <a href="https://koji.fedoraproject.org/koji/taskinfo?taskID=17206208">official build</a>.</p> <p>Now you may notice that this looks a bit odd, because all the builds for the different arches succeeded (they're green), but the overall 'State' is "failed". What's going on there? Well, if you click "Show result", you'll see this:</p> <pre><code>BuildError: The following noarch package built differently on different architectures: python-deap-doc-1.0.1-2.20160624git232ed17.fc26.noarch.rpm rpmdiff output was: error: cannot open Packages index using db5 - Permission denied (13) error: cannot open Packages database in /var/lib/rpm error: cannot open Packages database in /var/lib/rpm removed /usr/share/doc/python-deap/html/_images/cma_plotting_01_00.png removed /usr/share/doc/python-deap/html/examples/es/cma_plotting_01_00.hires.png removed /usr/share/doc/python-deap/html/examples/es/cma_plotting_01_00.pdf removed /usr/share/doc/python-deap/html/examples/es/cma_plotting_01_00.png </code></pre> <p>So, this is a good example of where background knowledge is valuable. Getting from step to step in this kind of debugging/troubleshooting process is a sort of combination of logic, knowledge and perseverance. Always try to be logical and methodical. When you start out you won't have an awful lot of knowledge, so you'll need a lot of perseverance; hopefully, the longer you go on, the more knowledge you'll pick up, and thus the less perseverance you'll need!</p> <p>In this case the error is actually fairly helpful, but I also know a bit about packages (which helps) and remembered a recent mailing list discussion. Fedora allows arched packages with noarch subpackages, and this is how <code>python-deap</code> is set up: the main packages are arched, but there is a <code>python-deap-docs</code> subpackage that is noarch. We're concerned with that package here. I recalled <a href="https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/message/VGL6XRJRZ4DM2ZFQCP5IRLLHT6XL4GW6/">a recent mailing list discussion</a> of this "built differently on different architectures" error.</p> <p>As discussed in that thread, we're failing a Koji check specific to this kind of package. If all the per-arch builds succeed individually, Koji will take the noarch subpackage(s) from each arch and compare them; if they're not all the same, Koji will consider this an error and fail the build. After all, the point of a noarch package is that its contents are the same for all arches and so it shouldn't matter which arch build we take the noarch subpackage from. If it comes out different on different arches, something is clearly up.</p> <p>So this left me with the problem of figuring out which arch was different (it'd be nice if the Koji message actually told us...) and why. I started out just looking at the build logs for each arch and searching for 'cma_plotting'. This is actually another important thing: one of the most important approaches to have in your toolbox for this kind of work is just 'searching for significant-looking text strings'. That might be a grep or it might be a web search, but you'll probably wind up doing a lot of both. Remember good searching technique: try to find the most 'unusual' strings you can to search for, ones for which the results will be strongly correlated with your problem. This quickly told me that the problematic arch was ppc64. The 'removed' files were not present in that build, but they were present in the builds for all other arches.</p> <p>So I started looking more deeply into the <a href="https://kojipkgs.fedoraproject.org//work/tasks/6214/17206214/build.log">ppc64 build log</a>. If you search for 'cma_plotting' in that file, you'll see the very first result is "WARNING: Exception occurred in plotting cma_plotting". That sounds bad! Below it is a long Python traceback - the text starting "Traceback (most recent call last):".</p> <p>So what we have here is some kind of Python thing crashing during the build. If we quickly compare with the build logs on other arches, we don't see the same thing at all - there is no traceback in those build logs. Especially since this shows up right when the build process should be generating the files we know are the problem (the <code>cma_plotting</code> files, remember), we can be pretty sure this is our culprit.</p> <p>Now this is a pretty big scary traceback, but we can learn some things from it quite easily. One is very important: we can see quite easily <em>what</em> it is that's going wrong. If we look at the end of the traceback, we see that all the last calls involve files in <code>/usr/lib64/python2.7/site-packages/matplotlib</code>. This means we're dealing with a Python module called <code>matplotlib</code>. We can quite easily associate that with the package <a href="https://admin.fedoraproject.org/pkgdb/package/python-matplotlib">python-matplotlib</a>, and now we have our next suspect.</p> <p>If we look a bit before the traceback, we can get a bit more general context of what's going on, though it turns out not to be very important in this case. Sometimes it is, though. In this case we can see this:</p> <pre><code>+ sphinx-build-2 doc build/html Running Sphinx v1.5.1 </code></pre> <p>Again, background knowledge comes in handy here: I happen to know that Sphinx is a tool for generating documentation. But if you didn't already know that, you should quite easily be able to find it out, by good old web search. So what's going on is the package build process is trying to generate python-deap's documentation, and that process uses this <code>matplotlib</code> library, and something is going very wrong - but only on ppc64, remember - in <code>matplotlib</code> when we try to generate one particular set of doc files.</p> <p>So next I start trying to figure out what's actually going wrong in matplotlib. As I mentioned, the traceback is pretty long. This is partly just because matplotlib is big and complex, but it's more because it's a fairly rare type of Python error - an infinite recursion. You'll see the traceback ends with many, many repetitions of this line:</p> <pre><code> File "/usr/lib64/python2.7/site-packages/matplotlib/mathtext.py", line 861, in _get_glyph return self._get_glyph('rm', font_class, sym, fontsize) </code></pre> <p>followed by:</p> <pre><code> File "/usr/lib64/python2.7/site-packages/matplotlib/mathtext.py", line 816, in _get_glyph uniindex = get_unicode_index(sym, math) File "/usr/lib64/python2.7/site-packages/matplotlib/mathtext.py", line 87, in get_unicode_index if symbol == '-': RuntimeError: maximum recursion depth exceeded in cmp </code></pre> <p>What 'recursion' means is pretty simple: it just means that a function can call <em>itself</em>. A common example of where you might want to do this is if you're trying to walk a directory tree. In Python-y pseudo-code it might look a bit like this:</p> <pre><code>def read_directory(directory): print(directory.name) for entry in directory: if entry is file: print(entry.name) if entry is directory: read_directory(entry) </code></pre> <p>To deal with directories nested in other directories, the function just calls itself. The danger is if you somehow mess up when writing code like this, and it winds up in a loop, calling itself over and over and never escaping: this is 'infinite recursion'. Python, being a nice language, notices when this is going on, and bails after a certain number of recursions, which is what's happening here.</p> <p>So now we know where to look in matplotlib, and what to look for. Let's go take a look! matplotlib, like most everything else in the universe these days, is <a href="https://github.com/matplotlib/matplotlib">in github</a>, which is bad for ecosystem health but handy just for finding stuff. Let's go look at the <a href="https://github.com/matplotlib/matplotlib/blob/7a83fc426f689f15df812676dc36d003f7d39a9a/lib/matplotlib/mathtext.py#L801">function from the backtrace</a>.</p> <p>Well, this is pretty long, and maybe a bit intimidating. But an interesting thing is, we don't really need to know what this function is <em>for</em> - I actually still don't know precisely (according to the name it should be returning a 'glyph' - a single visual representation for a specific character from a font - but it actually returns a font, the unicode index for the glyph, the name of the glyph, the font size, and whether the glyph is italicized, for some reason). What we need to concentrate on is the question of why this function is getting in a recursion loop on one arch (ppc64) but not any others.</p> <p>First let's figure out how the recursion is actually triggered - that's vital to figuring out what the next step in our chain is. The line that triggers the loop is this one:</p> <pre><code> return self._get_glyph('rm', font_class, sym, fontsize) </code></pre> <p>That's where it calls itself. It's kinda obvious that the authors expect that call to succeed - it shouldn't run down the same logical path, but instead get to the 'success' path (the <code>return font, uniindex, symbol_name, fontsize, slanted</code> line at the end of the function) and thus break the loop. But on ppc64, for some reason, it doesn't.</p> <p>So what's the logic path that leads us to that call, both initially and when it recurses? Well, it's down three levels of conditionals:</p> <pre><code> if not found_symbol: if self.cm_fallback: &lt;other path&gt; else: if fontname in ('it', 'regular') and isinstance(self, StixFonts): return self._get_glyph('rm', font_class, sym, fontsize) </code></pre> <p>So we only get to this path if <code>found_symbol</code> is not set by the time we reach that first <code>if</code>, then if <code>self.cm_fallback</code> is not set, then if the <code>fontname</code> given when the function was called was 'it' or 'regular' and if the class instance this function (actually method) is a part of is an instance of the <code>StixFonts</code> class (or a subclass). Don't worry if we're getting a bit too technical at this point, because I did spend a bit of time looking into those last two conditions, but ultimately they turned out not to be that significant. The important one is the first one: <code>if not found_symbol</code>.</p> <p>By this point, I'm starting to wonder if the problem is that we're failing to 'find' the symbol - in the first half of the function - when we shouldn't be. Now there are a couple of handy logical shortcuts we can take here that turned out to be rather useful. First we look at the whole logic flow of the <code>found_symbol</code> variable and see that it's a bit convoluted. From the start of the function, there are two different ways it can be set <code>True</code> - the <code>if self.use_cmex</code> block and then the 'fallback' <code>if not found_symbol</code> block after that. Then there's another block that starts <code>if found_symbol:</code> where it gets <em>set back to False again</em>, and another lookup is done:</p> <pre><code> if found_symbol: (...) found_symbol = False font = self._get_font(new_fontname) if font is not None: glyphindex = font.get_char_index(uniindex) if glyphindex != 0: found_symbol = True </code></pre> <p>At first, though, we don't know if we're even hitting that block, or if we're failing to 'find' the symbol earlier on. It turns out, though, that it's easy to tell - because of this earlier block:</p> <pre><code> if not found_symbol: try: uniindex = get_unicode_index(sym, math) found_symbol = True except ValueError: uniindex = ord('?') warn("No TeX to unicode mapping for '%s'" % sym.encode('ascii', 'backslashreplace'), MathTextWarning) </code></pre> <p>Basically, if we don't find the symbol there, the code logs a warning. We can see from our build log that we <em>don't</em> see any such warning, so we know that the code <em>does</em> initially succeed in finding the symbol - that is, when we get to the <code>if found_symbol:</code> block, <code>found_symbol</code> is True. That logically means that it's <em>that</em> block where the problem occurs - we have <code>found_symbol</code> going in, but where that block sets it back to <code>False</code> then looks it up again (after doing some kind of font substitution, I don't know why, don't care), it fails.</p> <p>The other thing I noticed while poking through this code is a later warning. Remember that the infinite recursion only happens <code>if fontname in ('it', 'regular') and isinstance(self, StixFonts)</code>? Well, what happens if that's <em>not</em> the case is interesting:</p> <pre><code> if fontname in ('it', 'regular') and isinstance(self, StixFonts): return self._get_glyph('rm', font_class, sym, fontsize) warn("Font '%s' does not have a glyph for '%s' [U+%x]" % (new_fontname, sym.encode('ascii', 'backslashreplace').decode('ascii'), uniindex), MathTextWarning) </code></pre> <p>that is, if that condition <em>isn't</em> satisfied, instead of calling itself, the next thing the function does is log a warning. So it occurred to me to go and see if there are any of those warnings in the build logs. And, whaddayaknow, there are four such warnings in the ppc64 build log:</p> <pre><code>/usr/lib64/python2.7/site-packages/matplotlib/mathtext.py:866: MathTextWarning: Font 'rm' does not have a glyph for '1' [U+1d7e3] MathTextWarning) /usr/lib64/python2.7/site-packages/matplotlib/mathtext.py:867: MathTextWarning: Substituting with a dummy symbol. warn("Substituting with a dummy symbol.", MathTextWarning) /usr/lib64/python2.7/site-packages/matplotlib/mathtext.py:866: MathTextWarning: Font 'rm' does not have a glyph for '0' [U+1d7e2] MathTextWarning) /usr/lib64/python2.7/site-packages/matplotlib/mathtext.py:866: MathTextWarning: Font 'rm' does not have a glyph for '-' [U+2212] MathTextWarning) /usr/lib64/python2.7/site-packages/matplotlib/mathtext.py:866: MathTextWarning: Font 'rm' does not have a glyph for '2' [U+1d7e4] MathTextWarning) </code></pre> <p>but there are <em>no</em> such warnings in the logs for other arches. That's really rather interesting. It makes one possibility very unlikely: that we do reach the recursed call on all arches, but it fails on ppc64 and succeeds on the other arches. It's looking far more likely that the problem is the "re-discovery" bit of the function - the <code>if found_symbol:</code> block where it looks up the symbol again - is usually working on other arches, but failing on ppc64.</p> <p>So just by looking at the logical flow of the function, particularly what happens in different conditional branches, we've actually been able to figure out quite a lot, without knowing or even caring what the function is really <em>for</em>. By this point, I was really focusing in on that <code>if found_symbol:</code> block. And that leads us to our next suspect. The most important bit in that block is where it actually decides whether to set <code>found_symbol</code> to True or not, here:</p> <pre><code> font = self._get_font(new_fontname) if font is not None: glyphindex = font.get_char_index(uniindex) if glyphindex != 0: found_symbol = True </code></pre> <p>I didn't actually know whether it was failing because <code>self._get_font</code> didn't find anything, or because <code>font.get_char_index</code> returned 0. I think I just played a hunch that <code>get_char_index</code> was the problem, but it wouldn't be too difficult to find out by just editing the code a bit to log a message telling you whether or not <code>font</code> was <code>None</code>, and re-running the test suite.</p> <p>Anyhow, I wound up looking at <code>get_char_index</code>, so we need to go find that. You could work backwards through the code and figure out what <code>font</code> is an instance of so you can find it, but that's boring: it's far quicker just to grep the damn code. If you do that, you get various results that are calls of it, then this:</p> <pre><code>src/ft2font_wrapper.cpp:const char *PyFT2Font_get_char_index__doc__ = src/ft2font_wrapper.cpp: "get_char_index()\n" src/ft2font_wrapper.cpp:static PyObject *PyFT2Font_get_char_index(PyFT2Font *self, PyObject *args, PyObject *kwds) src/ft2font_wrapper.cpp: if (!PyArg_ParseTuple(args, "k:get_char_index", &amp;ccode)) { src/ft2font_wrapper.cpp: {"get_char_index", (PyCFunction)PyFT2Font_get_char_index, METH_VARARGS, PyFT2Font_get_char_index__doc__}, </code></pre> <p>Which is the point at which I started mentally buckling myself in, because now we're out of Python and into C++. Glorious C++! I should note at this point that, while I'm probably a half-decent Python coder at this point, I am still pretty awful at C(++). I may be somewhat or very wrong in anything I say about it. Corrections welcome.</p> <p>So I buckled myself in and went for a look at this <a href="https://github.com/matplotlib/matplotlib/blob/v2.0.0rc2/src/ft2font_wrapper.cpp">ft2font_wrapper.cpp thing</a>. I've seen this kind of thing a couple of times before, so by squinting at it a bit sideways, I could more or less see that this is what Python calls an <a href="https://docs.python.org/2/extending/extending.html">extension module</a>: basically, it's a Python module written in C or C++. This gets done if you need to create a new built-in type, or for speed, or - as in this case - because the Python project wants to work directly with a system shared library (in this case, freetype), either because it doesn't have Python bindings or because the project doesn't want to use them for some reason.</p> <p>This code pretty much provides a few classes for working with Freetype fonts. It defines a class called <code>matplotlib.ft2font.FT2Font</code> with a method <code>get_char_index</code>, and that's what the code back up in <code>mathtext.py</code> is dealing with: that <code>font</code> we were dealing with is an <code>FT2Font</code> instance, and we're using its <code>get_char_index</code> method to try and 'find' our 'symbol'.</p> <p>Fortunately, this <code>get_char_index</code> method is actually simple enough that even I can figure out what it's doing:</p> <pre><code>static PyObject *PyFT2Font_get_char_index(PyFT2Font *self, PyObject *args, PyObject *kwds) { FT_UInt index; FT_ULong ccode; if (!PyArg_ParseTuple(args, "I:get_char_index", &amp;ccode)) { return NULL; } index = FT_Get_Char_Index(self-&gt;x-&gt;get_face(), ccode); return PyLong_FromLong(index); } </code></pre> <p>(If you're playing along at home for MEGA BONUS POINTS, you now have all the necessary information and you can try to figure out what the bug is. If you just want me to explain it, keep reading!)</p> <p>There's really not an awful lot there. It's calling <code>FT_Get_Char_Index</code> with a couple of args and returning the result. Not rocket science.</p> <p>In fact, this seemed like a good point to start just doing a bit of experimenting to identify the precise problem, because we've reduced the problem to a very small area. So this is where I stopped just reading the code and started hacking it up to see what it did.</p> <p>First I tweaked the relevant block in <code>mathtext.py</code> to just log the values it was feeding in and getting out:</p> <pre><code> font = self._get_font(new_fontname) if font is not None: glyphindex = font.get_char_index(uniindex) warn("uniindex: %s, glyphindex: %s" % (uniindex, glyphindex)) if glyphindex != 0: found_symbol = True </code></pre> <p>Sidenote: how exactly to just print something out to the console when you're building or running tests can vary quite a bit depending on the codebase in question. What I usually do is just look at how the project already does it - find some message that <em>is</em> being printed when you build or run the tests, and then copy that. Thus in this case we can see that the code is using this <code>warn</code> function (it's actually <code>warnings.warn</code>), and we know those messages are appearing in our build logs, so...let's just copy that.</p> <p>Then I ran the test suite on both x86_64 and ppc64, and compared. This told me that the Python code was passing the same <code>uniindex</code> values to the C code on both x86_64 and ppc64, but getting different results back - that is, I got the same recorded <code>uniindex</code> values, but on x86_64 the resulting <code>glyphindex</code> value was always something larger than 0, but on ppc64, it was sometimes 0.</p> <p>The next step should be pretty obvious: log the input and output values in the <em>C</em> code.</p> <pre><code>index = FT_Get_Char_Index(self-&gt;x-&gt;get_face(), ccode); printf("ccode: %lu index: %u\n", ccode, index); </code></pre> <p>Another sidenote: one of the more annoying things with this particular issue was just being able to run the tests with modifications and see what happened. First, I needed an actual ppc64 environment to use. The awesome <a href="https://patrick.uiterwijk.org/">Patrick Uiterwijk</a> of Fedora release engineering provided me with one. Then I built a .src.rpm of the python-matplotlib package, ran a mock build of it, and shelled into the mock environment. That gives you an environment with all the necessary build dependencies and the source and the tests all there and prepared already. Then I just copied the necessary build, install and test commands from the spec file. For a simple pure-Python module this is all usually pretty easy and you can just check the source out and do it right in your regular environment or in a virtualenv or something, but for something like matplotlib which has this C++ extension module too, it's more complex. The spec builds the code, then installs it, then runs the tests out of the source directory with <code>PYTHONPATH=BUILDROOT/usr/lib64/python2.7/site-packages</code> , so the code that was actually built and installed is used for the tests. When I wanted to modify the C part of matplotlib, I edited it in the source directory, then re-ran the 'build' and 'install' steps, then ran the tests; if I wanted to modify the Python part I just edited it directly in the BUILDROOT location and re-ran the tests. When I ran the tests on ppc64, I noticed that several hundred of them failed with exactly the bug we'd seen in the <code>python-deap</code> package build - this infinite recursion problem. Several others failed due to not being able to find the glyph, without hitting the recursion. It turned out the package maintainer had disabled the tests on ppc64, and so Fedora 24+'s <code>python-matplotlib</code> has been broken on ppc64 since about April).</p> <p>So anyway, with that modified C code built and used to run the test suite, I finally had a smoking gun. Running this on x86_64 and ppc64, the logged <code>ccode</code> values were totally different. The values logged on ppc64 were huge. But as we know from the previous logging, there was no difference in the value when the Python code passed it to the C code (the <code>uniindex</code> value logged in the Python code).</p> <p>So now I knew: the problem lay in how the C code took the value from the Python code. At this point I started figuring out how that worked. The key line is this one:</p> <pre><code>if (!PyArg_ParseTuple(args, "I:get_char_index", &amp;ccode)) { </code></pre> <p>That <code>PyArg_ParseTuple</code> function is what the C code is using to read in the value that <code>mathtext.py</code> calls <code>uniindex</code> and it calls <code>ccode</code>, the one that's somehow being messed up on ppc64. So let's <a href="https://docs.python.org/2/c-api/arg.html#c.PyArg_ParseTuple">read the docs</a>!</p> <p>This is one unusual example where the Python docs, which are usually awesome, are a bit difficult, because that's a very thin description which doesn't provide the references you usually get. But all you really need to do is read up - go back to the top of the page, and you get a much more comprehensive explanation. Reading carefully through the whole page, we can see pretty much what's going on in this call. It basically means that <code>args</code> is expected to be a structure representing a single Python object, a number, which we will store into the C variable <code>ccode</code>. The tricky bit is that second arg, <code>"I:get_char_index"</code>. This is the 'format string' that the Python page goes into a lot of helpful detail about.</p> <p>As it tells us, <code>PyArg_ParseTuple</code> "use[s] format strings which are used to tell the function about the expected arguments...A format string consists of zero or more “format units.” A format unit describes one Python object; it is usually a single character or a parenthesized sequence of format units. With a few exceptions, a format unit that is not a parenthesized sequence normally corresponds to a single address argument to these functions." Next we get a list of the 'format units', and <code>I</code> is one of those:</p> <pre><code> I (integer) [unsigned int] Convert a Python integer to a C unsigned int, without overflow checking. </code></pre> <p>You might also notice that the list of format units include several for converting Python integers to <em>other</em> things, like <code>i</code> for 'signed int' and <code>h</code> for 'short int'. This will become significant soon!</p> <p>The <code>:get_char_index</code> bit threw me for a minute, but it's explained further down:</p> <p>"A few other characters have a meaning in a format string. These may not occur inside nested parentheses. They are: ... : The list of format units ends here; the string after the colon is used as the function name in error messages (the “associated value” of the exception that PyArg_ParseTuple() raises)." So in our case here, we have only a single 'format unit' - <code>I</code> - and <code>get_char_index</code> is just a name that'll be used in any error messages this call might produce.</p> <p>So now we know what this call is doing. It's saying "when some Python code calls this function, take the args it was called with and parse them into C structures so we can do stuff with them. In this case, we expect there to be just a single arg, which will be a Python integer, and we want to convert it to a C unsigned integer, and store it in the C variable <code>ccode</code>."</p> <p>(If you're playing along at home but you didn't get it earlier, you really should be able to get it now! Hint: read up just a few lines in the C code. If not, go refresh your memory about architectures...)</p> <p>And once I understood that, I realized what the problem was. Let's read up just a few lines in the C code:</p> <pre><code>FT_ULong ccode; </code></pre> <p>Unlike Python, C and C++ are 'typed languages'. That just means that all variables must be declared to be of a specific type, unlike Python variables, which you don't have to declare explicitly and which can change type any time you like. This is a variable declaration: it's simply saying "we want a variable called <code>ccode</code>, and it's of type <code>FT_ULong</code>".</p> <p>If you know anything at all about C integer types, you should know what the problem is by now (you probably worked it out a few paragraphs back). But if you don't, now's a good time to learn!</p> <p>There are several different types you can use for storing integers in C: <code>short</code>, <code>int</code>, <code>long</code>, and possibly <code>long long</code> (depends on your arch). This is basically all about efficiency: you can only put a small number in a <code>short</code>, but if you only <em>need</em> to store small numbers, it might be more efficient to use a <code>short</code> than a <code>long</code>. Theoretically, when you use a <code>short</code> the compiler will allocate less memory than when you use an <code>int</code>, which uses less memory again than a <code>long</code>, which uses less than a <code>long long</code>. Practically speaking some of them wind up being the same size on some platforms, but the basic idea's there.</p> <p>All the types have <code>signed</code> and <code>unsigned</code> variants. The difference there is simple: <code>signed</code> numbers can be negative, <code>unsigned</code> ones can't. Say an <code>int</code> is big enough to let you store 101 different values: a <code>signed int</code> would let you store any number between -50 and +50, while an <code>unsigned int</code> would let you store any number between 0 and 100.</p> <p>Now look at that <code>ccode</code> declaration again. What is its type? <code>FT_ULong</code>. That <code>ULong</code>...sounds a lot like <code>unsigned long</code>, right?</p> <p>Yes it does! Here, have a cookie. C code often declares its own aliases for standard C types like this; we can find Freetype's in its API documentation, which I found by the cunning technique of doing a <a href="https://duckduckgo.com/?q=ft_ulong&amp;t=h_&amp;ia=web">web search for FT_ULong</a>. That finds us <a href="https://www.freetype.org/freetype2/docs/reference/ft2-basic_types.html#FT_ULong">this handy definition</a>: "A typedef for unsigned long."</p> <p>Aaaaaaand herein lies our bug! Whew, at last. As, hopefully, you can now see, this <code>ccode</code> variable is declared as an <code>unsigned long</code>, but we're telling <code>PyArg_ParseTuple</code> to convert the Python object such that we can store it as an <code>unsigned int</code>, not an <code>unsigned long</code>.</p> <p>But wait, you think. Why does this seem to work OK on most arches, and only fail on ppc64? Again, some of you will already know the answer, good for you, now go read something else. ;) For the rest of you, it's all about this concept called 'endianness', which you might have come across and completely failed to understand, like I did many times! But it's really pretty simple, at least if we skate over it just a bit.</p> <p>Consider the number "forty-two". Here is how we write it with numerals: 42. Right? At least, that's how most humans do it, these days, unless you're a particularly hardy survivor of the fall of Rome, or something. This means we humans are 'big-endian'. If we were 'little-endian', we'd write it like this: 24. 'Big-endian' just means the most significant element comes 'first' in the representation; 'little-endian' means the most significant element comes last.</p> <p>All the arches Fedora supports except for ppc64 are little-endian. On little-endian arches, this error doesn't actually cause a problem: even though we used the wrong format unit, the value winds up being correct. On (64-bit) big-endian arches, however, it <em>does</em> cause a problem - when you tell <code>PyArg_ParseTuple</code> to convert to an unsigned long, but store the result into a variable that was declared as an unsigned int, you get a completely different value (it's multiplied by 2x32). The reasons for this involve getting into a more technical understanding of little-endian vs. big-endian (we actually have to get into the icky details of how values are <em>really</em> represented in memory), which I'm going to skip since this post is already long enough.</p> <p>But you don't <em>really</em> need to understand it completely, certainly not to be able to spot problems like this. All you need to know is that there are little-endian and big-endian arches, and little-endian are far more prevalent these days, so it's not unusual for low-level code to have weird bugs on big-endian arches. If something works fine on most arches but not on one or two, check if the ones where it fails are big-endian. If so, then keep a careful eye out for this kind of integer type mismatch problem, because it's very, very likely to be the cause.</p> <p>So now all that remained to do was to <em>fix</em> the problem. And here we go, with our <a href="https://patch-diff.githubusercontent.com/raw/matplotlib/matplotlib/pull/7768.patch">one character patch</a>:</p> <pre><code>diff --git a/src/ft2font_wrapper.cpp b/src/ft2font_wrapper.cpp index a97de68..c77dd83 100644 --- a/src/ft2font_wrapper.cpp +++ b/src/ft2font_wrapper.cpp @@ -971,7 +971,7 @@ static PyObject *PyFT2Font_get_char_index(PyFT2Font *self, PyObject *args, PyObj FT_UInt index; FT_ULong ccode; - if (!PyArg_ParseTuple(args, "I:get_char_index", &amp;ccode)) { + if (!PyArg_ParseTuple(args, "k:get_char_index", &amp;ccode)) { return NULL; } </code></pre> <p>There's something I just love about a one-character change that fixes several hundred test failures. :) As you can see, we simply change the <code>I</code> - the format unit for <code>unsigned int</code> - to <code>k</code> - the format unit for <code>unsigned long</code>. And with that, the bug is solved! I applied this change on both x86_64 and ppc64, re-built the code and re-ran the test suite, and observed that several hundred errors disappeared from the test suite on ppc64, while the x86_64 tests continued to pass.</p> <p>So I was able to send that patch upstream, apply it to the Fedora package, and once the package build went through, I could finally <a href="https://koji.fedoraproject.org/koji/buildinfo?buildID=827137">build python-deap successfully</a>, two days after I'd first tried it.</p> <p>Bonus extra content: even though I'd fixed the <code>python-deap</code> problem, as I'm never able to leave well enough alone, it wound up bugging me that there were still several hundred other failures in the matplotlib test suite on ppc64. So I wound up looking into all the other failures, and finding <a href="https://github.com/matplotlib/matplotlib/pull/7781">several</a> <a href="https://github.com/matplotlib/matplotlib/pull/7782">other</a> <a href="https://github.com/matplotlib/matplotlib/pull/7791">similar</a> <a href="https://github.com/matplotlib/matplotlib/pull/7796">issues</a>, which got the failure count down to <a href="https://github.com/matplotlib/matplotlib/issues/7797">just</a> <a href="https://github.com/matplotlib/matplotlib/issues/7798">two</a> sets of problems that are too domain-specific for me to figure out, and actually also happen on aarch64 and ppc64le (they're not big-endian issues). So to both the people running matplotlib on ppc64...you're welcome ;)</p> <p>Seriously, though, I suspect without these fixes, we might have had some odd cases where a noarch package's documentation would suddenly get messed up if the package happened to get built on a ppc64 builder.</p> <p></p> Of DPIs, desktops, and toolkits https://www.happyassassin.net/posts/2015/07/09/of-dpis-desktops-and-toolkits/ 2015-07-09T15:40:01Z 2015-07-09T15:40:01Z Adam Williamson <p></p><p>Fair warning up front: this is going to be one of my long, rambling, deep dives. It doesn't even have a conclusion yet, just lots of messy details about what I'm trying to deal with.</p> <p>So this week I was aiming to do some work on our openQA stuff - ideally, add some extra tests for Server things. But first, since we finally had working Rawhide images again, I figured I'd check the existing openQA tests are still working.</p> <p>Unfortunately...nope. There are various problems, but I'm focusing on one particular case here. The 'default install of a live image' test reached the 'Root Password' screen and then failed.</p> <p>With a bit of careful image comparison, I figured out why openQA wasn't happy. With the Workstation live image, the capital letters in the text - 'R', 'P', 'D' - were all one pixel shorter than on the reference screenshot, which had been taken from a boot.iso install. openQA works on screenshot comparisons; this difference was sufficient for it to decide the screen of the test didn't match the reference screen, and so the test failed. Now I needed to figure out <em>why</em> the fonts rendered ever so slightly differently between the two cases.</p> <p>I happen to know a bit about font rendering and GNOME internals, so I had an initial theory: the difference was in the font rendering configuration. Somehow some element of this was set differently in the Workstation live environment vs. the traditional installer environment. On the Workstation live, obviously, GNOME is running. On 'traditional install' images - boot.iso (network install) and the Server DVD image - GNOME is not running; you just have anaconda running on metacity on X.</p> <p>I happen to know that GNOME - via gnome-settings-daemon - has some of its own default settings for font rendering. It defaults to 'medium' hinting, 'greyscale' antialiasing, and a resolution of exactly 96dpi (ignoring hidpi cases for now). You can see the hinting values in <a href="https://git.gnome.org/browse/gnome-settings-daemon/tree/data/org.gnome.settings-daemon.plugins.xsettings.gschema.xml.in.in">this gsettings schema</a>, and the DPI comes in via <a href="https://git.gnome.org/browse/gnome-settings-daemon/tree/plugins/xsettings/gsd-xsettings-manager.c#n79">gsd-xsettings-manager.c</a>.</p> <p>I also knew that there's a file <code>/etc/X11/Xresources</code>, which can be used to set the same configuration at the X level in some way. I checked a clean Fedora install and found that file looks like this (minus comments):</p> <pre><code>Xft.dpi: 96 Xft.hintstyle: hintmedium Xft.hinting: true </code></pre> <p>so...it appeared to be identical to the GNOME defaults. At this point I was a bit stuck, and was very grateful to <a href="https://blogs.gnome.org/mclasen/">Matthias Clasen</a> for helping with the next bit.</p> <p>He told me how to find out what settings a given anaconda instance was actually <em>using</em>. You can press ctrl+shift+d (or ctrl+shift+i) in any running GTK+ 3 app to launch <a href="https://blogs.gnome.org/mclasen/2014/05/15/introducing-gtkinspector/">GtkInspector</a>, a rather handy debugging tool. By clicking the GtkSettings 'object' and then going to 'Properties', we can see the relevant settings (down at the bottom): gtk-xft-antialias, gtk-xft-dpi, gtk-xft-hinting and gtk-xft-hintstyle. In my test case - a KVM at 1024x768 resolution - I found that in the Workstation live case, gtk-xft-dpi was 98304 and gtk-xft-hintstyle was 'medium', while in the boot.iso case (and also on KDE live images), gtk-xft-dpi was 98401 and gtk-xft-hintstyle was 'full'.</p> <p>So then we had to figure out why. gtk-xft-dpi has a multiplication factor of 1024. So on the Workstation image the actual DPI being used was 98304 / 1024 = 96 (precisely), while on boot.iso it was 98401 / 1024 = 96.094726563 . I confirmed that changing the gtk-xft-dpi value to 98304 and gtk-xft-hintstyle to medium made the affected glyphs the same in boot.iso as in Workstation. But why was boot.iso getting different values from Workstation? And where did the rather odd 98401 gtk-xft-dpi value come from?</p> <p>Matthias again pointed out what actually <em>happens</em> with the <code>/etc/X11/Xresources</code> values, which I never knew. They get loaded via a command <code>xrdb</code>, which affects the contents of something called an X resource database. This is reasonably well documented on <a href="https://en.wikipedia.org/wiki/X_resources">Wikipedia</a> and in the xrdb man page. Basically, the resource database is used as a store of configuration data. The <code>/etc/X11/Xresources</code> values (usually...) get loaded into the RESOURCE_MANAGER property of the root window of screen 0 at X startup, and act as defaults, more or less.</p> <p>There are various paths on which the <code>/etc/X11/Xresources</code> values actually get loaded into the RESOURCE_MANAGER. There's a file <code>/etc/X11/xinitrc-common</code> which does it: it calls <code>xrdb -nocpp -merge /etc/X11/Xresources</code> if that files exists. (It also loads in ~/.Xresources afterwards if that exists, allowing a user to override the system-wide config). <code>xinitrc-common</code> is sourced by <code>/etc/X11/xinit/xinitrc</code> and <code>/etc/X11/xinit/Xsession</code>, meaning any way you start X that runs through one of those files will get the <code>/etc/X11/Xresources</code> values applied. The <code>startx</code> command runs through <code>xinit</code> which runs through <code>xinitrc</code>, so that's covered. Display managers either run through <code>/etc/X11/xinit/Xsession</code> or implement something rather similar themselves; either way, they all ultimately load in the <code>Xresources</code> settings via <code>xrdb</code>.</p> <p>Sidebar: how gnome-settings-daemon actually implements the configuration that's specified in dconf is to load it into the X resource database and <a href="http://standards.freedesktop.org/xsettings-spec/xsettings-spec-0.5.html">xsettings</a>. So what happens when you boot to GNOME is that gdm loads in the values from <code>/etc/X11/Xresources</code>, then gnome-settings-daemon overwrites them with whatever it gets from dconf).</p> <p>In the anaconda traditional installer image case, though, nothing does actually load the settings. X gets started by anaconda itself, which does it simply by calling <code>/usr/libexec/Xorg</code> directly; it doesn't do anything to load in the <code>Xresources</code> settings, either then or at any other time.</p> <p>GTK+ is written to use the values from xsettings or the X resource database, if there are any. But what does it do when nothing has written the settings?</p> <p>Well, as Matthias and I finally found out, it has its own internal fallbacks. As of Monday, it would default to hintstyle 'full', and for the DPI, it would calculate it from the resolution and the display size reported by X. I found that, on the test VM, with a resolution of 1024x768, xdpyinfo reported the display size as 270x203mm. The calculation uses the vertical values, so we get: 768 / (203 / 25.4) * 1024 = 98400.851231527 , which rounds up to 98401 - the gtk-xft-dpi value we found that boot.iso was using! Eureka.</p> <p>So finally we understood what was actually going on: nothing was loading in the <code>Xresources</code> or dconf settings (DPI exactly 96, medium hinting) so GTK+ was falling back on its internal defaults, and getting a DPI of just over 96 and 'full' hinting, and that was causing the discrepancy in rendering.</p> <p>The next thing I did for extra credit was figure out what's going on with KDE, which had the same rendering as traditional installer images. You'd think the KDE path would load in the <code>Xresources</code> values, though. And - although at first I guessed the problem was that it was failing to - it does. SDDM, the login manager we use for KDE, certainly does load them in via <code>xrdb</code>. No, the problem turned out to be a bit subtler.</p> <p>KDE does something broadly similar to GNOME. It has its own font rendering configuration options stored in its own settings database, and a little widget called 'krdb' which kicks in on session start and loads them into the X resource database. Qt then (I believe) respects the xrdb and xsettings values like GTK+ does.</p> <p>Like GNOME, KDE overrides any defaults loaded in from <code>Xresources</code> (or anywhere else) with its own settings. In the KDE control center you can pick hinting and antialiasing settings and a DPI value, and they all get written out via xrdb. However, KDE has one important difference in how it handles the DPI setting. As I said above, GNOME wires it to 96 by default. KDE, instead, tries to entirely unset the <code>Xft.dpi</code> setting. What KDE expects to happen in that case is that apps/toolkits will calculate a 'correct' DPI for the screen and use that. As we saw earlier, this is indeed what GTK+ attempts to do, and Qt attempts much the same.</p> <p>As another sidebar, while figuring all this out, Kevin Kofler - who was equally helpful on the KDE side of things as Matthias was on the GNOME side - and I worked out that the krdb code is actually quite badly broken here: instead of just unsetting <code>Xft.dpi</code>, <a href="https://bugs.kde.org/show_bug.cgi?id=350039">it throws away <em>all</em> existing X resource settings</a>. But even if that bug gets fixed, the fact that it throws out any DPI setting that came from <code>/etc/X11/Xresources</code> is actually intentional. So in the KDE path, the values from <code>/etc/X11/Xresources</code> do get loaded (by SDDM), but krdb then throws them away. And so we fall into the same path we did on boot.iso, for anaconda: GTK+ has no X resource or xsetting value for the DPI, so it calculates it, and gets the same 98401 result.</p> <p>For today's extra credit, I started looking at the DPI calculation aspect of this whole shebang from the other end: where was the 270x203mm 'display size', that results in GTK+ calculating a DPI of just over 96, coming from in the first place?</p> <p>The (rather funny, to me at least...) answer is that it comes from X!</p> <p>When you start X, it sets an initial physical screen size - and it doesn't do it the way you might, perhaps, expect. It doesn't look at the results of an EDID monitor probe, or try in some way to combine the results of multiple probes for multiple monitors. Nope, it simply specifies a physical size <em>with the intent of producing a particular DPI value</em>!</p> <p><a href="http://cgit.freedesktop.org/xorg/xserver/tree/hw/xfree86/modes/xf86RandR12.c#n787">Here's the code</a>.</p> <p>As you can see there are basically three paths: one if <code>monitorResolution</code> is set, one if <code>output-&gt;conf_monitor-&gt;mon_width</code> and <code>output-&gt;conf_monitor-&gt;mon_height</code> are set, and a fallback if neither of those are true. <code>monitorResolution</code> is set if the Xorg binary was called with the <code>-dpi</code> parameter. <code>mon_width</code> and <code>mon_height</code> are set if some X config file or another contains the <code>DisplaySize</code> parameter.</p> <p>If you specified a dpi on the command line, X will figure out whatever 'physical size' would result in that DPI for the resolution that's in use, and claim that the screen is that size. If you specify a size with the <code>DisplaySize</code> parameter, X just uses that. And if neither of those is true, X will pick a physical size based on a built-in default DPI, which is...96.</p> <p>One important wrinkle is that X uses ints for this calculation. What that means is it uses whole numbers; the ultimate result (the display size in millimetres) is rounded to the nearest whole number.</p> <p>You know those bits where GTK+ and Qt try to calculate a 'correct' DPI based on the monitor size? Well (I bet you can guess where this is going), so far as I can tell, the value they use is <em>the value X made up to match some arbitrary DPI in the first place</em>!</p> <p>Let's have a look at the full series of magnificent inanity that goes on when you boot up a clean Fedora KDE and run an app:</p> <ul> <li>X is called without <code>-dpi</code> and there is no <code>DisplaySize</code> config option, so X calculates a completely artificial 'physical size' for the screen in order to get a result of <code>DEFAULT_DPI</code>, which is 96. Let's say we get 1024x768 resolution - X figures out that a display of size 270x203mm has a resolution of 96dpi, and sets that as the 'physical size' of the display. (Well...almost. Note that bit about rounding above.)</li> <li>sddm loads in the <code>/etc/X11/Xresources</code> DPI value - precisely 96 - to the X resource database: <code>Xft.dpi: 96</code></li> <li>krdb says 'no, we don't want some silly hardcoded default DPI! We want to be clever and calculate the 'correct' DPI for the display!' and throws away the <code>Xft.dpi</code> value from the X resource database. (Because it's broken it throws away all the other settings too, but for now let's pretend it just does what it actually <em>means</em> to do.)</li> <li>When the app starts up, Qt (or GTK+, they do the same calculation) takes the resolution and the physical display size reported by X and calculates the DPI...which, not surprisingly, comes out to about 96. Only not exactly 96, because of the rounding thing: X rounded off its calculation, so in most cases, the DPI the toolkit arrives at won't be <em>quite</em> exactly 96.</li> </ul> <p>Seems a mite silly, doesn't it? :) There is something of a 'heated debate' about whether it's better to try and calculate a DPI based on screen size or simply assume 96 (with some kind of integer scaling for hidpi displays) - the GNOME code which assumes 96dpi has a long email justification from ajax as a comment block - but regardless of which side of that debate you pick, it's fairly absurd that the toolkits have this code to perform the 'correct' DPI calculation based on a screen size which is almost always going to be a lie intended to produce a specific DPI result!</p> <p>Someone <a href="http://lists.x.org/archives/xorg-devel/2011-October/026227.html">proposed a patch</a> for X.org back in 2011 which would've used the EDID-reported display size (so KDE would actually get the result it wants), but it seems to have died on the vine. There's some more colorful, er, 'context' in <a href="https://bugs.freedesktop.org/show_bug.cgi?id=41115">this bug report</a>.</p> <p>I haven't figured out absolutely every wrinkle of this stuff, yet. There's clearly some code in at least Qt and possibly in X which does...stuff...to at least some 'display size' values when the monitor mode is changed (via RandR) in any way. One rather funny thing is that if you go into the KDE Control Center and change the display resolution, the physical display size reported by <code>xdpyinfo</code> changes - but it doesn't change to match the correct physical size of the screen, it looks like something else does the same 'report a physical screen size that would be 96dpi at this resolution' calculation and changes the physical size to <em>that</em>. It doesn't quite seem to do the calculation identically to how X does it, though, or something - it comes up with slightly different answers.</p> <p>So on my 1920x1080 laptop, X initially sets the physical size to 507x285mm (which gives a DPI of 96.252631579). If I use KDE control center to change the resolution to 1400x1050, the physical size reported by xdpyinfo changes to 369x277mm (which gives a DPI of 96.281588448). If I then change it back to 1920x1080, the physical size changes to 506x284mm - <em>nearly</em> the same value X initially set it to, but not quite. That gives a DPI of 96.591549296. Why is that? I'm damned if I know. None of those sizes, of course, resembles the actual size of the laptop's display, which is 294x165mm (as reported by xrandr) - it's a 13" laptop.</p> <p>None of this happens, though, if I change the resolution via xrandr instead of the KDE control panel: the reported physical size stays at 507x285mm throughout. (So if you change resolutions in KDE with xrandr then launch a new app, it'll be using some different DPI value from your other apps, most likely).</p> <p>So what the hell do I do to make anaconda look consistent in all cases, for openQA purposes? Well, mclasen <a href="https://git.gnome.org/browse/gtk+/commit/?id=bdf0820c501437a2150d8ff0d5340246e713f73f">made a change to GTK+</a> which may be controversial, but would certainly solve my problem: he changed its 'fallback' behaviour. If no hinting style is set in the X resource database or xsettings, now it'll default to 'medium' instead of 'full' - that's fairly uncontroversial. But if no specific DPI is configured in xrdb or xsettings, instead of doing the calculation, it now simply assumes 96dpi (pace hidpi scaling). In theory, this could be a significant change - if things all worked as KDE thought they did, there'd now be a significant discrepancy between Qt and GTK+ apps in KDE on higher resolution displays. But in point of fact, since the 'calculated' DPI is going to wind up being ~96dpi in most cases anyway, it's not going to make such a huge difference...</p> <p>Without the GTK+ change (or if it gets reverted), we'd probably have to make anaconda load the <code>Xresources</code> settings in when it starts X, so it gets the precise 96dpi value rather than the calculation which ends up giving not-quite-exactly 96dpi. For the KDE case, we'd probably have to futz up the tests a bit to somehow force the 96dpi value into xrdb or xsettings before running anaconda. We're not yet running the openQA tests on the KDE live image, though, so it's not urgent yet.</p> <p></p> <p></p><p>Fair warning up front: this is going to be one of my long, rambling, deep dives. It doesn't even have a conclusion yet, just lots of messy details about what I'm trying to deal with.</p> <p>So this week I was aiming to do some work on our openQA stuff - ideally, add some extra tests for Server things. But first, since we finally had working Rawhide images again, I figured I'd check the existing openQA tests are still working.</p> <p>Unfortunately...nope. There are various problems, but I'm focusing on one particular case here. The 'default install of a live image' test reached the 'Root Password' screen and then failed.</p> <p>With a bit of careful image comparison, I figured out why openQA wasn't happy. With the Workstation live image, the capital letters in the text - 'R', 'P', 'D' - were all one pixel shorter than on the reference screenshot, which had been taken from a boot.iso install. openQA works on screenshot comparisons; this difference was sufficient for it to decide the screen of the test didn't match the reference screen, and so the test failed. Now I needed to figure out <em>why</em> the fonts rendered ever so slightly differently between the two cases.</p> <p>I happen to know a bit about font rendering and GNOME internals, so I had an initial theory: the difference was in the font rendering configuration. Somehow some element of this was set differently in the Workstation live environment vs. the traditional installer environment. On the Workstation live, obviously, GNOME is running. On 'traditional install' images - boot.iso (network install) and the Server DVD image - GNOME is not running; you just have anaconda running on metacity on X.</p> <p>I happen to know that GNOME - via gnome-settings-daemon - has some of its own default settings for font rendering. It defaults to 'medium' hinting, 'greyscale' antialiasing, and a resolution of exactly 96dpi (ignoring hidpi cases for now). You can see the hinting values in <a href="https://git.gnome.org/browse/gnome-settings-daemon/tree/data/org.gnome.settings-daemon.plugins.xsettings.gschema.xml.in.in">this gsettings schema</a>, and the DPI comes in via <a href="https://git.gnome.org/browse/gnome-settings-daemon/tree/plugins/xsettings/gsd-xsettings-manager.c#n79">gsd-xsettings-manager.c</a>.</p> <p>I also knew that there's a file <code>/etc/X11/Xresources</code>, which can be used to set the same configuration at the X level in some way. I checked a clean Fedora install and found that file looks like this (minus comments):</p> <pre><code>Xft.dpi: 96 Xft.hintstyle: hintmedium Xft.hinting: true </code></pre> <p>so...it appeared to be identical to the GNOME defaults. At this point I was a bit stuck, and was very grateful to <a href="https://blogs.gnome.org/mclasen/">Matthias Clasen</a> for helping with the next bit.</p> <p>He told me how to find out what settings a given anaconda instance was actually <em>using</em>. You can press ctrl+shift+d (or ctrl+shift+i) in any running GTK+ 3 app to launch <a href="https://blogs.gnome.org/mclasen/2014/05/15/introducing-gtkinspector/">GtkInspector</a>, a rather handy debugging tool. By clicking the GtkSettings 'object' and then going to 'Properties', we can see the relevant settings (down at the bottom): gtk-xft-antialias, gtk-xft-dpi, gtk-xft-hinting and gtk-xft-hintstyle. In my test case - a KVM at 1024x768 resolution - I found that in the Workstation live case, gtk-xft-dpi was 98304 and gtk-xft-hintstyle was 'medium', while in the boot.iso case (and also on KDE live images), gtk-xft-dpi was 98401 and gtk-xft-hintstyle was 'full'.</p> <p>So then we had to figure out why. gtk-xft-dpi has a multiplication factor of 1024. So on the Workstation image the actual DPI being used was 98304 / 1024 = 96 (precisely), while on boot.iso it was 98401 / 1024 = 96.094726563 . I confirmed that changing the gtk-xft-dpi value to 98304 and gtk-xft-hintstyle to medium made the affected glyphs the same in boot.iso as in Workstation. But why was boot.iso getting different values from Workstation? And where did the rather odd 98401 gtk-xft-dpi value come from?</p> <p>Matthias again pointed out what actually <em>happens</em> with the <code>/etc/X11/Xresources</code> values, which I never knew. They get loaded via a command <code>xrdb</code>, which affects the contents of something called an X resource database. This is reasonably well documented on <a href="https://en.wikipedia.org/wiki/X_resources">Wikipedia</a> and in the xrdb man page. Basically, the resource database is used as a store of configuration data. The <code>/etc/X11/Xresources</code> values (usually...) get loaded into the RESOURCE_MANAGER property of the root window of screen 0 at X startup, and act as defaults, more or less.</p> <p>There are various paths on which the <code>/etc/X11/Xresources</code> values actually get loaded into the RESOURCE_MANAGER. There's a file <code>/etc/X11/xinitrc-common</code> which does it: it calls <code>xrdb -nocpp -merge /etc/X11/Xresources</code> if that files exists. (It also loads in ~/.Xresources afterwards if that exists, allowing a user to override the system-wide config). <code>xinitrc-common</code> is sourced by <code>/etc/X11/xinit/xinitrc</code> and <code>/etc/X11/xinit/Xsession</code>, meaning any way you start X that runs through one of those files will get the <code>/etc/X11/Xresources</code> values applied. The <code>startx</code> command runs through <code>xinit</code> which runs through <code>xinitrc</code>, so that's covered. Display managers either run through <code>/etc/X11/xinit/Xsession</code> or implement something rather similar themselves; either way, they all ultimately load in the <code>Xresources</code> settings via <code>xrdb</code>.</p> <p>Sidebar: how gnome-settings-daemon actually implements the configuration that's specified in dconf is to load it into the X resource database and <a href="http://standards.freedesktop.org/xsettings-spec/xsettings-spec-0.5.html">xsettings</a>. So what happens when you boot to GNOME is that gdm loads in the values from <code>/etc/X11/Xresources</code>, then gnome-settings-daemon overwrites them with whatever it gets from dconf).</p> <p>In the anaconda traditional installer image case, though, nothing does actually load the settings. X gets started by anaconda itself, which does it simply by calling <code>/usr/libexec/Xorg</code> directly; it doesn't do anything to load in the <code>Xresources</code> settings, either then or at any other time.</p> <p>GTK+ is written to use the values from xsettings or the X resource database, if there are any. But what does it do when nothing has written the settings?</p> <p>Well, as Matthias and I finally found out, it has its own internal fallbacks. As of Monday, it would default to hintstyle 'full', and for the DPI, it would calculate it from the resolution and the display size reported by X. I found that, on the test VM, with a resolution of 1024x768, xdpyinfo reported the display size as 270x203mm. The calculation uses the vertical values, so we get: 768 / (203 / 25.4) * 1024 = 98400.851231527 , which rounds up to 98401 - the gtk-xft-dpi value we found that boot.iso was using! Eureka.</p> <p>So finally we understood what was actually going on: nothing was loading in the <code>Xresources</code> or dconf settings (DPI exactly 96, medium hinting) so GTK+ was falling back on its internal defaults, and getting a DPI of just over 96 and 'full' hinting, and that was causing the discrepancy in rendering.</p> <p>The next thing I did for extra credit was figure out what's going on with KDE, which had the same rendering as traditional installer images. You'd think the KDE path would load in the <code>Xresources</code> values, though. And - although at first I guessed the problem was that it was failing to - it does. SDDM, the login manager we use for KDE, certainly does load them in via <code>xrdb</code>. No, the problem turned out to be a bit subtler.</p> <p>KDE does something broadly similar to GNOME. It has its own font rendering configuration options stored in its own settings database, and a little widget called 'krdb' which kicks in on session start and loads them into the X resource database. Qt then (I believe) respects the xrdb and xsettings values like GTK+ does.</p> <p>Like GNOME, KDE overrides any defaults loaded in from <code>Xresources</code> (or anywhere else) with its own settings. In the KDE control center you can pick hinting and antialiasing settings and a DPI value, and they all get written out via xrdb. However, KDE has one important difference in how it handles the DPI setting. As I said above, GNOME wires it to 96 by default. KDE, instead, tries to entirely unset the <code>Xft.dpi</code> setting. What KDE expects to happen in that case is that apps/toolkits will calculate a 'correct' DPI for the screen and use that. As we saw earlier, this is indeed what GTK+ attempts to do, and Qt attempts much the same.</p> <p>As another sidebar, while figuring all this out, Kevin Kofler - who was equally helpful on the KDE side of things as Matthias was on the GNOME side - and I worked out that the krdb code is actually quite badly broken here: instead of just unsetting <code>Xft.dpi</code>, <a href="https://bugs.kde.org/show_bug.cgi?id=350039">it throws away <em>all</em> existing X resource settings</a>. But even if that bug gets fixed, the fact that it throws out any DPI setting that came from <code>/etc/X11/Xresources</code> is actually intentional. So in the KDE path, the values from <code>/etc/X11/Xresources</code> do get loaded (by SDDM), but krdb then throws them away. And so we fall into the same path we did on boot.iso, for anaconda: GTK+ has no X resource or xsetting value for the DPI, so it calculates it, and gets the same 98401 result.</p> <p>For today's extra credit, I started looking at the DPI calculation aspect of this whole shebang from the other end: where was the 270x203mm 'display size', that results in GTK+ calculating a DPI of just over 96, coming from in the first place?</p> <p>The (rather funny, to me at least...) answer is that it comes from X!</p> <p>When you start X, it sets an initial physical screen size - and it doesn't do it the way you might, perhaps, expect. It doesn't look at the results of an EDID monitor probe, or try in some way to combine the results of multiple probes for multiple monitors. Nope, it simply specifies a physical size <em>with the intent of producing a particular DPI value</em>!</p> <p><a href="http://cgit.freedesktop.org/xorg/xserver/tree/hw/xfree86/modes/xf86RandR12.c#n787">Here's the code</a>.</p> <p>As you can see there are basically three paths: one if <code>monitorResolution</code> is set, one if <code>output-&gt;conf_monitor-&gt;mon_width</code> and <code>output-&gt;conf_monitor-&gt;mon_height</code> are set, and a fallback if neither of those are true. <code>monitorResolution</code> is set if the Xorg binary was called with the <code>-dpi</code> parameter. <code>mon_width</code> and <code>mon_height</code> are set if some X config file or another contains the <code>DisplaySize</code> parameter.</p> <p>If you specified a dpi on the command line, X will figure out whatever 'physical size' would result in that DPI for the resolution that's in use, and claim that the screen is that size. If you specify a size with the <code>DisplaySize</code> parameter, X just uses that. And if neither of those is true, X will pick a physical size based on a built-in default DPI, which is...96.</p> <p>One important wrinkle is that X uses ints for this calculation. What that means is it uses whole numbers; the ultimate result (the display size in millimetres) is rounded to the nearest whole number.</p> <p>You know those bits where GTK+ and Qt try to calculate a 'correct' DPI based on the monitor size? Well (I bet you can guess where this is going), so far as I can tell, the value they use is <em>the value X made up to match some arbitrary DPI in the first place</em>!</p> <p>Let's have a look at the full series of magnificent inanity that goes on when you boot up a clean Fedora KDE and run an app:</p> <ul> <li>X is called without <code>-dpi</code> and there is no <code>DisplaySize</code> config option, so X calculates a completely artificial 'physical size' for the screen in order to get a result of <code>DEFAULT_DPI</code>, which is 96. Let's say we get 1024x768 resolution - X figures out that a display of size 270x203mm has a resolution of 96dpi, and sets that as the 'physical size' of the display. (Well...almost. Note that bit about rounding above.)</li> <li>sddm loads in the <code>/etc/X11/Xresources</code> DPI value - precisely 96 - to the X resource database: <code>Xft.dpi: 96</code></li> <li>krdb says 'no, we don't want some silly hardcoded default DPI! We want to be clever and calculate the 'correct' DPI for the display!' and throws away the <code>Xft.dpi</code> value from the X resource database. (Because it's broken it throws away all the other settings too, but for now let's pretend it just does what it actually <em>means</em> to do.)</li> <li>When the app starts up, Qt (or GTK+, they do the same calculation) takes the resolution and the physical display size reported by X and calculates the DPI...which, not surprisingly, comes out to about 96. Only not exactly 96, because of the rounding thing: X rounded off its calculation, so in most cases, the DPI the toolkit arrives at won't be <em>quite</em> exactly 96.</li> </ul> <p>Seems a mite silly, doesn't it? :) There is something of a 'heated debate' about whether it's better to try and calculate a DPI based on screen size or simply assume 96 (with some kind of integer scaling for hidpi displays) - the GNOME code which assumes 96dpi has a long email justification from ajax as a comment block - but regardless of which side of that debate you pick, it's fairly absurd that the toolkits have this code to perform the 'correct' DPI calculation based on a screen size which is almost always going to be a lie intended to produce a specific DPI result!</p> <p>Someone <a href="http://lists.x.org/archives/xorg-devel/2011-October/026227.html">proposed a patch</a> for X.org back in 2011 which would've used the EDID-reported display size (so KDE would actually get the result it wants), but it seems to have died on the vine. There's some more colorful, er, 'context' in <a href="https://bugs.freedesktop.org/show_bug.cgi?id=41115">this bug report</a>.</p> <p>I haven't figured out absolutely every wrinkle of this stuff, yet. There's clearly some code in at least Qt and possibly in X which does...stuff...to at least some 'display size' values when the monitor mode is changed (via RandR) in any way. One rather funny thing is that if you go into the KDE Control Center and change the display resolution, the physical display size reported by <code>xdpyinfo</code> changes - but it doesn't change to match the correct physical size of the screen, it looks like something else does the same 'report a physical screen size that would be 96dpi at this resolution' calculation and changes the physical size to <em>that</em>. It doesn't quite seem to do the calculation identically to how X does it, though, or something - it comes up with slightly different answers.</p> <p>So on my 1920x1080 laptop, X initially sets the physical size to 507x285mm (which gives a DPI of 96.252631579). If I use KDE control center to change the resolution to 1400x1050, the physical size reported by xdpyinfo changes to 369x277mm (which gives a DPI of 96.281588448). If I then change it back to 1920x1080, the physical size changes to 506x284mm - <em>nearly</em> the same value X initially set it to, but not quite. That gives a DPI of 96.591549296. Why is that? I'm damned if I know. None of those sizes, of course, resembles the actual size of the laptop's display, which is 294x165mm (as reported by xrandr) - it's a 13" laptop.</p> <p>None of this happens, though, if I change the resolution via xrandr instead of the KDE control panel: the reported physical size stays at 507x285mm throughout. (So if you change resolutions in KDE with xrandr then launch a new app, it'll be using some different DPI value from your other apps, most likely).</p> <p>So what the hell do I do to make anaconda look consistent in all cases, for openQA purposes? Well, mclasen <a href="https://git.gnome.org/browse/gtk+/commit/?id=bdf0820c501437a2150d8ff0d5340246e713f73f">made a change to GTK+</a> which may be controversial, but would certainly solve my problem: he changed its 'fallback' behaviour. If no hinting style is set in the X resource database or xsettings, now it'll default to 'medium' instead of 'full' - that's fairly uncontroversial. But if no specific DPI is configured in xrdb or xsettings, instead of doing the calculation, it now simply assumes 96dpi (pace hidpi scaling). In theory, this could be a significant change - if things all worked as KDE thought they did, there'd now be a significant discrepancy between Qt and GTK+ apps in KDE on higher resolution displays. But in point of fact, since the 'calculated' DPI is going to wind up being ~96dpi in most cases anyway, it's not going to make such a huge difference...</p> <p>Without the GTK+ change (or if it gets reverted), we'd probably have to make anaconda load the <code>Xresources</code> settings in when it starts X, so it gets the precise 96dpi value rather than the calculation which ends up giving not-quite-exactly 96dpi. For the KDE case, we'd probably have to futz up the tests a bit to somehow force the 96dpi value into xrdb or xsettings before running anaconda. We're not yet running the openQA tests on the KDE live image, though, so it's not urgent yet.</p> <p></p>