AdamW on Linux and more (Posts about Red Hat) https://www.happyassassin.net/categories/red-hat.atom 2023-06-20T12:09:46Z Adam Williamson Nikola Fedora 37, openQA news, Mastodon and more https://www.happyassassin.net/posts/2022/11/18/fedora-37-openqa-news-mastodon-and-more/ 2022-11-18T22:34:59Z 2022-11-18T22:34:59Z Adam Williamson <p>Hey, time for my now-apparently-annual blog post, I guess? First, a quick note: I joined the herd showing up on <a href="https://joinmastodon.org/">Mastodon</a>, on the <a href="https://fosstodon.org">Fosstodon</a> server, as <a href="https://fosstodon.org/@adamw">@adamw@fosstodon.org</a>. So, you know, follow me or whatever. I posted to Twitter even less than I post here, but we'll see what happens!</p> <p>The big news lately is of course that <a href="https://fedoramagazine.org/announcing-fedora-37/">Fedora 37 is out</a>. Pulling this release together was a bit more painful than has been the norm lately, and it does have at least <a href="https://ask.fedoraproject.org/t/f37-install-media-dont-boot-in-uefi-mode-on-certain-motherboards/28364">one bug I'm sad we didn't sort out</a>, but unless you have one of a very few motherboards from six years ago and want to do a reinstall, everything should be great!</p> <p>Personally I've been running Fedora Silverblue this cycle, as an experiment to see how it fares as a daily driver and a <a href="https://en.wikipedia.org/wiki/Eating_your_own_dog_food">dogfooding</a> base. Overall it's been working fine; there are still some awkward corners if you are strict about avoiding RPM overlays, though. I'm definitely interested in <a href="https://fedoraproject.org/wiki/Changes/OstreeNativeContainerStable">Colin's big native container rework proposal</a>, which would significantly change how the rpm-ostree-based systems work and make package layering a more 'accepted' thing to do. I also found that sourcing apps feels slightly odd - I'd kinda like to use Fedora Flatpaks for everything, from a dogfooding perspective, but not everything I use is available as one, so I wound up with kind of a mix of things sourced from Flathub and from Fedora Flatpaks. I was also surprised that Fedora Flatpaks aren't generally updated terribly often, and don't seem to have 'development' branches - while Fedora 37 was in development, I couldn't get Flatpak builds of apps that matched the Fedora 37 RPM builds, I was stuck running Fedora 36-based Flatpaks. So it actually impeded my ability to test the latest versions of everything. It'd be nice to see some improvement here going forward.</p> <p>My biggest project this year has been working towards gating Rawhide critical path updates on the <a href="https://openqa.fedoraproject.org">openQA</a> tests, as we do for stable and Branched releases. This has been a deceptively large effort; ensuring all the tests work OK on Rawhide was a relatively small job, but the experience of actually having the tests running has been interesting. There are, overall, a lot more updates for Rawhide than any other release, and obviously, they tend to break things more often. First I turned the tests on for the staging instance, then after a few months trying to get on top of things there, turned them on for the production instance. I planned to run this way for a month or two to see if I could stay on top of keeping the tests running smoothly and passing when they should, and dealing with breakage. On the whole, it's been possible...but just barely. The increased workload means tests can take several hours to complete after an update is submitted, which isn't ideal. Because we don't have the gating turned on, when somebody does submit an update that breaks the tests, I have to ensure it gets fixed <em>right away</em> or else get it untagged before the next Rawhide compose happens, or else the test will fail for every subsequent update too; that can be stressful. We also have had quite a lot of 'fun' with intermittent problems like <a href="https://bugzilla.redhat.com/show_bug.cgi?id=2133829">systemd-oomd killing things it shouldn't</a>. This can result in a lot of time spent manually restarting failed tests, coming up with awkward workarounds, and trying to debug the problems.</p> <p>So, I kinda felt like things aren't quite solid enough yet to turn the gating on, and I wound up working down a path intended to help with the "too many jobs take too long" and "intermittent failures" angles. This actually started out when I <a href="https://pagure.io/fedora-comps/pull-request/764">added a proper critical path definition for KDE</a>. This rather increased the openQA workload, as it added a bunch of packages to critical path that weren't there before. There was especially a fun moment when a couple hundred KDE package updates got submitted separately as Rawhide updates, and openQA spent a day running 55 tests on all of them, including all the GNOME and Server tests.</p> <p>As part of getting the KDE stuff added to the critical path, I wound up doing a <a href="https://pagure.io/releng/pull-request/11000">big update</a> to the script that actually generates the critical path definition, and working on that made me realize it wouldn't be difficult to track the critical path package set by group, not just as one big flat list. That, in turn, could allow us to only run "relevant" openQA tests for an update: if the update is only in the KDE critical path, we don't need to run the GNOME and Server tests on it, for instance. So for the last few weeks I've been working on what turned out to be quite a lot of pieces relevant to that.</p> <p>First, I <a href="https://pagure.io/releng/pull-request/11044">added the fundamental support in the critical path generation script</a>. Then I had to make <a href="https://bodhi.fedoraproject.org">Bodhi</a> work with this. Bodhi decides whether an update is critical path or not, and openQA gets that information from Bodhi. Bodhi, as currently configured, actually gets this information from <a href="https://pdc.fedoraproject.org/">PDC</a>, which seems to me an unnecessary layer of indirection, especially as we're hoping to retire PDC; Bodhi could just as easily itself be the 'source of truth' for the critical path. So I <a href="https://github.com/fedora-infra/bodhi/pull/4755">made Bodhi capable of reading critpath information directly from the files output by the script</a>, then <a href="https://github.com/fedora-infra/bodhi/pull/4759">made it use the group information for Greenwave queries and show it in the web UI and API query results</a>. That's all a hard requirement for running fewer tests on some updates, because without that, we would still always gate on all the openQA tests for every critical path update - so if we didn't run all the tests for some update, it would always fail gating. I also <a href="https://pagure.io/fedora-infra/ansible/c/18817b175aca65c4c1c8167eaadad231ef7a0449?branch=main">changed the Greenwave policies accordingly</a>, to only require the appropriate set of tests to pass for each critical path group, once our production Bodhi is set up to use all this new stuff - until then, the combined policy for the non-grouped decision contexts Bodhi still uses for now winds up identical to what it was before.</p> <p>Once a new Bodhi release is made and deployed to production, and we configure it to use the new grouped-critpath stuff instead of the flat definition from PDC, all of the groundwork is in place for me to actually change the openQA scheduler to check which critical path group(s) an update is in, and only schedule the appropriate tests. But along the way, I noticed this change meant Bodhi was querying Greenwave for <em>even more</em> decision contexts for each update. Right now for critical path updates Bodhi usually sends two queries to Greenwave (if there are more than seven packages in the update, it sends 2*((number of packages in update+1)/8) queries). With these changes, if an update was in, say, three critical path groups, it would send 4 (or more) queries. This slows things down, and also produces <a href="https://github.com/fedora-infra/bodhi/issues/4320">rather awkward and hard-to-understand output in the web UI</a>. So I decided to fix that too. I made it so the gating status displayed in the web UI is <a href="https://github.com/fedora-infra/bodhi/pull/4784">combined from however many queries Bodhi has to make</a>, instead of just displaying the result of each query separately. Then I tweaked greenwave to allow <a href="https://github.com/release-engineering/greenwave/pull/95">querying multiple decision contexts together</a>, and <a href="https://github.com/fedora-infra/bodhi/pull/4821">had Bodhi make use of that</a>. With those changes combined, Bodhi should only have to query once for most updates, and for updates with more than seven packages, the displayed gating status won't be confusing any more!</p> <p>I'm hoping all those Bodhi changes can be deployed to stable soon, so I can move forward with the remaining work needed, and ultimately see how much of an improvement we see. I'm hoping we'll wind up having to run rather fewer tests, which should reduce the wait time for tests to complete and also mitigate the problem of intermittent failures a bit. If this works out well enough, we might be able to move ahead with actually turning on the gating for Rawhide updates, which I'm really looking forward to doing.</p> <p>Hey, time for my now-apparently-annual blog post, I guess? First, a quick note: I joined the herd showing up on <a href="https://joinmastodon.org/">Mastodon</a>, on the <a href="https://fosstodon.org">Fosstodon</a> server, as <a href="https://fosstodon.org/@adamw">@adamw@fosstodon.org</a>. So, you know, follow me or whatever. I posted to Twitter even less than I post here, but we'll see what happens!</p> <p>The big news lately is of course that <a href="https://fedoramagazine.org/announcing-fedora-37/">Fedora 37 is out</a>. Pulling this release together was a bit more painful than has been the norm lately, and it does have at least <a href="https://ask.fedoraproject.org/t/f37-install-media-dont-boot-in-uefi-mode-on-certain-motherboards/28364">one bug I'm sad we didn't sort out</a>, but unless you have one of a very few motherboards from six years ago and want to do a reinstall, everything should be great!</p> <p>Personally I've been running Fedora Silverblue this cycle, as an experiment to see how it fares as a daily driver and a <a href="https://en.wikipedia.org/wiki/Eating_your_own_dog_food">dogfooding</a> base. Overall it's been working fine; there are still some awkward corners if you are strict about avoiding RPM overlays, though. I'm definitely interested in <a href="https://fedoraproject.org/wiki/Changes/OstreeNativeContainerStable">Colin's big native container rework proposal</a>, which would significantly change how the rpm-ostree-based systems work and make package layering a more 'accepted' thing to do. I also found that sourcing apps feels slightly odd - I'd kinda like to use Fedora Flatpaks for everything, from a dogfooding perspective, but not everything I use is available as one, so I wound up with kind of a mix of things sourced from Flathub and from Fedora Flatpaks. I was also surprised that Fedora Flatpaks aren't generally updated terribly often, and don't seem to have 'development' branches - while Fedora 37 was in development, I couldn't get Flatpak builds of apps that matched the Fedora 37 RPM builds, I was stuck running Fedora 36-based Flatpaks. So it actually impeded my ability to test the latest versions of everything. It'd be nice to see some improvement here going forward.</p> <p>My biggest project this year has been working towards gating Rawhide critical path updates on the <a href="https://openqa.fedoraproject.org">openQA</a> tests, as we do for stable and Branched releases. This has been a deceptively large effort; ensuring all the tests work OK on Rawhide was a relatively small job, but the experience of actually having the tests running has been interesting. There are, overall, a lot more updates for Rawhide than any other release, and obviously, they tend to break things more often. First I turned the tests on for the staging instance, then after a few months trying to get on top of things there, turned them on for the production instance. I planned to run this way for a month or two to see if I could stay on top of keeping the tests running smoothly and passing when they should, and dealing with breakage. On the whole, it's been possible...but just barely. The increased workload means tests can take several hours to complete after an update is submitted, which isn't ideal. Because we don't have the gating turned on, when somebody does submit an update that breaks the tests, I have to ensure it gets fixed <em>right away</em> or else get it untagged before the next Rawhide compose happens, or else the test will fail for every subsequent update too; that can be stressful. We also have had quite a lot of 'fun' with intermittent problems like <a href="https://bugzilla.redhat.com/show_bug.cgi?id=2133829">systemd-oomd killing things it shouldn't</a>. This can result in a lot of time spent manually restarting failed tests, coming up with awkward workarounds, and trying to debug the problems.</p> <p>So, I kinda felt like things aren't quite solid enough yet to turn the gating on, and I wound up working down a path intended to help with the "too many jobs take too long" and "intermittent failures" angles. This actually started out when I <a href="https://pagure.io/fedora-comps/pull-request/764">added a proper critical path definition for KDE</a>. This rather increased the openQA workload, as it added a bunch of packages to critical path that weren't there before. There was especially a fun moment when a couple hundred KDE package updates got submitted separately as Rawhide updates, and openQA spent a day running 55 tests on all of them, including all the GNOME and Server tests.</p> <p>As part of getting the KDE stuff added to the critical path, I wound up doing a <a href="https://pagure.io/releng/pull-request/11000">big update</a> to the script that actually generates the critical path definition, and working on that made me realize it wouldn't be difficult to track the critical path package set by group, not just as one big flat list. That, in turn, could allow us to only run "relevant" openQA tests for an update: if the update is only in the KDE critical path, we don't need to run the GNOME and Server tests on it, for instance. So for the last few weeks I've been working on what turned out to be quite a lot of pieces relevant to that.</p> <p>First, I <a href="https://pagure.io/releng/pull-request/11044">added the fundamental support in the critical path generation script</a>. Then I had to make <a href="https://bodhi.fedoraproject.org">Bodhi</a> work with this. Bodhi decides whether an update is critical path or not, and openQA gets that information from Bodhi. Bodhi, as currently configured, actually gets this information from <a href="https://pdc.fedoraproject.org/">PDC</a>, which seems to me an unnecessary layer of indirection, especially as we're hoping to retire PDC; Bodhi could just as easily itself be the 'source of truth' for the critical path. So I <a href="https://github.com/fedora-infra/bodhi/pull/4755">made Bodhi capable of reading critpath information directly from the files output by the script</a>, then <a href="https://github.com/fedora-infra/bodhi/pull/4759">made it use the group information for Greenwave queries and show it in the web UI and API query results</a>. That's all a hard requirement for running fewer tests on some updates, because without that, we would still always gate on all the openQA tests for every critical path update - so if we didn't run all the tests for some update, it would always fail gating. I also <a href="https://pagure.io/fedora-infra/ansible/c/18817b175aca65c4c1c8167eaadad231ef7a0449?branch=main">changed the Greenwave policies accordingly</a>, to only require the appropriate set of tests to pass for each critical path group, once our production Bodhi is set up to use all this new stuff - until then, the combined policy for the non-grouped decision contexts Bodhi still uses for now winds up identical to what it was before.</p> <p>Once a new Bodhi release is made and deployed to production, and we configure it to use the new grouped-critpath stuff instead of the flat definition from PDC, all of the groundwork is in place for me to actually change the openQA scheduler to check which critical path group(s) an update is in, and only schedule the appropriate tests. But along the way, I noticed this change meant Bodhi was querying Greenwave for <em>even more</em> decision contexts for each update. Right now for critical path updates Bodhi usually sends two queries to Greenwave (if there are more than seven packages in the update, it sends 2*((number of packages in update+1)/8) queries). With these changes, if an update was in, say, three critical path groups, it would send 4 (or more) queries. This slows things down, and also produces <a href="https://github.com/fedora-infra/bodhi/issues/4320">rather awkward and hard-to-understand output in the web UI</a>. So I decided to fix that too. I made it so the gating status displayed in the web UI is <a href="https://github.com/fedora-infra/bodhi/pull/4784">combined from however many queries Bodhi has to make</a>, instead of just displaying the result of each query separately. Then I tweaked greenwave to allow <a href="https://github.com/release-engineering/greenwave/pull/95">querying multiple decision contexts together</a>, and <a href="https://github.com/fedora-infra/bodhi/pull/4821">had Bodhi make use of that</a>. With those changes combined, Bodhi should only have to query once for most updates, and for updates with more than seven packages, the displayed gating status won't be confusing any more!</p> <p>I'm hoping all those Bodhi changes can be deployed to stable soon, so I can move forward with the remaining work needed, and ultimately see how much of an improvement we see. I'm hoping we'll wind up having to run rather fewer tests, which should reduce the wait time for tests to complete and also mitigate the problem of intermittent failures a bit. If this works out well enough, we might be able to move ahead with actually turning on the gating for Rawhide updates, which I'm really looking forward to doing.</p> AdamW's Debugging Adventures: Bootloaders and machine IDs https://www.happyassassin.net/posts/2022/01/11/adamws-debugging-adventures-bootloaders-and-machine-ids/ 2022-01-11T22:08:00Z 2022-01-11T22:08:00Z Adam Williamson <p>Hi folks! Well, it looks like I forgot to blog for...<em>checks watch</em>....<em>checks calendar</em>...a year. Wow. Whoops. Sorry about that. I'm still here, though! We released, uh, lots of Fedoras since the last time I wrote about that. Fedora 35 is the current one. It's, uh, <a href="https://fedoraproject.org/wiki/Common_F35_bugs">mostly great!</a> Go <a href="https://getfedora.org/">get a copy</a>, why don't you?</p> <p>And while that's downloading, you can get comfy and listen to another of Crazy Uncle Adam's Debugging Adventures. In this episode, we'll be uncomfortably reminded just how much of the code that causes your system to actually boot at all consists of fragile shell script with no tests, so this'll be fun!</p> <p>Last month, booting a system installed from Rawhide live images stopped working properly. You could boot the live image fine, run the installation fine, but on rebooting, the system would fail to boot with an error: <code>dracut: FATAL: Don't know how to handle 'root=live:CDLABEL=Fedora-WS-Live-rawh-20211229-n-1'</code>. openQA caught this, and so did one of our QA community members - Ahed Almeleh - who <a href="https://bugzilla.redhat.com/show_bug.cgi?id=2036199">filed a bug</a>. After the end-of-year holidays, I got to figuring out what was going wrong.</p> <p>As usual, I got a bit of a head start from pre-existing knowledge. I happen to know that error message is referring to kernel arguments that are set in the bootloader configuration of <em>the live image itself</em>. dracut is the tool that handles an early phase of boot where we boot into a temporary environment that's loaded entirely into system memory, set up the <em>real</em> system environment, and boot that. This early environment is contained in the <code>initrd</code> files you can find alongside the kernel on most Linux distributions; that's what they're for. Part of dracut's job is to be run when a kernel is installed to <em>produce</em> this environment, and then other parts of dracut are included <em>in the environment itself</em> to handle initializing things, finding the real system root, preparing it, and then switching to it. The initrd environments on Fedora live images are built to contain a dracut 'module' (called <code>90dmsquash-live</code>) that knows to interpret <code>root=live:CDLABEL=Fedora-WS-Live-rawh-20211229-n-1</code> as meaning 'go look for a live system root on the filesystem with that label and boot that'. Installed systems don't contain that module, because, well, they don't need to know how to do that, and you wouldn't really ever want an installed system to <em>try</em> and do that.</p> <p>So the short version here is: the installed system has the wrong kernel argument for telling dracut where to find the system root. It <em>should</em> look something like <code>root=/dev/mapper/fedora-root</code> (where we're pointing to a system root on an LVM volume that dracut will set up and then switch to). So the obvious next question is: why? Why is our installed system getting this wrong argument? It seemed likely that it 'leaked' from the live system to the installed system somehow, but I needed to figure out how.</p> <p>From here, I had kinda two possible ways to investigate. The easiest and fastest would probably be if I happened to know exactly how we deal with setting up bootloader configuration when running a live install. Then I'd likely have been able to start poking the most obvious places right away and figure out the problem. But, as it happens, I didn't at the time remember exactly how that works. I just remembered that I wind up having to figure it out every few years, and it's complicated and scary, so I tend to forget again right afterwards. I kinda knew where to start looking, but didn't really want to have to work it all out again from scratch if I could avoid it.</p> <p>So I went with the other possibility, which is always: figure out when it broke, and figure out what changed between the last time it worked and the first time it broke. This usually makes life much easier because now you know one of the things on that list is the problem. The shorter and simpler the list, the easier life gets.</p> <p>I looked at the openQA result history and found that the bug was introduced somewhere between 20211215.n.0 and 20211229.n.1 (unfortunately kind of a wide range). The good news is that only a few packages could plausibly be involved in this bug; the most likely are dracut itself, grub2 (the bootloader), grubby (a Red Hat / Fedora-specific grub configuration...thing), anaconda (the Fedora installer, which obviously does some bootloader configuration stuff), the kernel itself, and systemd (which is of course involved in the boot process itself, but also - perhaps less obviously - is where a script called <a href="https://github.com/systemd/systemd/tree/main/src/kernel-install"><code>kernel-install</code></a> that is used (on Fedora and many other distros) to 'install' kernels lives (this was another handy thing I happened to know already, but really - it's always a safe bet to include systemd on the list of potential suspects for anything boot-related).</p> <p>Looking at what changed between 2021-12-15 and 2021-12-29, we could let out grub2 and grubby as they didn't change. There were some kernel builds, but nothing in the scriptlets changed in any way that could be related. dracut got a build with <a href="https://src.fedoraproject.org/rpms/dracut/c/76eb28fc2ef2f9e43b5ea66d0b9c96f83e124d4b?branch=rawhide">one change</a>, but again it seemed clearly unrelated. So I was down to anaconda and systemd as suspects. On an initial quick check during the vacation, I thought anaconda had not changed, and took a brief look at systemd, but didn't see anything immediately obvious.</p> <p>When I came back to look at it more thoroughly, I realized anaconda did get a new version (36.12) on 2021-12-15, so that initially interested me quite a lot. I spent some time going through the changes in that version, and there were some that really could have been related - it changed how running things during install inside the installed system worked (which is definitely how we do some bootloader setup stuff during install), and it had interesting commit messages like "Remove the dracut_args attribute" and "Remove upd-kernel". So I spent an afternoon fairly sure it'd turn out to be one of those, reviewed all those changes, mocked up locally how they worked, examined the logs of the actual image composes, and...concluded that none of those seemed to be the problem at all. The installer seemed to still be doing things the same as it always had. There weren't any tell-tale missing or failing bootloader config steps. However, this time wasn't entirely wasted: I was reminded of exactly what anaconda does to configure the bootloader when installing from a live image.</p> <p>When we install from a live image, we don't do what the 'traditional' installer does and install a bunch of RPM packages using dnf. The live image does not contain any RPM packages. The live image itself was <em>built</em> by installing a bunch of RPM packages, but it is the <em>result</em> of that process. Instead, we essentially set up the filesystems on the drive(s) we're installing to and then just dump the contents of the live image filesystem <em>itself</em> onto them. Then we run a few tweaks to adjust anything that needs adjusting for this now being an installed system, not a live one. One of the things we do is re-generate the <code>initrd</code> file for the installed system, and then re-generate the bootloader configuration. This involves running <code>kernel-install</code> (which places the kernel and initrd files onto the boot partition, and writes some bootloader configuration 'snippet' files), and then running <code>grub2-mkconfig</code>. The main thing <code>grub2-mkconfig</code> does is produce the main bootloader configuration file, but that's not really why we run it at this point. There's <a href="https://github.com/rhinstaller/anaconda/blob/anaconda-36.12-1/pyanaconda/modules/storage/bootloader/utils.py#L244">a very interesting comment</a> explaining why in the anaconda source:</p> <div class="code"><pre class="code literal-block"><span class="c1"># Update the bootloader configuration to make sure that the BLS</span> <span class="c1"># entries will have the correct kernel cmdline and not the value</span> <span class="c1"># taken from /proc/cmdline, that is used to boot the live image.</span> </pre></div> <p>Which is exactly what we were dealing with here. The "BLS entries" we're talking about here are the things I called 'snippet' files above, they live in <code>/boot/loader/entries</code> on Fedora systems. These are where the kernel arguments used at boot are specified, and indeed, that's where the problematic <code>root=live:...</code> arguments were specified in broken installs - in the "BLS entries" in <code>/boot/loader/entries</code>. So it seemed like, somehow, this mechanism just wasn't working right any more - we were expecting this run of <code>grub2-mkconfig</code> in the installed system root after live installation to correct those snippets, but it wasn't. However, as I said, I couldn't establish that any change to anaconda was causing this.</p> <p>So I eventually shelved anaconda at least temporarily and looked at systemd. And it turned out that systemd had changed too. During the time period in question, we'd gone from systemd 250~rc1 to 250~rc3. (If you check the build history of systemd the dates don't seem to match up - by 2021-12-29 the 250-2 build had happened already, but in fact the 250-1 and 250-2 builds were untagged for causing a different problem, so the 2021-12-29 compose had 250~rc3). By now I was obviously pretty focused on <code>kernel-install</code> as the most likely related part of systemd, so I went to my systemd git checkout and ran:</p> <div class="code"><pre class="code literal-block">git log v250-rc1..v250-rc3 src/kernel-install/ </pre></div> <p>which shows all the commits under <code>src/kernel-install</code> between 250-rc1 and 250-rc3. And that gave me another juicy-looking, yet thankfully short, set of commits:</p> <p><a href="https://github.com/systemd/systemd/commit/641e2124de6047e6010cd2925ea22fba29b25309">641e2124de6047e6010cd2925ea22fba29b25309</a> kernel-install: replace 00-entry-directory with K_I_LAYOUT in k-i <a href="https://github.com/systemd/systemd/commit/357376d0bb525b064f468e0e2af8193b4b90d257">357376d0bb525b064f468e0e2af8193b4b90d257</a> kernel-install: Introduce KERNEL_INSTALL_MACHINE_ID in /etc/machine-info <a href="https://github.com/systemd/systemd/commit/447a822f8ee47b63a4cae00423c4d407bfa5e516">447a822f8ee47b63a4cae00423c4d407bfa5e516</a> kernel-install: Remove "Default" from list of suffixes checked</p> <p>So I went and looked at all of those. And again...I got it wrong at first! This is I guess a good lesson from this Debugging Adventure: you don't always get the right answer at first, but that's okay. You just have to keep plugging, and always keep open the possibility that you're wrong and you should try something else. I spent time thinking the cause was likely a change in anaconda before focusing on systemd, then focused on the wrong systemd commit first. I got interested in 641e212 first, and had even written out a whole Bugzilla comment blaming it before I realized it wasn't the culprit (fortunately, I didn't post it!) I thought the problem was that the new check for <code>$BOOT_ROOT/$MACHINE_ID</code> would not behave as it should on Fedora and cause the install scripts to do something different from what they should - generating incorrect snippet files, or putting them in the wrong place, or something.</p> <p>Fortunately, I decided to test this before declaring it was the problem, and found out that it wasn't. I did this using something that turned out to be invaluable in figuring out the real problem.</p> <p>You may have noticed by this point - harking back to our intro - that this critical <code>kernel-install</code> script, key to making sure your system boots, is...a shell script. That calls other shell scripts. You know what else is a big pile of shell scripts? <code>dracut</code>. You know, that critical component that both builds and controls the initial boot environment. Big pile of shell script. The install script - the <code>dracut</code> command itself - is shell. All the dracut <code>modules</code> - the bits that do most of the work - are shell. There's a bit of C in the source tree (I'm not entirely sure what that bit does), but most of it's shell.</p> <p>Critical stuff like this being written in shell makes me shiver, because shell is very easy to get wrong, and quite hard to test properly (and in fact neither dracut nor kernel-install has good tests). But one good thing about it is that it's quite easy to <em>debug</em>, thanks to the magic of <code>sh -x</code>. If you run some shell script via <code>sh -x</code> (whether that's really <code>sh</code>, or <code>bash</code> or some other alternative pretending to be <code>sh</code>), it will run as normal but print out most of the logic (variable assignments, tests, and so on) that happen along the way. So on a VM where I'd run a broken install, I could do <code>chroot /mnt/sysimage</code> (to get into the root of the installed system), find the exact <code>kernel-install</code> command that anaconda ran from one of the logs in <code>/var/log/anaconda</code> (I forget which), and re-run it through <code>sh -x</code>. This showed me all the logic going on through the run of <code>kernel-install</code> itself and all the scripts it sources under <code>/usr/lib/kernel/install.d</code>. Using this, I could confirm that the check I suspected had the result I suspected - I could see that it was deciding that <code>layout="other"</code>, not <code>layout="bls"</code>, <a href="https://github.com/systemd/systemd/blob/v250-rc3/src/kernel-install/kernel-install#L134">here</a>. But I could <em>also</em> figure out a way to override that decision, confirm that it worked, and find that it didn't solve the problem: the config snippets were still wrong, and running <code>grub2-mkconfig</code> didn't fix them. In fact the config snippets got wronger - it turned out that we do <em>want</em> <code>kernel-install</code> to pick 'other' rather than 'bls' here, because Fedora doesn't really implement BLS according to the upstream specs, so if we let kernel-install think we do, the config snippets we get are wrong.</p> <p>So now I'd been wrong twice! But each time, I learned a bit more that eventually helped me be right. After I decided that commit wasn't the cause after all, I finally spotted the problem. I figured this out by continuing with the <code>sh -x</code> debugging, and noticing an inconsistency. By this point I'd thought to find out what bit of <code>grub2-mkconfig</code> should be doing the work of correcting the key bit of configuration here. It's in a <a href="https://src.fedoraproject.org/rpms/grub2/blob/b25606806096b51e9d920f50b9bb47773641644d/f/0062-Add-BLS-support-to-grub-mkconfig.patch#_230">Fedora-only downstream patch to one of the scriptlets in <code>/etc/grub.d</code></a>. It replaces the <code>options=</code> line in any snippet files it finds with what it reckons the kernel arguments "should be". So I got curious about what exactly was going wrong there. I tweaked <code>grub2-mkconfig</code> slightly to run those scriptlets using <code>sh -x</code> by changing these lines in <code>grub2-mkconfig</code>:</p> <div class="code"><pre class="code literal-block"><span class="n">echo</span> <span class="s">"### BEGIN $i ###"</span> <span class="s">"$i"</span> <span class="n">echo</span> <span class="s">"### END $i ###"</span> </pre></div> <p>to read:</p> <div class="code"><pre class="code literal-block"><span class="n">echo</span> <span class="s">"### BEGIN $i ###"</span> <span class="n">sh</span> <span class="o">-</span><span class="n">x</span> <span class="s">"$i"</span> <span class="n">echo</span> <span class="s">"### END $i ###"</span> </pre></div> <p>Now I could re-run <code>grub2-mkconfig</code> and look at what was going on behind the scenes of the scriptlet, and I noticed that it wasn't <em>finding</em> any snippet files at all. But why not?</p> <p><a href="https://src.fedoraproject.org/rpms/grub2/blob/b25606806096b51e9d920f50b9bb47773641644d/f/0062-Add-BLS-support-to-grub-mkconfig.patch#_211">The code that looks for the snippet files</a> reads the file <code>/etc/machine-id</code> as a string, then looks for files in <code>/boot/loader/entries</code> whose names start with that string (and end in <code>.conf</code>). So I went and looked at my sample system and...found that the files in <code>/boot/loader/entries</code> did <em>not</em> start with the string in <code>/etc/machine-id</code>. The files in <code>/boot/loader/entries</code> started with <code>a69bd9379d6445668e7df3ddbda62f86</code>, but the ID in <code>/etc/machine-id</code> was <code>b8d80a4c887c40199c4ea1a8f02aa9b4</code>. This is why everything was broken: because those IDs didn't match, <code>grub2-mkconfig</code> couldn't find the files to correct, so the argument was wrong, so the system didn't boot.</p> <p>Now I knew what was going wrong and I only had two systemd commits left on the list, it was pretty easy to see the problem. It was in <a href="https://github.com/systemd/systemd/commit/357376d0bb525b064f468e0e2af8193b4b90d257">357376d</a>. That changes how <code>kernel-install</code> names these snippet files when creating them. It names them by finding a machine ID to use as a prefix. Previously, it used whatever string was in <code>/etc/machine-id</code>; if that file didn't exist or was empty, it just used the string "Default". After that commit, it also looks for a value specified in <code>/etc/machine-info</code>. If there's a <code>/etc/machine-id</code> but not <code>/etc/machine-info</code> when you run <code>kernel-install</code>, it uses the value from <code>/etc/machine-id</code> and writes it to <code>/etc/machine-info</code>.</p> <p>When I checked those files, it turned out that on the live image, the ID in both <code>/etc/machine-id</code> and <code>/etc/machine-info</code> was <code>a69bd9379d6445668e7df3ddbda62f86</code> - the problematic ID on the installed system. When we generate the live image itself, <code>kernel-install</code> uses the value from <code>/etc/machine-id</code> and writes it to <code>/etc/machine-info</code>, and both files wind up in the live filesystem. But <em>on the installed system</em>, the ID in <code>/etc/machine-info</code> was that same value, but the ID in <code>/etc/machine-id</code> was different (as we saw above).</p> <p>Remember how I mentioned above that when doing a live install, we essentially dump the live filesystem itself onto the installed system? Well, one of the 'tweaks' we make when doing this is to <a href="https://github.com/rhinstaller/anaconda/blob/612bfee7d37e16d7bfab329b44182a71d04a3344/pyanaconda/modules/storage/bootloader/utils.py#L41-L48">re-generate <code>/etc/machine-id</code></a>, because that ID is meant to be unique to each installed system - we don't want every system installed from a Fedora live image to have the same machine ID as the live image itself. However, as this <code>/etc/machine-info</code> file is new, we don't strip it from or re-generate it in the installed system, we just install it. The installed system has a <code>/etc/machine-info</code> with the same ID as the live image's machine ID, but a new, different ID in <code>/etc/machine-id</code>. And this (finally) was the ultimate source of the problem! When we run them on the installed system, the new version of <code>kernel-install</code> writes config snippet files using the ID from <code>/etc/machine-info</code>. But Fedora's patched <code>grub2-mkconfig</code> scriptlet doesn't know about that mechanism at all (since it's brand new), and expects the snippet files to contain the ID from <code>/etc/machine-id</code>.</p> <p>There are various ways you could potentially solve this, but after consulting with systemd upstream, the one we chose is to <a href="https://github.com/rhinstaller/anaconda/commit/fe652add9733943a3476128a83b24b9a8c63b335">have anaconda exclude <code>/etc/machine-info</code></a> when doing a live install. The changes to systemd here aren't wrong - it does potentially make sense that <code>/etc/machine-id</code> and <code>/etc/machine-info</code> could both exist and specify different IDs in some cases. But for the case of Fedora live installs, it doesn't make sense. The sanest result is for those IDs to match and both be the 'fresh' machine ID that's generated at the end of the install process. By just not including <code>/etc/machine-info</code> on the installed system, we achieve this result, because now when <code>kernel-install</code> runs at the end of the install process, it reads the ID from <code>/etc/machine-id</code> and writes it to <code>/etc/machine-info</code>, and both IDs are the same, <code>grub2-mkconfig</code> finds the snippet files and edits them correctly, the installed system boots, and I can move along to the next debugging odyssey...</p> <p>Hi folks! Well, it looks like I forgot to blog for...<em>checks watch</em>....<em>checks calendar</em>...a year. Wow. Whoops. Sorry about that. I'm still here, though! We released, uh, lots of Fedoras since the last time I wrote about that. Fedora 35 is the current one. It's, uh, <a href="https://fedoraproject.org/wiki/Common_F35_bugs">mostly great!</a> Go <a href="https://getfedora.org/">get a copy</a>, why don't you?</p> <p>And while that's downloading, you can get comfy and listen to another of Crazy Uncle Adam's Debugging Adventures. In this episode, we'll be uncomfortably reminded just how much of the code that causes your system to actually boot at all consists of fragile shell script with no tests, so this'll be fun!</p> <p>Last month, booting a system installed from Rawhide live images stopped working properly. You could boot the live image fine, run the installation fine, but on rebooting, the system would fail to boot with an error: <code>dracut: FATAL: Don't know how to handle 'root=live:CDLABEL=Fedora-WS-Live-rawh-20211229-n-1'</code>. openQA caught this, and so did one of our QA community members - Ahed Almeleh - who <a href="https://bugzilla.redhat.com/show_bug.cgi?id=2036199">filed a bug</a>. After the end-of-year holidays, I got to figuring out what was going wrong.</p> <p>As usual, I got a bit of a head start from pre-existing knowledge. I happen to know that error message is referring to kernel arguments that are set in the bootloader configuration of <em>the live image itself</em>. dracut is the tool that handles an early phase of boot where we boot into a temporary environment that's loaded entirely into system memory, set up the <em>real</em> system environment, and boot that. This early environment is contained in the <code>initrd</code> files you can find alongside the kernel on most Linux distributions; that's what they're for. Part of dracut's job is to be run when a kernel is installed to <em>produce</em> this environment, and then other parts of dracut are included <em>in the environment itself</em> to handle initializing things, finding the real system root, preparing it, and then switching to it. The initrd environments on Fedora live images are built to contain a dracut 'module' (called <code>90dmsquash-live</code>) that knows to interpret <code>root=live:CDLABEL=Fedora-WS-Live-rawh-20211229-n-1</code> as meaning 'go look for a live system root on the filesystem with that label and boot that'. Installed systems don't contain that module, because, well, they don't need to know how to do that, and you wouldn't really ever want an installed system to <em>try</em> and do that.</p> <p>So the short version here is: the installed system has the wrong kernel argument for telling dracut where to find the system root. It <em>should</em> look something like <code>root=/dev/mapper/fedora-root</code> (where we're pointing to a system root on an LVM volume that dracut will set up and then switch to). So the obvious next question is: why? Why is our installed system getting this wrong argument? It seemed likely that it 'leaked' from the live system to the installed system somehow, but I needed to figure out how.</p> <p>From here, I had kinda two possible ways to investigate. The easiest and fastest would probably be if I happened to know exactly how we deal with setting up bootloader configuration when running a live install. Then I'd likely have been able to start poking the most obvious places right away and figure out the problem. But, as it happens, I didn't at the time remember exactly how that works. I just remembered that I wind up having to figure it out every few years, and it's complicated and scary, so I tend to forget again right afterwards. I kinda knew where to start looking, but didn't really want to have to work it all out again from scratch if I could avoid it.</p> <p>So I went with the other possibility, which is always: figure out when it broke, and figure out what changed between the last time it worked and the first time it broke. This usually makes life much easier because now you know one of the things on that list is the problem. The shorter and simpler the list, the easier life gets.</p> <p>I looked at the openQA result history and found that the bug was introduced somewhere between 20211215.n.0 and 20211229.n.1 (unfortunately kind of a wide range). The good news is that only a few packages could plausibly be involved in this bug; the most likely are dracut itself, grub2 (the bootloader), grubby (a Red Hat / Fedora-specific grub configuration...thing), anaconda (the Fedora installer, which obviously does some bootloader configuration stuff), the kernel itself, and systemd (which is of course involved in the boot process itself, but also - perhaps less obviously - is where a script called <a href="https://github.com/systemd/systemd/tree/main/src/kernel-install"><code>kernel-install</code></a> that is used (on Fedora and many other distros) to 'install' kernels lives (this was another handy thing I happened to know already, but really - it's always a safe bet to include systemd on the list of potential suspects for anything boot-related).</p> <p>Looking at what changed between 2021-12-15 and 2021-12-29, we could let out grub2 and grubby as they didn't change. There were some kernel builds, but nothing in the scriptlets changed in any way that could be related. dracut got a build with <a href="https://src.fedoraproject.org/rpms/dracut/c/76eb28fc2ef2f9e43b5ea66d0b9c96f83e124d4b?branch=rawhide">one change</a>, but again it seemed clearly unrelated. So I was down to anaconda and systemd as suspects. On an initial quick check during the vacation, I thought anaconda had not changed, and took a brief look at systemd, but didn't see anything immediately obvious.</p> <p>When I came back to look at it more thoroughly, I realized anaconda did get a new version (36.12) on 2021-12-15, so that initially interested me quite a lot. I spent some time going through the changes in that version, and there were some that really could have been related - it changed how running things during install inside the installed system worked (which is definitely how we do some bootloader setup stuff during install), and it had interesting commit messages like "Remove the dracut_args attribute" and "Remove upd-kernel". So I spent an afternoon fairly sure it'd turn out to be one of those, reviewed all those changes, mocked up locally how they worked, examined the logs of the actual image composes, and...concluded that none of those seemed to be the problem at all. The installer seemed to still be doing things the same as it always had. There weren't any tell-tale missing or failing bootloader config steps. However, this time wasn't entirely wasted: I was reminded of exactly what anaconda does to configure the bootloader when installing from a live image.</p> <p>When we install from a live image, we don't do what the 'traditional' installer does and install a bunch of RPM packages using dnf. The live image does not contain any RPM packages. The live image itself was <em>built</em> by installing a bunch of RPM packages, but it is the <em>result</em> of that process. Instead, we essentially set up the filesystems on the drive(s) we're installing to and then just dump the contents of the live image filesystem <em>itself</em> onto them. Then we run a few tweaks to adjust anything that needs adjusting for this now being an installed system, not a live one. One of the things we do is re-generate the <code>initrd</code> file for the installed system, and then re-generate the bootloader configuration. This involves running <code>kernel-install</code> (which places the kernel and initrd files onto the boot partition, and writes some bootloader configuration 'snippet' files), and then running <code>grub2-mkconfig</code>. The main thing <code>grub2-mkconfig</code> does is produce the main bootloader configuration file, but that's not really why we run it at this point. There's <a href="https://github.com/rhinstaller/anaconda/blob/anaconda-36.12-1/pyanaconda/modules/storage/bootloader/utils.py#L244">a very interesting comment</a> explaining why in the anaconda source:</p> <div class="code"><pre class="code literal-block"><span class="c1"># Update the bootloader configuration to make sure that the BLS</span> <span class="c1"># entries will have the correct kernel cmdline and not the value</span> <span class="c1"># taken from /proc/cmdline, that is used to boot the live image.</span> </pre></div> <p>Which is exactly what we were dealing with here. The "BLS entries" we're talking about here are the things I called 'snippet' files above, they live in <code>/boot/loader/entries</code> on Fedora systems. These are where the kernel arguments used at boot are specified, and indeed, that's where the problematic <code>root=live:...</code> arguments were specified in broken installs - in the "BLS entries" in <code>/boot/loader/entries</code>. So it seemed like, somehow, this mechanism just wasn't working right any more - we were expecting this run of <code>grub2-mkconfig</code> in the installed system root after live installation to correct those snippets, but it wasn't. However, as I said, I couldn't establish that any change to anaconda was causing this.</p> <p>So I eventually shelved anaconda at least temporarily and looked at systemd. And it turned out that systemd had changed too. During the time period in question, we'd gone from systemd 250~rc1 to 250~rc3. (If you check the build history of systemd the dates don't seem to match up - by 2021-12-29 the 250-2 build had happened already, but in fact the 250-1 and 250-2 builds were untagged for causing a different problem, so the 2021-12-29 compose had 250~rc3). By now I was obviously pretty focused on <code>kernel-install</code> as the most likely related part of systemd, so I went to my systemd git checkout and ran:</p> <div class="code"><pre class="code literal-block">git log v250-rc1..v250-rc3 src/kernel-install/ </pre></div> <p>which shows all the commits under <code>src/kernel-install</code> between 250-rc1 and 250-rc3. And that gave me another juicy-looking, yet thankfully short, set of commits:</p> <p><a href="https://github.com/systemd/systemd/commit/641e2124de6047e6010cd2925ea22fba29b25309">641e2124de6047e6010cd2925ea22fba29b25309</a> kernel-install: replace 00-entry-directory with K_I_LAYOUT in k-i <a href="https://github.com/systemd/systemd/commit/357376d0bb525b064f468e0e2af8193b4b90d257">357376d0bb525b064f468e0e2af8193b4b90d257</a> kernel-install: Introduce KERNEL_INSTALL_MACHINE_ID in /etc/machine-info <a href="https://github.com/systemd/systemd/commit/447a822f8ee47b63a4cae00423c4d407bfa5e516">447a822f8ee47b63a4cae00423c4d407bfa5e516</a> kernel-install: Remove "Default" from list of suffixes checked</p> <p>So I went and looked at all of those. And again...I got it wrong at first! This is I guess a good lesson from this Debugging Adventure: you don't always get the right answer at first, but that's okay. You just have to keep plugging, and always keep open the possibility that you're wrong and you should try something else. I spent time thinking the cause was likely a change in anaconda before focusing on systemd, then focused on the wrong systemd commit first. I got interested in 641e212 first, and had even written out a whole Bugzilla comment blaming it before I realized it wasn't the culprit (fortunately, I didn't post it!) I thought the problem was that the new check for <code>$BOOT_ROOT/$MACHINE_ID</code> would not behave as it should on Fedora and cause the install scripts to do something different from what they should - generating incorrect snippet files, or putting them in the wrong place, or something.</p> <p>Fortunately, I decided to test this before declaring it was the problem, and found out that it wasn't. I did this using something that turned out to be invaluable in figuring out the real problem.</p> <p>You may have noticed by this point - harking back to our intro - that this critical <code>kernel-install</code> script, key to making sure your system boots, is...a shell script. That calls other shell scripts. You know what else is a big pile of shell scripts? <code>dracut</code>. You know, that critical component that both builds and controls the initial boot environment. Big pile of shell script. The install script - the <code>dracut</code> command itself - is shell. All the dracut <code>modules</code> - the bits that do most of the work - are shell. There's a bit of C in the source tree (I'm not entirely sure what that bit does), but most of it's shell.</p> <p>Critical stuff like this being written in shell makes me shiver, because shell is very easy to get wrong, and quite hard to test properly (and in fact neither dracut nor kernel-install has good tests). But one good thing about it is that it's quite easy to <em>debug</em>, thanks to the magic of <code>sh -x</code>. If you run some shell script via <code>sh -x</code> (whether that's really <code>sh</code>, or <code>bash</code> or some other alternative pretending to be <code>sh</code>), it will run as normal but print out most of the logic (variable assignments, tests, and so on) that happen along the way. So on a VM where I'd run a broken install, I could do <code>chroot /mnt/sysimage</code> (to get into the root of the installed system), find the exact <code>kernel-install</code> command that anaconda ran from one of the logs in <code>/var/log/anaconda</code> (I forget which), and re-run it through <code>sh -x</code>. This showed me all the logic going on through the run of <code>kernel-install</code> itself and all the scripts it sources under <code>/usr/lib/kernel/install.d</code>. Using this, I could confirm that the check I suspected had the result I suspected - I could see that it was deciding that <code>layout="other"</code>, not <code>layout="bls"</code>, <a href="https://github.com/systemd/systemd/blob/v250-rc3/src/kernel-install/kernel-install#L134">here</a>. But I could <em>also</em> figure out a way to override that decision, confirm that it worked, and find that it didn't solve the problem: the config snippets were still wrong, and running <code>grub2-mkconfig</code> didn't fix them. In fact the config snippets got wronger - it turned out that we do <em>want</em> <code>kernel-install</code> to pick 'other' rather than 'bls' here, because Fedora doesn't really implement BLS according to the upstream specs, so if we let kernel-install think we do, the config snippets we get are wrong.</p> <p>So now I'd been wrong twice! But each time, I learned a bit more that eventually helped me be right. After I decided that commit wasn't the cause after all, I finally spotted the problem. I figured this out by continuing with the <code>sh -x</code> debugging, and noticing an inconsistency. By this point I'd thought to find out what bit of <code>grub2-mkconfig</code> should be doing the work of correcting the key bit of configuration here. It's in a <a href="https://src.fedoraproject.org/rpms/grub2/blob/b25606806096b51e9d920f50b9bb47773641644d/f/0062-Add-BLS-support-to-grub-mkconfig.patch#_230">Fedora-only downstream patch to one of the scriptlets in <code>/etc/grub.d</code></a>. It replaces the <code>options=</code> line in any snippet files it finds with what it reckons the kernel arguments "should be". So I got curious about what exactly was going wrong there. I tweaked <code>grub2-mkconfig</code> slightly to run those scriptlets using <code>sh -x</code> by changing these lines in <code>grub2-mkconfig</code>:</p> <div class="code"><pre class="code literal-block"><span class="n">echo</span> <span class="s">"### BEGIN $i ###"</span> <span class="s">"$i"</span> <span class="n">echo</span> <span class="s">"### END $i ###"</span> </pre></div> <p>to read:</p> <div class="code"><pre class="code literal-block"><span class="n">echo</span> <span class="s">"### BEGIN $i ###"</span> <span class="n">sh</span> <span class="o">-</span><span class="n">x</span> <span class="s">"$i"</span> <span class="n">echo</span> <span class="s">"### END $i ###"</span> </pre></div> <p>Now I could re-run <code>grub2-mkconfig</code> and look at what was going on behind the scenes of the scriptlet, and I noticed that it wasn't <em>finding</em> any snippet files at all. But why not?</p> <p><a href="https://src.fedoraproject.org/rpms/grub2/blob/b25606806096b51e9d920f50b9bb47773641644d/f/0062-Add-BLS-support-to-grub-mkconfig.patch#_211">The code that looks for the snippet files</a> reads the file <code>/etc/machine-id</code> as a string, then looks for files in <code>/boot/loader/entries</code> whose names start with that string (and end in <code>.conf</code>). So I went and looked at my sample system and...found that the files in <code>/boot/loader/entries</code> did <em>not</em> start with the string in <code>/etc/machine-id</code>. The files in <code>/boot/loader/entries</code> started with <code>a69bd9379d6445668e7df3ddbda62f86</code>, but the ID in <code>/etc/machine-id</code> was <code>b8d80a4c887c40199c4ea1a8f02aa9b4</code>. This is why everything was broken: because those IDs didn't match, <code>grub2-mkconfig</code> couldn't find the files to correct, so the argument was wrong, so the system didn't boot.</p> <p>Now I knew what was going wrong and I only had two systemd commits left on the list, it was pretty easy to see the problem. It was in <a href="https://github.com/systemd/systemd/commit/357376d0bb525b064f468e0e2af8193b4b90d257">357376d</a>. That changes how <code>kernel-install</code> names these snippet files when creating them. It names them by finding a machine ID to use as a prefix. Previously, it used whatever string was in <code>/etc/machine-id</code>; if that file didn't exist or was empty, it just used the string "Default". After that commit, it also looks for a value specified in <code>/etc/machine-info</code>. If there's a <code>/etc/machine-id</code> but not <code>/etc/machine-info</code> when you run <code>kernel-install</code>, it uses the value from <code>/etc/machine-id</code> and writes it to <code>/etc/machine-info</code>.</p> <p>When I checked those files, it turned out that on the live image, the ID in both <code>/etc/machine-id</code> and <code>/etc/machine-info</code> was <code>a69bd9379d6445668e7df3ddbda62f86</code> - the problematic ID on the installed system. When we generate the live image itself, <code>kernel-install</code> uses the value from <code>/etc/machine-id</code> and writes it to <code>/etc/machine-info</code>, and both files wind up in the live filesystem. But <em>on the installed system</em>, the ID in <code>/etc/machine-info</code> was that same value, but the ID in <code>/etc/machine-id</code> was different (as we saw above).</p> <p>Remember how I mentioned above that when doing a live install, we essentially dump the live filesystem itself onto the installed system? Well, one of the 'tweaks' we make when doing this is to <a href="https://github.com/rhinstaller/anaconda/blob/612bfee7d37e16d7bfab329b44182a71d04a3344/pyanaconda/modules/storage/bootloader/utils.py#L41-L48">re-generate <code>/etc/machine-id</code></a>, because that ID is meant to be unique to each installed system - we don't want every system installed from a Fedora live image to have the same machine ID as the live image itself. However, as this <code>/etc/machine-info</code> file is new, we don't strip it from or re-generate it in the installed system, we just install it. The installed system has a <code>/etc/machine-info</code> with the same ID as the live image's machine ID, but a new, different ID in <code>/etc/machine-id</code>. And this (finally) was the ultimate source of the problem! When we run them on the installed system, the new version of <code>kernel-install</code> writes config snippet files using the ID from <code>/etc/machine-info</code>. But Fedora's patched <code>grub2-mkconfig</code> scriptlet doesn't know about that mechanism at all (since it's brand new), and expects the snippet files to contain the ID from <code>/etc/machine-id</code>.</p> <p>There are various ways you could potentially solve this, but after consulting with systemd upstream, the one we chose is to <a href="https://github.com/rhinstaller/anaconda/commit/fe652add9733943a3476128a83b24b9a8c63b335">have anaconda exclude <code>/etc/machine-info</code></a> when doing a live install. The changes to systemd here aren't wrong - it does potentially make sense that <code>/etc/machine-id</code> and <code>/etc/machine-info</code> could both exist and specify different IDs in some cases. But for the case of Fedora live installs, it doesn't make sense. The sanest result is for those IDs to match and both be the 'fresh' machine ID that's generated at the end of the install process. By just not including <code>/etc/machine-info</code> on the installed system, we achieve this result, because now when <code>kernel-install</code> runs at the end of the install process, it reads the ID from <code>/etc/machine-id</code> and writes it to <code>/etc/machine-info</code>, and both IDs are the same, <code>grub2-mkconfig</code> finds the snippet files and edits them correctly, the installed system boots, and I can move along to the next debugging odyssey...</p> On inclusive language: an extended metaphor involving parties because why not https://www.happyassassin.net/posts/2020/06/17/on-inclusive-language-an-extended-metaphor-involving-parties-because-why-not/ 2020-06-17T00:57:49Z 2020-06-17T00:57:49Z Adam Williamson <p>So there's been some discussion within Red Hat about inclusive language lately, obviously related to current events and the worldwide protests against racism, especially anti-Black racism. I don't want to get into any internal details, but in one case we got into some general debate about the validity of efforts to use more inclusive language. I thought up this florid party metaphor, and I figured instead of throwing it at an internal list, I'd put it up here instead. If you have constructive thoughts on it, go ahead and mail me or start a twitter thread or something. If you have non-constructive thoughts on it, keep 'em to yourself!</p> <p>Before we get into my pontificating, though, here's some useful practical resources if you just want to read up on how you can make the language in your projects and docs more inclusive:</p> <ul> <li><a href="https://developers.google.com/style/inclusive-documentation">Google 'inclusive documentation' style guide</a></li> <li><a href="https://lethargy.org/~jesus/writes/a-guide-to-nomenclature-selection/">A Guide to Nomenclature Selection, by Theo Schlossnagle</a></li> <li><a href="https://tools.ietf.org/id/draft-knodel-terminology-00.html">Terminology, Power and Oppressive Language RFC draft by M. Knodel</a></li> </ul> <p>To provide a bit of context: I was thinking about a suggestion that people promoting the use of more inclusive language are "trying to be offended". And here's where my mind went!</p> <p>Imagine you are throwing a party. You send out the invites, order in some hors d'ouevres (did I spell that right? I never spell that right), queue up some Billie Eilish (everyone loves Billie Eilish, it's a scientific fact), set out the drinks, and wait for folks to arrive. In they all come, the room's buzzing, everyone seems to be having a good time, it's going great!</p> <p>But then you notice (or maybe someone else notices, and tells you) that most of the people at your party seem to be straight white dudes and their wives and girlfriends. That's weird, you think, I'm an open minded modern guy, I'd be happy to see some Black folks and maybe a cute gay couple or something! What gives? I don't want people to think I'm some kind of racist or sexist or homophobe or something!</p> <p>So you go and ask some non-white folks and some non-straight folks and some non-male folks what's going on. What is it? Is it me? What did I do wrong?</p> <p>Well, they say, look, it's a hugely complex issue, I mean, we could be here all night talking about it. And yes, fine, that broken pipeline outside your house might have something to do with it (IN-JOKE ALERT). But since you ask, look, let us break this one part of it down for you.</p> <p>You know how you've got a bouncer outside, and every time someone rolls up to the party he looks them up and down and says "well hi there! What's your name? Is it on the BLACKLIST or the WHITELIST?" Well...I mean...that might put some folks off a bit. And you know how you made the theme of the party "masters and slaves"? You know, that might have something to do with it too. And, yeah, you see how you sent all the invites to men and wrote "if your wife wants to come too, just put her name in your reply"? I mean, you know, that might speak to some people more than others, you hear what I'm saying?</p> <p>Now...this could go one of two ways. On the Good Ending, you might say "hey, you know what? I didn't think about that. Thanks for letting me know. I guess next time I'll maybe change those things up a bit and maybe it'll help. Hey thanks! I appreciate it!"</p> <p>and that would be great. But unfortunately, you might instead opt for the Bad Ending. In the Bad Ending, you say something like this:</p> <p>"Wow. I mean, just wow. I feel so <em>attacked</em> here. It's not like I called it a 'blacklist' because I'm <em>racist</em> or something. I don't have a racist bone in my body, why do you have to read it that way? You know <a href="https://github.com/graphite-project/graphite-web/issues/1569#issuecomment-237320189">blacklist doesn't even MEAN that</a>, right? And jeez, look, the whole 'masters and slaves' thing was just a bit of fun, it's not like we made all the Black people the slaves or something! And besides that whole thing was so long ago! And I mean look, most people are straight, right? It's just easier to go with what's accurate for most people. It's so inconvenient to have to think about EVERYBODY all the time. It's not like I'm <em>homophobic</em> or anything. If gay people would just write back and say 'actually I have a husband' or whatever they'd be TOTALLY welcome, I'm all cool with that. God, why do you have to be so EASILY OFFENDED? Why do you want to make me feel so guilty?"</p> <p>So, I mean. Out of Bad Ending Person and Good Ending Person...whose next party do we think is gonna be more inclusive?</p> <p>So obviously, in this metaphor, Party Throwing Person is Red Hat, or Google, or Microsoft, or pretty much any company that says "hey, we accept this industry has a problem with inclusion and we're trying to do better", and the party is our software and communities and events and so on. If you are looking at your communities and wondering why they seem to be pretty white and male and straight, and you ask folks for ideas on how to improve that, and they give you some ideas...just <em>listen</em>. And try to take them on board. You asked. They're trying to help. They are not saying you are a BAD PERSON who has done BAD THINGS and OFFENDED them and you must feel GUILTY for that. They're just trying to help you make a positive change that will help more folks feel more welcome in your communities.</p> <p>You know, in a weird way, if our Party Throwing Person wasn't quite Good Ending Person or Bad Ending person but instead said "hey, you know what, I don't care about women or Black people or gays or whatever, this is a STRAIGHT WHITE GUY PARTY! WOOOOO! SOMEONE TAP THAT KEG!"...that's almost <em>not as bad</em>. At least you know where you <em>stand</em> with that. You don't feel like you're getting gaslit. You can just write that idiot and their party off and try and find another. The kind of Bad Ending Person who keeps insisting they're not racist or sexist or homophobic and they totally want more minorities to show up at their party but they <em>just can't figure out</em> why they all seem to be so awkward and easily offended and why they want to make poor Bad Ending Person feel so guilty...you know...that gets pretty tiring to deal with sometimes.</p> <p>So there's been some discussion within Red Hat about inclusive language lately, obviously related to current events and the worldwide protests against racism, especially anti-Black racism. I don't want to get into any internal details, but in one case we got into some general debate about the validity of efforts to use more inclusive language. I thought up this florid party metaphor, and I figured instead of throwing it at an internal list, I'd put it up here instead. If you have constructive thoughts on it, go ahead and mail me or start a twitter thread or something. If you have non-constructive thoughts on it, keep 'em to yourself!</p> <p>Before we get into my pontificating, though, here's some useful practical resources if you just want to read up on how you can make the language in your projects and docs more inclusive:</p> <ul> <li><a href="https://developers.google.com/style/inclusive-documentation">Google 'inclusive documentation' style guide</a></li> <li><a href="https://lethargy.org/~jesus/writes/a-guide-to-nomenclature-selection/">A Guide to Nomenclature Selection, by Theo Schlossnagle</a></li> <li><a href="https://tools.ietf.org/id/draft-knodel-terminology-00.html">Terminology, Power and Oppressive Language RFC draft by M. Knodel</a></li> </ul> <p>To provide a bit of context: I was thinking about a suggestion that people promoting the use of more inclusive language are "trying to be offended". And here's where my mind went!</p> <p>Imagine you are throwing a party. You send out the invites, order in some hors d'ouevres (did I spell that right? I never spell that right), queue up some Billie Eilish (everyone loves Billie Eilish, it's a scientific fact), set out the drinks, and wait for folks to arrive. In they all come, the room's buzzing, everyone seems to be having a good time, it's going great!</p> <p>But then you notice (or maybe someone else notices, and tells you) that most of the people at your party seem to be straight white dudes and their wives and girlfriends. That's weird, you think, I'm an open minded modern guy, I'd be happy to see some Black folks and maybe a cute gay couple or something! What gives? I don't want people to think I'm some kind of racist or sexist or homophobe or something!</p> <p>So you go and ask some non-white folks and some non-straight folks and some non-male folks what's going on. What is it? Is it me? What did I do wrong?</p> <p>Well, they say, look, it's a hugely complex issue, I mean, we could be here all night talking about it. And yes, fine, that broken pipeline outside your house might have something to do with it (IN-JOKE ALERT). But since you ask, look, let us break this one part of it down for you.</p> <p>You know how you've got a bouncer outside, and every time someone rolls up to the party he looks them up and down and says "well hi there! What's your name? Is it on the BLACKLIST or the WHITELIST?" Well...I mean...that might put some folks off a bit. And you know how you made the theme of the party "masters and slaves"? You know, that might have something to do with it too. And, yeah, you see how you sent all the invites to men and wrote "if your wife wants to come too, just put her name in your reply"? I mean, you know, that might speak to some people more than others, you hear what I'm saying?</p> <p>Now...this could go one of two ways. On the Good Ending, you might say "hey, you know what? I didn't think about that. Thanks for letting me know. I guess next time I'll maybe change those things up a bit and maybe it'll help. Hey thanks! I appreciate it!"</p> <p>and that would be great. But unfortunately, you might instead opt for the Bad Ending. In the Bad Ending, you say something like this:</p> <p>"Wow. I mean, just wow. I feel so <em>attacked</em> here. It's not like I called it a 'blacklist' because I'm <em>racist</em> or something. I don't have a racist bone in my body, why do you have to read it that way? You know <a href="https://github.com/graphite-project/graphite-web/issues/1569#issuecomment-237320189">blacklist doesn't even MEAN that</a>, right? And jeez, look, the whole 'masters and slaves' thing was just a bit of fun, it's not like we made all the Black people the slaves or something! And besides that whole thing was so long ago! And I mean look, most people are straight, right? It's just easier to go with what's accurate for most people. It's so inconvenient to have to think about EVERYBODY all the time. It's not like I'm <em>homophobic</em> or anything. If gay people would just write back and say 'actually I have a husband' or whatever they'd be TOTALLY welcome, I'm all cool with that. God, why do you have to be so EASILY OFFENDED? Why do you want to make me feel so guilty?"</p> <p>So, I mean. Out of Bad Ending Person and Good Ending Person...whose next party do we think is gonna be more inclusive?</p> <p>So obviously, in this metaphor, Party Throwing Person is Red Hat, or Google, or Microsoft, or pretty much any company that says "hey, we accept this industry has a problem with inclusion and we're trying to do better", and the party is our software and communities and events and so on. If you are looking at your communities and wondering why they seem to be pretty white and male and straight, and you ask folks for ideas on how to improve that, and they give you some ideas...just <em>listen</em>. And try to take them on board. You asked. They're trying to help. They are not saying you are a BAD PERSON who has done BAD THINGS and OFFENDED them and you must feel GUILTY for that. They're just trying to help you make a positive change that will help more folks feel more welcome in your communities.</p> <p>You know, in a weird way, if our Party Throwing Person wasn't quite Good Ending Person or Bad Ending person but instead said "hey, you know what, I don't care about women or Black people or gays or whatever, this is a STRAIGHT WHITE GUY PARTY! WOOOOO! SOMEONE TAP THAT KEG!"...that's almost <em>not as bad</em>. At least you know where you <em>stand</em> with that. You don't feel like you're getting gaslit. You can just write that idiot and their party off and try and find another. The kind of Bad Ending Person who keeps insisting they're not racist or sexist or homophobic and they totally want more minorities to show up at their party but they <em>just can't figure out</em> why they all seem to be so awkward and easily offended and why they want to make poor Bad Ending Person feel so guilty...you know...that gets pretty tiring to deal with sometimes.</p> Fedora CoreOS Test Day coming up on 2020-06-08 https://www.happyassassin.net/posts/2020/06/03/fedora-coreos-test-day-coming-up-on-2020-06-08/ 2020-06-03T22:45:27Z 2020-06-03T22:45:27Z Adam Williamson <p>Mark your calendars for next Monday, folks: 2020-06-08 will be the <a href="https://fedoraproject.org/wiki/Test_Day:Fedora_32_CoreOS_2020-06-08">very first Fedora CoreOS test day</a>! Fedora QA and the CoreOS team are collaborating to bring you this event. We'll be asking participants to test the bleeding-edge <em>next</em> stream of Fedora CoreOS, run some <a href="https://fedoraproject.org/wiki/Category:CoreOS_Test_Cases">test cases</a>, and also read over the <a href="https://docs.fedoraproject.org/en-US/fedora-coreos/">documentation</a> and give feedback.</p> <p>All the details are on the <a href="https://fedoraproject.org/wiki/Test_Day:Fedora_32_CoreOS_2020-06-08">Test Day page</a>. You can join in on the day on Freenode IRC, we'll be using <a href="https://webchat.freenode.net/#fedora-coreos">#fedora-coreos</a> rather than #fedora-test-day for this event. Please come by and help out if you have the time!</p> <p>Mark your calendars for next Monday, folks: 2020-06-08 will be the <a href="https://fedoraproject.org/wiki/Test_Day:Fedora_32_CoreOS_2020-06-08">very first Fedora CoreOS test day</a>! Fedora QA and the CoreOS team are collaborating to bring you this event. We'll be asking participants to test the bleeding-edge <em>next</em> stream of Fedora CoreOS, run some <a href="https://fedoraproject.org/wiki/Category:CoreOS_Test_Cases">test cases</a>, and also read over the <a href="https://docs.fedoraproject.org/en-US/fedora-coreos/">documentation</a> and give feedback.</p> <p>All the details are on the <a href="https://fedoraproject.org/wiki/Test_Day:Fedora_32_CoreOS_2020-06-08">Test Day page</a>. You can join in on the day on Freenode IRC, we'll be using <a href="https://webchat.freenode.net/#fedora-coreos">#fedora-coreos</a> rather than #fedora-test-day for this event. Please come by and help out if you have the time!</p> Fedora 32 release and Lenovo announcement https://www.happyassassin.net/posts/2020/04/28/fedora-32-release-and-lenovo-announcement/ 2020-04-28T23:18:03Z 2020-04-28T23:18:03Z Adam Williamson <p>It's been a big week in Fedora news: first came the <a href="https://fedoramagazine.org/coming-soon-fedora-on-lenovo-laptops/">announcement of Lenovo planning to ship laptops preloaded with Fedora</a>, and today <a href="https://fedoramagazine.org/announcing-fedora-32/">Fedora 32 is released</a>. I'm happy this release was again "on time" (at least if you go by our definition and not Phoronix's!), though it was kinda chaotic in the last week or so. We just changed <a href="https://pagure.io/releng/issue/9403#comment-643466">the installer, the partitioning library, the custom partitioning tool, the kernel and the main desktop's display manager</a> - that's all perfectly normal stuff to change a day before you sign off the release, right? I'm pretty confident this is fine!</p> <p>But seriously folks, I think it turned out to be a pretty good sausage, like most of the ones we've put on the shelves lately. Please do take it for a spin and see how it works for you. </p> <p>I'm also really happy about the Lenovo announcement. The team working on that has been doing an awful lot of diplomacy and negotiation and cajoling for quite a while now and it's great to see it pay off. The RH Fedora QA team was formally brought into the plan in the last month or two, and Lenovo has kindly provided us with several test laptops which we've distributed around. While the project wasn't public we were clear that we couldn't do anything like making the Fedora 32 release contingent on test results on Lenovo hardware purely for this reason or anything like that, but both our team and Lenovo's have been running tests and we did accept several freeze exceptions to fix bugs like <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1814015">this one</a>, which also affected some Dell systems and maybe others too. Now this project is officially public, it's possible we'll consider adding some official release criteria for the supported systems, or something like that, so look out for proposals on the mailing lists in future.</p> <p>It's been a big week in Fedora news: first came the <a href="https://fedoramagazine.org/coming-soon-fedora-on-lenovo-laptops/">announcement of Lenovo planning to ship laptops preloaded with Fedora</a>, and today <a href="https://fedoramagazine.org/announcing-fedora-32/">Fedora 32 is released</a>. I'm happy this release was again "on time" (at least if you go by our definition and not Phoronix's!), though it was kinda chaotic in the last week or so. We just changed <a href="https://pagure.io/releng/issue/9403#comment-643466">the installer, the partitioning library, the custom partitioning tool, the kernel and the main desktop's display manager</a> - that's all perfectly normal stuff to change a day before you sign off the release, right? I'm pretty confident this is fine!</p> <p>But seriously folks, I think it turned out to be a pretty good sausage, like most of the ones we've put on the shelves lately. Please do take it for a spin and see how it works for you. </p> <p>I'm also really happy about the Lenovo announcement. The team working on that has been doing an awful lot of diplomacy and negotiation and cajoling for quite a while now and it's great to see it pay off. The RH Fedora QA team was formally brought into the plan in the last month or two, and Lenovo has kindly provided us with several test laptops which we've distributed around. While the project wasn't public we were clear that we couldn't do anything like making the Fedora 32 release contingent on test results on Lenovo hardware purely for this reason or anything like that, but both our team and Lenovo's have been running tests and we did accept several freeze exceptions to fix bugs like <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1814015">this one</a>, which also affected some Dell systems and maybe others too. Now this project is officially public, it's possible we'll consider adding some official release criteria for the supported systems, or something like that, so look out for proposals on the mailing lists in future.</p> Do not upgrade to Fedora 32, and do not adjust your sets https://www.happyassassin.net/posts/2020/02/14/do-not-upgrade-to-fedora-32-and-do-not-adjust-your-sets/ 2020-02-14T17:30:26Z 2020-02-14T17:30:26Z Adam Williamson <p></p><p>If you were unlucky today, you might have received a notification from GNOME in Fedora 30 or 31 that Fedora 32 is now available for upgrade.</p> <p>This might have struck you as a bit odd, it being rather early for Fedora 32 to be out and there not being any news about it or anything. And if so, you'd be right! This was an error, and we're very sorry for it.</p> <p>What happened is that a <a href="https://admin.fedoraproject.org/pkgdb/api/collections/">particular bit of data</a> which GNOME Software (among other things) uses as its source of truth about Fedora releases was updated for the branching of Fedora 32...but by mistake, 32 was added with status 'Active' (meaning 'stable release') rather than 'Under Development'. This fooled poor GNOME Software into thinking a new stable release was available, and telling you about it.</p> <p>Kamil Paral spotted this very quickly and releng fixed it right away, but if your GNOME Software happened to check for updates during the few minutes the incorrect data was up, it will have cached it, and you'll see the incorrect notification for a while.</p> <p>Please <strong>DO NOT</strong> upgrade to Fedora 32 yet. It is under heavy development and is very much not ready for normal use. We're very sorry for the incorrect notification and we hope it didn't cause too much disruption.</p> <p></p><p>If you were unlucky today, you might have received a notification from GNOME in Fedora 30 or 31 that Fedora 32 is now available for upgrade.</p> <p>This might have struck you as a bit odd, it being rather early for Fedora 32 to be out and there not being any news about it or anything. And if so, you'd be right! This was an error, and we're very sorry for it.</p> <p>What happened is that a <a href="https://admin.fedoraproject.org/pkgdb/api/collections/">particular bit of data</a> which GNOME Software (among other things) uses as its source of truth about Fedora releases was updated for the branching of Fedora 32...but by mistake, 32 was added with status 'Active' (meaning 'stable release') rather than 'Under Development'. This fooled poor GNOME Software into thinking a new stable release was available, and telling you about it.</p> <p>Kamil Paral spotted this very quickly and releng fixed it right away, but if your GNOME Software happened to check for updates during the few minutes the incorrect data was up, it will have cached it, and you'll see the incorrect notification for a while.</p> <p>Please <strong>DO NOT</strong> upgrade to Fedora 32 yet. It is under heavy development and is very much not ready for normal use. We're very sorry for the incorrect notification and we hope it didn't cause too much disruption.</p> Using Zuul CI with Pagure.io https://www.happyassassin.net/posts/2020/02/12/using-zuul-ci-with-pagure-io/ 2020-02-12T18:15:47Z 2020-02-12T18:15:47Z Adam Williamson <p></p><p>I attended <a href="https://www.devconf.info/cz/">Devconf.cz</a> again this year - I'll try and post a full blog post on that soon. One of the most interesting talks, though, was <a href="https://devconfcz2020a.sched.com/event/YOtV/cicd-for-fedora-packaging-with-zuul">CI/CD for Fedora packaging with Zuul</a>, where Fabien Boucher and Matthieu Huin introduced the work they've done to integrate <a href="https://fedora.softwarefactory-project.io/zuul/status">a specific Zuul instance</a> (part of the <a href="https://www.softwarefactory-project.io/">Software Factory</a> effort) with the <a href="https://src.fedoraproject.org/">Pagure instance Fedora uses for packages</a> and also with <a href="https://pagure.io/">Pagure.io</a>, the general-purpose Pagure instance that many Fedora groups use to host projects, including us in QA.</p> <p>They've done a lot of work to make it as simple as possible to hook up a project in either Pagure instance to run CI via Zuul, and it looked pretty cool, so I thought I'd try it on one of our projects and see how it compares to other options, like the Jenkins-based <a href="https://www.happyassassin.net/2017/02/16/getting-started-with-pagure-ci/">Pagure CI</a>.</p> <p>I wound up more or less following the instructions on <a href="https://fedoraproject.org/wiki/Zuul-based-ci#How_to_Zuul_attach_a_Pagure_repository_on_Zuul">this Wiki page</a>, but it does not give you an example of a minimal framework in the project repository itself to actually run some checks. However, after I submitted the pull request for <a href="https://pagure.io/fedora-project-config">fedora-project-config</a> as explained on the wiki page, Tristan Cacqueray was kind enough to send me this as <a href="https://pagure.io/fedora-qa/os-autoinst-distri-fedora/pull-request/141">a pull request for my project repository</a>.</p> <p>So, all that was needed to get a kind of 'hello world' process running was:</p> <p></p><ol> <li>Add the appropriate web hook in the project options</li> <li>Add the 'zuul' user as a committer on the project in the project options</li> <li>Get a <a href="https://pagure.io/fedora-project-config/pull-request/45">pull request</a> merged to fedora-project-config to add the desired project</li> <li>Add a <a href="https://pagure.io/fedora-qa/os-autoinst-distri-fedora/pull-request/141#request_diff">basic Zuul config which runs a single job</a></li> </ol> <p>After that, the next step was to have it run <em>useful</em> checks. I set the project up such that all the appropriate checks could be run just by calling <code>tox</code> (which is a great test runner for Python projects) - see the <a href="https://pagure.io/fedora-qa/os-autoinst-distri-fedora/blob/master/f/tox.ini">tox configuration</a>. Then, with a bit more help from Tristan, I was able to tweak the Zuul config to run it successfully. This mainly required a couple of things:</p> <ol> <li>Adding <code>nodeset: fedora-31-vm</code> to the <a href="https://pagure.io/fedora-qa/os-autoinst-distri-fedora/blob/master/f/.zuul.yaml">Zuul config</a> - this makes the CI job run on a Fedora 31 VM rather than the default CentOS 7 VM (CentOS 7's tox is too old for a modern Python 3 project)</li> <li>Modifying the <a href="https://pagure.io/fedora-qa/os-autoinst-distri-fedora/blob/master/f/ci/tox.yaml">job configuration</a> to ensure tox is installed (there's a canned role for this, called <code>ensure-tox</code>) and also all available Python interpreters (using the <code>package</code> module)</li> </ol> <p>This was all pretty small and easy stuff, and we had the whole thing up and running in a few hours. Now it all works great, so whenever a pull request is submitted for the project, the tests are automatically run and the results shown on the pull request.</p> <p>You can set up more complex workflows where Zuul takes over merging of pull requests entirely - an admin posts a comment indicating a PR is ready to merge, whereupon Zuul will retest it and then merge it automatically if the test succeeds. This can also be used to merge series of PRs together, with proper testing. But for my small project, this simple integration is enough so far.</p> <p>It's been a positive experience working with the system so far, and I'd encourage others to try it for their packages and Pagure projects!</p> <p></p><p>I attended <a href="https://www.devconf.info/cz/">Devconf.cz</a> again this year - I'll try and post a full blog post on that soon. One of the most interesting talks, though, was <a href="https://devconfcz2020a.sched.com/event/YOtV/cicd-for-fedora-packaging-with-zuul">CI/CD for Fedora packaging with Zuul</a>, where Fabien Boucher and Matthieu Huin introduced the work they've done to integrate <a href="https://fedora.softwarefactory-project.io/zuul/status">a specific Zuul instance</a> (part of the <a href="https://www.softwarefactory-project.io/">Software Factory</a> effort) with the <a href="https://src.fedoraproject.org/">Pagure instance Fedora uses for packages</a> and also with <a href="https://pagure.io/">Pagure.io</a>, the general-purpose Pagure instance that many Fedora groups use to host projects, including us in QA.</p> <p>They've done a lot of work to make it as simple as possible to hook up a project in either Pagure instance to run CI via Zuul, and it looked pretty cool, so I thought I'd try it on one of our projects and see how it compares to other options, like the Jenkins-based <a href="https://www.happyassassin.net/2017/02/16/getting-started-with-pagure-ci/">Pagure CI</a>.</p> <p>I wound up more or less following the instructions on <a href="https://fedoraproject.org/wiki/Zuul-based-ci#How_to_Zuul_attach_a_Pagure_repository_on_Zuul">this Wiki page</a>, but it does not give you an example of a minimal framework in the project repository itself to actually run some checks. However, after I submitted the pull request for <a href="https://pagure.io/fedora-project-config">fedora-project-config</a> as explained on the wiki page, Tristan Cacqueray was kind enough to send me this as <a href="https://pagure.io/fedora-qa/os-autoinst-distri-fedora/pull-request/141">a pull request for my project repository</a>.</p> <p>So, all that was needed to get a kind of 'hello world' process running was:</p> <p></p><ol> <li>Add the appropriate web hook in the project options</li> <li>Add the 'zuul' user as a committer on the project in the project options</li> <li>Get a <a href="https://pagure.io/fedora-project-config/pull-request/45">pull request</a> merged to fedora-project-config to add the desired project</li> <li>Add a <a href="https://pagure.io/fedora-qa/os-autoinst-distri-fedora/pull-request/141#request_diff">basic Zuul config which runs a single job</a></li> </ol> <p>After that, the next step was to have it run <em>useful</em> checks. I set the project up such that all the appropriate checks could be run just by calling <code>tox</code> (which is a great test runner for Python projects) - see the <a href="https://pagure.io/fedora-qa/os-autoinst-distri-fedora/blob/master/f/tox.ini">tox configuration</a>. Then, with a bit more help from Tristan, I was able to tweak the Zuul config to run it successfully. This mainly required a couple of things:</p> <ol> <li>Adding <code>nodeset: fedora-31-vm</code> to the <a href="https://pagure.io/fedora-qa/os-autoinst-distri-fedora/blob/master/f/.zuul.yaml">Zuul config</a> - this makes the CI job run on a Fedora 31 VM rather than the default CentOS 7 VM (CentOS 7's tox is too old for a modern Python 3 project)</li> <li>Modifying the <a href="https://pagure.io/fedora-qa/os-autoinst-distri-fedora/blob/master/f/ci/tox.yaml">job configuration</a> to ensure tox is installed (there's a canned role for this, called <code>ensure-tox</code>) and also all available Python interpreters (using the <code>package</code> module)</li> </ol> <p>This was all pretty small and easy stuff, and we had the whole thing up and running in a few hours. Now it all works great, so whenever a pull request is submitted for the project, the tests are automatically run and the results shown on the pull request.</p> <p>You can set up more complex workflows where Zuul takes over merging of pull requests entirely - an admin posts a comment indicating a PR is ready to merge, whereupon Zuul will retest it and then merge it automatically if the test succeeds. This can also be used to merge series of PRs together, with proper testing. But for my small project, this simple integration is enough so far.</p> <p>It's been a positive experience working with the system so far, and I'd encourage others to try it for their packages and Pagure projects!</p> Uptime https://www.happyassassin.net/posts/2020/02/02/uptime/ 2020-02-02T17:17:40Z 2020-02-02T17:17:40Z Adam Williamson <p></p><p>OK, so that was two days longer than I was expecting! Sorry for the extended downtime, folks, especially Fedora folks. It was rather beyond my control. But now I'm (just barely) back, through the single working cable outlet in the house and a powerline ethernet connection to the router, at least until the cable co can come and fix all the other outlets!</p> <p></p><p>OK, so that was two days longer than I was expecting! Sorry for the extended downtime, folks, especially Fedora folks. It was rather beyond my control. But now I'm (just barely) back, through the single working cable outlet in the house and a powerline ethernet connection to the router, at least until the cable co can come and fix all the other outlets!</p> In praise of WebAuthn https://www.happyassassin.net/posts/2019/12/24/in-praise-of-webauthn/ 2019-12-24T12:01:52Z 2019-12-24T12:01:52Z Adam Williamson <p></p><p>tl;dr: I just got a Yubikey 5 and set it up on a bunch of things. You should too, because WebAuthn is <em>awesome</em>.</p> <p>Now the long version!</p> <p>Two-factor authentication has been a thing for a while now. You're probably familiar with it in various forms. There's SMS-based 2FA, commonly used by banks, where they text you a code you have to re-type when logging in. Then there are token/one-time-password based systems where you can use a hardware key like a Yubikey or a software authenticator like Google Authenticator or FreeOTP to generate and enter a one-time password when logging into a system.</p> <p>If you're like I was yesterday, maybe you've got two Yubikeys on your keyring and another in a drawer somewhere and you have to remember which four systems you have set up on which slots on which key, and you've got FreeOTP as a backup.</p> <p>Maybe you've also kinda heard about "U2F" and had the vague idea that it sounded neat. And also maybe you've read some stuff about "WebAuthn" recently and thought it sounded maybe cool but also maybe confusing and what's going on and maybe this isn't the most important thing you could be figuring out today?</p> <p>Well, prodded by a couple of mailing list threads I did figure it out a bit, and here's the thing: WebAuthn is spreading, and it's <em>awesome</em>. If you are just a person who wants to log into stuff - it's 2FA done <em>way better</em>.</p> <p>Here's the cool stuff:</p> <p></p><ul> <li>You can use one key ('authenticator') to log into as many different WebAuthn-supporting sites as you want (and this is secure, they're not all sharing the same seed or anything)</li> <li>You can register multiple authenticators per site</li> <li>An authenticator can be a hardware key (mainly a <a href="https://www.yubico.com/products/yubikey-5-overview/">Yubikey 5</a>, at this point, but the <a href="https://solokeys.com/">Solokey</a> is supposed to be a fully open-source WebAuthn-supporting key, only available to backers so far), but you can also use a phone or laptop with a fingerprint reader or facial ID system</li> <li>It works on Linux. Really easily. It works on Firefox (not just Chrome). It works on Firefox on Android. Yeah, all the stuff you kinda automatically assume is going to be a pain in the ass...isn't! It actually fricking works!</li> <li>WebAuthn-compatible keys can still support other systems too...specifically, you can get a Yubikey and use it for WebAuthn but it also still has two OTP slots, and no you don't have to do something stupid to pick which system you're using, it all just magically works somehow, I don't know how and I don't care. The Yubikey and Firefox are also backwards-compatible with U2F, so sites that implemented U2F but didn't update to WebAuthn yet work fine</li> </ul> <p>Seriously, it's awesome. And it actually works, like, right now, really well. On useful sites! Try it! Github supports it, for instance. Go to your Github account, go to the Settings page, go to Security, enable 2FA if you don't have it enabled already, and hit edit on 'Security keys'. Then click 'Register new security key'. Give it a name (like 'phone' or 'yubikey #1' or whatever). If you're using a Yubikey, plug it in and hit the button. If you're using a phone with a fingerprint sensor or facial ID, there'll be an option for 'use this device with a fingerprint' or something like that. Pick it, and touch the sensor or show it your face. And <em>that's it</em>. You're done. Then when you login you just do the same thing (plug in, push button, touch sensor, or show face) and you're in. It's like the fricking future or something.</p> <p>You can even use a Yubikey via NFC to log in with Firefox on Android (and I assume Chrome too, but I didn't try that). Yeah, I tried it, it worked. First time. (Once I figured out where the NFC sensor was, anyway). You can even apparently use your phone connected via Bluetooth to login on a computer, though I didn't try that yet - the browser should let you pick the Bluetooth-connected phone as the authenticator, then the phone will ask you for your fingerprint or face.</p> <p>It's all so much frickin' better than re-typing codes from text messages or remembering Yubikey slot numbers. I really did not realize it was gonna be this nice. It is also more secure than OTP-based systems and <em>much</em> more secure than SMS-based systems, which is great, but even if it wasn't it's just <em>nicer</em>. I really hope W3C and Mozilla and Google and Apple and whoever go out and sell it that way as hard as they can.</p> <p>So far I've set up my Google account (I think Google is still technically using U2F not WebAuthn, but as far as the user experience went it didn't make any difference), Github, <a href="https://bitwarden.com/">Bitwarden</a> (which is a great open-source password management service), and <a href="https://www.gandi.net/">Gandi</a> (I use them for domain registration and DNS, they're great for that), and now I'm busy writing to a ton of other sites to demand they get on the bandwagon already. I used the OTP slots for Fedora and Red Hat internal systems (neither supports WebAuthn yet, unfortunately - one limitation of WebAuthn is that it <em>is</em> fairly 'web-y', it's less suited to systems where you need to authenticate in non-web-protocol scenarios, so FAS and RH auth can't just switch over to it that easily). And my three pre-U2F Yubikeys are wiped and on their way to hardware heaven...</p> <p></p><p>tl;dr: I just got a Yubikey 5 and set it up on a bunch of things. You should too, because WebAuthn is <em>awesome</em>.</p> <p>Now the long version!</p> <p>Two-factor authentication has been a thing for a while now. You're probably familiar with it in various forms. There's SMS-based 2FA, commonly used by banks, where they text you a code you have to re-type when logging in. Then there are token/one-time-password based systems where you can use a hardware key like a Yubikey or a software authenticator like Google Authenticator or FreeOTP to generate and enter a one-time password when logging into a system.</p> <p>If you're like I was yesterday, maybe you've got two Yubikeys on your keyring and another in a drawer somewhere and you have to remember which four systems you have set up on which slots on which key, and you've got FreeOTP as a backup.</p> <p>Maybe you've also kinda heard about "U2F" and had the vague idea that it sounded neat. And also maybe you've read some stuff about "WebAuthn" recently and thought it sounded maybe cool but also maybe confusing and what's going on and maybe this isn't the most important thing you could be figuring out today?</p> <p>Well, prodded by a couple of mailing list threads I did figure it out a bit, and here's the thing: WebAuthn is spreading, and it's <em>awesome</em>. If you are just a person who wants to log into stuff - it's 2FA done <em>way better</em>.</p> <p>Here's the cool stuff:</p> <p></p><ul> <li>You can use one key ('authenticator') to log into as many different WebAuthn-supporting sites as you want (and this is secure, they're not all sharing the same seed or anything)</li> <li>You can register multiple authenticators per site</li> <li>An authenticator can be a hardware key (mainly a <a href="https://www.yubico.com/products/yubikey-5-overview/">Yubikey 5</a>, at this point, but the <a href="https://solokeys.com/">Solokey</a> is supposed to be a fully open-source WebAuthn-supporting key, only available to backers so far), but you can also use a phone or laptop with a fingerprint reader or facial ID system</li> <li>It works on Linux. Really easily. It works on Firefox (not just Chrome). It works on Firefox on Android. Yeah, all the stuff you kinda automatically assume is going to be a pain in the ass...isn't! It actually fricking works!</li> <li>WebAuthn-compatible keys can still support other systems too...specifically, you can get a Yubikey and use it for WebAuthn but it also still has two OTP slots, and no you don't have to do something stupid to pick which system you're using, it all just magically works somehow, I don't know how and I don't care. The Yubikey and Firefox are also backwards-compatible with U2F, so sites that implemented U2F but didn't update to WebAuthn yet work fine</li> </ul> <p>Seriously, it's awesome. And it actually works, like, right now, really well. On useful sites! Try it! Github supports it, for instance. Go to your Github account, go to the Settings page, go to Security, enable 2FA if you don't have it enabled already, and hit edit on 'Security keys'. Then click 'Register new security key'. Give it a name (like 'phone' or 'yubikey #1' or whatever). If you're using a Yubikey, plug it in and hit the button. If you're using a phone with a fingerprint sensor or facial ID, there'll be an option for 'use this device with a fingerprint' or something like that. Pick it, and touch the sensor or show it your face. And <em>that's it</em>. You're done. Then when you login you just do the same thing (plug in, push button, touch sensor, or show face) and you're in. It's like the fricking future or something.</p> <p>You can even use a Yubikey via NFC to log in with Firefox on Android (and I assume Chrome too, but I didn't try that). Yeah, I tried it, it worked. First time. (Once I figured out where the NFC sensor was, anyway). You can even apparently use your phone connected via Bluetooth to login on a computer, though I didn't try that yet - the browser should let you pick the Bluetooth-connected phone as the authenticator, then the phone will ask you for your fingerprint or face.</p> <p>It's all so much frickin' better than re-typing codes from text messages or remembering Yubikey slot numbers. I really did not realize it was gonna be this nice. It is also more secure than OTP-based systems and <em>much</em> more secure than SMS-based systems, which is great, but even if it wasn't it's just <em>nicer</em>. I really hope W3C and Mozilla and Google and Apple and whoever go out and sell it that way as hard as they can.</p> <p>So far I've set up my Google account (I think Google is still technically using U2F not WebAuthn, but as far as the user experience went it didn't make any difference), Github, <a href="https://bitwarden.com/">Bitwarden</a> (which is a great open-source password management service), and <a href="https://www.gandi.net/">Gandi</a> (I use them for domain registration and DNS, they're great for that), and now I'm busy writing to a ton of other sites to demand they get on the bandwagon already. I used the OTP slots for Fedora and Red Hat internal systems (neither supports WebAuthn yet, unfortunately - one limitation of WebAuthn is that it <em>is</em> fairly 'web-y', it's less suited to systems where you need to authenticate in non-web-protocol scenarios, so FAS and RH auth can't just switch over to it that easily). And my three pre-U2F Yubikeys are wiped and on their way to hardware heaven...</p> AdamW's Debugging Adventures: "dnf is locked by another application" https://www.happyassassin.net/posts/2019/10/18/adamws-debugging-adventures-dnf-is-locked-by-another-application/ 2019-10-18T13:45:04Z 2019-10-18T13:45:04Z Adam Williamson <p></p><p>Gather round the fire, kids, it's time for another Debugging Adventure! These are posts where I write up the process of diagnosing the root cause of a bug, where it turned out to be interesting (to me, anyway...)</p> <p>This case - <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1750575">Bugzilla #1750575</a> - involved dnfdragora, the package management tool used on Fedora Xfce, which is a release-blocking environment for the ARM architecture. It was a pretty easy bug to reproduce: any time you updated a package, the update would work, but then dnfdragora would show an error "DNF is locked by another process. dnfdragora will exit.", and immediately exit.</p> <p>The bug sat around on the blocker list for a while; Daniel Mach (a DNF developer) <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1750575#c10">looked into it a bit</a> but didn't have time to figure it out all the way. So I got tired of waiting for someone else to do it, and decided to work it out myself.</p> <p></p><h2>Where's the error coming from?</h2> <p>As a starting point, I had a nice error message - so the obvious thing to do is figure out where that message comes from. The text appears in a couple of places in dnfdragora - in an <a href="https://github.com/manatools/dnfdragora/blob/master/dnfdragora/basedragora.py#L131">exception handler</a> and also in a <a href="https://github.com/manatools/dnfdragora/blob/master/dnfdragora/basedragora.py#L90">method for setting up a connection to dnfdaemon</a>. So, if we didn't already know (I happened to) this would be the point at which we'd realize that dnfdragora is a frontend app to a backend - <a href="https://github.com/manatools/dnfdaemon">dnfdaemon</a> - which does the heavy lifting.</p> <p>So, to figure out in more detail how we were getting to one of these two points, I hacked both the points where that error is logged. Both of them read <code>logger.critical(errmsg)</code>. I changed this to <code>logger.exception(errmsg)</code>. <code>logger.exception</code> is a very handy feature of Python's <a href="https://docs.python.org/3/library/logging.html">logging module</a> which logs whatever message you specify, <strong>plus a traceback to the current state</strong>, just like the traceback you get if the app actually crashes. So by doing that, the dnfdragora log (it logs to a file <code>dnfdragora.log</code> in the directory you run it from) gave us a traceback showing how we got to the error:</p> <pre><code>2019-10-14 17:53:29,436 &lt;a href="ERROR"&gt;dnfdragora&lt;/a&gt; dnfdaemon client error: g-io-error-quark: GDBus.Error:org.baseurl.DnfSystem.LockedError: dnf is locked by another application (36) Traceback (most recent call last): File "/usr/bin/dnfdragora", line 85, in &lt;module&gt; main_gui.handleevent() File "/usr/lib/python3.7/site-packages/dnfdragora/ui.py", line 1273, in handleevent if not self._searchPackages(filter, True) : File "/usr/lib/python3.7/site-packages/dnfdragora/ui.py", line 949, in _searchPackages packages = self.backend.search(fields, strings, self.match_all, self.newest_only, tags ) File "/usr/lib/python3.7/site-packages/dnfdragora/misc.py", line 135, in newFunc rc = func(*args, **kwargs) File "/usr/lib/python3.7/site-packages/dnfdragora/dnf_backend.py", line 464, in search newest_only, tags) File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 508, in Search fields, keys, attrs, match_all, newest_only, tags)) File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 293, in _run_dbus_async result = self._get_result(data) File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 277, in _get_result self._handle_dbus_error(user_data['error']) File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 250, in _handle_dbus_error raise DaemonError(str(err)) dnfdaemon.client.DaemonError: g-io-error-quark: GDBus.Error:org.baseurl.DnfSystem.LockedError: dnf is locked by another application (36)&lt;/module&gt; </code></pre> <p>So, this tells us quite a bit of stuff. We know we're crashing in some sort of 'search' operation, and dbus seems to be involved. We can also see a bit more of the architecture here. Note how we have <code>dnfdragora/dnf_backend.py</code> and <code>dnfdaemon/client/<strong>init</strong>.py</code> included in the trace, even though we're only in the dnfdragora executable here (dnfdaemon is a separate process). Looking at that and then looking at those files a bit, it's quite easy to see that the dnfdaemon Python library provides a sort of <a href="https://github.com/manatools/dnfdaemon/blob/master/python/dnfdaemon/client/__init__.py#L199">framework for a client class called (oddly enough) <code>DnfDaemonBase</code></a> which the actual client - dnfdragora in our case - is expected to subclass and flesh out. dnfdragora does this in a <a href="https://github.com/manatools/dnfdragora/blob/master/dnfdragora/dnf_backend.py#L144">class called <code>DnfRootBackend</code></a>, which inherits from both <code>dnfdragora.backend.Backend</code> (a sort of abstraction layer for dnfdragora to have multiple of these backends, though at present it only actually has this one) and <code>dnfdaemon.client.Client</code>, which is just a <a href="https://github.com/manatools/dnfdaemon/blob/master/python/dnfdaemon/client/__init__.py#L544">small extension to <code>DnfDaemonBase</code></a> that adds some dbus signal handling.</p> <p>So now we know more about the design we're dealing with, and we can also see that we're trying to do some sort of search operation which looks like it works by the client class communicating with the actual <code>dnfdaemon</code> server process via dbus, only we're hitting some kind of error in that process, and interpreting it as 'dnf is locked by another application'. If we dig a little deeper, we can figure out a bit more. We have to read through all of the backtrace frames and examine the functions, but ultimately we can figure out that <code>DnfRootBackend.Search()</code> is wrapped by <code>dnfdragora.misc.ExceptionHandler</code>, which handles <code>dnfdaemon.client.DaemonError</code> exceptions - like the one that's ultimately getting raised here! - by calling the base class's own <code>exception_handler()</code> on them...and for us, that's <a href="https://github.com/manatools/dnfdragora/blob/master/dnfdragora/basedragora.py#L121">BaseDragora.exception_handler</a>, one of the two places we found earlier that ultimately produces this "DNF is locked by another process. dnfdragora will exit" text. We also now have <em>two</em> indications (the dbus error itself, and the code in <code>exception_handler()</code> that the error we're dealing with is "LockedError".</p> <h2>A misleading error...</h2> <p>At this point, I went looking for the text <code>LockedError</code>, and found it in two files in dnfdaemon that are kinda variants on each other - <a href="https://github.com/manatools/dnfdaemon/blob/master/daemon/dnfdaemon-session.py">daemon/dnfdaemon-session.py</a> and <a href="https://github.com/manatools/dnfdaemon/blob/master/daemon/dnfdaemon-system.py">daemon/dnfdaemon-system.py</a>. I didn't actually know offhand which of the two is used in our case, but it doesn't really matter, because the codepath to <code>LockedError</code> is the same in both. There's a <a href="https://github.com/manatools/dnfdaemon/blob/master/daemon/dnfdaemon-system.py#L700">function called <code>check_lock()</code></a> which checks that <code>self._lock == sender</code>, and if it doesn't, raises <code>LockedError</code>. That sure looks like where we're at.</p> <p>So at this point I did a bit of poking around into how <code>self._lock</code> gets set and unset in the daemon. It turns out to be pretty simple. The daemon is basically implemented as a class with a bunch of methods that are wrapped by <code>@dbus.service.method</code>, which makes them accessible as <em>DBus</em> methods. (One of them is <code>Search()</code>, and we can see that the <a href="https://github.com/manatools/dnfdaemon/blob/master/python/dnfdaemon/client/__init__.py#L493">client class's own <code>Search()</code> basically just calls that</a>). There are also methods called <a href="https://github.com/manatools/dnfdaemon/blob/master/daemon/dnfdaemon-system.py#L105"><code>Lock()</code></a> and <a href="https://github.com/manatools/dnfdaemon/blob/master/daemon/dnfdaemon-system.py#L350"><code>Unlock()</code></a>, which - not surprisingly - set and release this lock, by setting the daemon class' <code>self._lock</code> to be either an identifier for the DBus client or <code>None</code>, respectively. And when the daemon is first initialized, the value is set to <code>None</code>.</p> <p>At this point, I realized that the error we're dealing with here is actually a lie in two important ways:</p> <ol> <li>The message claims that the problem is the lock being held "by another application", but that's not what <code>check_lock()</code> checks, really. It passes <strong>only if the caller holds the lock</strong>. It <em>does</em> fail if the lock is held "by another application", but it <em>also</em> fails if <strong>the lock is not held at all</strong>. Given all the code we looked at so far, we can't actually trust the message's assertion that something else is holding the lock. It is also possible that the lock is not held at all.</li> <li>The message suggests that the lock in question is a lock on dnf itself. I know dnf/libdnf <em>do</em> have locking, so up to now I'd been assuming we were actually dealing with the locking in dnf itself. But at this point I realized we weren't. The dnfdaemon lock code we just looked at doesn't actually call or wrap dnf's own locking code in any way. This lock we're dealing with is entirely internal to dnfdaemon. It's really a lock on <em>the dnfdaemon instance itself</em>.</li> </ol> <p>So, at this point I started thinking of the error as being "dnfdaemon is either locked by another DBus client, or not locked at all".</p> <h2>So what's going on with this lock anyway?</h2> <p>My next step, now I understood the locking process we're dealing with, was to stick some logging into it. I added log lines to the <code>Lock()</code> and <code>Unlock()</code> methods, and I also made <code>check_lock()</code> log what <code>sender</code> and <code>self._lock</code> were set to before returning. Because it sets <code>self._lock</code> to <code>None</code>, I also added a log line to the daemon's <code><strong>init</strong></code> that just records that we're in it. That got me some more useful information:</p> <pre><code>2019-10-14 18:53:03.397784 XXX In DnfDaemon.init now! 2019-10-14 18:53:03.402804 XXX LOCK: sender is :1.1835 2019-10-14 18:53:03.407524 XXX CHECK LOCK: sender is :1.1835 XXX CHECK LOCK: self._lock is :1.1835 2019-10-14 18:53:07.556499 XXX CHECK LOCK: sender is :1.1835 XXX CHECK LOCK: self._lock is :1.1835 [...snip a bunch more calls to check_lock where the sender is the same...] 2019-10-14 18:53:13.560828 XXX CHECK LOCK: sender is :1.1835 XXX CHECK LOCK: self._lock is :1.1835 2019-10-14 18:53:13.560941 XXX CHECK LOCK: sender is :1.1835 XXX CHECK LOCK: self._lock is :1.1835 2019-10-14 18:53:16.513900 XXX In DnfDaemon.init now! 2019-10-14 18:53:16.516724 XXX CHECK LOCK: sender is :1.1835 XXX CHECK LOCK: self._lock is None </code></pre> <p>so we could see that when we started dnfdragora, dnfdaemon started up and dnfdragora locked it almost immediately, then throughout the whole process of reproducing the bug - run dnfdragora, search for a package to be updated, mark it for updating, run the transaction, wait for the error - there were several instances of DBus method calls where everything worked fine (we see <code>check_lock()</code> being called and finding <code>sender</code> and <code>self._lock</code> set to the same value, the identifier for dnfdragora), but then suddenly we see the daemon's <code><strong>init</strong></code> running again for some reason, not being locked, and then a <code>check_lock()</code> call that fails because the daemon instance's <code>self._lock</code> is <code>None</code>.</p> <p>After a couple of minutes, I guessed what was going on here, and the <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1750575#c23">daemon's service logs confirmed it</a> - dnfdaemon was crashing and automatically restarting. The first attempt to invoke a DBus method after the crash and restart fails, because dnfdragora has not locked this <em>new</em> instance of the daemon (it has no idea it just crashed and restarted), so <code>check_lock()</code> fails. So as soon as a DBus method invocation is attempted after the dnfdaemon crash, dnfdragora errors out with the confusing "dnf is locked by another process" error.</p> <p>The crash was already mentioned in the bug report, but until now the exact interaction between the crash and the error had not been worked out - we just knew the daemon crashed and the app errored out, but we didn't really know what order those things happened in or how they related to each other.</p> <h2>OK then...why is dnfdaemon crashing?</h2> <p>So, the question now became: why is dnfdaemon crashing? Well, the backtrace we had didn't tell us a lot; really it only told us that something was going wrong in libdbus, which we could also tell from the dnfdaemon service log:</p> <pre><code>Oct 14 18:53:15 adam.happyassassin.net dnfdaemon-system[226042]: dbus[226042]: arguments to dbus_connection_unref() were incorrect, assertion "connection-&gt;generation == _dbus_current_generation" failed in file ../../dbus/dbus-connection.c line 2823. Oct 14 18:53:15 adam.happyassassin.net dnfdaemon-system[226042]: This is normally a bug in some application using the D-Bus library. Oct 14 18:53:15 adam.happyassassin.net dnfdaemon-system[226042]: D-Bus not built with -rdynamic so unable to print a backtrace </code></pre> <p>that last line looked like a cue, so of course, off I went to figure out how to build DBus with <code>-rdynamic</code>. A bit of Googling told me - thanks "the3dfxdude"! - that <a href="https://www.linuxquestions.org/questions/slackware-14/compile-dbus-with-backtrace-support-4175459330/#post4938346">the trick is to compile with --enable-asserts</a>. So I did that and reproduced the bug again, and <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1750575#c25">got a bit of a better backtrace</a>. It's a long one, but by picking through it carefully I could spot - in frame #17 - the actual point at which the problem happened, which was <a href="https://github.com/manatools/dnfdaemon/blob/master/python/dnfdaemon/server/__init__.py#L595">in <code>dnfdaemon.server.DnfDaemonBase.run_transaction()</code></a>. (Note, this is a <em>different</em> <code>DnfDaemonBase</code> class from <code>dnfdaemon.client.DnfDaemonBase</code>; I don't know why they have the same name, that's just confusing).</p> <p>So, the daemon's crashing on this <code>self.TransactionEvent('end-run', NONE)</code> call. I poked into what that does a bit, and found a design here that kinda mirrors what happens on the client side: this <code>DnfDaemonBase</code>, like the other one, is a framework for a full daemon implementation, and it's subclassed by <a href="https://github.com/manatools/dnfdaemon/blob/master/daemon/dnfdaemon-system.py#L59">a <code>DnfDaemon</code> class here</a>. That class defines <a href="https://github.com/manatools/dnfdaemon/blob/master/daemon/dnfdaemon-system.py#L639">a <code>TransactionEvent</code> method</a> that emits a DBus signal. So...we're crashing when trying to emit a dbus signal. That all adds up with the backtrace going through libdbus and all. But, <em>why</em> are we crashing?</p> <p>At this point I tried to make a small reproducer (which basically just set up a <code>DnfDaemon</code> instance and called <code>self.TransactionEvent</code> in the same way, I think) but that didn't work - I didn't know why at the time, but figured it out later. Continuing to trace it out through code wouldn't be that easy because now we're in DBus, which I know from experience is a big complex codebase that's not that easy to just reason your way through. We had the actual DBus error to work from too - "arguments to dbus_connection_unref() were incorrect, assertion "connection-&gt;generation == _dbus_current_generation" failed" - and I looked into that a bit, but there were no really helpful leads there (I got a bit more understanding about what the error means exactly, but it didn't help me understand <em>why it was happening</em> at all).</p> <h2>Time for the old standby...</h2> <p>So, being a bit stuck, I fell back on the most trusty standby: trial and error! Well, also a <em>bit</em> of logic. It did occur to me that the dbus broker is itself a long-running daemon that other things can talk to. So I started just wondering if something was interfering with dnfdaemon's connection with the dbus broker, somehow. This was in my head as I poked around at stuff - wherever I wound up looking, I was looking for stuff that involved dbus.</p> <p>But to figure out where to look, I just started hacking up dnfdaemon a bit. Now this first part is probably pure intuition, but that <code>self._reset_base()</code> call on the line right before the <code>self.TransactionEvent</code> call that crashes bugged me. It's probably just long experience telling me that anything with "reset" or "refresh" in the name is bad news. :P So I thought, hey, what happens if we move it?</p> <p>I stuck some logging lines into this <code>run_transaction</code> so I knew where we got to before we crashed - this is a great dumb trick, btw, just stick lines like <code>self.logger('XXX HERE 1')</code>, <code>self.logger('XXX HERE 2')</code> etc. between every significant line in the thing you're debugging, and grep the logs for "XXX" - and moved the <code>self._reset_base()</code> call down under the <code>self.TransactionEvent</code> call...and found that when I did that, we got further, the <code>self.TransactionEvent</code> call worked and we crashed the next time something <em>else</em> tried to emit a DBus signal. I also tried commenting out the <code>self._reset_base()</code> call entirely, and found that now we would only crash the next time a DBus signal was emitted <em>after a subsequent call to the <code>Unlock()</code> method</em>, which is another method that calls <code>self._reset_base()</code>. So, at this point I was pretty confident in this description: "dnfdaemon is crashing on the first interaction with DBus after <code>self._reset_base()</code> is called".</p> <p>So my next step was to break down what <code>_reset_base()</code> was actually <em>doing</em>. Turns out all of the detail is in the <code>DnfDaemonBase</code> skeleton server class: it has a <code>self._base</code> which is a <code>dnf.base.Base()</code> instance, and that method just calls that instance's <code>close()</code> method and sets <code>self._base</code> to <code>None</code>. So off I went into dnf code to see <a href="https://github.com/rpm-software-management/dnf/blob/master/dnf/base.py#L454">what <code>dnf.base.Base.close()</code> does</a>. Turns out it basically does two things: it calls <code>self._finalize_base()</code> and then calls <code>self.reset(True, True, True)</code>.</p> <p>Looking at the code it wasn't immediately obvious which of these would be the culprit, so it was all aboard the trial and error train again! I changed the call to <code>self._reset_base()</code> in the daemon to <code>self._base.reset(True, True, True)</code>...and the bug stopped happening! So that told me the problem was in the call to <code>_finalize_base()</code>, not the call to <code>reset()</code>. So I dug into what <code>_finalize_base()</code> does and kinda repeated this process - I kept drilling down through layers and splitting up what things did into individual pieces, and doing subsets of those pieces at a time to try and find the "smallest" thing I could which would cause the bug.</p> <p>To take a short aside...this is what I really like about these kinds of debugging odysseys. It's like being a detective, only ultimately you know that there's a definite <em>reason</em> for what's happening and there's always some way you can get closer to it. If you have enough patience there's always a next step you can take that will get you a little bit closer to figuring out what's going on. You just have to keep working through the little steps until you finally get there.</p> <p>Eventually I lit upon <a href="https://github.com/rpm-software-management/dnf/blob/master/dnf/rpm/transaction.py#L50">this bit of <code>dnf.rpm.transaction.TransactionWrapper.close()</code></a>. That was the key, as close as I could get to it: reducing the daemon's <code>self._reset_base()</code> call to just <code>self._base._priv_ts.ts = None</code> (which is what that line does) was enough to cause the bug. That was the one thing out of all the things that <code>self._reset_base()</code> does which caused the problem.</p> <p>So, of course, I took a look at what this <code>ts</code> thing was. Turns out it's an instance of <code>rpm.TransactionSet</code>, from RPM's Python library. So, at some point, we're setting up an instance of <code>rpm.TransactionSet</code>, and at this point we're dropping our reference to it, which - point to ponder - might trigger some kind of cleanup on it.</p> <p>Remember how I was looking for things that deal with dbus? Well, that turned out to bear fruit at this point...because what I did next was simply to go to my git checkout of rpm and grep it for 'dbus'. And lo and behold...<a href="https://github.com/rpm-software-management/rpm/blob/ab601b882b9d9d8248250111317615db1aa7b7c6/plugins/systemd_inhibit.c">this showed up</a>.</p> <p>Turns out RPM has plugins (TIL!), and in particular, it has <em>this</em> one, which talks to dbus. (What it actually does is try to inhibit systemd from suspending or shutting down the system while a package transaction is happening). And this plugin has a <a href="https://github.com/rpm-software-management/rpm/blob/ab601b882b9d9d8248250111317615db1aa7b7c6/plugins/systemd_inhibit.c#L85">cleanup function which calls something called <code>dbus_shutdown()</code></a> - aha!</p> <p>This was enough to get me pretty suspicious. So I checked my system and, indeed, I had a package <code>rpm-plugin-systemd-inhibit</code> installed. I poked at dependencies a bit and found that <code>python3-dnf</code> recommends that package, which means it'll basically be installed on nearly all Fedora installs. Still looking like a prime suspect. So, it was easy enough to check: I put the code back to a state where the crash happened, uninstalled the package, and tried again...and bingo! The crash stopped happening.</p> <p>So at this point the case was more or less closed. I just had to do a bit of confirming and tidying up. I checked and it turned out that indeed this call to <code>dbus_shutdown()</code> had been <a href="https://github.com/rpm-software-management/rpm/commit/d5f201345f6d27b6280750e5c6502f4418614fbc">added quite recently</a>, which tied in with the bug not showing up earlier. I looked up the <a href="https://dbus.freedesktop.org/doc/api/html/group__DBusMemory.html#ga01912903e39428872920d861ef565bac">documentation for <code>dbus_shutdown()</code></a> which confirmed that it's a bit of a big cannon which certainly could cause a problem like this:</p> <p>"Frees all memory allocated internally by libdbus and reverses the effects of dbus_threads_init().</p> <p>libdbus keeps internal global variables, for example caches and thread locks, and it can be useful to free these internal data structures.</p> <p>...</p> <p>You can't continue to use any D-Bus objects, such as connections, that were allocated prior to dbus_shutdown(). You can, however, start over; call dbus_threads_init() again, create new connections, and so forth."</p> <p>and then I did a scratch build of rpm with the commit reverted, tested, and found that indeed, it solved the problem. So, we finally had our culprit: when the <code>rpm.TransactionSet</code> instance went out of scope, it got cleaned up, and that resulted in this plugin's cleanup function getting called, and <code>dbus_shutdown()</code> happening. The RPM devs had intended that call to clean up <em>the RPM plugin's</em> DBus handles, but this is all happening in a single process, so the call also cleaned up the DBus handles used by dnfdaemon itself, and that was enough (as the docs suggest) to cause any further attempts to communicate with DBus in dnfdaemon code to blow up and crash the daemon.</p> <p>So, that's how you get from dnfdragora claiming that DNF is locked by another process to a stray RPM plugin crashing dnfdaemon on a DBus interaction!</p> <p></p><p>Gather round the fire, kids, it's time for another Debugging Adventure! These are posts where I write up the process of diagnosing the root cause of a bug, where it turned out to be interesting (to me, anyway...)</p> <p>This case - <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1750575">Bugzilla #1750575</a> - involved dnfdragora, the package management tool used on Fedora Xfce, which is a release-blocking environment for the ARM architecture. It was a pretty easy bug to reproduce: any time you updated a package, the update would work, but then dnfdragora would show an error "DNF is locked by another process. dnfdragora will exit.", and immediately exit.</p> <p>The bug sat around on the blocker list for a while; Daniel Mach (a DNF developer) <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1750575#c10">looked into it a bit</a> but didn't have time to figure it out all the way. So I got tired of waiting for someone else to do it, and decided to work it out myself.</p> <p></p><h2>Where's the error coming from?</h2> <p>As a starting point, I had a nice error message - so the obvious thing to do is figure out where that message comes from. The text appears in a couple of places in dnfdragora - in an <a href="https://github.com/manatools/dnfdragora/blob/master/dnfdragora/basedragora.py#L131">exception handler</a> and also in a <a href="https://github.com/manatools/dnfdragora/blob/master/dnfdragora/basedragora.py#L90">method for setting up a connection to dnfdaemon</a>. So, if we didn't already know (I happened to) this would be the point at which we'd realize that dnfdragora is a frontend app to a backend - <a href="https://github.com/manatools/dnfdaemon">dnfdaemon</a> - which does the heavy lifting.</p> <p>So, to figure out in more detail how we were getting to one of these two points, I hacked both the points where that error is logged. Both of them read <code>logger.critical(errmsg)</code>. I changed this to <code>logger.exception(errmsg)</code>. <code>logger.exception</code> is a very handy feature of Python's <a href="https://docs.python.org/3/library/logging.html">logging module</a> which logs whatever message you specify, <strong>plus a traceback to the current state</strong>, just like the traceback you get if the app actually crashes. So by doing that, the dnfdragora log (it logs to a file <code>dnfdragora.log</code> in the directory you run it from) gave us a traceback showing how we got to the error:</p> <pre><code>2019-10-14 17:53:29,436 &lt;a href="ERROR"&gt;dnfdragora&lt;/a&gt; dnfdaemon client error: g-io-error-quark: GDBus.Error:org.baseurl.DnfSystem.LockedError: dnf is locked by another application (36) Traceback (most recent call last): File "/usr/bin/dnfdragora", line 85, in &lt;module&gt; main_gui.handleevent() File "/usr/lib/python3.7/site-packages/dnfdragora/ui.py", line 1273, in handleevent if not self._searchPackages(filter, True) : File "/usr/lib/python3.7/site-packages/dnfdragora/ui.py", line 949, in _searchPackages packages = self.backend.search(fields, strings, self.match_all, self.newest_only, tags ) File "/usr/lib/python3.7/site-packages/dnfdragora/misc.py", line 135, in newFunc rc = func(*args, **kwargs) File "/usr/lib/python3.7/site-packages/dnfdragora/dnf_backend.py", line 464, in search newest_only, tags) File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 508, in Search fields, keys, attrs, match_all, newest_only, tags)) File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 293, in _run_dbus_async result = self._get_result(data) File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 277, in _get_result self._handle_dbus_error(user_data['error']) File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 250, in _handle_dbus_error raise DaemonError(str(err)) dnfdaemon.client.DaemonError: g-io-error-quark: GDBus.Error:org.baseurl.DnfSystem.LockedError: dnf is locked by another application (36)&lt;/module&gt; </code></pre> <p>So, this tells us quite a bit of stuff. We know we're crashing in some sort of 'search' operation, and dbus seems to be involved. We can also see a bit more of the architecture here. Note how we have <code>dnfdragora/dnf_backend.py</code> and <code>dnfdaemon/client/<strong>init</strong>.py</code> included in the trace, even though we're only in the dnfdragora executable here (dnfdaemon is a separate process). Looking at that and then looking at those files a bit, it's quite easy to see that the dnfdaemon Python library provides a sort of <a href="https://github.com/manatools/dnfdaemon/blob/master/python/dnfdaemon/client/__init__.py#L199">framework for a client class called (oddly enough) <code>DnfDaemonBase</code></a> which the actual client - dnfdragora in our case - is expected to subclass and flesh out. dnfdragora does this in a <a href="https://github.com/manatools/dnfdragora/blob/master/dnfdragora/dnf_backend.py#L144">class called <code>DnfRootBackend</code></a>, which inherits from both <code>dnfdragora.backend.Backend</code> (a sort of abstraction layer for dnfdragora to have multiple of these backends, though at present it only actually has this one) and <code>dnfdaemon.client.Client</code>, which is just a <a href="https://github.com/manatools/dnfdaemon/blob/master/python/dnfdaemon/client/__init__.py#L544">small extension to <code>DnfDaemonBase</code></a> that adds some dbus signal handling.</p> <p>So now we know more about the design we're dealing with, and we can also see that we're trying to do some sort of search operation which looks like it works by the client class communicating with the actual <code>dnfdaemon</code> server process via dbus, only we're hitting some kind of error in that process, and interpreting it as 'dnf is locked by another application'. If we dig a little deeper, we can figure out a bit more. We have to read through all of the backtrace frames and examine the functions, but ultimately we can figure out that <code>DnfRootBackend.Search()</code> is wrapped by <code>dnfdragora.misc.ExceptionHandler</code>, which handles <code>dnfdaemon.client.DaemonError</code> exceptions - like the one that's ultimately getting raised here! - by calling the base class's own <code>exception_handler()</code> on them...and for us, that's <a href="https://github.com/manatools/dnfdragora/blob/master/dnfdragora/basedragora.py#L121">BaseDragora.exception_handler</a>, one of the two places we found earlier that ultimately produces this "DNF is locked by another process. dnfdragora will exit" text. We also now have <em>two</em> indications (the dbus error itself, and the code in <code>exception_handler()</code> that the error we're dealing with is "LockedError".</p> <h2>A misleading error...</h2> <p>At this point, I went looking for the text <code>LockedError</code>, and found it in two files in dnfdaemon that are kinda variants on each other - <a href="https://github.com/manatools/dnfdaemon/blob/master/daemon/dnfdaemon-session.py">daemon/dnfdaemon-session.py</a> and <a href="https://github.com/manatools/dnfdaemon/blob/master/daemon/dnfdaemon-system.py">daemon/dnfdaemon-system.py</a>. I didn't actually know offhand which of the two is used in our case, but it doesn't really matter, because the codepath to <code>LockedError</code> is the same in both. There's a <a href="https://github.com/manatools/dnfdaemon/blob/master/daemon/dnfdaemon-system.py#L700">function called <code>check_lock()</code></a> which checks that <code>self._lock == sender</code>, and if it doesn't, raises <code>LockedError</code>. That sure looks like where we're at.</p> <p>So at this point I did a bit of poking around into how <code>self._lock</code> gets set and unset in the daemon. It turns out to be pretty simple. The daemon is basically implemented as a class with a bunch of methods that are wrapped by <code>@dbus.service.method</code>, which makes them accessible as <em>DBus</em> methods. (One of them is <code>Search()</code>, and we can see that the <a href="https://github.com/manatools/dnfdaemon/blob/master/python/dnfdaemon/client/__init__.py#L493">client class's own <code>Search()</code> basically just calls that</a>). There are also methods called <a href="https://github.com/manatools/dnfdaemon/blob/master/daemon/dnfdaemon-system.py#L105"><code>Lock()</code></a> and <a href="https://github.com/manatools/dnfdaemon/blob/master/daemon/dnfdaemon-system.py#L350"><code>Unlock()</code></a>, which - not surprisingly - set and release this lock, by setting the daemon class' <code>self._lock</code> to be either an identifier for the DBus client or <code>None</code>, respectively. And when the daemon is first initialized, the value is set to <code>None</code>.</p> <p>At this point, I realized that the error we're dealing with here is actually a lie in two important ways:</p> <ol> <li>The message claims that the problem is the lock being held "by another application", but that's not what <code>check_lock()</code> checks, really. It passes <strong>only if the caller holds the lock</strong>. It <em>does</em> fail if the lock is held "by another application", but it <em>also</em> fails if <strong>the lock is not held at all</strong>. Given all the code we looked at so far, we can't actually trust the message's assertion that something else is holding the lock. It is also possible that the lock is not held at all.</li> <li>The message suggests that the lock in question is a lock on dnf itself. I know dnf/libdnf <em>do</em> have locking, so up to now I'd been assuming we were actually dealing with the locking in dnf itself. But at this point I realized we weren't. The dnfdaemon lock code we just looked at doesn't actually call or wrap dnf's own locking code in any way. This lock we're dealing with is entirely internal to dnfdaemon. It's really a lock on <em>the dnfdaemon instance itself</em>.</li> </ol> <p>So, at this point I started thinking of the error as being "dnfdaemon is either locked by another DBus client, or not locked at all".</p> <h2>So what's going on with this lock anyway?</h2> <p>My next step, now I understood the locking process we're dealing with, was to stick some logging into it. I added log lines to the <code>Lock()</code> and <code>Unlock()</code> methods, and I also made <code>check_lock()</code> log what <code>sender</code> and <code>self._lock</code> were set to before returning. Because it sets <code>self._lock</code> to <code>None</code>, I also added a log line to the daemon's <code><strong>init</strong></code> that just records that we're in it. That got me some more useful information:</p> <pre><code>2019-10-14 18:53:03.397784 XXX In DnfDaemon.init now! 2019-10-14 18:53:03.402804 XXX LOCK: sender is :1.1835 2019-10-14 18:53:03.407524 XXX CHECK LOCK: sender is :1.1835 XXX CHECK LOCK: self._lock is :1.1835 2019-10-14 18:53:07.556499 XXX CHECK LOCK: sender is :1.1835 XXX CHECK LOCK: self._lock is :1.1835 [...snip a bunch more calls to check_lock where the sender is the same...] 2019-10-14 18:53:13.560828 XXX CHECK LOCK: sender is :1.1835 XXX CHECK LOCK: self._lock is :1.1835 2019-10-14 18:53:13.560941 XXX CHECK LOCK: sender is :1.1835 XXX CHECK LOCK: self._lock is :1.1835 2019-10-14 18:53:16.513900 XXX In DnfDaemon.init now! 2019-10-14 18:53:16.516724 XXX CHECK LOCK: sender is :1.1835 XXX CHECK LOCK: self._lock is None </code></pre> <p>so we could see that when we started dnfdragora, dnfdaemon started up and dnfdragora locked it almost immediately, then throughout the whole process of reproducing the bug - run dnfdragora, search for a package to be updated, mark it for updating, run the transaction, wait for the error - there were several instances of DBus method calls where everything worked fine (we see <code>check_lock()</code> being called and finding <code>sender</code> and <code>self._lock</code> set to the same value, the identifier for dnfdragora), but then suddenly we see the daemon's <code><strong>init</strong></code> running again for some reason, not being locked, and then a <code>check_lock()</code> call that fails because the daemon instance's <code>self._lock</code> is <code>None</code>.</p> <p>After a couple of minutes, I guessed what was going on here, and the <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1750575#c23">daemon's service logs confirmed it</a> - dnfdaemon was crashing and automatically restarting. The first attempt to invoke a DBus method after the crash and restart fails, because dnfdragora has not locked this <em>new</em> instance of the daemon (it has no idea it just crashed and restarted), so <code>check_lock()</code> fails. So as soon as a DBus method invocation is attempted after the dnfdaemon crash, dnfdragora errors out with the confusing "dnf is locked by another process" error.</p> <p>The crash was already mentioned in the bug report, but until now the exact interaction between the crash and the error had not been worked out - we just knew the daemon crashed and the app errored out, but we didn't really know what order those things happened in or how they related to each other.</p> <h2>OK then...why is dnfdaemon crashing?</h2> <p>So, the question now became: why is dnfdaemon crashing? Well, the backtrace we had didn't tell us a lot; really it only told us that something was going wrong in libdbus, which we could also tell from the dnfdaemon service log:</p> <pre><code>Oct 14 18:53:15 adam.happyassassin.net dnfdaemon-system[226042]: dbus[226042]: arguments to dbus_connection_unref() were incorrect, assertion "connection-&gt;generation == _dbus_current_generation" failed in file ../../dbus/dbus-connection.c line 2823. Oct 14 18:53:15 adam.happyassassin.net dnfdaemon-system[226042]: This is normally a bug in some application using the D-Bus library. Oct 14 18:53:15 adam.happyassassin.net dnfdaemon-system[226042]: D-Bus not built with -rdynamic so unable to print a backtrace </code></pre> <p>that last line looked like a cue, so of course, off I went to figure out how to build DBus with <code>-rdynamic</code>. A bit of Googling told me - thanks "the3dfxdude"! - that <a href="https://www.linuxquestions.org/questions/slackware-14/compile-dbus-with-backtrace-support-4175459330/#post4938346">the trick is to compile with --enable-asserts</a>. So I did that and reproduced the bug again, and <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1750575#c25">got a bit of a better backtrace</a>. It's a long one, but by picking through it carefully I could spot - in frame #17 - the actual point at which the problem happened, which was <a href="https://github.com/manatools/dnfdaemon/blob/master/python/dnfdaemon/server/__init__.py#L595">in <code>dnfdaemon.server.DnfDaemonBase.run_transaction()</code></a>. (Note, this is a <em>different</em> <code>DnfDaemonBase</code> class from <code>dnfdaemon.client.DnfDaemonBase</code>; I don't know why they have the same name, that's just confusing).</p> <p>So, the daemon's crashing on this <code>self.TransactionEvent('end-run', NONE)</code> call. I poked into what that does a bit, and found a design here that kinda mirrors what happens on the client side: this <code>DnfDaemonBase</code>, like the other one, is a framework for a full daemon implementation, and it's subclassed by <a href="https://github.com/manatools/dnfdaemon/blob/master/daemon/dnfdaemon-system.py#L59">a <code>DnfDaemon</code> class here</a>. That class defines <a href="https://github.com/manatools/dnfdaemon/blob/master/daemon/dnfdaemon-system.py#L639">a <code>TransactionEvent</code> method</a> that emits a DBus signal. So...we're crashing when trying to emit a dbus signal. That all adds up with the backtrace going through libdbus and all. But, <em>why</em> are we crashing?</p> <p>At this point I tried to make a small reproducer (which basically just set up a <code>DnfDaemon</code> instance and called <code>self.TransactionEvent</code> in the same way, I think) but that didn't work - I didn't know why at the time, but figured it out later. Continuing to trace it out through code wouldn't be that easy because now we're in DBus, which I know from experience is a big complex codebase that's not that easy to just reason your way through. We had the actual DBus error to work from too - "arguments to dbus_connection_unref() were incorrect, assertion "connection-&gt;generation == _dbus_current_generation" failed" - and I looked into that a bit, but there were no really helpful leads there (I got a bit more understanding about what the error means exactly, but it didn't help me understand <em>why it was happening</em> at all).</p> <h2>Time for the old standby...</h2> <p>So, being a bit stuck, I fell back on the most trusty standby: trial and error! Well, also a <em>bit</em> of logic. It did occur to me that the dbus broker is itself a long-running daemon that other things can talk to. So I started just wondering if something was interfering with dnfdaemon's connection with the dbus broker, somehow. This was in my head as I poked around at stuff - wherever I wound up looking, I was looking for stuff that involved dbus.</p> <p>But to figure out where to look, I just started hacking up dnfdaemon a bit. Now this first part is probably pure intuition, but that <code>self._reset_base()</code> call on the line right before the <code>self.TransactionEvent</code> call that crashes bugged me. It's probably just long experience telling me that anything with "reset" or "refresh" in the name is bad news. :P So I thought, hey, what happens if we move it?</p> <p>I stuck some logging lines into this <code>run_transaction</code> so I knew where we got to before we crashed - this is a great dumb trick, btw, just stick lines like <code>self.logger('XXX HERE 1')</code>, <code>self.logger('XXX HERE 2')</code> etc. between every significant line in the thing you're debugging, and grep the logs for "XXX" - and moved the <code>self._reset_base()</code> call down under the <code>self.TransactionEvent</code> call...and found that when I did that, we got further, the <code>self.TransactionEvent</code> call worked and we crashed the next time something <em>else</em> tried to emit a DBus signal. I also tried commenting out the <code>self._reset_base()</code> call entirely, and found that now we would only crash the next time a DBus signal was emitted <em>after a subsequent call to the <code>Unlock()</code> method</em>, which is another method that calls <code>self._reset_base()</code>. So, at this point I was pretty confident in this description: "dnfdaemon is crashing on the first interaction with DBus after <code>self._reset_base()</code> is called".</p> <p>So my next step was to break down what <code>_reset_base()</code> was actually <em>doing</em>. Turns out all of the detail is in the <code>DnfDaemonBase</code> skeleton server class: it has a <code>self._base</code> which is a <code>dnf.base.Base()</code> instance, and that method just calls that instance's <code>close()</code> method and sets <code>self._base</code> to <code>None</code>. So off I went into dnf code to see <a href="https://github.com/rpm-software-management/dnf/blob/master/dnf/base.py#L454">what <code>dnf.base.Base.close()</code> does</a>. Turns out it basically does two things: it calls <code>self._finalize_base()</code> and then calls <code>self.reset(True, True, True)</code>.</p> <p>Looking at the code it wasn't immediately obvious which of these would be the culprit, so it was all aboard the trial and error train again! I changed the call to <code>self._reset_base()</code> in the daemon to <code>self._base.reset(True, True, True)</code>...and the bug stopped happening! So that told me the problem was in the call to <code>_finalize_base()</code>, not the call to <code>reset()</code>. So I dug into what <code>_finalize_base()</code> does and kinda repeated this process - I kept drilling down through layers and splitting up what things did into individual pieces, and doing subsets of those pieces at a time to try and find the "smallest" thing I could which would cause the bug.</p> <p>To take a short aside...this is what I really like about these kinds of debugging odysseys. It's like being a detective, only ultimately you know that there's a definite <em>reason</em> for what's happening and there's always some way you can get closer to it. If you have enough patience there's always a next step you can take that will get you a little bit closer to figuring out what's going on. You just have to keep working through the little steps until you finally get there.</p> <p>Eventually I lit upon <a href="https://github.com/rpm-software-management/dnf/blob/master/dnf/rpm/transaction.py#L50">this bit of <code>dnf.rpm.transaction.TransactionWrapper.close()</code></a>. That was the key, as close as I could get to it: reducing the daemon's <code>self._reset_base()</code> call to just <code>self._base._priv_ts.ts = None</code> (which is what that line does) was enough to cause the bug. That was the one thing out of all the things that <code>self._reset_base()</code> does which caused the problem.</p> <p>So, of course, I took a look at what this <code>ts</code> thing was. Turns out it's an instance of <code>rpm.TransactionSet</code>, from RPM's Python library. So, at some point, we're setting up an instance of <code>rpm.TransactionSet</code>, and at this point we're dropping our reference to it, which - point to ponder - might trigger some kind of cleanup on it.</p> <p>Remember how I was looking for things that deal with dbus? Well, that turned out to bear fruit at this point...because what I did next was simply to go to my git checkout of rpm and grep it for 'dbus'. And lo and behold...<a href="https://github.com/rpm-software-management/rpm/blob/ab601b882b9d9d8248250111317615db1aa7b7c6/plugins/systemd_inhibit.c">this showed up</a>.</p> <p>Turns out RPM has plugins (TIL!), and in particular, it has <em>this</em> one, which talks to dbus. (What it actually does is try to inhibit systemd from suspending or shutting down the system while a package transaction is happening). And this plugin has a <a href="https://github.com/rpm-software-management/rpm/blob/ab601b882b9d9d8248250111317615db1aa7b7c6/plugins/systemd_inhibit.c#L85">cleanup function which calls something called <code>dbus_shutdown()</code></a> - aha!</p> <p>This was enough to get me pretty suspicious. So I checked my system and, indeed, I had a package <code>rpm-plugin-systemd-inhibit</code> installed. I poked at dependencies a bit and found that <code>python3-dnf</code> recommends that package, which means it'll basically be installed on nearly all Fedora installs. Still looking like a prime suspect. So, it was easy enough to check: I put the code back to a state where the crash happened, uninstalled the package, and tried again...and bingo! The crash stopped happening.</p> <p>So at this point the case was more or less closed. I just had to do a bit of confirming and tidying up. I checked and it turned out that indeed this call to <code>dbus_shutdown()</code> had been <a href="https://github.com/rpm-software-management/rpm/commit/d5f201345f6d27b6280750e5c6502f4418614fbc">added quite recently</a>, which tied in with the bug not showing up earlier. I looked up the <a href="https://dbus.freedesktop.org/doc/api/html/group__DBusMemory.html#ga01912903e39428872920d861ef565bac">documentation for <code>dbus_shutdown()</code></a> which confirmed that it's a bit of a big cannon which certainly could cause a problem like this:</p> <p>"Frees all memory allocated internally by libdbus and reverses the effects of dbus_threads_init().</p> <p>libdbus keeps internal global variables, for example caches and thread locks, and it can be useful to free these internal data structures.</p> <p>...</p> <p>You can't continue to use any D-Bus objects, such as connections, that were allocated prior to dbus_shutdown(). You can, however, start over; call dbus_threads_init() again, create new connections, and so forth."</p> <p>and then I did a scratch build of rpm with the commit reverted, tested, and found that indeed, it solved the problem. So, we finally had our culprit: when the <code>rpm.TransactionSet</code> instance went out of scope, it got cleaned up, and that resulted in this plugin's cleanup function getting called, and <code>dbus_shutdown()</code> happening. The RPM devs had intended that call to clean up <em>the RPM plugin's</em> DBus handles, but this is all happening in a single process, so the call also cleaned up the DBus handles used by dnfdaemon itself, and that was enough (as the docs suggest) to cause any further attempts to communicate with DBus in dnfdaemon code to blow up and crash the daemon.</p> <p>So, that's how you get from dnfdragora claiming that DNF is locked by another process to a stray RPM plugin crashing dnfdaemon on a DBus interaction!</p>