Hi folks! Well, it looks like I forgot to blog for...*checks watch*....*checks calendar*...a year. Wow. Whoops. Sorry about that. I'm still here, though! We released, uh, lots of Fedoras since the last time I wrote about that. Fedora 35 is the current one. It's, uh, [mostly great!](https://fedoraproject.org/wiki/Common_F35_bugs) Go [get a copy](https://getfedora.org/), why don't you? And while that's downloading, you can get comfy and listen to another of Crazy Uncle Adam's Debugging Adventures. In this episode, we'll be uncomfortably reminded just how much of the code that causes your system to actually boot at all consists of fragile shell script with no tests, so this'll be fun! Last month, booting a system installed from Rawhide live images stopped working properly. You could boot the live image fine, run the installation fine, but on rebooting, the system would fail to boot with an error: `dracut: FATAL: Don't know how to handle 'root=live:CDLABEL=Fedora-WS-Live-rawh-20211229-n-1'`. openQA caught this, and so did one of our QA community members - Ahed Almeleh - who [filed a bug](https://bugzilla.redhat.com/show_bug.cgi?id=2036199). After the end-of-year holidays, I got to figuring out what was going wrong. As usual, I got a bit of a head start from pre-existing knowledge. I happen to know that error message is referring to kernel arguments that are set in the bootloader configuration of *the live image itself*. dracut is the tool that handles an early phase of boot where we boot into a temporary environment that's loaded entirely into system memory, set up the *real* system environment, and boot that. This early environment is contained in the `initrd` files you can find alongside the kernel on most Linux distributions; that's what they're for. Part of dracut's job is to be run when a kernel is installed to *produce* this environment, and then other parts of dracut are included *in the environment itself* to handle initializing things, finding the real system root, preparing it, and then switching to it. The initrd environments on Fedora live images are built to contain a dracut 'module' (called `90dmsquash-live`) that knows to interpret `root=live:CDLABEL=Fedora-WS-Live-rawh-20211229-n-1` as meaning 'go look for a live system root on the filesystem with that label and boot that'. Installed systems don't contain that module, because, well, they don't need to know how to do that, and you wouldn't really ever want an installed system to *try* and do that. So the short version here is: the installed system has the wrong kernel argument for telling dracut where to find the system root. It *should* look something like `root=/dev/mapper/fedora-root` (where we're pointing to a system root on an LVM volume that dracut will set up and then switch to). So the obvious next question is: why? Why is our installed system getting this wrong argument? It seemed likely that it 'leaked' from the live system to the installed system somehow, but I needed to figure out how. From here, I had kinda two possible ways to investigate. The easiest and fastest would probably be if I happened to know exactly how we deal with setting up bootloader configuration when running a live install. Then I'd likely have been able to start poking the most obvious places right away and figure out the problem. But, as it happens, I didn't at the time remember exactly how that works. I just remembered that I wind up having to figure it out every few years, and it's complicated and scary, so I tend to forget again right afterwards. I kinda knew where to start looking, but didn't really want to have to work it all out again from scratch if I could avoid it. So I went with the other possibility, which is always: figure out when it broke, and figure out what changed between the last time it worked and the first time it broke. This usually makes life much easier because now you know one of the things on that list is the problem. The shorter and simpler the list, the easier life gets. I looked at the openQA result history and found that the bug was introduced somewhere between 20211215.n.0 and 20211229.n.1 (unfortunately kind of a wide range). The good news is that only a few packages could plausibly be involved in this bug; the most likely are dracut itself, grub2 (the bootloader), grubby (a Red Hat / Fedora-specific grub configuration...thing), anaconda (the Fedora installer, which obviously does some bootloader configuration stuff), the kernel itself, and systemd (which is of course involved in the boot process itself, but also - perhaps less obviously - is where a script called [`kernel-install`](https://github.com/systemd/systemd/tree/main/src/kernel-install) that is used (on Fedora and many other distros) to 'install' kernels lives (this was another handy thing I happened to know already, but really - it's always a safe bet to include systemd on the list of potential suspects for anything boot-related). Looking at what changed between 2021-12-15 and 2021-12-29, we could let out grub2 and grubby as they didn't change. There were some kernel builds, but nothing in the scriptlets changed in any way that could be related. dracut got a build with [one change](https://src.fedoraproject.org/rpms/dracut/c/76eb28fc2ef2f9e43b5ea66d0b9c96f83e124d4b?branch=rawhide), but again it seemed clearly unrelated. So I was down to anaconda and systemd as suspects. On an initial quick check during the vacation, I thought anaconda had not changed, and took a brief look at systemd, but didn't see anything immediately obvious. When I came back to look at it more thoroughly, I realized anaconda did get a new version (36.12) on 2021-12-15, so that initially interested me quite a lot. I spent some time going through the changes in that version, and there were some that really could have been related - it changed how running things during install inside the installed system worked (which is definitely how we do some bootloader setup stuff during install), and it had interesting commit messages like "Remove the dracut_args attribute" and "Remove upd-kernel". So I spent an afternoon fairly sure it'd turn out to be one of those, reviewed all those changes, mocked up locally how they worked, examined the logs of the actual image composes, and...concluded that none of those seemed to be the problem at all. The installer seemed to still be doing things the same as it always had. There weren't any tell-tale missing or failing bootloader config steps. However, this time wasn't entirely wasted: I was reminded of exactly what anaconda does to configure the bootloader when installing from a live image. When we install from a live image, we don't do what the 'traditional' installer does and install a bunch of RPM packages using dnf. The live image does not contain any RPM packages. The live image itself was *built* by installing a bunch of RPM packages, but it is the *result* of that process. Instead, we essentially set up the filesystems on the drive(s) we're installing to and then just dump the contents of the live image filesystem *itself* onto them. Then we run a few tweaks to adjust anything that needs adjusting for this now being an installed system, not a live one. One of the things we do is re-generate the `initrd` file for the installed system, and then re-generate the bootloader configuration. This involves running `kernel-install` (which places the kernel and initrd files onto the boot partition, and writes some bootloader configuration 'snippet' files), and then running `grub2-mkconfig`. The main thing `grub2-mkconfig` does is produce the main bootloader configuration file, but that's not really why we run it at this point. There's [a very interesting comment](https://github.com/rhinstaller/anaconda/blob/anaconda-36.12-1/pyanaconda/modules/storage/bootloader/utils.py#L244) explaining why in the anaconda source: # Update the bootloader configuration to make sure that the BLS # entries will have the correct kernel cmdline and not the value # taken from /proc/cmdline, that is used to boot the live image. Which is exactly what we were dealing with here. The "BLS entries" we're talking about here are the things I called 'snippet' files above, they live in `/boot/loader/entries` on Fedora systems. These are where the kernel arguments used at boot are specified, and indeed, that's where the problematic `root=live:...` arguments were specified in broken installs - in the "BLS entries" in `/boot/loader/entries`. So it seemed like, somehow, this mechanism just wasn't working right any more - we were expecting this run of `grub2-mkconfig` in the installed system root after live installation to correct those snippets, but it wasn't. However, as I said, I couldn't establish that any change to anaconda was causing this. So I eventually shelved anaconda at least temporarily and looked at systemd. And it turned out that systemd had changed too. During the time period in question, we'd gone from systemd 250~rc1 to 250~rc3. (If you check the build history of systemd the dates don't seem to match up - by 2021-12-29 the 250-2 build had happened already, but in fact the 250-1 and 250-2 builds were untagged for causing a different problem, so the 2021-12-29 compose had 250~rc3). By now I was obviously pretty focused on `kernel-install` as the most likely related part of systemd, so I went to my systemd git checkout and ran: git log v250-rc1..v250-rc3 src/kernel-install/ which shows all the commits under `src/kernel-install` between 250-rc1 and 250-rc3. And that gave me another juicy-looking, yet thankfully short, set of commits: [641e2124de6047e6010cd2925ea22fba29b25309](https://github.com/systemd/systemd/commit/641e2124de6047e6010cd2925ea22fba29b25309) kernel-install: replace 00-entry-directory with K_I_LAYOUT in k-i [357376d0bb525b064f468e0e2af8193b4b90d257](https://github.com/systemd/systemd/commit/357376d0bb525b064f468e0e2af8193b4b90d257) kernel-install: Introduce KERNEL_INSTALL_MACHINE_ID in /etc/machine-info [447a822f8ee47b63a4cae00423c4d407bfa5e516](https://github.com/systemd/systemd/commit/447a822f8ee47b63a4cae00423c4d407bfa5e516) kernel-install: Remove "Default" from list of suffixes checked So I went and looked at all of those. And again...I got it wrong at first! This is I guess a good lesson from this Debugging Adventure: you don't always get the right answer at first, but that's okay. You just have to keep plugging, and always keep open the possibility that you're wrong and you should try something else. I spent time thinking the cause was likely a change in anaconda before focusing on systemd, then focused on the wrong systemd commit first. I got interested in 641e212 first, and had even written out a whole Bugzilla comment blaming it before I realized it wasn't the culprit (fortunately, I didn't post it!) I thought the problem was that the new check for `$BOOT_ROOT/$MACHINE_ID` would not behave as it should on Fedora and cause the install scripts to do something different from what they should - generating incorrect snippet files, or putting them in the wrong place, or something. Fortunately, I decided to test this before declaring it was the problem, and found out that it wasn't. I did this using something that turned out to be invaluable in figuring out the real problem. You may have noticed by this point - harking back to our intro - that this critical `kernel-install` script, key to making sure your system boots, is...a shell script. That calls other shell scripts. You know what else is a big pile of shell scripts? `dracut`. You know, that critical component that both builds and controls the initial boot environment. Big pile of shell script. The install script - the `dracut` command itself - is shell. All the dracut `modules` - the bits that do most of the work - are shell. There's a bit of C in the source tree (I'm not entirely sure what that bit does), but most of it's shell. Critical stuff like this being written in shell makes me shiver, because shell is very easy to get wrong, and quite hard to test properly (and in fact neither dracut nor kernel-install has good tests). But one good thing about it is that it's quite easy to *debug*, thanks to the magic of `sh -x`. If you run some shell script via `sh -x` (whether that's really `sh`, or `bash` or some other alternative pretending to be `sh`), it will run as normal but print out most of the logic (variable assignments, tests, and so on) that happen along the way. So on a VM where I'd run a broken install, I could do `chroot /mnt/sysimage` (to get into the root of the installed system), find the exact `kernel-install` command that anaconda ran from one of the logs in `/var/log/anaconda` (I forget which), and re-run it through `sh -x`. This showed me all the logic going on through the run of `kernel-install` itself and all the scripts it sources under `/usr/lib/kernel/install.d`. Using this, I could confirm that the check I suspected had the result I suspected - I could see that it was deciding that `layout="other"`, not `layout="bls"`, [here](https://github.com/systemd/systemd/blob/v250-rc3/src/kernel-install/kernel-install#L134). But I could *also* figure out a way to override that decision, confirm that it worked, and find that it didn't solve the problem: the config snippets were still wrong, and running `grub2-mkconfig` didn't fix them. In fact the config snippets got wronger - it turned out that we do *want* `kernel-install` to pick 'other' rather than 'bls' here, because Fedora doesn't really implement BLS according to the upstream specs, so if we let kernel-install think we do, the config snippets we get are wrong. So now I'd been wrong twice! But each time, I learned a bit more that eventually helped me be right. After I decided that commit wasn't the cause after all, I finally spotted the problem. I figured this out by continuing with the `sh -x` debugging, and noticing an inconsistency. By this point I'd thought to find out what bit of `grub2-mkconfig` should be doing the work of correcting the key bit of configuration here. It's in a [Fedora-only downstream patch to one of the scriptlets in `/etc/grub.d`](https://src.fedoraproject.org/rpms/grub2/blob/b25606806096b51e9d920f50b9bb47773641644d/f/0062-Add-BLS-support-to-grub-mkconfig.patch#_230). It replaces the `options=` line in any snippet files it finds with what it reckons the kernel arguments "should be". So I got curious about what exactly was going wrong there. I tweaked `grub2-mkconfig` slightly to run those scriptlets using `sh -x` by changing these lines in `grub2-mkconfig`: echo "### BEGIN $i ###" "$i" echo "### END $i ###" to read: echo "### BEGIN $i ###" sh -x "$i" echo "### END $i ###" Now I could re-run `grub2-mkconfig` and look at what was going on behind the scenes of the scriptlet, and I noticed that it wasn't *finding* any snippet files at all. But why not? [The code that looks for the snippet files](https://src.fedoraproject.org/rpms/grub2/blob/b25606806096b51e9d920f50b9bb47773641644d/f/0062-Add-BLS-support-to-grub-mkconfig.patch#_211) reads the file `/etc/machine-id` as a string, then looks for files in `/boot/loader/entries` whose names start with that string (and end in `.conf`). So I went and looked at my sample system and...found that the files in `/boot/loader/entries` did *not* start with the string in `/etc/machine-id`. The files in `/boot/loader/entries` started with `a69bd9379d6445668e7df3ddbda62f86`, but the ID in `/etc/machine-id` was `b8d80a4c887c40199c4ea1a8f02aa9b4`. This is why everything was broken: because those IDs didn't match, `grub2-mkconfig` couldn't find the files to correct, so the argument was wrong, so the system didn't boot. Now I knew what was going wrong and I only had two systemd commits left on the list, it was pretty easy to see the problem. It was in [357376d](https://github.com/systemd/systemd/commit/357376d0bb525b064f468e0e2af8193b4b90d257). That changes how `kernel-install` names these snippet files when creating them. It names them by finding a machine ID to use as a prefix. Previously, it used whatever string was in `/etc/machine-id`; if that file didn't exist or was empty, it just used the string "Default". After that commit, it also looks for a value specified in `/etc/machine-info`. If there's a `/etc/machine-id` but not `/etc/machine-info` when you run `kernel-install`, it uses the value from `/etc/machine-id` and writes it to `/etc/machine-info`. When I checked those files, it turned out that on the live image, the ID in both `/etc/machine-id` and `/etc/machine-info` was `a69bd9379d6445668e7df3ddbda62f86` - the problematic ID on the installed system. When we generate the live image itself, `kernel-install` uses the value from `/etc/machine-id` and writes it to `/etc/machine-info`, and both files wind up in the live filesystem. But *on the installed system*, the ID in `/etc/machine-info` was that same value, but the ID in `/etc/machine-id` was different (as we saw above). Remember how I mentioned above that when doing a live install, we essentially dump the live filesystem itself onto the installed system? Well, one of the 'tweaks' we make when doing this is to [re-generate `/etc/machine-id`](https://github.com/rhinstaller/anaconda/blob/612bfee7d37e16d7bfab329b44182a71d04a3344/pyanaconda/modules/storage/bootloader/utils.py#L41-L48), because that ID is meant to be unique to each installed system - we don't want every system installed from a Fedora live image to have the same machine ID as the live image itself. However, as this `/etc/machine-info` file is new, we don't strip it from or re-generate it in the installed system, we just install it. The installed system has a `/etc/machine-info` with the same ID as the live image's machine ID, but a new, different ID in `/etc/machine-id`. And this (finally) was the ultimate source of the problem! When we run them on the installed system, the new version of `kernel-install` writes config snippet files using the ID from `/etc/machine-info`. But Fedora's patched `grub2-mkconfig` scriptlet doesn't know about that mechanism at all (since it's brand new), and expects the snippet files to contain the ID from `/etc/machine-id`. There are various ways you could potentially solve this, but after consulting with systemd upstream, the one we chose is to [have anaconda exclude `/etc/machine-info`](https://github.com/rhinstaller/anaconda/commit/fe652add9733943a3476128a83b24b9a8c63b335) when doing a live install. The changes to systemd here aren't wrong - it does potentially make sense that `/etc/machine-id` and `/etc/machine-info` could both exist and specify different IDs in some cases. But for the case of Fedora live installs, it doesn't make sense. The sanest result is for those IDs to match and both be the 'fresh' machine ID that's generated at the end of the install process. By just not including `/etc/machine-info` on the installed system, we achieve this result, because now when `kernel-install` runs at the end of the install process, it reads the ID from `/etc/machine-id` and writes it to `/etc/machine-info`, and both IDs are the same, `grub2-mkconfig` finds the snippet files and edits them correctly, the installed system boots, and I can move along to the next debugging odyssey...