Welp, since I haven't blogged for a while, here's another root-causing write up! Honestly, root-causing things is one of my favorite parts of the job, lately.

I've been on vacation recently, and when I came back, it looked like several things were wrong with Rawhide. Several of these were relatively easy fixes: live images not behaving right at all and the installer not launching properly any more both only took a couple of hours to track down. But there were also a couple of bugs causing more recent composes to fail entirely. The inimitable puiterwijk dealt with one of those (aarch64 cloud images not building properly), and I wound up taking the other one: overnight, most live image composes had suddenly stopped working.

What's The Problem?

The starting point for this debugging odyssey was the error messages we got in the failed composes:

DEBUG pylorax.treebuilder: kernels=[]
ERROR livemedia-creator: No kernels found, cannot rebuild_initrds

So, the first step is pretty easy: let's just go look up where those errors come from. pylorax and livemedia-creator are both part of the lorax tree, so we'll start there. It's easy to use grep to find the sources of those two messages: they're both in treebuilder.py, the first here, in findkernels() and the second here, in TreeBuilder.rebuild_initrds(). As the second happens if there is no self.kernels, and we we can see just a few lines further back that self.kernels is a property based on a call to findkernels(), it's pretty obvious that ultimately what's going wrong here is that findkernels() isn't finding any kernels.

So...Why Aren't We Finding Any Kernels?

So next, of course, I put my thinking cap on and had a good look at findkernels() - not just at the code itself, but at its history. When something that was working breaks, you're always looking for a change that caused it. There were no recent changes to findkernels(), and I couldn't see anything obviously wrong in its implementation - it's basically looking for files named vmlinuz-(something) in /boot - so it didn't look like the quest was over yet.

So at this point there were kinda two possibilities: findkernels() was broken, or it was working fine but there really weren't any kernels to find. I decided the easiest way to figure out which we were dealing with was just to reproduce the problem and take a look. With any sufficiently complex problem, you can usually get some way into figuring it out just by looking and logs and code and thinking about things logically, but at some point you'll want to try and reproduce the problem in a controlled environment you can poke at - knowing when you're at that point isn't an exact science, but if you find yourself with multiple possibilities and no easy way to decide which you're dealing with, that's often a good indicator.

Doing a test live image build is a bit awkward for Fedora at present, unfortunately, but not really too bad. I've been meaning to write a helper tool for a while, but never got around to it. The way I do it - because I want to replicate as close as possible how the official images are built - is just to try and do exactly what the build system does, only on my local system and by hand. What the build system does is spin up a mock root, install the live image creation tools and a flattened kickstart into it, and run the tool. So, that's what I do! A handy 'trick' here is just to look at the logs from a real live image creation task - like this one - and shadow what they do. Note these logs will get garbage collected at some point, but I can't do anything about that.

The main log to look at is root.log. First, we see that the tools are installed to the mock:

DEBUG util.py:439:  Marking packages as installed by the group:
DEBUG util.py:439:   @livemedia-build glibc-all-langpacks coreutils               lorax-lmc-novirt
DEBUG util.py:439:                    util-linux          selinux-policy-targeted bash            
DEBUG util.py:439:                    shadow-utils

So we'll do just the same with our mock:

mock -r fedora-rawhide-x86_64 --install glibc-all-langpacks coreutils lorax-lmc-novirt util-linux selinux-policy-targeted bash shadow-utils

(Really all you want to do is install the livemedia-build group, but I haven't actually found an equivalent of mock --groupinstall, so I just do that).

We can find the actual image compose command in the task log again, by looking for 'livemedia-creator' - it looks like this:

INFO backend.py:285:  Running in chroot: ['/sbin/livemedia-creator', '--ks', '/chroot_tmpdir/koji-image-f30-build-30594219.ks', '--logfile', '/chroot_tmpdir/lmc-logs/livemedia-out.log', '--no-virt', '--resultdir', '/chroot_tmpdir/lmc', '--project', 'Fedora-KDE-Live', '--make-iso', '--volid', 'Fedora-KDE-Live-rawh-20181101.n.', '--iso-only', '--iso-name', 'Fedora-KDE-Live-x86_64-Rawhide-20181101.n.0.iso', '--releasever', 'Rawhide', '--title', 'Fedora-KDE-Live', '--macboot']

we can easily turn that Python array into a console command by just replacing occurrences of ', ' with :

/sbin/livemedia-creator --ks /chroot_tmpdir/koji-image-f30-build-30594219.ks --logfile /chroot_tmpdir/lmc-logs/livemedia-out.log --no-virt --resultdir /chroot_tmpdir/lmc --project Fedora-KDE-Live --make-iso --volid Fedora-KDE-Live-rawh-20181101.n. --iso-only --iso-name Fedora-KDE-Live-x86_64-Rawhide-20181101.n.0.iso --releasever Rawhide --title Fedora-KDE-Live --macboot

We can see that's using a scratch directory called /chroot_tmpdir, and a kickstart file called koji-image-f30-build-30594219.ks. This kickstart can be found as one of the task assets, so we'll grab it and copy it into the mock's /root for now:

sudo cp koji-image-f30-build-30606109.ks /var/lib/mock/fedora-rawhide-x86_64/root/root

Then finally, we're going to get a shell in the mock root, using the old-style chroot implementation (this is necessary for live image builds to work, the new systemd-based implementation doesn't work yet):

mock -r fedora-rawhide-x86_64 --shell --old-chroot

Inside the mock, we'll create the /chroot_tmpdir scratch dir, copy the kickstart into it, and finally run the image creation:

mkdir -p /chroot_tmpdir
cd /chroot_tmpdir
cp /root/koji-image-f30-build-30606109.ks .
/sbin/livemedia-creator --ks /chroot_tmpdir/koji-image-f30-build-30594219.ks --logfile /chroot_tmpdir/lmc-logs/livemedia-out.log --no-virt --resultdir /chroot_tmpdir/lmc --project Fedora-KDE-Live --make-iso --volid Fedora-KDE-Live-rawh-20181101.n. --iso-only --iso-name Fedora-KDE-Live-x86_64-Rawhide-20181101.n.0.iso --releasever Rawhide --title Fedora-KDE-Live --macboot

And when I did that, it worked away for a while - half an hour or so - and eventually failed exactly like the 'official' build had! So now I had a failure in a controlled environment (my little mock root) to look at. Note that if you're playing along at home, this will only work so long as you can grab that kickstart from Koji, and the 20181101.n.0 compose files are kept around, which will only be for another two weeks or so - after that you won't be able to reproduce this, but you can of course follow the same procedure with a newer Koji task if you want to reproduce a newer official live image build.

Next, I needed to examine the actual filesystem produced by the image build process and see if it really had any kernels in it (remember, that's what we were trying to figure out). This requires a bit of knowledge about how livemedia-creator works, which you'd have had to look up if you didn't know it already: it creates an image file, loopback mounts it, and installs into the loopback-mounted directory. When it fails, it leaves the image file around, and you can just mount it again and poke around. The file will be in the lmc/ subdirectory of the directory where the image build command was run, with a filename like lmc-disk-ctfz98m5.img (the alphanumeric soup bit is random), and we mount it like this:

mount -o loop lmc/lmc-disk-ctfz98m5.img /mnt/sysimage

(the mount point having been created and left around by the tool). Now we can look in /mnt/sysimage/boot, and when I did that...I found that, indeed, it contained no vmlinuz-* files at all! So, I had eliminated the possibility that findkernels() was going wrong: it was doing its job just fine, and it wasn't finding any kernels because...there were no kernels to find.

OK...So Why Aren't There Any Kernels?

So now I had to try and work out: why were there no kernels in the image's /boot? I knew from the logs of earlier, successful image composes that, in successful composes, there really were kernels to find: it wasn't that there had never been kernels, but this had only recently become fatal for some reason. The difference really was that there used to be kernels present when this lorax code ran, but now there weren't.

This led me into a fun new piece of research: figuring out how kernel files get into /boot in a Fedora system at all. You might think - I did - that they're simply perfectly normal packaged files installed by the kernel-core package. But it turns out, they're not! The kernel-core package does own the relevant /boot/vmlinuz-* file, but it doesn't actually directly install it: it uses an RPM directive called %ghost to own it without installing it. So the file must get installed some other way. Here again I cheated a bit with prior knowledge - I knew this overall mechanism existed, though I didn't know until now that it really installed the kernel file itself - but if you don't have that, you could look at the %posttrans script in the kernel package: when a kernel is installed, a command called kernel-install gets run.

I also found out (by diffing the logged packages from the 20181030.n.0 and 20181101.n.0 live image composes) that the kernel itself had been updated in the 20181101.n.0 compose (which was when things started breaking). So once again I had a couple of promising lines of inquiry: the new kernel, and this kernel-install path.

Well, turns out systemd haters in the audience can get very excited, because kernel-install is part of systemd:

[adamw@adam lorax (master %)]$ rpm -qf `which kernel-install`

Anyway, I refreshed my memory a bit about what kernel-install does, but it's kinda complicated and it calls out to various other things, including /usr/sbin/new-kernel-pkg (part of grubby) and some plugins in /usr/lib/kernel/install.d (various of which come from systemd, grub2, and dracut). So I think what I did next (my memory gets a bit hazy) was to wonder whether the same problem would affect a regular install from the same packages.

I got the last working Rawhide network install image, and set it to install from the Everything repo from the failed 20181101.n.0 compose. I let that install run, then checked that the /boot/vmlinuz-(whatever) file existed in the installed system...which it did. This sort of let out one theory I had: that the new kernel package had somehow messed something up such that the kernel file never actually got installed properly at all.

So, I got to wondering whether kernel-install really was the thing that put the /boot/vmlinuz-(whatever) file in place (I still wasn't sure at this point), whether it reproducibly failed to do so in the live image creation environment but succeeded in doing so in the regular install environment, and if so, what differed between the two.

I could see the exact kernel-install command just by examining the kernel-core package scripts:

rpm -q --scripts kernel-core | grep kernel-install
/bin/kernel-install add 4.20.0-0.rc0.git7.1.fc30.x86_64 /lib/modules/4.20.0-0.rc0.git7.1.fc30.x86_64/vmlinuz || exit $?

So I tried just deleting /boot/vmlinuz* and re-running that command in the installed system...and sure enough, the file re-appeared! So now I was pretty sure kernel-install was the thing that's supposed to install it. I also tried doing this in my live image creation mock environment:

chroot /mnt/sysimage
/bin/kernel-install add 4.20.0-0.rc0.git7.1.fc30.x86_64 /lib/modules/4.20.0-0.rc0.git7.1.fc30.x86_64/vmlinuz

...and sure enough, it didn't create the /boot/vmlinuz-(foo) file. So now I was narrowing in on the problem: something about the live image creation environment meant that this kernel-install invocation didn't install the file, when it probably should.

OK...So Why Isn't kernel-install Installing Kernels?

At this point I probably could've figured out the problem by reading the source if I'd read it carefully enough, but I decided to carry on with the practical experimentation. I tried running the script through sh -x in each environment, to see exactly what commands were run by the script in each case, and somehow - I forget how - I zeroed in on one of the /usr/lib/kernel/install.d plugin scripts: /usr/lib/kernel/install.d/20-grub.install. This is part of grub2. I think I found these scripts from the sh -x output, and noticed that this one has what looks like the code to actually install the kernel file to /boot. So I made that script run with -z as well, and this finally got me my next big breakthrough. In the installed system I could see that script doing a lot of stuff, but in the live environment it seemed to exit almost as soon as it started:

+ [[ -n '' ]]
+ exit 0

It's not 100% obvious, but I was pretty sure that just meant it was failing in the test right at the start:

    exit 0

So I went and looked up $KERNEL_INSTALL_MACHINE_ID and the references suggested that it's basically tied to /etc/machine-id. So I looked, and, lo and behold, in the regular installed system, that file contained a random alphanumeric string, but in the live image creation environment, the file was empty! This sure looked suspicious.

I read through some references on the file, and found that it's usually meant to get created by a call to /usr/bin/systemd-machine-id-setup in systemd package scripts. So I tried running systemd-machine-id-setup in the live image creation environment, and suddenly the file got a random string, and when I ran kernel-install again, the kernel file did get installed to /boot!

OK...So Why Isn't There A Machine ID?

So now I'd moved along to a new problem again: why was /etc/machine-id empty when the kernel %post script ran, but if I tried to generate it again, it worked? Was the initial generation failing? Was it happening too late? Was it working, but the file getting touched by something else?

Again, I looked at quite a lot of code to try and figure it out - there's code that at least could touch /etc/machine-id in anaconda, for instance, and in lorax - but in the end I decided to go with practical experimentation again. So I did a custom scratch build of systemd to try and find out what happened when the %post script ran. I changed the command from this:

systemd-machine-id-setup &>/dev/null || :

to this:

systemd-machine-id-setup > /tmp/smids.log 2>&1
cat /etc/machine-id >> /tmp/smids.log

pulled that build into a side repository, edited the live kickstart to use that side repository, and re-ran the live image creation. And that hit paydirt in a big way, because in smids.log I saw this:

systemd-machine-id-setup: error while loading shared libraries: libssl.so.1.1: cannot open shared object file: No such file or directory

...and here was the ultimate solution to our mystery! The attempt to set the machine-id in systemd %post was failing because it needs libssl, but it obviously wasn't present yet. libssl is part of openssl-libs, but the systemd spec did not specify that its %post script needs openssl-libs installed. What I surmise had happened was that up until 20181030, some other dependency in some other package happened to mean that dnf would always choose to install openssl-libs before installing systemd, so no-one had ever noticed this missing dependency...but on 20181101, some change to some package caused dnf to start installing systemd before openssl-libs, and suddenly, this problem showed up. So - as is very often the case - once I'd finally managed to nail down the problem, the fix was simple: we just add the missing dependency to systemd, so that openssl-libs will always be installed before systemd's %post is run. With that fix, generation of /etc/machine-id will succeed again, and so the plugin script that installs the kernel file to /boot won't bail early, and so there will be a kernel file in /boot, and lorax won't fail when it tries to regenerate initrds because there aren't any kernels present!

...and so ends another exciting episode of AdamW's Debugging Adventures :)