AdamW on Linux and more (Posts about Technical) https://www.happyassassin.net/categories/technical.atom 2023-06-20T12:09:45Z Adam Williamson Nikola DevConf.CZ 2023, Rawhide update test gating, ELN testing and more! https://www.happyassassin.net/posts/2023/06/20/devconfcz-2023-rawhide-update-test-gating-eln-testing-and-more/ 2023-06-20T11:17:53Z 2023-06-20T11:17:53Z Adam Williamson <p>I'm in Brno, working from the office for a few days after the end of DevConf.CZ. It was a great conference, good to see people and feel some positive energy after all the stuff with RH layoffs and so on. It was really well attended, and there were a lot of useful talks. I presented on the current state of openQA and Fedora CI, with Miroslav Vadkerti kindly covering the Fedora CI stuff (thanks to Miro for that). The segmented talk video hasn't been updated yet, but you can watch it from the recorded live stream starting <a href="https://youtu.be/detokEIR4jU?t=21872">here</a> (at 6:04:32). I think it went pretty well, I'm much happier with this latest version of the talk than the versions I did at DevConf.US and LinuxCon (sorry to anyone who saw those ones!)</p> <p>The <a href="https://devconfcz2023.sched.com/event/1MYbo/estimators-of-the-lost-arc-advance-recon-crew">talk by Aoife Moloney and Michal Konecny</a> on how the CPE team (which handles Fedora infra and apps) has started doing organized pre-scoping for projects instead of just diving in was really interesting and informative. The anaconda meetup wound up being just the anaconda team and myself and Sumantro from QA, but it was very useful as we were able to talk about the plans for moving forward with the new <a href="https://fedoramagazine.org/anaconda-web-ui-preview-image-now-public/">anaconda webUI</a> and how we can contribute testing for that - look out for Test Weeks coming soon. Davide Cavalca's <a href="https://devconfcz2023.sched.com/event/1MYlp/fedora-eln-at-meta-a-testbed-for-fleet-upgrades">talk on Fedora ELN usage at Meta</a> was great, and inspired me to work on stuff (more on that later).</p> <p>There were a lot of random conversations as always - thanks to it being June, the "hallway track" mostly evolved into the "shadow track", under the shade of a big tree in the courtyard, with beanbags and ice cream! That's a definite improvement. The social event was in a great location - around an outdoor swimming pool (although we couldn't swim - apparently we couldn't serve drinks if swimming was allowed, so that seems like the best choice!) All in all, a great conference. I'm very much looking forward to <a href="https://flocktofedora.org/">Flock in Cork</a> now, and will be doing my talk there again if it's accepted.</p> <p>Tomorrow will be an exciting day, because (barring any unforeseen issues) we'll be <a href="https://lists.fedoraproject.org/archives/list/devel-announce@lists.fedoraproject.org/message/G2K2SMYN7ONOJEFQEMCDR7GK72MZQFYB/">turning on gating of Rawhide updates</a>! I've been working towards this for some time now - improving the reliability of the tests, implementing test re-run support from Bodhi, implementing the critical path group stuff, and improving the Bodhi web UI display of test results and gating status - so I'm really looking forward to getting it done (and hoping it goes well). This should mean Rawhide's stability improves even more, and Kevin and I don't have to scramble quite so much to "shadow gate" Rawhide any more (by untagging builds that fail the tests).</p> <p>Davide mentioned during his ELN talk that they ran into an issue that openQA would have caught if it ran on ELN, so I asked if that would be useful, and he said yes. So, yesterday I did it. This required changes to <a href="https://pagure.io/fedora-qa/fedfind/c/967349c5e2f81bc95f8f617457119c95ef1f067d?branch=main">fedfind</a>, the <a href="https://pagure.io/fedora-qa/os-autoinst-distri-fedora/c/b0fb6911f3d74721a2ca70ee0db309d619146df6?branch=main">openQA tests</a>, and the <a href="https://pagure.io/fedora-qa/fedora_openqa/c/2e0500d7ce0d9becb3f59186c1f007898aba76ad?branch=main">openQA scheduler</a> - and then after that all worked out well and I deployed it, I realized it also needed changes to the result reporting code and a couple of other things too, which I had to do in rather a hurry! But it's all sorted out now, and we have new ELN composes automatically tested in production when they land. Initially only a couple of default-install-and-boot tests were running, I'm now working to extend the test set and the tested images.</p> <p>Other than that I've been doing a lot of work on the usual things - keeping openQA updated and running smoothly, investigating and fixing test failures, improving stuff in Bodhi and Greenwave, and reviewing new tests written by lruzicka. I'll be on vacation for a week or so from Friday, which will be a nice way to decompress from DevConf, then back to work on a bunch of ideas that came out of it!</p> <p>I'm in Brno, working from the office for a few days after the end of DevConf.CZ. It was a great conference, good to see people and feel some positive energy after all the stuff with RH layoffs and so on. It was really well attended, and there were a lot of useful talks. I presented on the current state of openQA and Fedora CI, with Miroslav Vadkerti kindly covering the Fedora CI stuff (thanks to Miro for that). The segmented talk video hasn't been updated yet, but you can watch it from the recorded live stream starting <a href="https://youtu.be/detokEIR4jU?t=21872">here</a> (at 6:04:32). I think it went pretty well, I'm much happier with this latest version of the talk than the versions I did at DevConf.US and LinuxCon (sorry to anyone who saw those ones!)</p> <p>The <a href="https://devconfcz2023.sched.com/event/1MYbo/estimators-of-the-lost-arc-advance-recon-crew">talk by Aoife Moloney and Michal Konecny</a> on how the CPE team (which handles Fedora infra and apps) has started doing organized pre-scoping for projects instead of just diving in was really interesting and informative. The anaconda meetup wound up being just the anaconda team and myself and Sumantro from QA, but it was very useful as we were able to talk about the plans for moving forward with the new <a href="https://fedoramagazine.org/anaconda-web-ui-preview-image-now-public/">anaconda webUI</a> and how we can contribute testing for that - look out for Test Weeks coming soon. Davide Cavalca's <a href="https://devconfcz2023.sched.com/event/1MYlp/fedora-eln-at-meta-a-testbed-for-fleet-upgrades">talk on Fedora ELN usage at Meta</a> was great, and inspired me to work on stuff (more on that later).</p> <p>There were a lot of random conversations as always - thanks to it being June, the "hallway track" mostly evolved into the "shadow track", under the shade of a big tree in the courtyard, with beanbags and ice cream! That's a definite improvement. The social event was in a great location - around an outdoor swimming pool (although we couldn't swim - apparently we couldn't serve drinks if swimming was allowed, so that seems like the best choice!) All in all, a great conference. I'm very much looking forward to <a href="https://flocktofedora.org/">Flock in Cork</a> now, and will be doing my talk there again if it's accepted.</p> <p>Tomorrow will be an exciting day, because (barring any unforeseen issues) we'll be <a href="https://lists.fedoraproject.org/archives/list/devel-announce@lists.fedoraproject.org/message/G2K2SMYN7ONOJEFQEMCDR7GK72MZQFYB/">turning on gating of Rawhide updates</a>! I've been working towards this for some time now - improving the reliability of the tests, implementing test re-run support from Bodhi, implementing the critical path group stuff, and improving the Bodhi web UI display of test results and gating status - so I'm really looking forward to getting it done (and hoping it goes well). This should mean Rawhide's stability improves even more, and Kevin and I don't have to scramble quite so much to "shadow gate" Rawhide any more (by untagging builds that fail the tests).</p> <p>Davide mentioned during his ELN talk that they ran into an issue that openQA would have caught if it ran on ELN, so I asked if that would be useful, and he said yes. So, yesterday I did it. This required changes to <a href="https://pagure.io/fedora-qa/fedfind/c/967349c5e2f81bc95f8f617457119c95ef1f067d?branch=main">fedfind</a>, the <a href="https://pagure.io/fedora-qa/os-autoinst-distri-fedora/c/b0fb6911f3d74721a2ca70ee0db309d619146df6?branch=main">openQA tests</a>, and the <a href="https://pagure.io/fedora-qa/fedora_openqa/c/2e0500d7ce0d9becb3f59186c1f007898aba76ad?branch=main">openQA scheduler</a> - and then after that all worked out well and I deployed it, I realized it also needed changes to the result reporting code and a couple of other things too, which I had to do in rather a hurry! But it's all sorted out now, and we have new ELN composes automatically tested in production when they land. Initially only a couple of default-install-and-boot tests were running, I'm now working to extend the test set and the tested images.</p> <p>Other than that I've been doing a lot of work on the usual things - keeping openQA updated and running smoothly, investigating and fixing test failures, improving stuff in Bodhi and Greenwave, and reviewing new tests written by lruzicka. I'll be on vacation for a week or so from Friday, which will be a nice way to decompress from DevConf, then back to work on a bunch of ideas that came out of it!</p> Thoughts on a pile of laptops https://www.happyassassin.net/posts/2023/02/02/thoughts-on-a-pile-of-laptops/ 2023-02-02T21:42:51Z 2023-02-02T21:42:51Z Adam Williamson <p>Hi folks! For the first post of 2023, I thought I'd do something a bit different. If you want to keep up with what I've been working on, these days <a href="https://fosstodon.org/@adamw">Mastodon</a> is the best place - I've been posting a quick summary at the end of every working day there. Seems to be working out well so far. The biggest thing lately is that "grouped critical path", which I wrote about in my <a href="https://www.happyassassin.net/posts/2022/11/18/fedora-37-openqa-news-mastodon-and-more/">last post</a>, is deployed in production now. This has already reduced the amount of tests openQA has to run, and I'm working on some further changes to optimize things more.</p> <p>So instead of that, I want to rhapsodize on this pile of laptops:</p> <p><img alt="A pile of laptops" src="https://www.happyassassin.net/images/laptops.jpg"></p> <p>On the top is the one I used as my main laptop for the last six years, and my main system for the last couple, since I got rid of my desktop. It's a Dell XPS 13 9360, the "Kaby Lake" generation. Not pictured (as it's over here being typed on, not in the pile) is its replacement, a 2022 XPS 13 (9315), which I bought in December and have been pretty happy with so far. On the bottom of the pile is a Lenovo tester (with AMD Ryzen hardware) which I tried to use as my main system for a bit, but it didn't work out as it only has 8G of RAM and that turns out to be...not enough. Second from bottom is a terrible budget Asus laptop with Windows on it that I keep around for the occasional time I need to use Windows - mainly to strip DRM from ebooks. Not pictured is the older XPS 13 I used before the later two, which broke down after a few years.</p> <p>But the hidden star of the show is the one second from top. It has a high-resolution 13" display with pretty slim bezels and a built-in webcam. It has dual NVIDIA and Intel GPUs. It has 8G of RAM, SSD storage and a multicore CPU, and runs Fedora 36 just fine, with decent (3-4hr) battery life. It weighs 3.15lb (1.43kg) and has USB, HDMI and ethernet outs.</p> <p>It also has a built-in DVD drive, VGA out and an ExpressCard slot (anyone remember those?) That's because it's from <strong>2010</strong>.</p> <p>It's a Sony Vaio Z VPC-Z11, and I still use it as a backup/test system. It barely feels outdated at all (until you remember about the DVD drive, which is actually pretty damn useful sometimes still). Every time I open it I'm still amazed at what a ridiculous piece of kit it is/was. Just do an image search for "2010 laptop" and you'll see stuff like, well, <a href="https://www.notebookcheck.net/NBC-Onsite-HP-Notebook-Lineup-05-2010.30296.0.html">this</a>. That's what pretty much every laptop looked like in 2010. They had 4G of RAM if you were lucky, and hard disks. They weighed 2kg+. They had huge frickin' bezels. The Macbook Air had come out in 2008, but it was an underpowered thing with a weak CPU and HDD storage. The 2010 models had SSDs, but maxed out at 4G RAM and still had pretty weak CPUs (and way bigger bezels, and worse screens, and they certainly didn't have DVD drives). They'd probably feel pretty painful to use now, but the Vaio still feels fine. Here's a glamour shot:</p> <p><img alt="One very cool laptop" src="https://www.happyassassin.net/images/vaioz.jpg"></p> <p>I've only had to replace its battery twice and its SSDs (it came from the factory with two SSDs configured RAID-0, because weird Sony is like that) once in 12 years. Probably one day it will finally not be really usable any more, but who the heck knows how long that will be.</p> <p>Hi folks! For the first post of 2023, I thought I'd do something a bit different. If you want to keep up with what I've been working on, these days <a href="https://fosstodon.org/@adamw">Mastodon</a> is the best place - I've been posting a quick summary at the end of every working day there. Seems to be working out well so far. The biggest thing lately is that "grouped critical path", which I wrote about in my <a href="https://www.happyassassin.net/posts/2022/11/18/fedora-37-openqa-news-mastodon-and-more/">last post</a>, is deployed in production now. This has already reduced the amount of tests openQA has to run, and I'm working on some further changes to optimize things more.</p> <p>So instead of that, I want to rhapsodize on this pile of laptops:</p> <p><img alt="A pile of laptops" src="https://www.happyassassin.net/images/laptops.jpg"></p> <p>On the top is the one I used as my main laptop for the last six years, and my main system for the last couple, since I got rid of my desktop. It's a Dell XPS 13 9360, the "Kaby Lake" generation. Not pictured (as it's over here being typed on, not in the pile) is its replacement, a 2022 XPS 13 (9315), which I bought in December and have been pretty happy with so far. On the bottom of the pile is a Lenovo tester (with AMD Ryzen hardware) which I tried to use as my main system for a bit, but it didn't work out as it only has 8G of RAM and that turns out to be...not enough. Second from bottom is a terrible budget Asus laptop with Windows on it that I keep around for the occasional time I need to use Windows - mainly to strip DRM from ebooks. Not pictured is the older XPS 13 I used before the later two, which broke down after a few years.</p> <p>But the hidden star of the show is the one second from top. It has a high-resolution 13" display with pretty slim bezels and a built-in webcam. It has dual NVIDIA and Intel GPUs. It has 8G of RAM, SSD storage and a multicore CPU, and runs Fedora 36 just fine, with decent (3-4hr) battery life. It weighs 3.15lb (1.43kg) and has USB, HDMI and ethernet outs.</p> <p>It also has a built-in DVD drive, VGA out and an ExpressCard slot (anyone remember those?) That's because it's from <strong>2010</strong>.</p> <p>It's a Sony Vaio Z VPC-Z11, and I still use it as a backup/test system. It barely feels outdated at all (until you remember about the DVD drive, which is actually pretty damn useful sometimes still). Every time I open it I'm still amazed at what a ridiculous piece of kit it is/was. Just do an image search for "2010 laptop" and you'll see stuff like, well, <a href="https://www.notebookcheck.net/NBC-Onsite-HP-Notebook-Lineup-05-2010.30296.0.html">this</a>. That's what pretty much every laptop looked like in 2010. They had 4G of RAM if you were lucky, and hard disks. They weighed 2kg+. They had huge frickin' bezels. The Macbook Air had come out in 2008, but it was an underpowered thing with a weak CPU and HDD storage. The 2010 models had SSDs, but maxed out at 4G RAM and still had pretty weak CPUs (and way bigger bezels, and worse screens, and they certainly didn't have DVD drives). They'd probably feel pretty painful to use now, but the Vaio still feels fine. Here's a glamour shot:</p> <p><img alt="One very cool laptop" src="https://www.happyassassin.net/images/vaioz.jpg"></p> <p>I've only had to replace its battery twice and its SSDs (it came from the factory with two SSDs configured RAID-0, because weird Sony is like that) once in 12 years. Probably one day it will finally not be really usable any more, but who the heck knows how long that will be.</p> Fedora 37, openQA news, Mastodon and more https://www.happyassassin.net/posts/2022/11/18/fedora-37-openqa-news-mastodon-and-more/ 2022-11-18T22:34:59Z 2022-11-18T22:34:59Z Adam Williamson <p>Hey, time for my now-apparently-annual blog post, I guess? First, a quick note: I joined the herd showing up on <a href="https://joinmastodon.org/">Mastodon</a>, on the <a href="https://fosstodon.org">Fosstodon</a> server, as <a href="https://fosstodon.org/@adamw">@adamw@fosstodon.org</a>. So, you know, follow me or whatever. I posted to Twitter even less than I post here, but we'll see what happens!</p> <p>The big news lately is of course that <a href="https://fedoramagazine.org/announcing-fedora-37/">Fedora 37 is out</a>. Pulling this release together was a bit more painful than has been the norm lately, and it does have at least <a href="https://ask.fedoraproject.org/t/f37-install-media-dont-boot-in-uefi-mode-on-certain-motherboards/28364">one bug I'm sad we didn't sort out</a>, but unless you have one of a very few motherboards from six years ago and want to do a reinstall, everything should be great!</p> <p>Personally I've been running Fedora Silverblue this cycle, as an experiment to see how it fares as a daily driver and a <a href="https://en.wikipedia.org/wiki/Eating_your_own_dog_food">dogfooding</a> base. Overall it's been working fine; there are still some awkward corners if you are strict about avoiding RPM overlays, though. I'm definitely interested in <a href="https://fedoraproject.org/wiki/Changes/OstreeNativeContainerStable">Colin's big native container rework proposal</a>, which would significantly change how the rpm-ostree-based systems work and make package layering a more 'accepted' thing to do. I also found that sourcing apps feels slightly odd - I'd kinda like to use Fedora Flatpaks for everything, from a dogfooding perspective, but not everything I use is available as one, so I wound up with kind of a mix of things sourced from Flathub and from Fedora Flatpaks. I was also surprised that Fedora Flatpaks aren't generally updated terribly often, and don't seem to have 'development' branches - while Fedora 37 was in development, I couldn't get Flatpak builds of apps that matched the Fedora 37 RPM builds, I was stuck running Fedora 36-based Flatpaks. So it actually impeded my ability to test the latest versions of everything. It'd be nice to see some improvement here going forward.</p> <p>My biggest project this year has been working towards gating Rawhide critical path updates on the <a href="https://openqa.fedoraproject.org">openQA</a> tests, as we do for stable and Branched releases. This has been a deceptively large effort; ensuring all the tests work OK on Rawhide was a relatively small job, but the experience of actually having the tests running has been interesting. There are, overall, a lot more updates for Rawhide than any other release, and obviously, they tend to break things more often. First I turned the tests on for the staging instance, then after a few months trying to get on top of things there, turned them on for the production instance. I planned to run this way for a month or two to see if I could stay on top of keeping the tests running smoothly and passing when they should, and dealing with breakage. On the whole, it's been possible...but just barely. The increased workload means tests can take several hours to complete after an update is submitted, which isn't ideal. Because we don't have the gating turned on, when somebody does submit an update that breaks the tests, I have to ensure it gets fixed <em>right away</em> or else get it untagged before the next Rawhide compose happens, or else the test will fail for every subsequent update too; that can be stressful. We also have had quite a lot of 'fun' with intermittent problems like <a href="https://bugzilla.redhat.com/show_bug.cgi?id=2133829">systemd-oomd killing things it shouldn't</a>. This can result in a lot of time spent manually restarting failed tests, coming up with awkward workarounds, and trying to debug the problems.</p> <p>So, I kinda felt like things aren't quite solid enough yet to turn the gating on, and I wound up working down a path intended to help with the "too many jobs take too long" and "intermittent failures" angles. This actually started out when I <a href="https://pagure.io/fedora-comps/pull-request/764">added a proper critical path definition for KDE</a>. This rather increased the openQA workload, as it added a bunch of packages to critical path that weren't there before. There was especially a fun moment when a couple hundred KDE package updates got submitted separately as Rawhide updates, and openQA spent a day running 55 tests on all of them, including all the GNOME and Server tests.</p> <p>As part of getting the KDE stuff added to the critical path, I wound up doing a <a href="https://pagure.io/releng/pull-request/11000">big update</a> to the script that actually generates the critical path definition, and working on that made me realize it wouldn't be difficult to track the critical path package set by group, not just as one big flat list. That, in turn, could allow us to only run "relevant" openQA tests for an update: if the update is only in the KDE critical path, we don't need to run the GNOME and Server tests on it, for instance. So for the last few weeks I've been working on what turned out to be quite a lot of pieces relevant to that.</p> <p>First, I <a href="https://pagure.io/releng/pull-request/11044">added the fundamental support in the critical path generation script</a>. Then I had to make <a href="https://bodhi.fedoraproject.org">Bodhi</a> work with this. Bodhi decides whether an update is critical path or not, and openQA gets that information from Bodhi. Bodhi, as currently configured, actually gets this information from <a href="https://pdc.fedoraproject.org/">PDC</a>, which seems to me an unnecessary layer of indirection, especially as we're hoping to retire PDC; Bodhi could just as easily itself be the 'source of truth' for the critical path. So I <a href="https://github.com/fedora-infra/bodhi/pull/4755">made Bodhi capable of reading critpath information directly from the files output by the script</a>, then <a href="https://github.com/fedora-infra/bodhi/pull/4759">made it use the group information for Greenwave queries and show it in the web UI and API query results</a>. That's all a hard requirement for running fewer tests on some updates, because without that, we would still always gate on all the openQA tests for every critical path update - so if we didn't run all the tests for some update, it would always fail gating. I also <a href="https://pagure.io/fedora-infra/ansible/c/18817b175aca65c4c1c8167eaadad231ef7a0449?branch=main">changed the Greenwave policies accordingly</a>, to only require the appropriate set of tests to pass for each critical path group, once our production Bodhi is set up to use all this new stuff - until then, the combined policy for the non-grouped decision contexts Bodhi still uses for now winds up identical to what it was before.</p> <p>Once a new Bodhi release is made and deployed to production, and we configure it to use the new grouped-critpath stuff instead of the flat definition from PDC, all of the groundwork is in place for me to actually change the openQA scheduler to check which critical path group(s) an update is in, and only schedule the appropriate tests. But along the way, I noticed this change meant Bodhi was querying Greenwave for <em>even more</em> decision contexts for each update. Right now for critical path updates Bodhi usually sends two queries to Greenwave (if there are more than seven packages in the update, it sends 2*((number of packages in update+1)/8) queries). With these changes, if an update was in, say, three critical path groups, it would send 4 (or more) queries. This slows things down, and also produces <a href="https://github.com/fedora-infra/bodhi/issues/4320">rather awkward and hard-to-understand output in the web UI</a>. So I decided to fix that too. I made it so the gating status displayed in the web UI is <a href="https://github.com/fedora-infra/bodhi/pull/4784">combined from however many queries Bodhi has to make</a>, instead of just displaying the result of each query separately. Then I tweaked greenwave to allow <a href="https://github.com/release-engineering/greenwave/pull/95">querying multiple decision contexts together</a>, and <a href="https://github.com/fedora-infra/bodhi/pull/4821">had Bodhi make use of that</a>. With those changes combined, Bodhi should only have to query once for most updates, and for updates with more than seven packages, the displayed gating status won't be confusing any more!</p> <p>I'm hoping all those Bodhi changes can be deployed to stable soon, so I can move forward with the remaining work needed, and ultimately see how much of an improvement we see. I'm hoping we'll wind up having to run rather fewer tests, which should reduce the wait time for tests to complete and also mitigate the problem of intermittent failures a bit. If this works out well enough, we might be able to move ahead with actually turning on the gating for Rawhide updates, which I'm really looking forward to doing.</p> <p>Hey, time for my now-apparently-annual blog post, I guess? First, a quick note: I joined the herd showing up on <a href="https://joinmastodon.org/">Mastodon</a>, on the <a href="https://fosstodon.org">Fosstodon</a> server, as <a href="https://fosstodon.org/@adamw">@adamw@fosstodon.org</a>. So, you know, follow me or whatever. I posted to Twitter even less than I post here, but we'll see what happens!</p> <p>The big news lately is of course that <a href="https://fedoramagazine.org/announcing-fedora-37/">Fedora 37 is out</a>. Pulling this release together was a bit more painful than has been the norm lately, and it does have at least <a href="https://ask.fedoraproject.org/t/f37-install-media-dont-boot-in-uefi-mode-on-certain-motherboards/28364">one bug I'm sad we didn't sort out</a>, but unless you have one of a very few motherboards from six years ago and want to do a reinstall, everything should be great!</p> <p>Personally I've been running Fedora Silverblue this cycle, as an experiment to see how it fares as a daily driver and a <a href="https://en.wikipedia.org/wiki/Eating_your_own_dog_food">dogfooding</a> base. Overall it's been working fine; there are still some awkward corners if you are strict about avoiding RPM overlays, though. I'm definitely interested in <a href="https://fedoraproject.org/wiki/Changes/OstreeNativeContainerStable">Colin's big native container rework proposal</a>, which would significantly change how the rpm-ostree-based systems work and make package layering a more 'accepted' thing to do. I also found that sourcing apps feels slightly odd - I'd kinda like to use Fedora Flatpaks for everything, from a dogfooding perspective, but not everything I use is available as one, so I wound up with kind of a mix of things sourced from Flathub and from Fedora Flatpaks. I was also surprised that Fedora Flatpaks aren't generally updated terribly often, and don't seem to have 'development' branches - while Fedora 37 was in development, I couldn't get Flatpak builds of apps that matched the Fedora 37 RPM builds, I was stuck running Fedora 36-based Flatpaks. So it actually impeded my ability to test the latest versions of everything. It'd be nice to see some improvement here going forward.</p> <p>My biggest project this year has been working towards gating Rawhide critical path updates on the <a href="https://openqa.fedoraproject.org">openQA</a> tests, as we do for stable and Branched releases. This has been a deceptively large effort; ensuring all the tests work OK on Rawhide was a relatively small job, but the experience of actually having the tests running has been interesting. There are, overall, a lot more updates for Rawhide than any other release, and obviously, they tend to break things more often. First I turned the tests on for the staging instance, then after a few months trying to get on top of things there, turned them on for the production instance. I planned to run this way for a month or two to see if I could stay on top of keeping the tests running smoothly and passing when they should, and dealing with breakage. On the whole, it's been possible...but just barely. The increased workload means tests can take several hours to complete after an update is submitted, which isn't ideal. Because we don't have the gating turned on, when somebody does submit an update that breaks the tests, I have to ensure it gets fixed <em>right away</em> or else get it untagged before the next Rawhide compose happens, or else the test will fail for every subsequent update too; that can be stressful. We also have had quite a lot of 'fun' with intermittent problems like <a href="https://bugzilla.redhat.com/show_bug.cgi?id=2133829">systemd-oomd killing things it shouldn't</a>. This can result in a lot of time spent manually restarting failed tests, coming up with awkward workarounds, and trying to debug the problems.</p> <p>So, I kinda felt like things aren't quite solid enough yet to turn the gating on, and I wound up working down a path intended to help with the "too many jobs take too long" and "intermittent failures" angles. This actually started out when I <a href="https://pagure.io/fedora-comps/pull-request/764">added a proper critical path definition for KDE</a>. This rather increased the openQA workload, as it added a bunch of packages to critical path that weren't there before. There was especially a fun moment when a couple hundred KDE package updates got submitted separately as Rawhide updates, and openQA spent a day running 55 tests on all of them, including all the GNOME and Server tests.</p> <p>As part of getting the KDE stuff added to the critical path, I wound up doing a <a href="https://pagure.io/releng/pull-request/11000">big update</a> to the script that actually generates the critical path definition, and working on that made me realize it wouldn't be difficult to track the critical path package set by group, not just as one big flat list. That, in turn, could allow us to only run "relevant" openQA tests for an update: if the update is only in the KDE critical path, we don't need to run the GNOME and Server tests on it, for instance. So for the last few weeks I've been working on what turned out to be quite a lot of pieces relevant to that.</p> <p>First, I <a href="https://pagure.io/releng/pull-request/11044">added the fundamental support in the critical path generation script</a>. Then I had to make <a href="https://bodhi.fedoraproject.org">Bodhi</a> work with this. Bodhi decides whether an update is critical path or not, and openQA gets that information from Bodhi. Bodhi, as currently configured, actually gets this information from <a href="https://pdc.fedoraproject.org/">PDC</a>, which seems to me an unnecessary layer of indirection, especially as we're hoping to retire PDC; Bodhi could just as easily itself be the 'source of truth' for the critical path. So I <a href="https://github.com/fedora-infra/bodhi/pull/4755">made Bodhi capable of reading critpath information directly from the files output by the script</a>, then <a href="https://github.com/fedora-infra/bodhi/pull/4759">made it use the group information for Greenwave queries and show it in the web UI and API query results</a>. That's all a hard requirement for running fewer tests on some updates, because without that, we would still always gate on all the openQA tests for every critical path update - so if we didn't run all the tests for some update, it would always fail gating. I also <a href="https://pagure.io/fedora-infra/ansible/c/18817b175aca65c4c1c8167eaadad231ef7a0449?branch=main">changed the Greenwave policies accordingly</a>, to only require the appropriate set of tests to pass for each critical path group, once our production Bodhi is set up to use all this new stuff - until then, the combined policy for the non-grouped decision contexts Bodhi still uses for now winds up identical to what it was before.</p> <p>Once a new Bodhi release is made and deployed to production, and we configure it to use the new grouped-critpath stuff instead of the flat definition from PDC, all of the groundwork is in place for me to actually change the openQA scheduler to check which critical path group(s) an update is in, and only schedule the appropriate tests. But along the way, I noticed this change meant Bodhi was querying Greenwave for <em>even more</em> decision contexts for each update. Right now for critical path updates Bodhi usually sends two queries to Greenwave (if there are more than seven packages in the update, it sends 2*((number of packages in update+1)/8) queries). With these changes, if an update was in, say, three critical path groups, it would send 4 (or more) queries. This slows things down, and also produces <a href="https://github.com/fedora-infra/bodhi/issues/4320">rather awkward and hard-to-understand output in the web UI</a>. So I decided to fix that too. I made it so the gating status displayed in the web UI is <a href="https://github.com/fedora-infra/bodhi/pull/4784">combined from however many queries Bodhi has to make</a>, instead of just displaying the result of each query separately. Then I tweaked greenwave to allow <a href="https://github.com/release-engineering/greenwave/pull/95">querying multiple decision contexts together</a>, and <a href="https://github.com/fedora-infra/bodhi/pull/4821">had Bodhi make use of that</a>. With those changes combined, Bodhi should only have to query once for most updates, and for updates with more than seven packages, the displayed gating status won't be confusing any more!</p> <p>I'm hoping all those Bodhi changes can be deployed to stable soon, so I can move forward with the remaining work needed, and ultimately see how much of an improvement we see. I'm hoping we'll wind up having to run rather fewer tests, which should reduce the wait time for tests to complete and also mitigate the problem of intermittent failures a bit. If this works out well enough, we might be able to move ahead with actually turning on the gating for Rawhide updates, which I'm really looking forward to doing.</p> AdamW's Debugging Adventures: Bootloaders and machine IDs https://www.happyassassin.net/posts/2022/01/11/adamws-debugging-adventures-bootloaders-and-machine-ids/ 2022-01-11T22:08:00Z 2022-01-11T22:08:00Z Adam Williamson <p>Hi folks! Well, it looks like I forgot to blog for...<em>checks watch</em>....<em>checks calendar</em>...a year. Wow. Whoops. Sorry about that. I'm still here, though! We released, uh, lots of Fedoras since the last time I wrote about that. Fedora 35 is the current one. It's, uh, <a href="https://fedoraproject.org/wiki/Common_F35_bugs">mostly great!</a> Go <a href="https://getfedora.org/">get a copy</a>, why don't you?</p> <p>And while that's downloading, you can get comfy and listen to another of Crazy Uncle Adam's Debugging Adventures. In this episode, we'll be uncomfortably reminded just how much of the code that causes your system to actually boot at all consists of fragile shell script with no tests, so this'll be fun!</p> <p>Last month, booting a system installed from Rawhide live images stopped working properly. You could boot the live image fine, run the installation fine, but on rebooting, the system would fail to boot with an error: <code>dracut: FATAL: Don't know how to handle 'root=live:CDLABEL=Fedora-WS-Live-rawh-20211229-n-1'</code>. openQA caught this, and so did one of our QA community members - Ahed Almeleh - who <a href="https://bugzilla.redhat.com/show_bug.cgi?id=2036199">filed a bug</a>. After the end-of-year holidays, I got to figuring out what was going wrong.</p> <p>As usual, I got a bit of a head start from pre-existing knowledge. I happen to know that error message is referring to kernel arguments that are set in the bootloader configuration of <em>the live image itself</em>. dracut is the tool that handles an early phase of boot where we boot into a temporary environment that's loaded entirely into system memory, set up the <em>real</em> system environment, and boot that. This early environment is contained in the <code>initrd</code> files you can find alongside the kernel on most Linux distributions; that's what they're for. Part of dracut's job is to be run when a kernel is installed to <em>produce</em> this environment, and then other parts of dracut are included <em>in the environment itself</em> to handle initializing things, finding the real system root, preparing it, and then switching to it. The initrd environments on Fedora live images are built to contain a dracut 'module' (called <code>90dmsquash-live</code>) that knows to interpret <code>root=live:CDLABEL=Fedora-WS-Live-rawh-20211229-n-1</code> as meaning 'go look for a live system root on the filesystem with that label and boot that'. Installed systems don't contain that module, because, well, they don't need to know how to do that, and you wouldn't really ever want an installed system to <em>try</em> and do that.</p> <p>So the short version here is: the installed system has the wrong kernel argument for telling dracut where to find the system root. It <em>should</em> look something like <code>root=/dev/mapper/fedora-root</code> (where we're pointing to a system root on an LVM volume that dracut will set up and then switch to). So the obvious next question is: why? Why is our installed system getting this wrong argument? It seemed likely that it 'leaked' from the live system to the installed system somehow, but I needed to figure out how.</p> <p>From here, I had kinda two possible ways to investigate. The easiest and fastest would probably be if I happened to know exactly how we deal with setting up bootloader configuration when running a live install. Then I'd likely have been able to start poking the most obvious places right away and figure out the problem. But, as it happens, I didn't at the time remember exactly how that works. I just remembered that I wind up having to figure it out every few years, and it's complicated and scary, so I tend to forget again right afterwards. I kinda knew where to start looking, but didn't really want to have to work it all out again from scratch if I could avoid it.</p> <p>So I went with the other possibility, which is always: figure out when it broke, and figure out what changed between the last time it worked and the first time it broke. This usually makes life much easier because now you know one of the things on that list is the problem. The shorter and simpler the list, the easier life gets.</p> <p>I looked at the openQA result history and found that the bug was introduced somewhere between 20211215.n.0 and 20211229.n.1 (unfortunately kind of a wide range). The good news is that only a few packages could plausibly be involved in this bug; the most likely are dracut itself, grub2 (the bootloader), grubby (a Red Hat / Fedora-specific grub configuration...thing), anaconda (the Fedora installer, which obviously does some bootloader configuration stuff), the kernel itself, and systemd (which is of course involved in the boot process itself, but also - perhaps less obviously - is where a script called <a href="https://github.com/systemd/systemd/tree/main/src/kernel-install"><code>kernel-install</code></a> that is used (on Fedora and many other distros) to 'install' kernels lives (this was another handy thing I happened to know already, but really - it's always a safe bet to include systemd on the list of potential suspects for anything boot-related).</p> <p>Looking at what changed between 2021-12-15 and 2021-12-29, we could let out grub2 and grubby as they didn't change. There were some kernel builds, but nothing in the scriptlets changed in any way that could be related. dracut got a build with <a href="https://src.fedoraproject.org/rpms/dracut/c/76eb28fc2ef2f9e43b5ea66d0b9c96f83e124d4b?branch=rawhide">one change</a>, but again it seemed clearly unrelated. So I was down to anaconda and systemd as suspects. On an initial quick check during the vacation, I thought anaconda had not changed, and took a brief look at systemd, but didn't see anything immediately obvious.</p> <p>When I came back to look at it more thoroughly, I realized anaconda did get a new version (36.12) on 2021-12-15, so that initially interested me quite a lot. I spent some time going through the changes in that version, and there were some that really could have been related - it changed how running things during install inside the installed system worked (which is definitely how we do some bootloader setup stuff during install), and it had interesting commit messages like "Remove the dracut_args attribute" and "Remove upd-kernel". So I spent an afternoon fairly sure it'd turn out to be one of those, reviewed all those changes, mocked up locally how they worked, examined the logs of the actual image composes, and...concluded that none of those seemed to be the problem at all. The installer seemed to still be doing things the same as it always had. There weren't any tell-tale missing or failing bootloader config steps. However, this time wasn't entirely wasted: I was reminded of exactly what anaconda does to configure the bootloader when installing from a live image.</p> <p>When we install from a live image, we don't do what the 'traditional' installer does and install a bunch of RPM packages using dnf. The live image does not contain any RPM packages. The live image itself was <em>built</em> by installing a bunch of RPM packages, but it is the <em>result</em> of that process. Instead, we essentially set up the filesystems on the drive(s) we're installing to and then just dump the contents of the live image filesystem <em>itself</em> onto them. Then we run a few tweaks to adjust anything that needs adjusting for this now being an installed system, not a live one. One of the things we do is re-generate the <code>initrd</code> file for the installed system, and then re-generate the bootloader configuration. This involves running <code>kernel-install</code> (which places the kernel and initrd files onto the boot partition, and writes some bootloader configuration 'snippet' files), and then running <code>grub2-mkconfig</code>. The main thing <code>grub2-mkconfig</code> does is produce the main bootloader configuration file, but that's not really why we run it at this point. There's <a href="https://github.com/rhinstaller/anaconda/blob/anaconda-36.12-1/pyanaconda/modules/storage/bootloader/utils.py#L244">a very interesting comment</a> explaining why in the anaconda source:</p> <div class="code"><pre class="code literal-block"><span class="c1"># Update the bootloader configuration to make sure that the BLS</span> <span class="c1"># entries will have the correct kernel cmdline and not the value</span> <span class="c1"># taken from /proc/cmdline, that is used to boot the live image.</span> </pre></div> <p>Which is exactly what we were dealing with here. The "BLS entries" we're talking about here are the things I called 'snippet' files above, they live in <code>/boot/loader/entries</code> on Fedora systems. These are where the kernel arguments used at boot are specified, and indeed, that's where the problematic <code>root=live:...</code> arguments were specified in broken installs - in the "BLS entries" in <code>/boot/loader/entries</code>. So it seemed like, somehow, this mechanism just wasn't working right any more - we were expecting this run of <code>grub2-mkconfig</code> in the installed system root after live installation to correct those snippets, but it wasn't. However, as I said, I couldn't establish that any change to anaconda was causing this.</p> <p>So I eventually shelved anaconda at least temporarily and looked at systemd. And it turned out that systemd had changed too. During the time period in question, we'd gone from systemd 250~rc1 to 250~rc3. (If you check the build history of systemd the dates don't seem to match up - by 2021-12-29 the 250-2 build had happened already, but in fact the 250-1 and 250-2 builds were untagged for causing a different problem, so the 2021-12-29 compose had 250~rc3). By now I was obviously pretty focused on <code>kernel-install</code> as the most likely related part of systemd, so I went to my systemd git checkout and ran:</p> <div class="code"><pre class="code literal-block">git log v250-rc1..v250-rc3 src/kernel-install/ </pre></div> <p>which shows all the commits under <code>src/kernel-install</code> between 250-rc1 and 250-rc3. And that gave me another juicy-looking, yet thankfully short, set of commits:</p> <p><a href="https://github.com/systemd/systemd/commit/641e2124de6047e6010cd2925ea22fba29b25309">641e2124de6047e6010cd2925ea22fba29b25309</a> kernel-install: replace 00-entry-directory with K_I_LAYOUT in k-i <a href="https://github.com/systemd/systemd/commit/357376d0bb525b064f468e0e2af8193b4b90d257">357376d0bb525b064f468e0e2af8193b4b90d257</a> kernel-install: Introduce KERNEL_INSTALL_MACHINE_ID in /etc/machine-info <a href="https://github.com/systemd/systemd/commit/447a822f8ee47b63a4cae00423c4d407bfa5e516">447a822f8ee47b63a4cae00423c4d407bfa5e516</a> kernel-install: Remove "Default" from list of suffixes checked</p> <p>So I went and looked at all of those. And again...I got it wrong at first! This is I guess a good lesson from this Debugging Adventure: you don't always get the right answer at first, but that's okay. You just have to keep plugging, and always keep open the possibility that you're wrong and you should try something else. I spent time thinking the cause was likely a change in anaconda before focusing on systemd, then focused on the wrong systemd commit first. I got interested in 641e212 first, and had even written out a whole Bugzilla comment blaming it before I realized it wasn't the culprit (fortunately, I didn't post it!) I thought the problem was that the new check for <code>$BOOT_ROOT/$MACHINE_ID</code> would not behave as it should on Fedora and cause the install scripts to do something different from what they should - generating incorrect snippet files, or putting them in the wrong place, or something.</p> <p>Fortunately, I decided to test this before declaring it was the problem, and found out that it wasn't. I did this using something that turned out to be invaluable in figuring out the real problem.</p> <p>You may have noticed by this point - harking back to our intro - that this critical <code>kernel-install</code> script, key to making sure your system boots, is...a shell script. That calls other shell scripts. You know what else is a big pile of shell scripts? <code>dracut</code>. You know, that critical component that both builds and controls the initial boot environment. Big pile of shell script. The install script - the <code>dracut</code> command itself - is shell. All the dracut <code>modules</code> - the bits that do most of the work - are shell. There's a bit of C in the source tree (I'm not entirely sure what that bit does), but most of it's shell.</p> <p>Critical stuff like this being written in shell makes me shiver, because shell is very easy to get wrong, and quite hard to test properly (and in fact neither dracut nor kernel-install has good tests). But one good thing about it is that it's quite easy to <em>debug</em>, thanks to the magic of <code>sh -x</code>. If you run some shell script via <code>sh -x</code> (whether that's really <code>sh</code>, or <code>bash</code> or some other alternative pretending to be <code>sh</code>), it will run as normal but print out most of the logic (variable assignments, tests, and so on) that happen along the way. So on a VM where I'd run a broken install, I could do <code>chroot /mnt/sysimage</code> (to get into the root of the installed system), find the exact <code>kernel-install</code> command that anaconda ran from one of the logs in <code>/var/log/anaconda</code> (I forget which), and re-run it through <code>sh -x</code>. This showed me all the logic going on through the run of <code>kernel-install</code> itself and all the scripts it sources under <code>/usr/lib/kernel/install.d</code>. Using this, I could confirm that the check I suspected had the result I suspected - I could see that it was deciding that <code>layout="other"</code>, not <code>layout="bls"</code>, <a href="https://github.com/systemd/systemd/blob/v250-rc3/src/kernel-install/kernel-install#L134">here</a>. But I could <em>also</em> figure out a way to override that decision, confirm that it worked, and find that it didn't solve the problem: the config snippets were still wrong, and running <code>grub2-mkconfig</code> didn't fix them. In fact the config snippets got wronger - it turned out that we do <em>want</em> <code>kernel-install</code> to pick 'other' rather than 'bls' here, because Fedora doesn't really implement BLS according to the upstream specs, so if we let kernel-install think we do, the config snippets we get are wrong.</p> <p>So now I'd been wrong twice! But each time, I learned a bit more that eventually helped me be right. After I decided that commit wasn't the cause after all, I finally spotted the problem. I figured this out by continuing with the <code>sh -x</code> debugging, and noticing an inconsistency. By this point I'd thought to find out what bit of <code>grub2-mkconfig</code> should be doing the work of correcting the key bit of configuration here. It's in a <a href="https://src.fedoraproject.org/rpms/grub2/blob/b25606806096b51e9d920f50b9bb47773641644d/f/0062-Add-BLS-support-to-grub-mkconfig.patch#_230">Fedora-only downstream patch to one of the scriptlets in <code>/etc/grub.d</code></a>. It replaces the <code>options=</code> line in any snippet files it finds with what it reckons the kernel arguments "should be". So I got curious about what exactly was going wrong there. I tweaked <code>grub2-mkconfig</code> slightly to run those scriptlets using <code>sh -x</code> by changing these lines in <code>grub2-mkconfig</code>:</p> <div class="code"><pre class="code literal-block"><span class="n">echo</span> <span class="s">"### BEGIN $i ###"</span> <span class="s">"$i"</span> <span class="n">echo</span> <span class="s">"### END $i ###"</span> </pre></div> <p>to read:</p> <div class="code"><pre class="code literal-block"><span class="n">echo</span> <span class="s">"### BEGIN $i ###"</span> <span class="n">sh</span> <span class="o">-</span><span class="n">x</span> <span class="s">"$i"</span> <span class="n">echo</span> <span class="s">"### END $i ###"</span> </pre></div> <p>Now I could re-run <code>grub2-mkconfig</code> and look at what was going on behind the scenes of the scriptlet, and I noticed that it wasn't <em>finding</em> any snippet files at all. But why not?</p> <p><a href="https://src.fedoraproject.org/rpms/grub2/blob/b25606806096b51e9d920f50b9bb47773641644d/f/0062-Add-BLS-support-to-grub-mkconfig.patch#_211">The code that looks for the snippet files</a> reads the file <code>/etc/machine-id</code> as a string, then looks for files in <code>/boot/loader/entries</code> whose names start with that string (and end in <code>.conf</code>). So I went and looked at my sample system and...found that the files in <code>/boot/loader/entries</code> did <em>not</em> start with the string in <code>/etc/machine-id</code>. The files in <code>/boot/loader/entries</code> started with <code>a69bd9379d6445668e7df3ddbda62f86</code>, but the ID in <code>/etc/machine-id</code> was <code>b8d80a4c887c40199c4ea1a8f02aa9b4</code>. This is why everything was broken: because those IDs didn't match, <code>grub2-mkconfig</code> couldn't find the files to correct, so the argument was wrong, so the system didn't boot.</p> <p>Now I knew what was going wrong and I only had two systemd commits left on the list, it was pretty easy to see the problem. It was in <a href="https://github.com/systemd/systemd/commit/357376d0bb525b064f468e0e2af8193b4b90d257">357376d</a>. That changes how <code>kernel-install</code> names these snippet files when creating them. It names them by finding a machine ID to use as a prefix. Previously, it used whatever string was in <code>/etc/machine-id</code>; if that file didn't exist or was empty, it just used the string "Default". After that commit, it also looks for a value specified in <code>/etc/machine-info</code>. If there's a <code>/etc/machine-id</code> but not <code>/etc/machine-info</code> when you run <code>kernel-install</code>, it uses the value from <code>/etc/machine-id</code> and writes it to <code>/etc/machine-info</code>.</p> <p>When I checked those files, it turned out that on the live image, the ID in both <code>/etc/machine-id</code> and <code>/etc/machine-info</code> was <code>a69bd9379d6445668e7df3ddbda62f86</code> - the problematic ID on the installed system. When we generate the live image itself, <code>kernel-install</code> uses the value from <code>/etc/machine-id</code> and writes it to <code>/etc/machine-info</code>, and both files wind up in the live filesystem. But <em>on the installed system</em>, the ID in <code>/etc/machine-info</code> was that same value, but the ID in <code>/etc/machine-id</code> was different (as we saw above).</p> <p>Remember how I mentioned above that when doing a live install, we essentially dump the live filesystem itself onto the installed system? Well, one of the 'tweaks' we make when doing this is to <a href="https://github.com/rhinstaller/anaconda/blob/612bfee7d37e16d7bfab329b44182a71d04a3344/pyanaconda/modules/storage/bootloader/utils.py#L41-L48">re-generate <code>/etc/machine-id</code></a>, because that ID is meant to be unique to each installed system - we don't want every system installed from a Fedora live image to have the same machine ID as the live image itself. However, as this <code>/etc/machine-info</code> file is new, we don't strip it from or re-generate it in the installed system, we just install it. The installed system has a <code>/etc/machine-info</code> with the same ID as the live image's machine ID, but a new, different ID in <code>/etc/machine-id</code>. And this (finally) was the ultimate source of the problem! When we run them on the installed system, the new version of <code>kernel-install</code> writes config snippet files using the ID from <code>/etc/machine-info</code>. But Fedora's patched <code>grub2-mkconfig</code> scriptlet doesn't know about that mechanism at all (since it's brand new), and expects the snippet files to contain the ID from <code>/etc/machine-id</code>.</p> <p>There are various ways you could potentially solve this, but after consulting with systemd upstream, the one we chose is to <a href="https://github.com/rhinstaller/anaconda/commit/fe652add9733943a3476128a83b24b9a8c63b335">have anaconda exclude <code>/etc/machine-info</code></a> when doing a live install. The changes to systemd here aren't wrong - it does potentially make sense that <code>/etc/machine-id</code> and <code>/etc/machine-info</code> could both exist and specify different IDs in some cases. But for the case of Fedora live installs, it doesn't make sense. The sanest result is for those IDs to match and both be the 'fresh' machine ID that's generated at the end of the install process. By just not including <code>/etc/machine-info</code> on the installed system, we achieve this result, because now when <code>kernel-install</code> runs at the end of the install process, it reads the ID from <code>/etc/machine-id</code> and writes it to <code>/etc/machine-info</code>, and both IDs are the same, <code>grub2-mkconfig</code> finds the snippet files and edits them correctly, the installed system boots, and I can move along to the next debugging odyssey...</p> <p>Hi folks! Well, it looks like I forgot to blog for...<em>checks watch</em>....<em>checks calendar</em>...a year. Wow. Whoops. Sorry about that. I'm still here, though! We released, uh, lots of Fedoras since the last time I wrote about that. Fedora 35 is the current one. It's, uh, <a href="https://fedoraproject.org/wiki/Common_F35_bugs">mostly great!</a> Go <a href="https://getfedora.org/">get a copy</a>, why don't you?</p> <p>And while that's downloading, you can get comfy and listen to another of Crazy Uncle Adam's Debugging Adventures. In this episode, we'll be uncomfortably reminded just how much of the code that causes your system to actually boot at all consists of fragile shell script with no tests, so this'll be fun!</p> <p>Last month, booting a system installed from Rawhide live images stopped working properly. You could boot the live image fine, run the installation fine, but on rebooting, the system would fail to boot with an error: <code>dracut: FATAL: Don't know how to handle 'root=live:CDLABEL=Fedora-WS-Live-rawh-20211229-n-1'</code>. openQA caught this, and so did one of our QA community members - Ahed Almeleh - who <a href="https://bugzilla.redhat.com/show_bug.cgi?id=2036199">filed a bug</a>. After the end-of-year holidays, I got to figuring out what was going wrong.</p> <p>As usual, I got a bit of a head start from pre-existing knowledge. I happen to know that error message is referring to kernel arguments that are set in the bootloader configuration of <em>the live image itself</em>. dracut is the tool that handles an early phase of boot where we boot into a temporary environment that's loaded entirely into system memory, set up the <em>real</em> system environment, and boot that. This early environment is contained in the <code>initrd</code> files you can find alongside the kernel on most Linux distributions; that's what they're for. Part of dracut's job is to be run when a kernel is installed to <em>produce</em> this environment, and then other parts of dracut are included <em>in the environment itself</em> to handle initializing things, finding the real system root, preparing it, and then switching to it. The initrd environments on Fedora live images are built to contain a dracut 'module' (called <code>90dmsquash-live</code>) that knows to interpret <code>root=live:CDLABEL=Fedora-WS-Live-rawh-20211229-n-1</code> as meaning 'go look for a live system root on the filesystem with that label and boot that'. Installed systems don't contain that module, because, well, they don't need to know how to do that, and you wouldn't really ever want an installed system to <em>try</em> and do that.</p> <p>So the short version here is: the installed system has the wrong kernel argument for telling dracut where to find the system root. It <em>should</em> look something like <code>root=/dev/mapper/fedora-root</code> (where we're pointing to a system root on an LVM volume that dracut will set up and then switch to). So the obvious next question is: why? Why is our installed system getting this wrong argument? It seemed likely that it 'leaked' from the live system to the installed system somehow, but I needed to figure out how.</p> <p>From here, I had kinda two possible ways to investigate. The easiest and fastest would probably be if I happened to know exactly how we deal with setting up bootloader configuration when running a live install. Then I'd likely have been able to start poking the most obvious places right away and figure out the problem. But, as it happens, I didn't at the time remember exactly how that works. I just remembered that I wind up having to figure it out every few years, and it's complicated and scary, so I tend to forget again right afterwards. I kinda knew where to start looking, but didn't really want to have to work it all out again from scratch if I could avoid it.</p> <p>So I went with the other possibility, which is always: figure out when it broke, and figure out what changed between the last time it worked and the first time it broke. This usually makes life much easier because now you know one of the things on that list is the problem. The shorter and simpler the list, the easier life gets.</p> <p>I looked at the openQA result history and found that the bug was introduced somewhere between 20211215.n.0 and 20211229.n.1 (unfortunately kind of a wide range). The good news is that only a few packages could plausibly be involved in this bug; the most likely are dracut itself, grub2 (the bootloader), grubby (a Red Hat / Fedora-specific grub configuration...thing), anaconda (the Fedora installer, which obviously does some bootloader configuration stuff), the kernel itself, and systemd (which is of course involved in the boot process itself, but also - perhaps less obviously - is where a script called <a href="https://github.com/systemd/systemd/tree/main/src/kernel-install"><code>kernel-install</code></a> that is used (on Fedora and many other distros) to 'install' kernels lives (this was another handy thing I happened to know already, but really - it's always a safe bet to include systemd on the list of potential suspects for anything boot-related).</p> <p>Looking at what changed between 2021-12-15 and 2021-12-29, we could let out grub2 and grubby as they didn't change. There were some kernel builds, but nothing in the scriptlets changed in any way that could be related. dracut got a build with <a href="https://src.fedoraproject.org/rpms/dracut/c/76eb28fc2ef2f9e43b5ea66d0b9c96f83e124d4b?branch=rawhide">one change</a>, but again it seemed clearly unrelated. So I was down to anaconda and systemd as suspects. On an initial quick check during the vacation, I thought anaconda had not changed, and took a brief look at systemd, but didn't see anything immediately obvious.</p> <p>When I came back to look at it more thoroughly, I realized anaconda did get a new version (36.12) on 2021-12-15, so that initially interested me quite a lot. I spent some time going through the changes in that version, and there were some that really could have been related - it changed how running things during install inside the installed system worked (which is definitely how we do some bootloader setup stuff during install), and it had interesting commit messages like "Remove the dracut_args attribute" and "Remove upd-kernel". So I spent an afternoon fairly sure it'd turn out to be one of those, reviewed all those changes, mocked up locally how they worked, examined the logs of the actual image composes, and...concluded that none of those seemed to be the problem at all. The installer seemed to still be doing things the same as it always had. There weren't any tell-tale missing or failing bootloader config steps. However, this time wasn't entirely wasted: I was reminded of exactly what anaconda does to configure the bootloader when installing from a live image.</p> <p>When we install from a live image, we don't do what the 'traditional' installer does and install a bunch of RPM packages using dnf. The live image does not contain any RPM packages. The live image itself was <em>built</em> by installing a bunch of RPM packages, but it is the <em>result</em> of that process. Instead, we essentially set up the filesystems on the drive(s) we're installing to and then just dump the contents of the live image filesystem <em>itself</em> onto them. Then we run a few tweaks to adjust anything that needs adjusting for this now being an installed system, not a live one. One of the things we do is re-generate the <code>initrd</code> file for the installed system, and then re-generate the bootloader configuration. This involves running <code>kernel-install</code> (which places the kernel and initrd files onto the boot partition, and writes some bootloader configuration 'snippet' files), and then running <code>grub2-mkconfig</code>. The main thing <code>grub2-mkconfig</code> does is produce the main bootloader configuration file, but that's not really why we run it at this point. There's <a href="https://github.com/rhinstaller/anaconda/blob/anaconda-36.12-1/pyanaconda/modules/storage/bootloader/utils.py#L244">a very interesting comment</a> explaining why in the anaconda source:</p> <div class="code"><pre class="code literal-block"><span class="c1"># Update the bootloader configuration to make sure that the BLS</span> <span class="c1"># entries will have the correct kernel cmdline and not the value</span> <span class="c1"># taken from /proc/cmdline, that is used to boot the live image.</span> </pre></div> <p>Which is exactly what we were dealing with here. The "BLS entries" we're talking about here are the things I called 'snippet' files above, they live in <code>/boot/loader/entries</code> on Fedora systems. These are where the kernel arguments used at boot are specified, and indeed, that's where the problematic <code>root=live:...</code> arguments were specified in broken installs - in the "BLS entries" in <code>/boot/loader/entries</code>. So it seemed like, somehow, this mechanism just wasn't working right any more - we were expecting this run of <code>grub2-mkconfig</code> in the installed system root after live installation to correct those snippets, but it wasn't. However, as I said, I couldn't establish that any change to anaconda was causing this.</p> <p>So I eventually shelved anaconda at least temporarily and looked at systemd. And it turned out that systemd had changed too. During the time period in question, we'd gone from systemd 250~rc1 to 250~rc3. (If you check the build history of systemd the dates don't seem to match up - by 2021-12-29 the 250-2 build had happened already, but in fact the 250-1 and 250-2 builds were untagged for causing a different problem, so the 2021-12-29 compose had 250~rc3). By now I was obviously pretty focused on <code>kernel-install</code> as the most likely related part of systemd, so I went to my systemd git checkout and ran:</p> <div class="code"><pre class="code literal-block">git log v250-rc1..v250-rc3 src/kernel-install/ </pre></div> <p>which shows all the commits under <code>src/kernel-install</code> between 250-rc1 and 250-rc3. And that gave me another juicy-looking, yet thankfully short, set of commits:</p> <p><a href="https://github.com/systemd/systemd/commit/641e2124de6047e6010cd2925ea22fba29b25309">641e2124de6047e6010cd2925ea22fba29b25309</a> kernel-install: replace 00-entry-directory with K_I_LAYOUT in k-i <a href="https://github.com/systemd/systemd/commit/357376d0bb525b064f468e0e2af8193b4b90d257">357376d0bb525b064f468e0e2af8193b4b90d257</a> kernel-install: Introduce KERNEL_INSTALL_MACHINE_ID in /etc/machine-info <a href="https://github.com/systemd/systemd/commit/447a822f8ee47b63a4cae00423c4d407bfa5e516">447a822f8ee47b63a4cae00423c4d407bfa5e516</a> kernel-install: Remove "Default" from list of suffixes checked</p> <p>So I went and looked at all of those. And again...I got it wrong at first! This is I guess a good lesson from this Debugging Adventure: you don't always get the right answer at first, but that's okay. You just have to keep plugging, and always keep open the possibility that you're wrong and you should try something else. I spent time thinking the cause was likely a change in anaconda before focusing on systemd, then focused on the wrong systemd commit first. I got interested in 641e212 first, and had even written out a whole Bugzilla comment blaming it before I realized it wasn't the culprit (fortunately, I didn't post it!) I thought the problem was that the new check for <code>$BOOT_ROOT/$MACHINE_ID</code> would not behave as it should on Fedora and cause the install scripts to do something different from what they should - generating incorrect snippet files, or putting them in the wrong place, or something.</p> <p>Fortunately, I decided to test this before declaring it was the problem, and found out that it wasn't. I did this using something that turned out to be invaluable in figuring out the real problem.</p> <p>You may have noticed by this point - harking back to our intro - that this critical <code>kernel-install</code> script, key to making sure your system boots, is...a shell script. That calls other shell scripts. You know what else is a big pile of shell scripts? <code>dracut</code>. You know, that critical component that both builds and controls the initial boot environment. Big pile of shell script. The install script - the <code>dracut</code> command itself - is shell. All the dracut <code>modules</code> - the bits that do most of the work - are shell. There's a bit of C in the source tree (I'm not entirely sure what that bit does), but most of it's shell.</p> <p>Critical stuff like this being written in shell makes me shiver, because shell is very easy to get wrong, and quite hard to test properly (and in fact neither dracut nor kernel-install has good tests). But one good thing about it is that it's quite easy to <em>debug</em>, thanks to the magic of <code>sh -x</code>. If you run some shell script via <code>sh -x</code> (whether that's really <code>sh</code>, or <code>bash</code> or some other alternative pretending to be <code>sh</code>), it will run as normal but print out most of the logic (variable assignments, tests, and so on) that happen along the way. So on a VM where I'd run a broken install, I could do <code>chroot /mnt/sysimage</code> (to get into the root of the installed system), find the exact <code>kernel-install</code> command that anaconda ran from one of the logs in <code>/var/log/anaconda</code> (I forget which), and re-run it through <code>sh -x</code>. This showed me all the logic going on through the run of <code>kernel-install</code> itself and all the scripts it sources under <code>/usr/lib/kernel/install.d</code>. Using this, I could confirm that the check I suspected had the result I suspected - I could see that it was deciding that <code>layout="other"</code>, not <code>layout="bls"</code>, <a href="https://github.com/systemd/systemd/blob/v250-rc3/src/kernel-install/kernel-install#L134">here</a>. But I could <em>also</em> figure out a way to override that decision, confirm that it worked, and find that it didn't solve the problem: the config snippets were still wrong, and running <code>grub2-mkconfig</code> didn't fix them. In fact the config snippets got wronger - it turned out that we do <em>want</em> <code>kernel-install</code> to pick 'other' rather than 'bls' here, because Fedora doesn't really implement BLS according to the upstream specs, so if we let kernel-install think we do, the config snippets we get are wrong.</p> <p>So now I'd been wrong twice! But each time, I learned a bit more that eventually helped me be right. After I decided that commit wasn't the cause after all, I finally spotted the problem. I figured this out by continuing with the <code>sh -x</code> debugging, and noticing an inconsistency. By this point I'd thought to find out what bit of <code>grub2-mkconfig</code> should be doing the work of correcting the key bit of configuration here. It's in a <a href="https://src.fedoraproject.org/rpms/grub2/blob/b25606806096b51e9d920f50b9bb47773641644d/f/0062-Add-BLS-support-to-grub-mkconfig.patch#_230">Fedora-only downstream patch to one of the scriptlets in <code>/etc/grub.d</code></a>. It replaces the <code>options=</code> line in any snippet files it finds with what it reckons the kernel arguments "should be". So I got curious about what exactly was going wrong there. I tweaked <code>grub2-mkconfig</code> slightly to run those scriptlets using <code>sh -x</code> by changing these lines in <code>grub2-mkconfig</code>:</p> <div class="code"><pre class="code literal-block"><span class="n">echo</span> <span class="s">"### BEGIN $i ###"</span> <span class="s">"$i"</span> <span class="n">echo</span> <span class="s">"### END $i ###"</span> </pre></div> <p>to read:</p> <div class="code"><pre class="code literal-block"><span class="n">echo</span> <span class="s">"### BEGIN $i ###"</span> <span class="n">sh</span> <span class="o">-</span><span class="n">x</span> <span class="s">"$i"</span> <span class="n">echo</span> <span class="s">"### END $i ###"</span> </pre></div> <p>Now I could re-run <code>grub2-mkconfig</code> and look at what was going on behind the scenes of the scriptlet, and I noticed that it wasn't <em>finding</em> any snippet files at all. But why not?</p> <p><a href="https://src.fedoraproject.org/rpms/grub2/blob/b25606806096b51e9d920f50b9bb47773641644d/f/0062-Add-BLS-support-to-grub-mkconfig.patch#_211">The code that looks for the snippet files</a> reads the file <code>/etc/machine-id</code> as a string, then looks for files in <code>/boot/loader/entries</code> whose names start with that string (and end in <code>.conf</code>). So I went and looked at my sample system and...found that the files in <code>/boot/loader/entries</code> did <em>not</em> start with the string in <code>/etc/machine-id</code>. The files in <code>/boot/loader/entries</code> started with <code>a69bd9379d6445668e7df3ddbda62f86</code>, but the ID in <code>/etc/machine-id</code> was <code>b8d80a4c887c40199c4ea1a8f02aa9b4</code>. This is why everything was broken: because those IDs didn't match, <code>grub2-mkconfig</code> couldn't find the files to correct, so the argument was wrong, so the system didn't boot.</p> <p>Now I knew what was going wrong and I only had two systemd commits left on the list, it was pretty easy to see the problem. It was in <a href="https://github.com/systemd/systemd/commit/357376d0bb525b064f468e0e2af8193b4b90d257">357376d</a>. That changes how <code>kernel-install</code> names these snippet files when creating them. It names them by finding a machine ID to use as a prefix. Previously, it used whatever string was in <code>/etc/machine-id</code>; if that file didn't exist or was empty, it just used the string "Default". After that commit, it also looks for a value specified in <code>/etc/machine-info</code>. If there's a <code>/etc/machine-id</code> but not <code>/etc/machine-info</code> when you run <code>kernel-install</code>, it uses the value from <code>/etc/machine-id</code> and writes it to <code>/etc/machine-info</code>.</p> <p>When I checked those files, it turned out that on the live image, the ID in both <code>/etc/machine-id</code> and <code>/etc/machine-info</code> was <code>a69bd9379d6445668e7df3ddbda62f86</code> - the problematic ID on the installed system. When we generate the live image itself, <code>kernel-install</code> uses the value from <code>/etc/machine-id</code> and writes it to <code>/etc/machine-info</code>, and both files wind up in the live filesystem. But <em>on the installed system</em>, the ID in <code>/etc/machine-info</code> was that same value, but the ID in <code>/etc/machine-id</code> was different (as we saw above).</p> <p>Remember how I mentioned above that when doing a live install, we essentially dump the live filesystem itself onto the installed system? Well, one of the 'tweaks' we make when doing this is to <a href="https://github.com/rhinstaller/anaconda/blob/612bfee7d37e16d7bfab329b44182a71d04a3344/pyanaconda/modules/storage/bootloader/utils.py#L41-L48">re-generate <code>/etc/machine-id</code></a>, because that ID is meant to be unique to each installed system - we don't want every system installed from a Fedora live image to have the same machine ID as the live image itself. However, as this <code>/etc/machine-info</code> file is new, we don't strip it from or re-generate it in the installed system, we just install it. The installed system has a <code>/etc/machine-info</code> with the same ID as the live image's machine ID, but a new, different ID in <code>/etc/machine-id</code>. And this (finally) was the ultimate source of the problem! When we run them on the installed system, the new version of <code>kernel-install</code> writes config snippet files using the ID from <code>/etc/machine-info</code>. But Fedora's patched <code>grub2-mkconfig</code> scriptlet doesn't know about that mechanism at all (since it's brand new), and expects the snippet files to contain the ID from <code>/etc/machine-id</code>.</p> <p>There are various ways you could potentially solve this, but after consulting with systemd upstream, the one we chose is to <a href="https://github.com/rhinstaller/anaconda/commit/fe652add9733943a3476128a83b24b9a8c63b335">have anaconda exclude <code>/etc/machine-info</code></a> when doing a live install. The changes to systemd here aren't wrong - it does potentially make sense that <code>/etc/machine-id</code> and <code>/etc/machine-info</code> could both exist and specify different IDs in some cases. But for the case of Fedora live installs, it doesn't make sense. The sanest result is for those IDs to match and both be the 'fresh' machine ID that's generated at the end of the install process. By just not including <code>/etc/machine-info</code> on the installed system, we achieve this result, because now when <code>kernel-install</code> runs at the end of the install process, it reads the ID from <code>/etc/machine-id</code> and writes it to <code>/etc/machine-info</code>, and both IDs are the same, <code>grub2-mkconfig</code> finds the snippet files and edits them correctly, the installed system boots, and I can move along to the next debugging odyssey...</p> Site and blog migration https://www.happyassassin.net/posts/2020/11/24/site-and-blog-migration/ 2020-11-24T00:36:54Z 2020-11-24T00:36:54Z Adam Williamson <p>So I've been having an adventurous week here at HA Towers: I decided, after something more than a decade, I'm going to get out of the self-hosting game, as far as I can. It makes me a bit sad, because it's been kinda cool to do and I think it's worked pretty well, but I'm getting to a point where it seems silly that a small part of me has to constantly be concerned with making sure my web and mail servers and all the rest of it keep working, when the services exist to do it much more efficiently. It's cool that it's still possible to do it, but I don't think I need to <em>actually do it</em> any more.</p> <p>So, if you're reading this...and I didn't do something really weird...it's not being served to you by a Fedora system three feet from my desk any more. It's being served to you by a server owned by a commodity web hoster...somewhere in North America...running Lightspeed (boo) on who knows what OS. I pre-paid for four years of hosting before realizing they were running proprietary software, and I figured what the hell, it's just a web serving serving static files. If it starts to really bug me I'll move it, and hopefully you'll never notice.</p> <p>All the redirects for old Wordpress URLs should still be in place, and also all URLs for software projects I used to host here (fedfind etc) should redirect to appropriate places in Pagure and/or Pypi. Please yell if you see something that seems to be wrong. I moved <a href="https://openqa.fedoraproject.org/nightlies.html">nightlies</a> and <a href="https://openqa.fedoraproject.org/testcase_stats/">testcase_stats</a> to the Fedora openQA server for now; that's still a slightly odd place for them to be, but at least it's in the Fedora domain not on my personal domain, and it was easiest to do since I have all the necessary permissions, putting them anywhere else would be more work and require other people to do stuff, so this is good enough for now. Redirects are in place for those too.</p> <p>I've been working on all the other stuff I self-host, too. Today I set up all the IRC channels I regularly read in my <a href="https://matrix.org/">Matrix</a> account and I'm going to try using that setup for IRC instead of my own proxy (which ran <a href="https://bip.milkypond.org/">bip</a>). It seems to work okay so far. I'm using the <a href="https://github.com/quotient-im/Quaternion">Quaternion</a> client for now, as it seems to have the most efficient UI layout and isn't a big heavy wrapper around a web client. Matrix is a really cool thing, and it'd be great to see more F/OSS projects adopting it to lower barriers to entry without compromising F/OSS principles; IRC really is getting pretty creaky these days, folks. There's some talk about both Fedora and GNOME adopting Matrix officially, and I really hope that happens.</p> <p>I also set up a <a href="https://kolabnow.com/">Kolab Now</a> account and switched my contacts and calendar to it, which was nice and easy to do (download the ICS files from Radicale, upload them to Kolab, switch my accounts on my laptops and phone, shut down the Radicale server, done). I also plan to have it serve my mail, but that migration is going to be the longest and most complicated as I'll have to move several gigs of mail and re-do all my filters. Fun!</p> <p>I also refreshed my "desktop" setup; after (again) something more than a decade having a dedicated desktop PC I'm trying to roll without one again. Back when I last did this, I got to resenting the clunky nature of docking at the time, and also I still ran quite a lot of local code compiles and laptops aren't ideal for that. These days, though, docking is getting pretty slick, and I don't recall the last time I built anything really chunky locally. My current laptop (a 2017 XPS 13) should have enough power anyhow, for the occasional case. So I got me a <a href="https://www.apple.com/ca/shop/product/HMX12ZM/A/caldigit-ts3-plus-dock">fancy Thunderbolt dock</a> - yes, from the Apple store, because apparently no-one else has it in stock in Canada - and a <a href="https://www.techradar.com/reviews/benq-ew3270u-monitor">32" 4K monitor</a> and plugged the things into the things and waited a whole night while all sorts of gigantic things I forgot I had lying around my home directory synced over to the laptop and...hey, it works. Probably in two months I'll run into something weird that's only set up on the old desktop box, but hey.</p> <p>So once I have all this wrapped up I'm aiming to have substantially fewer computers lying around here and fewer Sysadmin Things taking up space in my brain. At the cost of being able to say I run an entire domain out of a $20 TV stand in my home office. Ah, well.</p> <p>Oh, I also bought a <a href="https://blueradius.ca">new domain</a> as part of this whole thing, as a sort of backup / staging area for transitions and also possibly as an alternative vanity domain. Because it is sometimes awkward telling people yes, my email address is happyassassin.net, no, I'm not an assassin, don't worry, it's a name based on a throwaway joke from university which I probably wouldn't have picked if I knew I'd be signing up for bank accounts with it fifteen years later. So if I do start using it for stuff, here is your advance notice that yeah, it's me. This name I just picked to be vaguely memorable and hopefully to be entirely inoffensive, vaguely professional-sounding, and composed of sounds that are unambiguous when read over an international phone line to a call centre in India. It doesn't mean anything at all.</p> <p>So I've been having an adventurous week here at HA Towers: I decided, after something more than a decade, I'm going to get out of the self-hosting game, as far as I can. It makes me a bit sad, because it's been kinda cool to do and I think it's worked pretty well, but I'm getting to a point where it seems silly that a small part of me has to constantly be concerned with making sure my web and mail servers and all the rest of it keep working, when the services exist to do it much more efficiently. It's cool that it's still possible to do it, but I don't think I need to <em>actually do it</em> any more.</p> <p>So, if you're reading this...and I didn't do something really weird...it's not being served to you by a Fedora system three feet from my desk any more. It's being served to you by a server owned by a commodity web hoster...somewhere in North America...running Lightspeed (boo) on who knows what OS. I pre-paid for four years of hosting before realizing they were running proprietary software, and I figured what the hell, it's just a web serving serving static files. If it starts to really bug me I'll move it, and hopefully you'll never notice.</p> <p>All the redirects for old Wordpress URLs should still be in place, and also all URLs for software projects I used to host here (fedfind etc) should redirect to appropriate places in Pagure and/or Pypi. Please yell if you see something that seems to be wrong. I moved <a href="https://openqa.fedoraproject.org/nightlies.html">nightlies</a> and <a href="https://openqa.fedoraproject.org/testcase_stats/">testcase_stats</a> to the Fedora openQA server for now; that's still a slightly odd place for them to be, but at least it's in the Fedora domain not on my personal domain, and it was easiest to do since I have all the necessary permissions, putting them anywhere else would be more work and require other people to do stuff, so this is good enough for now. Redirects are in place for those too.</p> <p>I've been working on all the other stuff I self-host, too. Today I set up all the IRC channels I regularly read in my <a href="https://matrix.org/">Matrix</a> account and I'm going to try using that setup for IRC instead of my own proxy (which ran <a href="https://bip.milkypond.org/">bip</a>). It seems to work okay so far. I'm using the <a href="https://github.com/quotient-im/Quaternion">Quaternion</a> client for now, as it seems to have the most efficient UI layout and isn't a big heavy wrapper around a web client. Matrix is a really cool thing, and it'd be great to see more F/OSS projects adopting it to lower barriers to entry without compromising F/OSS principles; IRC really is getting pretty creaky these days, folks. There's some talk about both Fedora and GNOME adopting Matrix officially, and I really hope that happens.</p> <p>I also set up a <a href="https://kolabnow.com/">Kolab Now</a> account and switched my contacts and calendar to it, which was nice and easy to do (download the ICS files from Radicale, upload them to Kolab, switch my accounts on my laptops and phone, shut down the Radicale server, done). I also plan to have it serve my mail, but that migration is going to be the longest and most complicated as I'll have to move several gigs of mail and re-do all my filters. Fun!</p> <p>I also refreshed my "desktop" setup; after (again) something more than a decade having a dedicated desktop PC I'm trying to roll without one again. Back when I last did this, I got to resenting the clunky nature of docking at the time, and also I still ran quite a lot of local code compiles and laptops aren't ideal for that. These days, though, docking is getting pretty slick, and I don't recall the last time I built anything really chunky locally. My current laptop (a 2017 XPS 13) should have enough power anyhow, for the occasional case. So I got me a <a href="https://www.apple.com/ca/shop/product/HMX12ZM/A/caldigit-ts3-plus-dock">fancy Thunderbolt dock</a> - yes, from the Apple store, because apparently no-one else has it in stock in Canada - and a <a href="https://www.techradar.com/reviews/benq-ew3270u-monitor">32" 4K monitor</a> and plugged the things into the things and waited a whole night while all sorts of gigantic things I forgot I had lying around my home directory synced over to the laptop and...hey, it works. Probably in two months I'll run into something weird that's only set up on the old desktop box, but hey.</p> <p>So once I have all this wrapped up I'm aiming to have substantially fewer computers lying around here and fewer Sysadmin Things taking up space in my brain. At the cost of being able to say I run an entire domain out of a $20 TV stand in my home office. Ah, well.</p> <p>Oh, I also bought a <a href="https://blueradius.ca">new domain</a> as part of this whole thing, as a sort of backup / staging area for transitions and also possibly as an alternative vanity domain. Because it is sometimes awkward telling people yes, my email address is happyassassin.net, no, I'm not an assassin, don't worry, it's a name based on a throwaway joke from university which I probably wouldn't have picked if I knew I'd be signing up for bank accounts with it fifteen years later. So if I do start using it for stuff, here is your advance notice that yeah, it's me. This name I just picked to be vaguely memorable and hopefully to be entirely inoffensive, vaguely professional-sounding, and composed of sounds that are unambiguous when read over an international phone line to a call centre in India. It doesn't mean anything at all.</p> Fedora 32 release and Lenovo announcement https://www.happyassassin.net/posts/2020/04/28/fedora-32-release-and-lenovo-announcement/ 2020-04-28T23:18:03Z 2020-04-28T23:18:03Z Adam Williamson <p>It's been a big week in Fedora news: first came the <a href="https://fedoramagazine.org/coming-soon-fedora-on-lenovo-laptops/">announcement of Lenovo planning to ship laptops preloaded with Fedora</a>, and today <a href="https://fedoramagazine.org/announcing-fedora-32/">Fedora 32 is released</a>. I'm happy this release was again "on time" (at least if you go by our definition and not Phoronix's!), though it was kinda chaotic in the last week or so. We just changed <a href="https://pagure.io/releng/issue/9403#comment-643466">the installer, the partitioning library, the custom partitioning tool, the kernel and the main desktop's display manager</a> - that's all perfectly normal stuff to change a day before you sign off the release, right? I'm pretty confident this is fine!</p> <p>But seriously folks, I think it turned out to be a pretty good sausage, like most of the ones we've put on the shelves lately. Please do take it for a spin and see how it works for you. </p> <p>I'm also really happy about the Lenovo announcement. The team working on that has been doing an awful lot of diplomacy and negotiation and cajoling for quite a while now and it's great to see it pay off. The RH Fedora QA team was formally brought into the plan in the last month or two, and Lenovo has kindly provided us with several test laptops which we've distributed around. While the project wasn't public we were clear that we couldn't do anything like making the Fedora 32 release contingent on test results on Lenovo hardware purely for this reason or anything like that, but both our team and Lenovo's have been running tests and we did accept several freeze exceptions to fix bugs like <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1814015">this one</a>, which also affected some Dell systems and maybe others too. Now this project is officially public, it's possible we'll consider adding some official release criteria for the supported systems, or something like that, so look out for proposals on the mailing lists in future.</p> <p>It's been a big week in Fedora news: first came the <a href="https://fedoramagazine.org/coming-soon-fedora-on-lenovo-laptops/">announcement of Lenovo planning to ship laptops preloaded with Fedora</a>, and today <a href="https://fedoramagazine.org/announcing-fedora-32/">Fedora 32 is released</a>. I'm happy this release was again "on time" (at least if you go by our definition and not Phoronix's!), though it was kinda chaotic in the last week or so. We just changed <a href="https://pagure.io/releng/issue/9403#comment-643466">the installer, the partitioning library, the custom partitioning tool, the kernel and the main desktop's display manager</a> - that's all perfectly normal stuff to change a day before you sign off the release, right? I'm pretty confident this is fine!</p> <p>But seriously folks, I think it turned out to be a pretty good sausage, like most of the ones we've put on the shelves lately. Please do take it for a spin and see how it works for you. </p> <p>I'm also really happy about the Lenovo announcement. The team working on that has been doing an awful lot of diplomacy and negotiation and cajoling for quite a while now and it's great to see it pay off. The RH Fedora QA team was formally brought into the plan in the last month or two, and Lenovo has kindly provided us with several test laptops which we've distributed around. While the project wasn't public we were clear that we couldn't do anything like making the Fedora 32 release contingent on test results on Lenovo hardware purely for this reason or anything like that, but both our team and Lenovo's have been running tests and we did accept several freeze exceptions to fix bugs like <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1814015">this one</a>, which also affected some Dell systems and maybe others too. Now this project is officially public, it's possible we'll consider adding some official release criteria for the supported systems, or something like that, so look out for proposals on the mailing lists in future.</p> No more Wordpress! https://www.happyassassin.net/posts/2020/04/24/no-more-wordpress/ 2020-04-24T00:57:51Z 2020-04-24T00:57:51Z Adam Williamson <p>So I finally managed to bite the bullet and move my blog off Wordpress! I've tried this multiple times over the last few years but always sort of ran out of gas, but this time I finished the job. I'm using <a href="https://getnikola.com">Nikola</a>, and with a bit of poking around, managed to convert my entire blog, including existing comments. I don't intend to allow new comments or user registrations, but I wanted to keep the existing ones visible.</p> <p>More or less all old URLs should be redirected properly. This domain is still set up in a really icky way that I should redo sometime, but that's gonna have to wait till I get some more roundtuits. I didn't bother trying to copy the theme I was using before, I'm just using one of the stock Nikola themes with minor tweaks to display the comments, so the site's appearance is a bit different now, but hey, it's just a blog.</p> <p>I killed my tt-rss deployment and an old cgit deployment I had forgotten I had running at the same time. Now if I can find some time to switch from Roundcube to Mailpile or something, I can uninstall PHP forever...</p> <p>So I finally managed to bite the bullet and move my blog off Wordpress! I've tried this multiple times over the last few years but always sort of ran out of gas, but this time I finished the job. I'm using <a href="https://getnikola.com">Nikola</a>, and with a bit of poking around, managed to convert my entire blog, including existing comments. I don't intend to allow new comments or user registrations, but I wanted to keep the existing ones visible.</p> <p>More or less all old URLs should be redirected properly. This domain is still set up in a really icky way that I should redo sometime, but that's gonna have to wait till I get some more roundtuits. I didn't bother trying to copy the theme I was using before, I'm just using one of the stock Nikola themes with minor tweaks to display the comments, so the site's appearance is a bit different now, but hey, it's just a blog.</p> <p>I killed my tt-rss deployment and an old cgit deployment I had forgotten I had running at the same time. Now if I can find some time to switch from Roundcube to Mailpile or something, I can uninstall PHP forever...</p> Do not upgrade to Fedora 32, and do not adjust your sets https://www.happyassassin.net/posts/2020/02/14/do-not-upgrade-to-fedora-32-and-do-not-adjust-your-sets/ 2020-02-14T17:30:26Z 2020-02-14T17:30:26Z Adam Williamson <p></p><p>If you were unlucky today, you might have received a notification from GNOME in Fedora 30 or 31 that Fedora 32 is now available for upgrade.</p> <p>This might have struck you as a bit odd, it being rather early for Fedora 32 to be out and there not being any news about it or anything. And if so, you'd be right! This was an error, and we're very sorry for it.</p> <p>What happened is that a <a href="https://admin.fedoraproject.org/pkgdb/api/collections/">particular bit of data</a> which GNOME Software (among other things) uses as its source of truth about Fedora releases was updated for the branching of Fedora 32...but by mistake, 32 was added with status 'Active' (meaning 'stable release') rather than 'Under Development'. This fooled poor GNOME Software into thinking a new stable release was available, and telling you about it.</p> <p>Kamil Paral spotted this very quickly and releng fixed it right away, but if your GNOME Software happened to check for updates during the few minutes the incorrect data was up, it will have cached it, and you'll see the incorrect notification for a while.</p> <p>Please <strong>DO NOT</strong> upgrade to Fedora 32 yet. It is under heavy development and is very much not ready for normal use. We're very sorry for the incorrect notification and we hope it didn't cause too much disruption.</p> <p></p><p>If you were unlucky today, you might have received a notification from GNOME in Fedora 30 or 31 that Fedora 32 is now available for upgrade.</p> <p>This might have struck you as a bit odd, it being rather early for Fedora 32 to be out and there not being any news about it or anything. And if so, you'd be right! This was an error, and we're very sorry for it.</p> <p>What happened is that a <a href="https://admin.fedoraproject.org/pkgdb/api/collections/">particular bit of data</a> which GNOME Software (among other things) uses as its source of truth about Fedora releases was updated for the branching of Fedora 32...but by mistake, 32 was added with status 'Active' (meaning 'stable release') rather than 'Under Development'. This fooled poor GNOME Software into thinking a new stable release was available, and telling you about it.</p> <p>Kamil Paral spotted this very quickly and releng fixed it right away, but if your GNOME Software happened to check for updates during the few minutes the incorrect data was up, it will have cached it, and you'll see the incorrect notification for a while.</p> <p>Please <strong>DO NOT</strong> upgrade to Fedora 32 yet. It is under heavy development and is very much not ready for normal use. We're very sorry for the incorrect notification and we hope it didn't cause too much disruption.</p> Using Zuul CI with Pagure.io https://www.happyassassin.net/posts/2020/02/12/using-zuul-ci-with-pagure-io/ 2020-02-12T18:15:47Z 2020-02-12T18:15:47Z Adam Williamson <p></p><p>I attended <a href="https://www.devconf.info/cz/">Devconf.cz</a> again this year - I'll try and post a full blog post on that soon. One of the most interesting talks, though, was <a href="https://devconfcz2020a.sched.com/event/YOtV/cicd-for-fedora-packaging-with-zuul">CI/CD for Fedora packaging with Zuul</a>, where Fabien Boucher and Matthieu Huin introduced the work they've done to integrate <a href="https://fedora.softwarefactory-project.io/zuul/status">a specific Zuul instance</a> (part of the <a href="https://www.softwarefactory-project.io/">Software Factory</a> effort) with the <a href="https://src.fedoraproject.org/">Pagure instance Fedora uses for packages</a> and also with <a href="https://pagure.io/">Pagure.io</a>, the general-purpose Pagure instance that many Fedora groups use to host projects, including us in QA.</p> <p>They've done a lot of work to make it as simple as possible to hook up a project in either Pagure instance to run CI via Zuul, and it looked pretty cool, so I thought I'd try it on one of our projects and see how it compares to other options, like the Jenkins-based <a href="https://www.happyassassin.net/2017/02/16/getting-started-with-pagure-ci/">Pagure CI</a>.</p> <p>I wound up more or less following the instructions on <a href="https://fedoraproject.org/wiki/Zuul-based-ci#How_to_Zuul_attach_a_Pagure_repository_on_Zuul">this Wiki page</a>, but it does not give you an example of a minimal framework in the project repository itself to actually run some checks. However, after I submitted the pull request for <a href="https://pagure.io/fedora-project-config">fedora-project-config</a> as explained on the wiki page, Tristan Cacqueray was kind enough to send me this as <a href="https://pagure.io/fedora-qa/os-autoinst-distri-fedora/pull-request/141">a pull request for my project repository</a>.</p> <p>So, all that was needed to get a kind of 'hello world' process running was:</p> <p></p><ol> <li>Add the appropriate web hook in the project options</li> <li>Add the 'zuul' user as a committer on the project in the project options</li> <li>Get a <a href="https://pagure.io/fedora-project-config/pull-request/45">pull request</a> merged to fedora-project-config to add the desired project</li> <li>Add a <a href="https://pagure.io/fedora-qa/os-autoinst-distri-fedora/pull-request/141#request_diff">basic Zuul config which runs a single job</a></li> </ol> <p>After that, the next step was to have it run <em>useful</em> checks. I set the project up such that all the appropriate checks could be run just by calling <code>tox</code> (which is a great test runner for Python projects) - see the <a href="https://pagure.io/fedora-qa/os-autoinst-distri-fedora/blob/master/f/tox.ini">tox configuration</a>. Then, with a bit more help from Tristan, I was able to tweak the Zuul config to run it successfully. This mainly required a couple of things:</p> <ol> <li>Adding <code>nodeset: fedora-31-vm</code> to the <a href="https://pagure.io/fedora-qa/os-autoinst-distri-fedora/blob/master/f/.zuul.yaml">Zuul config</a> - this makes the CI job run on a Fedora 31 VM rather than the default CentOS 7 VM (CentOS 7's tox is too old for a modern Python 3 project)</li> <li>Modifying the <a href="https://pagure.io/fedora-qa/os-autoinst-distri-fedora/blob/master/f/ci/tox.yaml">job configuration</a> to ensure tox is installed (there's a canned role for this, called <code>ensure-tox</code>) and also all available Python interpreters (using the <code>package</code> module)</li> </ol> <p>This was all pretty small and easy stuff, and we had the whole thing up and running in a few hours. Now it all works great, so whenever a pull request is submitted for the project, the tests are automatically run and the results shown on the pull request.</p> <p>You can set up more complex workflows where Zuul takes over merging of pull requests entirely - an admin posts a comment indicating a PR is ready to merge, whereupon Zuul will retest it and then merge it automatically if the test succeeds. This can also be used to merge series of PRs together, with proper testing. But for my small project, this simple integration is enough so far.</p> <p>It's been a positive experience working with the system so far, and I'd encourage others to try it for their packages and Pagure projects!</p> <p></p><p>I attended <a href="https://www.devconf.info/cz/">Devconf.cz</a> again this year - I'll try and post a full blog post on that soon. One of the most interesting talks, though, was <a href="https://devconfcz2020a.sched.com/event/YOtV/cicd-for-fedora-packaging-with-zuul">CI/CD for Fedora packaging with Zuul</a>, where Fabien Boucher and Matthieu Huin introduced the work they've done to integrate <a href="https://fedora.softwarefactory-project.io/zuul/status">a specific Zuul instance</a> (part of the <a href="https://www.softwarefactory-project.io/">Software Factory</a> effort) with the <a href="https://src.fedoraproject.org/">Pagure instance Fedora uses for packages</a> and also with <a href="https://pagure.io/">Pagure.io</a>, the general-purpose Pagure instance that many Fedora groups use to host projects, including us in QA.</p> <p>They've done a lot of work to make it as simple as possible to hook up a project in either Pagure instance to run CI via Zuul, and it looked pretty cool, so I thought I'd try it on one of our projects and see how it compares to other options, like the Jenkins-based <a href="https://www.happyassassin.net/2017/02/16/getting-started-with-pagure-ci/">Pagure CI</a>.</p> <p>I wound up more or less following the instructions on <a href="https://fedoraproject.org/wiki/Zuul-based-ci#How_to_Zuul_attach_a_Pagure_repository_on_Zuul">this Wiki page</a>, but it does not give you an example of a minimal framework in the project repository itself to actually run some checks. However, after I submitted the pull request for <a href="https://pagure.io/fedora-project-config">fedora-project-config</a> as explained on the wiki page, Tristan Cacqueray was kind enough to send me this as <a href="https://pagure.io/fedora-qa/os-autoinst-distri-fedora/pull-request/141">a pull request for my project repository</a>.</p> <p>So, all that was needed to get a kind of 'hello world' process running was:</p> <p></p><ol> <li>Add the appropriate web hook in the project options</li> <li>Add the 'zuul' user as a committer on the project in the project options</li> <li>Get a <a href="https://pagure.io/fedora-project-config/pull-request/45">pull request</a> merged to fedora-project-config to add the desired project</li> <li>Add a <a href="https://pagure.io/fedora-qa/os-autoinst-distri-fedora/pull-request/141#request_diff">basic Zuul config which runs a single job</a></li> </ol> <p>After that, the next step was to have it run <em>useful</em> checks. I set the project up such that all the appropriate checks could be run just by calling <code>tox</code> (which is a great test runner for Python projects) - see the <a href="https://pagure.io/fedora-qa/os-autoinst-distri-fedora/blob/master/f/tox.ini">tox configuration</a>. Then, with a bit more help from Tristan, I was able to tweak the Zuul config to run it successfully. This mainly required a couple of things:</p> <ol> <li>Adding <code>nodeset: fedora-31-vm</code> to the <a href="https://pagure.io/fedora-qa/os-autoinst-distri-fedora/blob/master/f/.zuul.yaml">Zuul config</a> - this makes the CI job run on a Fedora 31 VM rather than the default CentOS 7 VM (CentOS 7's tox is too old for a modern Python 3 project)</li> <li>Modifying the <a href="https://pagure.io/fedora-qa/os-autoinst-distri-fedora/blob/master/f/ci/tox.yaml">job configuration</a> to ensure tox is installed (there's a canned role for this, called <code>ensure-tox</code>) and also all available Python interpreters (using the <code>package</code> module)</li> </ol> <p>This was all pretty small and easy stuff, and we had the whole thing up and running in a few hours. Now it all works great, so whenever a pull request is submitted for the project, the tests are automatically run and the results shown on the pull request.</p> <p>You can set up more complex workflows where Zuul takes over merging of pull requests entirely - an admin posts a comment indicating a PR is ready to merge, whereupon Zuul will retest it and then merge it automatically if the test succeeds. This can also be used to merge series of PRs together, with proper testing. But for my small project, this simple integration is enough so far.</p> <p>It's been a positive experience working with the system so far, and I'd encourage others to try it for their packages and Pagure projects!</p> Uptime https://www.happyassassin.net/posts/2020/02/02/uptime/ 2020-02-02T17:17:40Z 2020-02-02T17:17:40Z Adam Williamson <p></p><p>OK, so that was two days longer than I was expecting! Sorry for the extended downtime, folks, especially Fedora folks. It was rather beyond my control. But now I'm (just barely) back, through the single working cable outlet in the house and a powerline ethernet connection to the router, at least until the cable co can come and fix all the other outlets!</p> <p></p><p>OK, so that was two days longer than I was expecting! Sorry for the extended downtime, folks, especially Fedora folks. It was rather beyond my control. But now I'm (just barely) back, through the single working cable outlet in the house and a powerline ethernet connection to the router, at least until the cable co can come and fix all the other outlets!</p>