New openQA tests: update live image build/install

Hot on the heels of adding installer image build/install tests to openQA, I've now added tests which do just the same, but for the Workstation live image.

That means that, when running the desktop tests for an update, openQA will also run a test that builds a Workstation live image and a test that boots and installs it. The packages from the update will be used - if relevant - in the live image creation environment, and included in the live image itself. This will allow us to catch problems in updates that relate to the build and basic functionality of live images.

Here's an update where you can see that both the installer and live image build tests ran successfully and passed - see the updates-everything-boot-iso and updates-workstation-live-iso flavors.

I'm hoping this will help us catch compose issues much more easily during the upcoming Fedora 30 release process.

Devconf.cz 2019 trip report

I've just got back from my Devconf.cz 2019 trip, after spending a few days after the conference in Red Hat's Brno office with other Fedora QA team members, then a few days visiting family.

I gave both my talks - Don't Move That Fence 'Til You Know Why It's There and Things Fedora QA Robots Do - and both were well-attended and, I think, well-received. The slide decks are up on the talk pages, and recordings should I believe go up on the Devconf Youtube channel soon.

I attended many other talks, my personal favourite being Stef Walter's Using machine learning to find Linux bugs. Stef noticed something I also have noticed in our openQA work - that "test flakes" are very often not just some kind of "random blip" but genuine bugs that can be investigated and fixed with a little care - and ran with it, using the extensive amount of results generated by the automated test suite for Cockpit as input data for a machine learning-based system which clusters "test flakes" based on an analysis of key data from the logs for each test. In this way they can identify when a large number of apparent "flakes" seem to have significant common features and are likely to be occurrences of the same bug, allowing them then to go back and analyze the commonalities between those cases and identify the underlying bug. We likely aren't currently running enough tests in openQA to utilize the approach Stef outlined in full, but the concept is very interesting and may be useful in future with more data, and perhaps for Fedora CI results.

Other useful and valuable talks I saw included Dan Walsh on podman, Lennart Poettering on portable services, Daniel Mach and Jaroslav Mracek on the future of DNF, Kevin Fenzi and Stephen Smoogen on the future of EPEL, Jiri Benc and Marian Šámal on a summer coding camp for kids, Ben Cotton on community project management, the latest edition of Will Woods' and Stephen Gallagher's crusade to kill all scriptlets, and the Fedora Council BoF.

There were also of course lots of useful "hallway track" sessions with Miroslav Vadkerti, Kevin Fenzi, Mohan Boddu, Patrick Uiterwijk, Alexander Bokovoy, Dominik Perpeet, Matthew Miller, Brian Exelbierd and many more - it is invaluable to be able to catch up with people in person and discuss things that are harder to do in tickets and IRC.

As usual it was an enjoyable and productive event, and the rum list at the Bar That Doesn't Exist remains as impressive as ever...;)

Devconf.cz 2019

For anyone who - inexplicably - hasn't already had it in their social calendar in pink sharpie for months, I will be at Devconf.cz 2019 this weekend, at FIT VUT in Brno. I'll be doing two talks: Things Fedora QA Robots Do on Friday at 3pm (which is basically a brain dump about the pile of little fedmsg consumers that do quite important jobs that probably no-one knows about but me), and Don't Move That Fence 'Til You Know Why It's There on Saturday at 11am, which is a less QA-specific talk that's about how I reckon you ought to go about changing code. The slides for both talks are up now, if you want a sneak preview (though if you do, you're disqualified from the audience participation section of the "fence" talk!)

Do come by to the talks, if you're around and there's nothing more interesting in that timeslot. Otherwise feel free to buttonhole me around the conference any time.

New openQA tests: update installer tests and desktop app start/stop test

It's been a while since I wrote about significant developments in Fedora openQA, so today I'll be writing about two! I wrote about one of them a bit in my last post, but that was primarily about a bug I ran into along the way, so now let's focus on the changes themselves.

Testing of install media built from packages in updates-testing

We have long had a problem in Fedora testing that we could not always properly test installer changes. This is most significant during the period of development after a new release has branched from Rawhide, but before it is released as the new stable Fedora release (we use the name 'Branched' to refer to a release in this state; in a month or so, Fedora 30 will branch from Rawhide and become the current Branched release).

During most of this time, the Bodhi update system is enabled for the release. New packages built for the release do not immediately appear in any repositories, but - as with stable releases - must be submitted as "updates", sometimes together with related packages. Once submitted as an update, the package(s) are sent to the "updates-testing" repository for the release. This repository is enabled on installed Branched systems by default (this is a difference from stable releases), so testers who have already installed Branched will receive the package(s) at this point (unless they disable the "updates-testing" repository, which some do). However, the package is still not truly a part of the release at this point. It is not included in the nightly testing composes, nor will it be included in any Beta or Final candidate composes that may be run while it is in updates-testing. That means that if the actual release media were composed while the package was still in updates-testing, it would not be a part of the release proper. Packages only become part of these composes once they pass through Bodhi and are 'pushed stable'.

This system allows us to back out packages that turn out to be problematic, and hopefully to prevent them from destabilizing the test and release composes by not pushing them stable if they turn out to cause problems. It also means more conservative testers have the option to disable the "updates-testing" repository and avoid some destabilizing updates, though of course if all the testers did this, no-one would be finding the problems. In the last few years we have also been running several automated tests on updates (via Taskotron, openQA and the CI pipeline) and reporting results from those to Bodhi, allowing packagers to pull the update if the tests find problems.

However, there has long been a bit of a problem in this process: if the update works fine on an installed system but causes problems if included in (for example) an installer image or live image, we have no very good way to find this out. There was no system for automatically building media like this that include the updates currently in testing so they could be tested. The only way to find this sort of problem was for testers to manually create test media - a process that is not widely understood, is time consuming, and can be somewhat difficult. We also of course could not do automated testing without media to test.

We've looked at different ways of addressing this in the past, but ultimately none of them came to much (yet), so last year I decided to just go ahead and do something. And after a bit of a roadblock (see that last post), that something is now done!

Our openQA now has two new tests it runs on all the updates it tests. The first test - here's an example run - builds a network install image, and the second - example run - tests it. Most importantly, any packages from the update under testing are both used in the process of building the install image (if they are relevant to that process) and included in the installer image (if they are packages which would usually be in such an image). Thus if the update breaks the production of the image, or the basic functionality of the image itself, this will be caught. This (finally) means that we have some idea whether a new anaconda, lorax, pykickstart, systemd, dbus, blivet, dracut or any one of dozens of other key packages might break the installer. If you're a packager and you see that one of these two tests has failed for your update, we should look into that! If you're not sure how to go about that, you can poke me, bcl, or the anaconda developers in Freenode #anaconda, and we should be able to help.

It is also possible for a human tester to download the image produced by the first test and run more in-depth tests on it manually; I haven't yet done anything to make that possibility more visible or easier, but will try to look into ways of doing that over the next few weeks.

GNOME application start/stop testing

My colleague Lukáš Růžička has recently been looking into what we might be able to do to streamline and improve our desktop application testing, something I'd honestly been avoiding because it seemed quite intractable! After some great work by Lukáš, one major fruit of this work is now visible in Fedora openQA: a GNOME application start/stop test suite. Here's an example run of it - note that more recent runs have a ton of failures caused by a change in GNOME, Lukáš has proposed a change to the test to address that but I have not yet reviewed it.

This big test suite just tests starting and then exiting a large number of the default installed applications on the Fedora Workstation edition, making sure they both launch and exit successfully. This is of course pretty easy for a human to do - but it's extremely tedious and time-consuming, so it's something we don't do very often at all (usually only a handful of times per release cycle), meaning we may not notice that an application which perhaps we don't commonly use has a very critical bug (like failing to launch at all) for some time.

Making an automated system like openQA do this is actually quite a lot of work, so it was a great job by Lukas to get it working. Now by monitoring the results of this test on the nightly composes closely, we should find out much more quickly if one of the tested applications is completely broken (or has gone missing entirely).

AdamW's Debugging Adventures: The Mysterious Disappearing /proc

Yep, folks, it's that time again - time for one of old Grandpa Adam's tall tales of root causing adventure...

There's a sort of catch-22 situation in Fedora that has been a personal bugbear for a very long time. It mainly affects Branched releases - each new Fedora release, when it has branched from Rawhide, but before it has been released. During this period the Bodhi update system is in effect, meaning all new packages have to go through Bodhi review before they are included in the composes for the release. This means, in theory, we should be able to make sure nothing really broken lands in the release. However, there's a big class of really important updates we have never been able to test properly at all: updates that affect the installer.

The catch-22 is this - release engineering only builds install media from the 'stable' package set, those packages that have gone through review. So if a package under review breaks the installer, we can't test whether it breaks the installer unless we push it stable. Well, you can, but it's quite difficult - you have to learn how to build an installer image yourself, then build one containing the packages from the update and test it. I can do that, but most other people aren't going to bother.

I've filed bugs and talked to people about ways to resolve this multiple times over many years, but a few months back I just got sick of the problem and decided to fix it myself. So I wrote an openQA update test which automates the process: it builds an installer image, with the packages from the update available to the installer image build tool. I also included a subsequent test which takes that image and runs an install with it. Since I already had the process for doing this manually down pat, it wasn't actually very difficult.

Only...when I deployed the test to the openQA staging instance and actually tried it out, I found the installer image build would frequently fail in a rather strange way.

The installer image build process works (more or less) by creating a temporary directory, installing a bunch of packages to it (using dnf's feature of installing to an alternative 'root'), fiddling around with that environment a bit more, creating a disk image whose root is that temporary directory, then fiddling with the image a bit to make it into a bootable ISO. (HANDWAVE HANDWAVE). However, I was finding it would commonly fail in the 'fiddling around with the environment' stage, because somehow some parts of the environment had disappeared. Specifically, it'd show this error:

FileNoteFoundError: [Errno 2] No such file or directory: '/var/tmp/lorax.q8xfvc0p/installroot//proc/modules'

lorax was, at that point, trying to touch that directory (never mind why). That's the /proc/modules inside the temporary root, basically. The question was, why was it disappearing? And why had neither myself nor bcl (the lorax maintainer) seen it happening previously in manual use, or in official composes?

I tried reproducing it in a virtual machine...and failed. Then I tried again, and succeeded. Then I ran the command again...and it worked! That pattern turned out to repeat: I could usually get it to happen the first time I tried it in a VM, but any subsequent attempts in the same VM succeeded.

So this was seeming really pretty mysterious. Brian couldn't get it to happen at all.

At this point I wrote a dumb, short Python script which just constantly monitored the disappearing location and told me when it appeared and disappeared. I hacked up the openQA test to run this script, and upload the result. Using the timestamps, I was able to figure out exactly what bit of lorax was running when the directory suddenly disappeared. But...I couldn't immediately see why anything in that chunk of lorax would wind up deleting the directory.

At this point, other work became more important, and I wound up leaving this on the back burner for a couple of months. Then I came back to it a couple days ago. I picked back up where I left off, and did a custom build of lorax with some debug logging statements strewn around the relevant section, to figure out really precisely where we were when things went wrong. But this turned out to be a bit of a brick wall, because it turned out that at the time the directory disappeared, lorax was just...running mksquashfs. And I could not figure out any plausible reason at all why a run of mksquashfs would cause the directory to vanish.

After a bit, though, the thought struck me - maybe it's not lorax itself wiping the directory out at all! Maybe something else is doing it. So I thought to look at the system logs. And lo and behold, I found my smoking gun. At the exact time my script logged that the directory had disappeared, this message appeared in the system log:

Jan 18 01:57:30 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com systemd[1]: Starting Cleanup of Temporary Directories...

now, remember our problem directory is in /var/tmp. So this smells very suspicious indeed! So I figured out what that service actually is - to do this, you just grep for the description ("Cleanup of Temporary Directories") in /usr/lib/systemd/system - and it turned out to be /usr/lib/systemd/system/systemd-tmpfiles-clean.service, which is part of systemd's systemd-tmpfiles mechanism, which you can read up on in great detail in man systemd-tmpfiles and man tmpfiles.d.

I had run into it a few times before, so I had a vague idea what I was dealing with and what to look for. It's basically a mechanism for managing temporary files and directories: you can write config snippets which systemd will read and do stuff like creating expected temporary files or directories on boot (this lets packages manage temporary directories without doing it themselves in scriptlets). I poked through the docs again and, sure enough, it turns out another thing the system can do is delete temporary files that reach a certain age:

Age
The date field, when set, is used to decide what files to delete when cleaning. If a file or directory is
older than the current time minus the age field, it is deleted. The field format is a series of integers
each followed by one of the following suffixes for the respective time units: s, m or min, h, d, w, ms, and
us, meaning seconds, minutes, hours, days, weeks, milliseconds, and microseconds, respectively. Full names
of the time units can be used too.

This systemd-tmpfiles-clean.service does that job. So I went looking for tmpfiles.d snippets that cover /var/tmp, and sure enough, found one, in Fedora's stock config file /usr/lib/tmpfiles.d/tmp.conf:

q /var/tmp 1777 root root 30d

The 30d there is the 'age' field. So this tells the tmpfiles mechanism that it's fine to wipe anything under /var/tmp which is older than 30 days.

Of course, naively we might think our directory won't be older than 30 days - after all, we only just ran lorax! But remember, lorax installs packages into this temporary directory, and files and directories in packages get some of their time attributes from the package. So we (at this point, Brian and I were chatting about the problem as I poked it) looked into how systemd-tmpfiles defines age, precisely:

The age of a file system entry is determined from its last modification timestamp (mtime), its last access
timestamp (atime), and (except for directories) its last status change timestamp (ctime). Any of these three
(or two) values will prevent cleanup if it is more recent than the current time minus the age field.

So since our thing is a directory, its mtime and atime are relevant. So Brian and I both looked into those. He did it manually, while I hacked up my check script to also print the mtime and atime of the directory when it existed. And sure enough, it turned out these were several months in the past - they were obviously related to the date the filesystem package (from which /proc/modules comes) was built. They were certainly longer than 30 days ago.

Finally, I looked into what was actually running systemd-tmpfiles-clean.service; it's run on a timer, systemd-tmpfiles-clean.timer. That timer is set to run the service 15 minutes after the system boots, and every day thereafter.

So all of this hooked up nicely into a convincing story. openQA kept running into this problem because it always runs the test in a freshly-booted VM - that '15 minutes after boot' was turning out to be right in the middle of the image creation. My manual reproductions were failing on the first try for the same reason - but then succeeding on the second and subsequent tries because the cleaner would not run again until the next day. And Brian and I had either never or rarely seen this when we ran lorax manually for one reason or another because it was pretty unlikely the "once a day" timer would happen to wake up and run just when we had lorax running (and if it did happen, we'd try again, and when it worked, we'd figure it was just some weird transient failure). The problem likely never happens in official composes, I think, because the tmpfiles timer isn't active at all in the environment lorax gets run in (haven't double-checked this, though).

Brian now gets to deal with the thorny problem of trying to fix this somehow on the lorax side (so the tmpfiles cleanup won't remove bits of the temporary tree even if it does run while lorax is running). Now I know what's going on, it was easy enough to work around in the openQA test - I just have the test do systemctl stop systemd-tmpfiles-clean.timer before running the image build.

AdamW's Debugging Adventures: Python 3 Porting 201

Hey folks! Time for another edition of AdamW's Debugging Adventures, wherein I boast about how great I am at fixin' stuff.

Today's episode is about a bug in the client for Fedora's Koji buildsystem which has been biting more and more Fedora maintainers lately. The most obvious thing it affects is task watching. When you do a package build with fedpkg, it will by default "watch" the build task - it'll update you when the various subtasks start and finish, and not quit until the build ultimately succeeds or fails. You can also directly watch tasks with koji watch-task. So this is something Fedora maintainers see a lot. There's also a common workflow where you chain something to the successful completion of a fedpkg build or koji watch-task, which relies on the task watch completing successfully and exiting 0, if the build actually completed.

However, recently, people noticed that this task watching seemed to be just...failing, quite a lot. While the task was still running, it'd suddenly exit, usually showing this message:

ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

After a while, nirik realized that this seemed to be associated with the client going from running under Python 2 by default to running under Python 3 by default. This seems to happen when running on Python 3; it doesn't seem to happen when running on Python 2.

Today I finally decided it had got annoying enough that I'd spend some time trying to track it down.

It's pretty obvious that the message we see relates to an exception, in some way. But ultimately something is catching that exception and printing it out and then exiting (we're not actually getting a traceback, as you do if the exception is ultimately left to reach the interpreter). So my first approach was to dig into the watch-task code from the top down, and try and find something that handles exceptions that looks like it might be the bit we were hitting.

And...I failed! This happens, sometimes. In fact I still haven't found the exact bit of code that prints the message and exits. Sometimes, this just happens. It's OK. Don't give up. Try something else!

So what I did next was kind of a long shot - I just grepped the code for the exception text. I wasn't really expecting this to work, as there's nothing to suggest the actual exception is part of Koji; it's most likely the code doesn't contain any of that text at all. But hey, it's easy to do, so why not? And as it happened, I got lucky and hit paydirt: there happens to be a comment with some of the text from the error we're hitting. And it sure looks like it might be relevant to the problem we're having! The comment itself, and the function it's in, looked so obviously promising that I went ahead and dug a little deeper.

That function, is_conn_error(), is used by only one other thing: this _sendCall() method in the same file. And that seems very interesting, because what it does can be boiled down to: "hey, we got an error! OK, send it to is_conn_error(). If that returns True, then just log a debug message and kick the session. If that returns False, then raise an exception". That behaviour obviously smells a lot like it could be causing our problem. So, I now had a working theory: for some reason, given some particular server behaviour, is_conn_error() returns True on Python 2 but False on Python 3. That causes this _sendCall() to raise an exception instead of just resetting the session and carrying on, and some other code - which we no longer need to find - catches that exception, prints it, and quits.

The next step was to test this theory - because at this point it's only a theory, it could be entirely wrong. I've certainly come up with entirely plausible theories like this before which turned out to be not what was going on at all. So, like a true lazy shortcut enthusiast, I hacked up my local copy of Koji's init.py and sprinkled a bunch of lines like print("HERE 1!") and print("HERE 2!") through the whole of is_conn_error(). Then I just rankoji wait-task commands on random tasks until one failed.

This is fine. When you're just trying to debug the problem you don't need to be super elegant about it. You don't need to do a proper git patch and rebuild the Koji package for your system and use proper logging methods and all the rest of it. Just dumping some print lines in a working copy of the file is just fine, if it works. Just remember to put everything back as it was before later. :)

So, as it happened the god of root causing was on my side today, and it turned out I was right on the money. When one of the koji watch-task commands failed, it hit my HERE 1! and HERE 3! lines right when it died. Those told me we were indeed running through is_conn_error() right before the error, and further, where we were coming out of it. We were entering the if isinstance(e, socket.error) block at the start of the function, and returning False because the exception (e) did appear to be an instance of socket.error, but either did not have an errno attribute, or it was not one of errno.ECONNRESET, errno.ECONNABORTED, or errno.EPIPE.

Obviously, this made me curious as to what the exception actually is, whether it has an errno at all, and if so, what it is. So I threw in a few more debugging lines - to print out type(e), and getattr(e, 'errno', 'foobar'). The result of this was pretty interesting. The second print statement gave me 'foobar', meaning the exception doesn't have an errno attribute at all. And the type of the exception was...requests.exceptions.ConnectionError.

That's a bit curious! You wouldn't necessarily expect requests.exceptions.ConnectionError to be an instance of socket.error, would you? So why are we in a block that only handles instances of socket.error? Also, it's clear the code doesn't expect this, because there's a block later in the function that explicitly handles instances of requests.exceptions.ConnectionError - but because this earlier block that handles socket.error instances always returns, we will never reach that block if requests.exceptions.ConnectionError instances are also instances of socket.error. So there's clearly something screwy going on here.

So of course the next thing to do is...look up socket.error in the Python 2 and Python 3 docs. ANY TIME you're investigating a mysterious Python 3 porting issue, remember this can be useful. Here's the Python 2 socket.error entry, and the Python 3 socket.error entry. And indeed there's a rather significant difference! The Python 2 docs talk about socket.error as an exception that is, well, its own unique thing. However, the Python 3 docs say: "A deprecated alias of OSError." - and even tell us specifically that this changed in Python 3.3: "Changed in version 3.3: Following PEP 3151, this class was made an alias of OSError." Obviously, this is looking an awful lot like one more link in the chain of what's going wrong here.

With a bit of Python knowledge you should be able to figure out what's going on now. Think: if socket.error is now just an alias of OSError, what does if isinstance(e, socket.error) mean, in Python 3.3+ ? It means just the same as if isinstance(e, OSError). And guess what? requests.exception.ConnectionError happens to be a subclass of OSError. Thus, if e is an instance of requests.exception.ConnectionError, isinstance(e, socket.error) will return True in Python 3.3+. In Python 2, it returns False. It's easy to check this in an interactive Python shell or with a test script, to confirm.

Because of this, when we run under Python 3 and e is a requests.exception.ConnectionError, we're unexpectedly entering this block intended for handling socket.error exceptions and - because that block always returns, having the return False line that gets hit if the errno attribute check fails - we're never actually reaching the later block that's actually intended to handle requests.exception.ConnectionError instances at all, we return False before we get there.

There are a few different ways you could fix this - you could just drop the return False short-circuit line in the socket.error block, for instance, or change the ordering so the requests.exception.ConnectionError handling is done first. In the end I sent a pull request which drops the return False, but also drops the if isinstance(e, socket.error) checks (there's another, for nested exceptions, later) entirely. Since socket.error is meant to be deprecated in Python 3.3+ we shouldn't really use it, and we probably don't need to - we can just rely on the errno attribute check alone. Whatever type the exception is, if it has an errno attribute and that attribute is errno.ECONNRESET, errno.ECONNABORTED, or errno.EPIPE, I think we can be pretty sure this is a connection error.

What's the moral of this debugging tale? I guess it's this: when porting from Python 2 to Python 3 (or doing anything similar to that), fixing the things that outright crash or obviously behave wrong is sometimes the easy part. Even if everything seems to be working fine on a simple test, it's certainly possible that subtler issues like this could be lurking in the background, causing unexpected failures or (possibly worse) subtly incorrect behaviour. And of course, that's just another reason to add to the big old "Why To Have A Really Good Test Suite" list!

There's also a 'secondary moral', I guess, and that's this: predicting all the impacts of an interface change like this is hard. Remember the Python 3 docs mentioned a PEP associated with this change? Well, here it is. If you read it, it's clear the proposers actually put quite a lot of effort into thinking about how existing code might be affected by the change, but it looks like they still didn't consider a case like this. They talk about "Careless (or "naïve") code" which "blindly catches any of OSError, IOError, socket.error, mmap.error, WindowsError, select.error without checking the errno attribute", and about "Careful code is defined as code which, when catching any of the above exceptions, examines the errno attribute to determine the actual error condition and takes action depending on it" - and claim that "useful compatibility doesn't alter the behaviour of careful exception-catching code". However, Koji's code here clearly matches their definition of "careful" code - it considers both the exception's type, and the errno attribute, in making decisions - but because it is not just doing except socket.error as e or similar, but catching the exception elsewhere and then passing it to this function and using isinstance, it still gets tripped up by the change.

So...the ur-moral, as always, is: software is hard!

Reverse proxying WebSocket requests with Apache: a generic approach that works (even with Firefox)

Right up front, I should say all credit for this goes to Patrick Uiterwijk - I am just writing it up :)

So I'm upgrading Fedora's openQA instances to the latest upstream code, which replaces the old 'interactive mode' with a new 'developer mode'. This relies on the browser being able to establish a WebSocket connection to the server.

Both my pet deployment of openQA and the official Fedora instances have some reverse proxying going on between the browser and the actual box where the openQA server bits are running, and in both cases, Apache is involved.

Some HTTP reverse proxies just magically pass WebSocket requests correctly, I've heard it said, but Apache does not.

The 'standard' way to do reverse proxying with Apache looks something like this:

<VirtualHost *:443>
    ServerName openqa.happyassassin.net
    ProxyPass / https://openqa-backend01/
    ProxyPassReverse / https://openqa-backend01/
    ProxyRequests off
</VirtualHost>

but that alone does not handle WebSocket requests correctly. AIUI, it sort of strips the WebSocket-y bits off and makes them into plain https requests, which the backend server then mishandles because, well, why are you sending plain https requests when you should be sending WebSocket-y ones?

If you've run into this before, and Googled around for solutions, what you've probably found is stuff which relies on knowing specific locations to which requests should always be WebSocket-y, and ProxyPassing those specifically, like this:

<VirtualHost *:443>
    ServerName openqa.happyassassin.net
    ProxyPass /liveviewhandler/ wss://openqa-backend01/liveviewhandler/
    ProxyPassReverse /liveviewhandler/ wss://openqa-backend01/liveviewhandler/
    ProxyPass / https://openqa-backend01/
    ProxyPassReverse / https://openqa-backend01/
    ProxyRequests off
</VirtualHost>

...which basically means 'if this is a request to a path under livehandler/, we know that ought to be a WebSocket request, so proxy it as one; otherwise, proxy as https'. And that works, you can do that. For every websocket-y path. For every application. So long as you can always distinguish between where websocket-y requests go and where plain http-y ones go.

But it seems like a bit of a drag! So instead, why not try this?

<VirtualHost *:443>
    ServerName openqa.happyassassin.net
    RewriteEngine on
    RewriteCond %{HTTP:Upgrade} websocket [NC]
    RewriteCond %{HTTP:Connection} upgrade [NC]
    RewriteRule .* "wss://openqa-backend01%{REQUEST_URI}" [P]
    ProxyPass / https://openqa-backend01/
    ProxyPassReverse / https://openqa-backend01/
    ProxyRequests off
</VirtualHost>

That takes advantage of mod_rewrite, and what it basically says is: if the HTTP Connection header has the string 'upgrade' in it, and the HTTP Upgrade header has the string 'websocket' in it - both case-insensitive, that's what the [NC] means - then it's a WebSocket request, so proxy it as one. Otherwise, proxy as plain https.

You may have seen similar incantations, but stricter, like this:

RewriteCond %{HTTP:Upgrade} ^WebSocket$
RewriteCond %{HTTP:Connection} ^Upgrade$

or case-insensitive but still rejecting other text before or after the key word, or loose about additional text but case-sensitive. If you tried something like that, you might've found it doesn't work with Firefox...and that's because Firefox, apparently, sends 'websocket' (not 'WebSocket') in the Upgrade header, and 'keep-alive, Upgrade' (not just 'Upgrade') in the Connection header. It seems that lots of stuff around WebSocket requests has been developed using Chrom(e|ium) as a reference, so assuming the headers will look just the way Chrome does them...but that's not the real world!

The magic recipe above should correctly proxy all WebSocket requests through to the backend server unmolested.

Thanks again to Patrick for this!

AdamW's Debugging Adventures: Has Anyone Seen My Kernel?

Welp, since I haven't blogged for a while, here's another root-causing write up! Honestly, root-causing things is one of my favorite parts of the job, lately.

I've been on vacation recently, and when I came back, it looked like several things were wrong with Rawhide. Several of these were relatively easy fixes: live images not behaving right at all and the installer not launching properly any more both only took a couple of hours to track down. But there were also a couple of bugs causing more recent composes to fail entirely. The inimitable puiterwijk dealt with one of those (aarch64 cloud images not building properly), and I wound up taking the other one: overnight, most live image composes had suddenly stopped working.

What's The Problem?

The starting point for this debugging odyssey was the error messages we got in the failed composes:

DEBUG pylorax.treebuilder: kernels=[]
ERROR livemedia-creator: No kernels found, cannot rebuild_initrds

So, the first step is pretty easy: let's just go look up where those errors come from. pylorax and livemedia-creator are both part of the lorax tree, so we'll start there. It's easy to use grep to find the sources of those two messages: they're both in treebuilder.py, the first here, in findkernels() and the second here, in TreeBuilder.rebuild_initrds(). As the second happens if there is no self.kernels, and we we can see just a few lines further back that self.kernels is a property based on a call to findkernels(), it's pretty obvious that ultimately what's going wrong here is that findkernels() isn't finding any kernels.

So...Why Aren't We Finding Any Kernels?

So next, of course, I put my thinking cap on and had a good look at findkernels() - not just at the code itself, but at its history. When something that was working breaks, you're always looking for a change that caused it. There were no recent changes to findkernels(), and I couldn't see anything obviously wrong in its implementation - it's basically looking for files named vmlinuz-(something) in /boot - so it didn't look like the quest was over yet.

So at this point there were kinda two possibilities: findkernels() was broken, or it was working fine but there really weren't any kernels to find. I decided the easiest way to figure out which we were dealing with was just to reproduce the problem and take a look. With any sufficiently complex problem, you can usually get some way into figuring it out just by looking and logs and code and thinking about things logically, but at some point you'll want to try and reproduce the problem in a controlled environment you can poke at - knowing when you're at that point isn't an exact science, but if you find yourself with multiple possibilities and no easy way to decide which you're dealing with, that's often a good indicator.

Doing a test live image build is a bit awkward for Fedora at present, unfortunately, but not really too bad. I've been meaning to write a helper tool for a while, but never got around to it. The way I do it - because I want to replicate as close as possible how the official images are built - is just to try and do exactly what the build system does, only on my local system and by hand. What the build system does is spin up a mock root, install the live image creation tools and a flattened kickstart into it, and run the tool. So, that's what I do! A handy 'trick' here is just to look at the logs from a real live image creation task - like this one - and shadow what they do. Note these logs will get garbage collected at some point, but I can't do anything about that.

The main log to look at is root.log. First, we see that the tools are installed to the mock:

DEBUG util.py:439:  Marking packages as installed by the group:
DEBUG util.py:439:   @livemedia-build glibc-all-langpacks coreutils               lorax-lmc-novirt
DEBUG util.py:439:                    util-linux          selinux-policy-targeted bash            
DEBUG util.py:439:                    shadow-utils

So we'll do just the same with our mock:

mock -r fedora-rawhide-x86_64 --install glibc-all-langpacks coreutils lorax-lmc-novirt util-linux selinux-policy-targeted bash shadow-utils

(Really all you want to do is install the livemedia-build group, but I haven't actually found an equivalent of mock --groupinstall, so I just do that).

We can find the actual image compose command in the task log again, by looking for 'livemedia-creator' - it looks like this:

INFO backend.py:285:  Running in chroot: ['/sbin/livemedia-creator', '--ks', '/chroot_tmpdir/koji-image-f30-build-30594219.ks', '--logfile', '/chroot_tmpdir/lmc-logs/livemedia-out.log', '--no-virt', '--resultdir', '/chroot_tmpdir/lmc', '--project', 'Fedora-KDE-Live', '--make-iso', '--volid', 'Fedora-KDE-Live-rawh-20181101.n.', '--iso-only', '--iso-name', 'Fedora-KDE-Live-x86_64-Rawhide-20181101.n.0.iso', '--releasever', 'Rawhide', '--title', 'Fedora-KDE-Live', '--macboot']

we can easily turn that Python array into a console command by just replacing occurrences of ', ' with :

/sbin/livemedia-creator --ks /chroot_tmpdir/koji-image-f30-build-30594219.ks --logfile /chroot_tmpdir/lmc-logs/livemedia-out.log --no-virt --resultdir /chroot_tmpdir/lmc --project Fedora-KDE-Live --make-iso --volid Fedora-KDE-Live-rawh-20181101.n. --iso-only --iso-name Fedora-KDE-Live-x86_64-Rawhide-20181101.n.0.iso --releasever Rawhide --title Fedora-KDE-Live --macboot

We can see that's using a scratch directory called /chroot_tmpdir, and a kickstart file called koji-image-f30-build-30594219.ks. This kickstart can be found as one of the task assets, so we'll grab it and copy it into the mock's /root for now:

sudo cp koji-image-f30-build-30606109.ks /var/lib/mock/fedora-rawhide-x86_64/root/root

Then finally, we're going to get a shell in the mock root, using the old-style chroot implementation (this is necessary for live image builds to work, the new systemd-based implementation doesn't work yet):

mock -r fedora-rawhide-x86_64 --shell --old-chroot

Inside the mock, we'll create the /chroot_tmpdir scratch dir, copy the kickstart into it, and finally run the image creation:

mkdir -p /chroot_tmpdir
cd /chroot_tmpdir
cp /root/koji-image-f30-build-30606109.ks .
/sbin/livemedia-creator --ks /chroot_tmpdir/koji-image-f30-build-30594219.ks --logfile /chroot_tmpdir/lmc-logs/livemedia-out.log --no-virt --resultdir /chroot_tmpdir/lmc --project Fedora-KDE-Live --make-iso --volid Fedora-KDE-Live-rawh-20181101.n. --iso-only --iso-name Fedora-KDE-Live-x86_64-Rawhide-20181101.n.0.iso --releasever Rawhide --title Fedora-KDE-Live --macboot

And when I did that, it worked away for a while - half an hour or so - and eventually failed exactly like the 'official' build had! So now I had a failure in a controlled environment (my little mock root) to look at. Note that if you're playing along at home, this will only work so long as you can grab that kickstart from Koji, and the 20181101.n.0 compose files are kept around, which will only be for another two weeks or so - after that you won't be able to reproduce this, but you can of course follow the same procedure with a newer Koji task if you want to reproduce a newer official live image build.

Next, I needed to examine the actual filesystem produced by the image build process and see if it really had any kernels in it (remember, that's what we were trying to figure out). This requires a bit of knowledge about how livemedia-creator works, which you'd have had to look up if you didn't know it already: it creates an image file, loopback mounts it, and installs into the loopback-mounted directory. When it fails, it leaves the image file around, and you can just mount it again and poke around. The file will be in the lmc/ subdirectory of the directory where the image build command was run, with a filename like lmc-disk-ctfz98m5.img (the alphanumeric soup bit is random), and we mount it like this:

mount -o loop lmc/lmc-disk-ctfz98m5.img /mnt/sysimage

(the mount point having been created and left around by the tool). Now we can look in /mnt/sysimage/boot, and when I did that...I found that, indeed, it contained no vmlinuz-* files at all! So, I had eliminated the possibility that findkernels() was going wrong: it was doing its job just fine, and it wasn't finding any kernels because...there were no kernels to find.

OK...So Why Aren't There Any Kernels?

So now I had to try and work out: why were there no kernels in the image's /boot? I knew from the logs of earlier, successful image composes that, in successful composes, there really were kernels to find: it wasn't that there had never been kernels, but this had only recently become fatal for some reason. The difference really was that there used to be kernels present when this lorax code ran, but now there weren't.

This led me into a fun new piece of research: figuring out how kernel files get into /boot in a Fedora system at all. You might think - I did - that they're simply perfectly normal packaged files installed by the kernel-core package. But it turns out, they're not! The kernel-core package does own the relevant /boot/vmlinuz-* file, but it doesn't actually directly install it: it uses an RPM directive called %ghost to own it without installing it. So the file must get installed some other way. Here again I cheated a bit with prior knowledge - I knew this overall mechanism existed, though I didn't know until now that it really installed the kernel file itself - but if you don't have that, you could look at the %posttrans script in the kernel package: when a kernel is installed, a command called kernel-install gets run.

I also found out (by diffing the logged packages from the 20181030.n.0 and 20181101.n.0 live image composes) that the kernel itself had been updated in the 20181101.n.0 compose (which was when things started breaking). So once again I had a couple of promising lines of inquiry: the new kernel, and this kernel-install path.

Well, turns out systemd haters in the audience can get very excited, because kernel-install is part of systemd:

[adamw@adam lorax (master %)]$ rpm -qf `which kernel-install`
systemd-udev-239-3.fc29.x86_64

Anyway, I refreshed my memory a bit about what kernel-install does, but it's kinda complicated and it calls out to various other things, including /usr/sbin/new-kernel-pkg (part of grubby) and some plugins in /usr/lib/kernel/install.d (various of which come from systemd, grub2, and dracut). So I think what I did next (my memory gets a bit hazy) was to wonder whether the same problem would affect a regular install from the same packages.

I got the last working Rawhide network install image, and set it to install from the Everything repo from the failed 20181101.n.0 compose. I let that install run, then checked that the /boot/vmlinuz-(whatever) file existed in the installed system...which it did. This sort of let out one theory I had: that the new kernel package had somehow messed something up such that the kernel file never actually got installed properly at all.

So, I got to wondering whether kernel-install really was the thing that put the /boot/vmlinuz-(whatever) file in place (I still wasn't sure at this point), whether it reproducibly failed to do so in the live image creation environment but succeeded in doing so in the regular install environment, and if so, what differed between the two.

I could see the exact kernel-install command just by examining the kernel-core package scripts:

rpm -q --scripts kernel-core | grep kernel-install
/bin/kernel-install add 4.20.0-0.rc0.git7.1.fc30.x86_64 /lib/modules/4.20.0-0.rc0.git7.1.fc30.x86_64/vmlinuz || exit $?

So I tried just deleting /boot/vmlinuz* and re-running that command in the installed system...and sure enough, the file re-appeared! So now I was pretty sure kernel-install was the thing that's supposed to install it. I also tried doing this in my live image creation mock environment:

chroot /mnt/sysimage
/bin/kernel-install add 4.20.0-0.rc0.git7.1.fc30.x86_64 /lib/modules/4.20.0-0.rc0.git7.1.fc30.x86_64/vmlinuz

...and sure enough, it didn't create the /boot/vmlinuz-(foo) file. So now I was narrowing in on the problem: something about the live image creation environment meant that this kernel-install invocation didn't install the file, when it probably should.

OK...So Why Isn't kernel-install Installing Kernels?

At this point I probably could've figured out the problem by reading the source if I'd read it carefully enough, but I decided to carry on with the practical experimentation. I tried running the script through sh -x in each environment, to see exactly what commands were run by the script in each case, and somehow - I forget how - I zeroed in on one of the /usr/lib/kernel/install.d plugin scripts: /usr/lib/kernel/install.d/20-grub.install. This is part of grub2. I think I found these scripts from the sh -x output, and noticed that this one has what looks like the code to actually install the kernel file to /boot. So I made that script run with -z as well, and this finally got me my next big breakthrough. In the installed system I could see that script doing a lot of stuff, but in the live environment it seemed to exit almost as soon as it started:

+ [[ -n '' ]]
+ exit 0

It's not 100% obvious, but I was pretty sure that just meant it was failing in the test right at the start:

if ! [[ $KERNEL_INSTALL_MACHINE_ID ]]; then
    exit 0
fi

So I went and looked up $KERNEL_INSTALL_MACHINE_ID and the references suggested that it's basically tied to /etc/machine-id. So I looked, and, lo and behold, in the regular installed system, that file contained a random alphanumeric string, but in the live image creation environment, the file was empty! This sure looked suspicious.

I read through some references on the file, and found that it's usually meant to get created by a call to /usr/bin/systemd-machine-id-setup in systemd package scripts. So I tried running systemd-machine-id-setup in the live image creation environment, and suddenly the file got a random string, and when I ran kernel-install again, the kernel file did get installed to /boot!

OK...So Why Isn't There A Machine ID?

So now I'd moved along to a new problem again: why was /etc/machine-id empty when the kernel %post script ran, but if I tried to generate it again, it worked? Was the initial generation failing? Was it happening too late? Was it working, but the file getting touched by something else?

Again, I looked at quite a lot of code to try and figure it out - there's code that at least could touch /etc/machine-id in anaconda, for instance, and in lorax - but in the end I decided to go with practical experimentation again. So I did a custom scratch build of systemd to try and find out what happened when the %post script ran. I changed the command from this:

systemd-machine-id-setup &>/dev/null || :

to this:

systemd-machine-id-setup > /tmp/smids.log 2>&1
cat /etc/machine-id >> /tmp/smids.log

pulled that build into a side repository, edited the live kickstart to use that side repository, and re-ran the live image creation. And that hit paydirt in a big way, because in smids.log I saw this:

systemd-machine-id-setup: error while loading shared libraries: libssl.so.1.1: cannot open shared object file: No such file or directory

...and here was the ultimate solution to our mystery! The attempt to set the machine-id in systemd %post was failing because it needs libssl, but it obviously wasn't present yet. libssl is part of openssl-libs, but the systemd spec did not specify that its %post script needs openssl-libs installed. What I surmise had happened was that up until 20181030, some other dependency in some other package happened to mean that dnf would always choose to install openssl-libs before installing systemd, so no-one had ever noticed this missing dependency...but on 20181101, some change to some package caused dnf to start installing systemd before openssl-libs, and suddenly, this problem showed up. So - as is very often the case - once I'd finally managed to nail down the problem, the fix was simple: we just add the missing dependency to systemd, so that openssl-libs will always be installed before systemd's %post is run. With that fix, generation of /etc/machine-id will succeed again, and so the plugin script that installs the kernel file to /boot won't bail early, and so there will be a kernel file in /boot, and lorax won't fail when it tries to regenerate initrds because there aren't any kernels present!

...and so ends another exciting episode of AdamW's Debugging Adventures :)

Adam's Debugging Adventures: The Immutable Mutable Object

Here's a puzzle for you, Python fans:

[adamw@adam dnf (master %)]$ python3
Python 3.6.5 (default, Apr 23 2018, 22:53:50) 
[GCC 8.0.1 20180410 (Red Hat 8.0.1-0.21)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from dnf.conf import MainConf
>>> mc = MainConf()
>>> print(mc.group_package_types)
['mandatory', 'default', 'conditional']
>>> mc.group_package_types.append('foo')
>>> print(mc.group_package_types)
['mandatory', 'default', 'conditional']
>>> 

Note: if you want to try reproducing this...make sure you use DNF 3. It works as expected with DNF < 3. That's why it just showed up as a problem.

Before I explain what's going on there...let's unpick the problem a bit for non-Pythonistas.

In Python (and in many other languages) some things - objects - are 'mutable', and some are 'immutable'. A 'mutable' object can be changed. An 'immutable' object cannot.

In the Python session above, we create an object, mc, which is an instance of the class MainConf (don't worry if you don't entirely understand that, it's not compulsory). We then examine one attribute of mc: mc.group_package_types. Python tells us that this is ['mandatory', 'default', 'conditional'] - which is a Python list containing the values 'mandatory', 'default' and 'conditional'.

In Python, lists are 'mutable'. That means you can take a list and change it somehow - remove an item from it, add an item to it, re-order it - and it's still the same object. Any existing reference to the object is still valid, and now refers to the changed list.

For comparison, an example of an 'immutable' object is a string. If you do this:

mystring = "here's some text"

you can't then change the actual string object referenced by mystring there. It has no methods that let you change it in any way, and there are no functions that can operate on it and change it. You can do this:

mystring = "here's some text"
mystring = "here's some different text"
mystring = mystring.replace("some", "some more")

and at each step the contents of the string to which the name mystring refers are different - but also, at each step mystring refers to a DIFFERENT object. (That's why the replace() string method returns a new string - it can't mutate the existing string). So if you did this:

mystring = "here's some text"
otherref = mystring
mystring = "here's some different text"

then at the end, otherref still points to the first-created string object and its value is still "here's some text", while mystring points to a new string object and its value is "here's some different text". Let's compare with a similar case using a mutable object:

mylist = [1, 2, 3]
otherref = mylist
mylist.append(4)
print(mylist)
print(otherref)

In this case, when we get to the end, both mylist and otherref are still pointing to the same object, the original object, and both prints will print [1, 2, 3, 4]. No new list object was created at any point after the initial creation of mylist.

So with that understood, take a look back at the original example, and maybe you can see why it's so weird. By all appearances, it looks like we have a pretty simple scenario here: we have an object that has an attribute which is a list, and we just append a value to that list. Then we go look at its value again, and it...hasn't changed at all? But we didn't get any kind of crash, or error, or anything. Our append call returned apparently successfully. It just...didn't seem to change anything. The list is an immutable mutable object!

This is a real problem in real code: it broke the most recent Fedora Rawhide compose. So, obviously, I was pretty keen to figure out what the hell was going on, here! It turns out that it's down to dnf getting clever (arguably over-clever).

Python's a very...flexible language. The key to the problem here turned out to be exactly what happens when we actually look up the group_package_types attribute of mc, the dnf.conf.MainConf instance.

Getting and setting attributes of objects in Python is usually a kind of core operation that you never really think about, you just expect it to work in the 'usual way'. A simple approximation of how it works is that the object has a Python dict (like a 'hash' in some languages - a key/value store, more or less) whose keys are attribute names and whose values are the attribute values. When you ask for an attribute of an object, Python checks if its name is one of the keys in that object's dict, and if it is, returns the value. If it's not, Python raises an error. When you set an attribute, Python stuffs it into the dict.

But since Python's flexible, it provides some mechanisms to let you mess around with this stuff, if you want to. You can define __setattr__, __getattr__ and/or __getattribute__ methods in a class, and they'll affect this behaviour.

The base object class that almost all Python classes inherit from defines the default __setattr__ and __getattribute__, which work sort of like the approximation I gave above. If you override __setattr__ in a class, then when something tries to set an attribute for an instance of that class, that method will get called instead of the default object.__setattr__. If you override __getattribute__, then that method will get called instead of object.__getattribute__ when something tries to look up an attribute of an instance of that class.

If you leave __getattribute__ alone but define __getattr__, then when something tries to look up an attribute, first the stock object.__getattribute__ will be used to try and look it up, but if that doesn't find it, rather than raising an exception immediately, Python will try your __getattr__ method.

We can actually override __setattr__ and __getattr__ to do a very simplified demonstration of how the default implementation usually works:

#!/bin/python3

class TestClass(object):
    def __init__(self):
        self.__dict__["_attrs"] = {}

    def __setattr__(self, name, value):
        print("Hi, setattr here, just settin' attrs...")
        self._attrs[name] = value

    def __getattr__(self, name):
        print("Hi, getattr here, here's your attr!")
        return self._attrs[name]

tc = TestClass()
tc.foo = [1, 2, 3]
print(tc.foo)
tc.foo.append(4)
print(tc.foo)

Note that __dict__ is the store that object.__getattribute__ uses, so that's why we set up our backing store with self.__dict__["_attrs"] = {} - that ensures that when we look up self._attrs, we will find it via __getattribute__. We can't just do self._attrs = {} because then we wind up in an infinite recursion in our __getattr__.

If you save that and run it, you'll see [1, 2, 3] then [1, 2, 3, 4] (plus the messages that prove our new methods are being used). Our mutable attribute is nice and properly mutable. We can append things to it and everything. Notice that when we append the value, we hit __getattr__ but not __setattr__.

So, how does this manage not to work with dnf config classes? (MainConf is a subclass of BaseConfig, and so are a ton of other config-related classes in dnf - we actually encountered this bug with another subclass, RepoConf). It turns out to be because dnf overrides BaseConfig.__setattr__ and BaseConfig.__getattr__ to do some "clever" stuff, and it breaks this!

We don't need to go into what its __setattr__ does in detail, except to note one thing: it doesn't store the values in the __dict__ store, so object.__getattribute__ can never find them. When looking up any attribute on an instance of one of these classes except _config (which is the store the class' __setattr__ and __getattr__ methods themselves use, just like _attrs in our example, and is created directly in __dict__ in the same way), we always wind up in the class's __getattr__.

Here's the whole of current dnf's BaseConfig.__getattr__:

def __getattr__(self, name):
    option = getattr(self._config, name, None)
    if option is None:
        return None
    try:
        value = option().getValue()
    except Exception as ex:
        return None
    if isinstance(value, cfg.VectorString):
        return list(value)
    if isinstance(value, str):
        return ucd(value)
    return value

There is some more stuff going on in the background here that we don't need to worry about too much (a feature of DNF, I have found, is that it has layers upon layers. It contains multitudes. You usually can't debug anything in DNF without going through at least eight levels of things calling other things that turn out to be yet other things that turn out to be written in C++ just cuz.) In the case of the group_package_types option, and also the option we were actually dealing with in the buggy case (the baseurl option for a repo), the option is basically a list-y type, so we wind up in the if isinstance(value, cfg.VectorString): branch here.

So if you follow it through, when we asked for mc.group_package_types, unlike in the default Python implementation or our simplified example, we didn't just pull an existing list object out from some sekrit store in the mc object. No. We got some kind of object (fact fans: it's a libdnf.conf.OptionStringList - libdnf is the C++ bit I mentioned earlier...) out of the self._config dict that's acting as our sort-of attribute store, and ran its getValue method to get some other sort of object (fact fans: it's a libdnf.conf.VectorString), then we ran list() on that object, and returned that.

The problem is that the thing that gets returned is basically a temporary copy of the 'real' backing value. It's a mutable object - it really is a list! - and we can mutate it...but the next time anyone looks up the same attribute we looked up to get that list, they won't get the same list object we got. This wacky __getattr__ will run through the same indirection maze and return a new listified copy of the backing value. Every time you look up the attribute, it does that. We can mutate the copies all we like, but doing that doesn't actually change the backing value.

Now, it's easy enough to work around this, once you know what's going on. The overridden __setattr__, of course, actually does change the backing store. So if you explicitly try to set an attribute (rather than getting one and mutating it), that does work:

[adamw@adam dnf (master %)]$ python3
Python 3.6.5 (default, Apr 23 2018, 22:53:50) 
[GCC 8.0.1 20180410 (Red Hat 8.0.1-0.21)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from dnf.conf import MainConf
>>> mc = MainConf()
>>> print(mc.group_package_types)
['mandatory', 'default', 'conditional']
>>> mc.group_package_types = mc.group_package_types + ['foo']
>>> print(mc.group_package_types)
['mandatory', 'default', 'conditional', 'foo']
>>> 

That time, it worked because we didn't try to mutate our magical immutable mutable object. We just flat out replaced it with a new list. When we explicitly set the attribute like that, we hit the overridden __setattr__ method, which does the necessary magic to write the new value to the backing store.

But any regular Pythonista who sees that the instance attribute foo.bar is a mutable object is naturally going to assume that they can go ahead and mutate it. That's just...standard. They aren't going to figure they should ignore the fact that it's a mutable object and just replace it every time they want to change it. That's exactly what we did do in the code that got broken. That's the exact code that used to work with DNF 2 but no longer does with DNF 3.

So, that took a few hours to figure out! But I got there in the end. I really just wanted to write up my latest 'interesting' debugging adventure! But if there's a moral to this story...I guess it's "think really hard about whether messing with core behaviour like this is the right way to go about implementing your special thing"?

Oh, and by the way: comments should be working again now! I just disabled the plugin that was interfering with them. So, you know, comment away.

QA: the glamorous bit

Of course, we all know that working in QA is more or less a 24x7 whirl of red carpets and high-end cocktail parties...but today is particularly glamorous! Here's what I'm doing right now:

  1. Build an RPM of a git snapshot of Plymouth
  2. Put it in a temporary repo
  3. Build an installer image containing it
  4. Boot the installer image in a VM, see if it reaches anaconda
  5. Repeat, more or less ad infinitum

I just can't take the excitement!