Fedora CoreOS Test Day coming up on 2020-06-08

Mark your calendars for next Monday, folks: 2020-06-08 will be the very first Fedora CoreOS test day! Fedora QA and the CoreOS team are collaborating to bring you this event. We'll be asking participants to test the bleeding-edge next stream of Fedora CoreOS, run some test cases, and also read over the documentation and give feedback.

All the details are on the Test Day page. You can join in on the day on Freenode IRC, we'll be using #fedora-coreos rather than #fedora-test-day for this event. Please come by and help out if you have the time!

Fedora 32 release and Lenovo announcement

It's been a big week in Fedora news: first came the announcement of Lenovo planning to ship laptops preloaded with Fedora, and today Fedora 32 is released. I'm happy this release was again "on time" (at least if you go by our definition and not Phoronix's!), though it was kinda chaotic in the last week or so. We just changed the installer, the partitioning library, the custom partitioning tool, the kernel and the main desktop's display manager - that's all perfectly normal stuff to change a day before you sign off the release, right? I'm pretty confident this is fine!

But seriously folks, I think it turned out to be a pretty good sausage, like most of the ones we've put on the shelves lately. Please do take it for a spin and see how it works for you.

I'm also really happy about the Lenovo announcement. The team working on that has been doing an awful lot of diplomacy and negotiation and cajoling for quite a while now and it's great to see it pay off. The RH Fedora QA team was formally brought into the plan in the last month or two, and Lenovo has kindly provided us with several test laptops which we've distributed around. While the project wasn't public we were clear that we couldn't do anything like making the Fedora 32 release contingent on test results on Lenovo hardware purely for this reason or anything like that, but both our team and Lenovo's have been running tests and we did accept several freeze exceptions to fix bugs like this one, which also affected some Dell systems and maybe others too. Now this project is officially public, it's possible we'll consider adding some official release criteria for the supported systems, or something like that, so look out for proposals on the mailing lists in future.

No more Wordpress!

So I finally managed to bite the bullet and move my blog off Wordpress! I've tried this multiple times over the last few years but always sort of ran out of gas, but this time I finished the job. I'm using Nikola, and with a bit of poking around, managed to convert my entire blog, including existing comments. I don't intend to allow new comments or user registrations, but I wanted to keep the existing ones visible.

More or less all old URLs should be redirected properly. This domain is still set up in a really icky way that I should redo sometime, but that's gonna have to wait till I get some more roundtuits. I didn't bother trying to copy the theme I was using before, I'm just using one of the stock Nikola themes with minor tweaks to display the comments, so the site's appearance is a bit different now, but hey, it's just a blog.

I killed my tt-rss deployment and an old cgit deployment I had forgotten I had running at the same time. Now if I can find some time to switch from Roundcube to Mailpile or something, I can uninstall PHP forever...

Do not upgrade to Fedora 32, and do not adjust your sets

If you were unlucky today, you might have received a notification from GNOME in Fedora 30 or 31 that Fedora 32 is now available for upgrade.

This might have struck you as a bit odd, it being rather early for Fedora 32 to be out and there not being any news about it or anything. And if so, you'd be right! This was an error, and we're very sorry for it.

What happened is that a particular bit of data which GNOME Software (among other things) uses as its source of truth about Fedora releases was updated for the branching of Fedora 32...but by mistake, 32 was added with status 'Active' (meaning 'stable release') rather than 'Under Development'. This fooled poor GNOME Software into thinking a new stable release was available, and telling you about it.

Kamil Paral spotted this very quickly and releng fixed it right away, but if your GNOME Software happened to check for updates during the few minutes the incorrect data was up, it will have cached it, and you'll see the incorrect notification for a while.

Please DO NOT upgrade to Fedora 32 yet. It is under heavy development and is very much not ready for normal use. We're very sorry for the incorrect notification and we hope it didn't cause too much disruption.

Using Zuul CI with Pagure.io

I attended Devconf.cz again this year - I'll try and post a full blog post on that soon. One of the most interesting talks, though, was CI/CD for Fedora packaging with Zuul, where Fabien Boucher and Matthieu Huin introduced the work they've done to integrate a specific Zuul instance (part of the Software Factory effort) with the Pagure instance Fedora uses for packages and also with Pagure.io, the general-purpose Pagure instance that many Fedora groups use to host projects, including us in QA.

They've done a lot of work to make it as simple as possible to hook up a project in either Pagure instance to run CI via Zuul, and it looked pretty cool, so I thought I'd try it on one of our projects and see how it compares to other options, like the Jenkins-based Pagure CI.

I wound up more or less following the instructions on this Wiki page, but it does not give you an example of a minimal framework in the project repository itself to actually run some checks. However, after I submitted the pull request for fedora-project-config as explained on the wiki page, Tristan Cacqueray was kind enough to send me this as a pull request for my project repository.

So, all that was needed to get a kind of 'hello world' process running was:

  1. Add the appropriate web hook in the project options
  2. Add the 'zuul' user as a committer on the project in the project options
  3. Get a pull request merged to fedora-project-config to add the desired project
  4. Add a basic Zuul config which runs a single job

After that, the next step was to have it run useful checks. I set the project up such that all the appropriate checks could be run just by calling tox (which is a great test runner for Python projects) - see the tox configuration. Then, with a bit more help from Tristan, I was able to tweak the Zuul config to run it successfully. This mainly required a couple of things:

  1. Adding nodeset: fedora-31-vm to the Zuul config - this makes the CI job run on a Fedora 31 VM rather than the default CentOS 7 VM (CentOS 7's tox is too old for a modern Python 3 project)
  2. Modifying the job configuration to ensure tox is installed (there's a canned role for this, called ensure-tox) and also all available Python interpreters (using the package module)

This was all pretty small and easy stuff, and we had the whole thing up and running in a few hours. Now it all works great, so whenever a pull request is submitted for the project, the tests are automatically run and the results shown on the pull request.

You can set up more complex workflows where Zuul takes over merging of pull requests entirely - an admin posts a comment indicating a PR is ready to merge, whereupon Zuul will retest it and then merge it automatically if the test succeeds. This can also be used to merge series of PRs together, with proper testing. But for my small project, this simple integration is enough so far.

It's been a positive experience working with the system so far, and I'd encourage others to try it for their packages and Pagure projects!

Uptime

OK, so that was two days longer than I was expecting! Sorry for the extended downtime, folks, especially Fedora folks. It was rather beyond my control. But now I'm (just barely) back, through the single working cable outlet in the house and a powerline ethernet connection to the router, at least until the cable co can come and fix all the other outlets!

Downtime

happyassassin.net - this blog, all email addresses, and anything else hosted there (which should only matter to me...) - will be down for about the next day and a half or so, just For The Record. I'm moving house again, and am now at the whims of the fickle gods of Shaw Cable. Make all homage unto them.

In praise of WebAuthn

tl;dr: I just got a Yubikey 5 and set it up on a bunch of things. You should too, because WebAuthn is awesome.

Now the long version!

Two-factor authentication has been a thing for a while now. You're probably familiar with it in various forms. There's SMS-based 2FA, commonly used by banks, where they text you a code you have to re-type when logging in. Then there are token/one-time-password based systems where you can use a hardware key like a Yubikey or a software authenticator like Google Authenticator or FreeOTP to generate and enter a one-time password when logging into a system.

If you're like I was yesterday, maybe you've got two Yubikeys on your keyring and another in a drawer somewhere and you have to remember which four systems you have set up on which slots on which key, and you've got FreeOTP as a backup.

Maybe you've also kinda heard about "U2F" and had the vague idea that it sounded neat. And also maybe you've read some stuff about "WebAuthn" recently and thought it sounded maybe cool but also maybe confusing and what's going on and maybe this isn't the most important thing you could be figuring out today?

Well, prodded by a couple of mailing list threads I did figure it out a bit, and here's the thing: WebAuthn is spreading, and it's awesome. If you are just a person who wants to log into stuff - it's 2FA done way better.

Here's the cool stuff:

  • You can use one key ('authenticator') to log into as many different WebAuthn-supporting sites as you want (and this is secure, they're not all sharing the same seed or anything)
  • You can register multiple authenticators per site
  • An authenticator can be a hardware key (mainly a Yubikey 5, at this point, but the Solokey is supposed to be a fully open-source WebAuthn-supporting key, only available to backers so far), but you can also use a phone or laptop with a fingerprint reader or facial ID system
  • It works on Linux. Really easily. It works on Firefox (not just Chrome). It works on Firefox on Android. Yeah, all the stuff you kinda automatically assume is going to be a pain in the ass...isn't! It actually fricking works!
  • WebAuthn-compatible keys can still support other systems too...specifically, you can get a Yubikey and use it for WebAuthn but it also still has two OTP slots, and no you don't have to do something stupid to pick which system you're using, it all just magically works somehow, I don't know how and I don't care. The Yubikey and Firefox are also backwards-compatible with U2F, so sites that implemented U2F but didn't update to WebAuthn yet work fine

Seriously, it's awesome. And it actually works, like, right now, really well. On useful sites! Try it! Github supports it, for instance. Go to your Github account, go to the Settings page, go to Security, enable 2FA if you don't have it enabled already, and hit edit on 'Security keys'. Then click 'Register new security key'. Give it a name (like 'phone' or 'yubikey #1' or whatever). If you're using a Yubikey, plug it in and hit the button. If you're using a phone with a fingerprint sensor or facial ID, there'll be an option for 'use this device with a fingerprint' or something like that. Pick it, and touch the sensor or show it your face. And that's it. You're done. Then when you login you just do the same thing (plug in, push button, touch sensor, or show face) and you're in. It's like the fricking future or something.

You can even use a Yubikey via NFC to log in with Firefox on Android (and I assume Chrome too, but I didn't try that). Yeah, I tried it, it worked. First time. (Once I figured out where the NFC sensor was, anyway). You can even apparently use your phone connected via Bluetooth to login on a computer, though I didn't try that yet - the browser should let you pick the Bluetooth-connected phone as the authenticator, then the phone will ask you for your fingerprint or face.

It's all so much frickin' better than re-typing codes from text messages or remembering Yubikey slot numbers. I really did not realize it was gonna be this nice. It is also more secure than OTP-based systems and much more secure than SMS-based systems, which is great, but even if it wasn't it's just nicer. I really hope W3C and Mozilla and Google and Apple and whoever go out and sell it that way as hard as they can.

So far I've set up my Google account (I think Google is still technically using U2F not WebAuthn, but as far as the user experience went it didn't make any difference), Github, Bitwarden (which is a great open-source password management service), and Gandi (I use them for domain registration and DNS, they're great for that), and now I'm busy writing to a ton of other sites to demand they get on the bandwagon already. I used the OTP slots for Fedora and Red Hat internal systems (neither supports WebAuthn yet, unfortunately - one limitation of WebAuthn is that it is fairly 'web-y', it's less suited to systems where you need to authenticate in non-web-protocol scenarios, so FAS and RH auth can't just switch over to it that easily). And my three pre-U2F Yubikeys are wiped and on their way to hardware heaven...

AdamW's Debugging Adventures: "dnf is locked by another application"

Gather round the fire, kids, it's time for another Debugging Adventure! These are posts where I write up the process of diagnosing the root cause of a bug, where it turned out to be interesting (to me, anyway...)

This case - Bugzilla #1750575 - involved dnfdragora, the package management tool used on Fedora Xfce, which is a release-blocking environment for the ARM architecture. It was a pretty easy bug to reproduce: any time you updated a package, the update would work, but then dnfdragora would show an error "DNF is locked by another process. dnfdragora will exit.", and immediately exit.

The bug sat around on the blocker list for a while; Daniel Mach (a DNF developer) looked into it a bit but didn't have time to figure it out all the way. So I got tired of waiting for someone else to do it, and decided to work it out myself.

Where's the error coming from?

As a starting point, I had a nice error message - so the obvious thing to do is figure out where that message comes from. The text appears in a couple of places in dnfdragora - in an exception handler and also in a method for setting up a connection to dnfdaemon. So, if we didn't already know (I happened to) this would be the point at which we'd realize that dnfdragora is a frontend app to a backend - dnfdaemon - which does the heavy lifting.

So, to figure out in more detail how we were getting to one of these two points, I hacked both the points where that error is logged. Both of them read logger.critical(errmsg). I changed this to logger.exception(errmsg). logger.exception is a very handy feature of Python's logging module which logs whatever message you specify, plus a traceback to the current state, just like the traceback you get if the app actually crashes. So by doing that, the dnfdragora log (it logs to a file dnfdragora.log in the directory you run it from) gave us a traceback showing how we got to the error:

2019-10-14 17:53:29,436 <a href="ERROR">dnfdragora</a> dnfdaemon client error: g-io-error-quark: GDBus.Error:org.baseurl.DnfSystem.LockedError: dnf is locked by another application (36)
Traceback (most recent call last):
File "/usr/bin/dnfdragora", line 85, in <module>
main_gui.handleevent()
File "/usr/lib/python3.7/site-packages/dnfdragora/ui.py", line 1273, in handleevent
if not self._searchPackages(filter, True) :
File "/usr/lib/python3.7/site-packages/dnfdragora/ui.py", line 949, in _searchPackages
packages = self.backend.search(fields, strings, self.match_all, self.newest_only, tags )
File "/usr/lib/python3.7/site-packages/dnfdragora/misc.py", line 135, in newFunc
rc = func(*args, **kwargs)
File "/usr/lib/python3.7/site-packages/dnfdragora/dnf_backend.py", line 464, in search
newest_only, tags)
File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 508, in Search
fields, keys, attrs, match_all, newest_only, tags))
File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 293, in _run_dbus_async
result = self._get_result(data)
File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 277, in _get_result
self._handle_dbus_error(user_data['error'])
File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 250, in _handle_dbus_error
raise DaemonError(str(err))
dnfdaemon.client.DaemonError: g-io-error-quark: GDBus.Error:org.baseurl.DnfSystem.LockedError: dnf is locked by another application (36)</module>

So, this tells us quite a bit of stuff. We know we're crashing in some sort of 'search' operation, and dbus seems to be involved. We can also see a bit more of the architecture here. Note how we have dnfdragora/dnf_backend.py and dnfdaemon/client/init.py included in the trace, even though we're only in the dnfdragora executable here (dnfdaemon is a separate process). Looking at that and then looking at those files a bit, it's quite easy to see that the dnfdaemon Python library provides a sort of framework for a client class called (oddly enough) DnfDaemonBase which the actual client - dnfdragora in our case - is expected to subclass and flesh out. dnfdragora does this in a class called DnfRootBackend, which inherits from both dnfdragora.backend.Backend (a sort of abstraction layer for dnfdragora to have multiple of these backends, though at present it only actually has this one) and dnfdaemon.client.Client, which is just a small extension to DnfDaemonBase that adds some dbus signal handling.

So now we know more about the design we're dealing with, and we can also see that we're trying to do some sort of search operation which looks like it works by the client class communicating with the actual dnfdaemon server process via dbus, only we're hitting some kind of error in that process, and interpreting it as 'dnf is locked by another application'. If we dig a little deeper, we can figure out a bit more. We have to read through all of the backtrace frames and examine the functions, but ultimately we can figure out that DnfRootBackend.Search() is wrapped by dnfdragora.misc.ExceptionHandler, which handles dnfdaemon.client.DaemonError exceptions - like the one that's ultimately getting raised here! - by calling the base class's own exception_handler() on them...and for us, that's BaseDragora.exception_handler, one of the two places we found earlier that ultimately produces this "DNF is locked by another process. dnfdragora will exit" text. We also now have two indications (the dbus error itself, and the code in exception_handler() that the error we're dealing with is "LockedError".

A misleading error...

At this point, I went looking for the text LockedError, and found it in two files in dnfdaemon that are kinda variants on each other - daemon/dnfdaemon-session.py and daemon/dnfdaemon-system.py. I didn't actually know offhand which of the two is used in our case, but it doesn't really matter, because the codepath to LockedError is the same in both. There's a function called check_lock() which checks that self._lock == sender, and if it doesn't, raises LockedError. That sure looks like where we're at.

So at this point I did a bit of poking around into how self._lock gets set and unset in the daemon. It turns out to be pretty simple. The daemon is basically implemented as a class with a bunch of methods that are wrapped by @dbus.service.method, which makes them accessible as DBus methods. (One of them is Search(), and we can see that the client class's own Search() basically just calls that). There are also methods called Lock() and Unlock(), which - not surprisingly - set and release this lock, by setting the daemon class' self._lock to be either an identifier for the DBus client or None, respectively. And when the daemon is first initialized, the value is set to None.

At this point, I realized that the error we're dealing with here is actually a lie in two important ways:

  1. The message claims that the problem is the lock being held "by another application", but that's not what check_lock() checks, really. It passes only if the caller holds the lock. It does fail if the lock is held "by another application", but it also fails if the lock is not held at all. Given all the code we looked at so far, we can't actually trust the message's assertion that something else is holding the lock. It is also possible that the lock is not held at all.
  2. The message suggests that the lock in question is a lock on dnf itself. I know dnf/libdnf do have locking, so up to now I'd been assuming we were actually dealing with the locking in dnf itself. But at this point I realized we weren't. The dnfdaemon lock code we just looked at doesn't actually call or wrap dnf's own locking code in any way. This lock we're dealing with is entirely internal to dnfdaemon. It's really a lock on the dnfdaemon instance itself.

So, at this point I started thinking of the error as being "dnfdaemon is either locked by another DBus client, or not locked at all".

So what's going on with this lock anyway?

My next step, now I understood the locking process we're dealing with, was to stick some logging into it. I added log lines to the Lock() and Unlock() methods, and I also made check_lock() log what sender and self._lock were set to before returning. Because it sets self._lock to None, I also added a log line to the daemon's init that just records that we're in it. That got me some more useful information:

2019-10-14 18:53:03.397784 XXX In DnfDaemon.init now!
2019-10-14 18:53:03.402804 XXX LOCK: sender is :1.1835
2019-10-14 18:53:03.407524 XXX CHECK LOCK: sender is :1.1835
XXX CHECK LOCK: self._lock is :1.1835
2019-10-14 18:53:07.556499 XXX CHECK LOCK: sender is :1.1835
XXX CHECK LOCK: self._lock is :1.1835
[...snip a bunch more calls to check_lock where the sender is the same...]
2019-10-14 18:53:13.560828 XXX CHECK LOCK: sender is :1.1835
XXX CHECK LOCK: self._lock is :1.1835
2019-10-14 18:53:13.560941 XXX CHECK LOCK: sender is :1.1835
XXX CHECK LOCK: self._lock is :1.1835
2019-10-14 18:53:16.513900 XXX In DnfDaemon.init now!
2019-10-14 18:53:16.516724 XXX CHECK LOCK: sender is :1.1835
XXX CHECK LOCK: self._lock is None

so we could see that when we started dnfdragora, dnfdaemon started up and dnfdragora locked it almost immediately, then throughout the whole process of reproducing the bug - run dnfdragora, search for a package to be updated, mark it for updating, run the transaction, wait for the error - there were several instances of DBus method calls where everything worked fine (we see check_lock() being called and finding sender and self._lock set to the same value, the identifier for dnfdragora), but then suddenly we see the daemon's init running again for some reason, not being locked, and then a check_lock() call that fails because the daemon instance's self._lock is None.

After a couple of minutes, I guessed what was going on here, and the daemon's service logs confirmed it - dnfdaemon was crashing and automatically restarting. The first attempt to invoke a DBus method after the crash and restart fails, because dnfdragora has not locked this new instance of the daemon (it has no idea it just crashed and restarted), so check_lock() fails. So as soon as a DBus method invocation is attempted after the dnfdaemon crash, dnfdragora errors out with the confusing "dnf is locked by another process" error.

The crash was already mentioned in the bug report, but until now the exact interaction between the crash and the error had not been worked out - we just knew the daemon crashed and the app errored out, but we didn't really know what order those things happened in or how they related to each other.

OK then...why is dnfdaemon crashing?

So, the question now became: why is dnfdaemon crashing? Well, the backtrace we had didn't tell us a lot; really it only told us that something was going wrong in libdbus, which we could also tell from the dnfdaemon service log:

Oct 14 18:53:15 adam.happyassassin.net dnfdaemon-system[226042]: dbus[226042]: arguments to dbus_connection_unref() were incorrect, assertion "connection->generation == _dbus_current_generation" failed in file ../../dbus/dbus-connection.c line 2823.
Oct 14 18:53:15 adam.happyassassin.net dnfdaemon-system[226042]: This is normally a bug in some application using the D-Bus library.
Oct 14 18:53:15 adam.happyassassin.net dnfdaemon-system[226042]:   D-Bus not built with -rdynamic so unable to print a backtrace

that last line looked like a cue, so of course, off I went to figure out how to build DBus with -rdynamic. A bit of Googling told me - thanks "the3dfxdude"! - that the trick is to compile with --enable-asserts. So I did that and reproduced the bug again, and got a bit of a better backtrace. It's a long one, but by picking through it carefully I could spot - in frame #17 - the actual point at which the problem happened, which was in dnfdaemon.server.DnfDaemonBase.run_transaction(). (Note, this is a different DnfDaemonBase class from dnfdaemon.client.DnfDaemonBase; I don't know why they have the same name, that's just confusing).

So, the daemon's crashing on this self.TransactionEvent('end-run', NONE) call. I poked into what that does a bit, and found a design here that kinda mirrors what happens on the client side: this DnfDaemonBase, like the other one, is a framework for a full daemon implementation, and it's subclassed by a DnfDaemon class here. That class defines a TransactionEvent method that emits a DBus signal. So...we're crashing when trying to emit a dbus signal. That all adds up with the backtrace going through libdbus and all. But, why are we crashing?

At this point I tried to make a small reproducer (which basically just set up a DnfDaemon instance and called self.TransactionEvent in the same way, I think) but that didn't work - I didn't know why at the time, but figured it out later. Continuing to trace it out through code wouldn't be that easy because now we're in DBus, which I know from experience is a big complex codebase that's not that easy to just reason your way through. We had the actual DBus error to work from too - "arguments to dbus_connection_unref() were incorrect, assertion "connection->generation == _dbus_current_generation" failed" - and I looked into that a bit, but there were no really helpful leads there (I got a bit more understanding about what the error means exactly, but it didn't help me understand why it was happening at all).

Time for the old standby...

So, being a bit stuck, I fell back on the most trusty standby: trial and error! Well, also a bit of logic. It did occur to me that the dbus broker is itself a long-running daemon that other things can talk to. So I started just wondering if something was interfering with dnfdaemon's connection with the dbus broker, somehow. This was in my head as I poked around at stuff - wherever I wound up looking, I was looking for stuff that involved dbus.

But to figure out where to look, I just started hacking up dnfdaemon a bit. Now this first part is probably pure intuition, but that self._reset_base() call on the line right before the self.TransactionEvent call that crashes bugged me. It's probably just long experience telling me that anything with "reset" or "refresh" in the name is bad news. :P So I thought, hey, what happens if we move it?

I stuck some logging lines into this run_transaction so I knew where we got to before we crashed - this is a great dumb trick, btw, just stick lines like self.logger('XXX HERE 1'), self.logger('XXX HERE 2') etc. between every significant line in the thing you're debugging, and grep the logs for "XXX" - and moved the self._reset_base() call down under the self.TransactionEvent call...and found that when I did that, we got further, the self.TransactionEvent call worked and we crashed the next time something else tried to emit a DBus signal. I also tried commenting out the self._reset_base() call entirely, and found that now we would only crash the next time a DBus signal was emitted after a subsequent call to the Unlock() method, which is another method that calls self._reset_base(). So, at this point I was pretty confident in this description: "dnfdaemon is crashing on the first interaction with DBus after self._reset_base() is called".

So my next step was to break down what _reset_base() was actually doing. Turns out all of the detail is in the DnfDaemonBase skeleton server class: it has a self._base which is a dnf.base.Base() instance, and that method just calls that instance's close() method and sets self._base to None. So off I went into dnf code to see what dnf.base.Base.close() does. Turns out it basically does two things: it calls self._finalize_base() and then calls self.reset(True, True, True).

Looking at the code it wasn't immediately obvious which of these would be the culprit, so it was all aboard the trial and error train again! I changed the call to self._reset_base() in the daemon to self._base.reset(True, True, True)...and the bug stopped happening! So that told me the problem was in the call to _finalize_base(), not the call to reset(). So I dug into what _finalize_base() does and kinda repeated this process - I kept drilling down through layers and splitting up what things did into individual pieces, and doing subsets of those pieces at a time to try and find the "smallest" thing I could which would cause the bug.

To take a short aside...this is what I really like about these kinds of debugging odysseys. It's like being a detective, only ultimately you know that there's a definite reason for what's happening and there's always some way you can get closer to it. If you have enough patience there's always a next step you can take that will get you a little bit closer to figuring out what's going on. You just have to keep working through the little steps until you finally get there.

Eventually I lit upon this bit of dnf.rpm.transaction.TransactionWrapper.close(). That was the key, as close as I could get to it: reducing the daemon's self._reset_base() call to just self._base._priv_ts.ts = None (which is what that line does) was enough to cause the bug. That was the one thing out of all the things that self._reset_base() does which caused the problem.

So, of course, I took a look at what this ts thing was. Turns out it's an instance of rpm.TransactionSet, from RPM's Python library. So, at some point, we're setting up an instance of rpm.TransactionSet, and at this point we're dropping our reference to it, which - point to ponder - might trigger some kind of cleanup on it.

Remember how I was looking for things that deal with dbus? Well, that turned out to bear fruit at this point...because what I did next was simply to go to my git checkout of rpm and grep it for 'dbus'. And lo and behold...this showed up.

Turns out RPM has plugins (TIL!), and in particular, it has this one, which talks to dbus. (What it actually does is try to inhibit systemd from suspending or shutting down the system while a package transaction is happening). And this plugin has a cleanup function which calls something called dbus_shutdown() - aha!

This was enough to get me pretty suspicious. So I checked my system and, indeed, I had a package rpm-plugin-systemd-inhibit installed. I poked at dependencies a bit and found that python3-dnf recommends that package, which means it'll basically be installed on nearly all Fedora installs. Still looking like a prime suspect. So, it was easy enough to check: I put the code back to a state where the crash happened, uninstalled the package, and tried again...and bingo! The crash stopped happening.

So at this point the case was more or less closed. I just had to do a bit of confirming and tidying up. I checked and it turned out that indeed this call to dbus_shutdown() had been added quite recently, which tied in with the bug not showing up earlier. I looked up the documentation for dbus_shutdown() which confirmed that it's a bit of a big cannon which certainly could cause a problem like this:

"Frees all memory allocated internally by libdbus and reverses the effects of dbus_threads_init().

libdbus keeps internal global variables, for example caches and thread locks, and it can be useful to free these internal data structures.

...

You can't continue to use any D-Bus objects, such as connections, that were allocated prior to dbus_shutdown(). You can, however, start over; call dbus_threads_init() again, create new connections, and so forth."

and then I did a scratch build of rpm with the commit reverted, tested, and found that indeed, it solved the problem. So, we finally had our culprit: when the rpm.TransactionSet instance went out of scope, it got cleaned up, and that resulted in this plugin's cleanup function getting called, and dbus_shutdown() happening. The RPM devs had intended that call to clean up the RPM plugin's DBus handles, but this is all happening in a single process, so the call also cleaned up the DBus handles used by dnfdaemon itself, and that was enough (as the docs suggest) to cause any further attempts to communicate with DBus in dnfdaemon code to blow up and crash the daemon.

So, that's how you get from dnfdragora claiming that DNF is locked by another process to a stray RPM plugin crashing dnfdaemon on a DBus interaction!