Hey folks! Time for another edition of AdamW's Debugging Adventures, wherein I boast about how great I am at fixin' stuff.
Today's episode is about a bug in the client for Fedora's Koji buildsystem which has been biting more and more Fedora maintainers lately. The most obvious thing it affects is task watching. When you do a package build with
fedpkg, it will by default "watch" the build task - it'll update you when the various subtasks start and finish, and not quit until the build ultimately succeeds or fails. You can also directly watch tasks with
koji watch-task. So this is something Fedora maintainers see a lot. There's also a common workflow where you chain something to the successful completion of a
fedpkg build or
koji watch-task, which relies on the task watch completing successfully and exiting 0, if the build actually completed.
However, recently, people noticed that this task watching seemed to be just...failing, quite a lot. While the task was still running, it'd suddenly exit, usually showing this message:
ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))
After a while, nirik realized that this seemed to be associated with the client going from running under Python 2 by default to running under Python 3 by default. This seems to happen when running on Python 3; it doesn't seem to happen when running on Python 2.
Today I finally decided it had got annoying enough that I'd spend some time trying to track it down.
It's pretty obvious that the message we see relates to an exception, in some way. But ultimately something is catching that exception and printing it out and then exiting (we're not actually getting a traceback, as you do if the exception is ultimately left to reach the interpreter). So my first approach was to dig into the
watch-task code from the top down, and try and find something that handles exceptions that looks like it might be the bit we were hitting.
And...I failed! This happens, sometimes. In fact I still haven't found the exact bit of code that prints the message and exits. Sometimes, this just happens. It's OK. Don't give up. Try something else!
So what I did next was kind of a long shot - I just grepped the code for the exception text. I wasn't really expecting this to work, as there's nothing to suggest the actual exception is part of Koji; it's most likely the code doesn't contain any of that text at all. But hey, it's easy to do, so why not? And as it happened, I got lucky and hit paydirt: there happens to be a comment with some of the text from the error we're hitting. And it sure looks like it might be relevant to the problem we're having! The comment itself, and the function it's in, looked so obviously promising that I went ahead and dug a little deeper.
is_conn_error(), is used by only one other thing: this
_sendCall() method in the same file. And that seems very interesting, because what it does can be boiled down to: "hey, we got an error! OK, send it to
is_conn_error(). If that returns True, then just log a debug message and kick the session. If that returns False, then raise an exception". That behaviour obviously smells a lot like it could be causing our problem. So, I now had a working theory: for some reason, given some particular server behaviour,
True on Python 2 but
False on Python 3. That causes this
_sendCall() to raise an exception instead of just resetting the session and carrying on, and some other code - which we no longer need to find - catches that exception, prints it, and quits.
The next step was to test this theory - because at this point it's only a theory, it could be entirely wrong. I've certainly come up with entirely plausible theories like this before which turned out to be not what was going on at all. So, like a true lazy shortcut enthusiast, I hacked up my local copy of Koji's
init.py and sprinkled a bunch of lines like
print("HERE 1!") and
print("HERE 2!") through the whole of
is_conn_error(). Then I just ran
koji wait-task commands on random tasks until one failed.
This is fine. When you're just trying to debug the problem you don't need to be super elegant about it. You don't need to do a proper git patch and rebuild the Koji package for your system and use proper logging methods and all the rest of it. Just dumping some
So, as it happened the god of root causing was on my side today, and it turned out I was right on the money. When one of the
koji watch-task commands failed, it hit my
HERE 1! and
HERE 3! lines right when it died. Those told me we were indeed running through
is_conn_error() right before the error, and further, where we were coming out of it. We were entering the
if isinstance(e, socket.error) block at the start of the function, and returning
False because the exception (
e) did appear to be an instance of
socket.error, but either did not have an
errno attribute, or it was not one of
Obviously, this made me curious as to what the exception actually is, whether it has an
errno at all, and if so, what it is. So I threw in a few more debugging lines - to print out
getattr(e, 'errno', 'foobar'). The result of this was pretty interesting. The second print statement gave me 'foobar', meaning the exception doesn't have an
errno attribute at all. And the type of the exception was...
That's a bit curious! You wouldn't necessarily expect
requests.exceptions.ConnectionError to be an instance of
socket.error, would you? So why are we in a block that only handles instances of
socket.error? Also, it's clear the code doesn't expect this, because there's a block later in the function that explicitly handles instances of
requests.exceptions.ConnectionError - but because this earlier block that handles
socket.error instances always returns, we will never reach that block if
requests.exceptions.ConnectionError instances are also instances of
socket.error. So there's clearly something screwy going on here.
So of course the next thing to do is...look up
socket.error in the Python 2 and Python 3 docs. ANY TIME you're investigating a mysterious Python 3 porting issue, remember this can be useful. Here's the Python 2
socket.error entry, and the Python 3
socket.error entry. And indeed there's a rather significant difference! The Python 2 docs talk about
socket.error as an exception that is, well, its own unique thing. However, the Python 3 docs say: "A deprecated alias of OSError." - and even tell us specifically that this changed in Python 3.3: "Changed in version 3.3: Following PEP 3151, this class was made an alias of OSError." Obviously, this is looking an awful lot like one more link in the chain of what's going wrong here.
With a bit of Python knowledge you should be able to figure out what's going on now. Think: if
socket.error is now just an alias of
OSError, what does
if isinstance(e, socket.error) mean, in Python 3.3+ ? It means just the same as
if isinstance(e, OSError). And guess what?
requests.exception.ConnectionError happens to be a subclass of
OSError. Thus, if
e is an instance of
isinstance(e, socket.error) will return
True in Python 3.3+. In Python 2, it returns
False. It's easy to check this in an interactive Python shell or with a test script, to confirm.
Because of this, when we run under Python 3 and
e is a
requests.exception.ConnectionError, we're unexpectedly entering this block intended for handling
socket.error exceptions and - because that block always returns, having the
return False line that gets hit if the
errno attribute check fails - we're never actually reaching the later block that's actually intended to handle
requests.exception.ConnectionError instances at all, we return
False before we get there.
There are a few different ways you could fix this - you could just drop the
return False short-circuit line in the
socket.error block, for instance, or change the ordering so the
requests.exception.ConnectionError handling is done first. In the end I sent a pull request which drops the
return False, but also drops the
if isinstance(e, socket.error) checks (there's another, for nested exceptions, later) entirely. Since
socket.error is meant to be deprecated in Python 3.3+ we shouldn't really use it, and we probably don't need to - we can just rely on the
errno attribute check alone. Whatever type the exception is, if it has an
errno attribute and that attribute is
errno.EPIPE, I think we can be pretty sure this is a connection error.
What's the moral of this debugging tale? I guess it's this: when porting from Python 2 to Python 3 (or doing anything similar to that), fixing the things that outright crash or obviously behave wrong is sometimes the easy part. Even if everything seems to be working fine on a simple test, it's certainly possible that subtler issues like this could be lurking in the background, causing unexpected failures or (possibly worse) subtly incorrect behaviour. And of course, that's just another reason to add to the big old "Why To Have A Really Good Test Suite" list!
There's also a 'secondary moral', I guess, and that's this: predicting all the impacts of an interface change like this is hard. Remember the Python 3 docs mentioned a PEP associated with this change? Well, here it is. If you read it, it's clear the proposers actually put quite a lot of effort into thinking about how existing code might be affected by the change, but it looks like they still didn't consider a case like this. They talk about "Careless (or "naïve") code" which "blindly catches any of OSError, IOError, socket.error, mmap.error, WindowsError, select.error without checking the errno attribute", and about "Careful code is defined as code which, when catching any of the above exceptions, examines the errno attribute to determine the actual error condition and takes action depending on it" - and claim that "useful compatibility doesn't alter the behaviour of careful exception-catching code". However, Koji's code here clearly matches their definition of "careful" code - it considers both the exception's type, and the
errno attribute, in making decisions - but because it is not just doing
except socket.error as e or similar, but catching the exception elsewhere and then passing it to this function and using
isinstance, it still gets tripped up by the change.
So...the ur-moral, as always, is: software is hard!