AdamW's Debugging Adventures: Python 3 Porting 201

Hey folks! Time for another edition of AdamW's Debugging Adventures, wherein I boast about how great I am at fixin' stuff.

Today's episode is about a bug in the client for Fedora's Koji buildsystem which has been biting more and more Fedora maintainers lately. The most obvious thing it affects is task watching. When you do a package build with fedpkg, it will by default "watch" the build task - it'll update you when the various subtasks start and finish, and not quit until the build ultimately succeeds or fails. You can also directly watch tasks with koji watch-task. So this is something Fedora maintainers see a lot. There's also a common workflow where you chain something to the successful completion of a fedpkg build or koji watch-task, which relies on the task watch completing successfully and exiting 0, if the build actually completed.

However, recently, people noticed that this task watching seemed to be just...failing, quite a lot. While the task was still running, it'd suddenly exit, usually showing this message:

ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

After a while, nirik realized that this seemed to be associated with the client going from running under Python 2 by default to running under Python 3 by default. This seems to happen when running on Python 3; it doesn't seem to happen when running on Python 2.

Today I finally decided it had got annoying enough that I'd spend some time trying to track it down.

It's pretty obvious that the message we see relates to an exception, in some way. But ultimately something is catching that exception and printing it out and then exiting (we're not actually getting a traceback, as you do if the exception is ultimately left to reach the interpreter). So my first approach was to dig into the watch-task code from the top down, and try and find something that handles exceptions that looks like it might be the bit we were hitting.

And...I failed! This happens, sometimes. In fact I still haven't found the exact bit of code that prints the message and exits. Sometimes, this just happens. It's OK. Don't give up. Try something else!

So what I did next was kind of a long shot - I just grepped the code for the exception text. I wasn't really expecting this to work, as there's nothing to suggest the actual exception is part of Koji; it's most likely the code doesn't contain any of that text at all. But hey, it's easy to do, so why not? And as it happened, I got lucky and hit paydirt: there happens to be a comment with some of the text from the error we're hitting. And it sure looks like it might be relevant to the problem we're having! The comment itself, and the function it's in, looked so obviously promising that I went ahead and dug a little deeper.

That function, is_conn_error(), is used by only one other thing: this _sendCall() method in the same file. And that seems very interesting, because what it does can be boiled down to: "hey, we got an error! OK, send it to is_conn_error(). If that returns True, then just log a debug message and kick the session. If that returns False, then raise an exception". That behaviour obviously smells a lot like it could be causing our problem. So, I now had a working theory: for some reason, given some particular server behaviour, is_conn_error() returns True on Python 2 but False on Python 3. That causes this _sendCall() to raise an exception instead of just resetting the session and carrying on, and some other code - which we no longer need to find - catches that exception, prints it, and quits.

The next step was to test this theory - because at this point it's only a theory, it could be entirely wrong. I've certainly come up with entirely plausible theories like this before which turned out to be not what was going on at all. So, like a true lazy shortcut enthusiast, I hacked up my local copy of Koji's init.py and sprinkled a bunch of lines like print("HERE 1!") and print("HERE 2!") through the whole of is_conn_error(). Then I just rankoji wait-task commands on random tasks until one failed.

This is fine. When you're just trying to debug the problem you don't need to be super elegant about it. You don't need to do a proper git patch and rebuild the Koji package for your system and use proper logging methods and all the rest of it. Just dumping some print lines in a working copy of the file is just fine, if it works. Just remember to put everything back as it was before later. :)

So, as it happened the god of root causing was on my side today, and it turned out I was right on the money. When one of the koji watch-task commands failed, it hit my HERE 1! and HERE 3! lines right when it died. Those told me we were indeed running through is_conn_error() right before the error, and further, where we were coming out of it. We were entering the if isinstance(e, socket.error) block at the start of the function, and returning False because the exception (e) did appear to be an instance of socket.error, but either did not have an errno attribute, or it was not one of errno.ECONNRESET, errno.ECONNABORTED, or errno.EPIPE.

Obviously, this made me curious as to what the exception actually is, whether it has an errno at all, and if so, what it is. So I threw in a few more debugging lines - to print out type(e), and getattr(e, 'errno', 'foobar'). The result of this was pretty interesting. The second print statement gave me 'foobar', meaning the exception doesn't have an errno attribute at all. And the type of the exception was...requests.exceptions.ConnectionError.

That's a bit curious! You wouldn't necessarily expect requests.exceptions.ConnectionError to be an instance of socket.error, would you? So why are we in a block that only handles instances of socket.error? Also, it's clear the code doesn't expect this, because there's a block later in the function that explicitly handles instances of requests.exceptions.ConnectionError - but because this earlier block that handles socket.error instances always returns, we will never reach that block if requests.exceptions.ConnectionError instances are also instances of socket.error. So there's clearly something screwy going on here.

So of course the next thing to do is...look up socket.error in the Python 2 and Python 3 docs. ANY TIME you're investigating a mysterious Python 3 porting issue, remember this can be useful. Here's the Python 2 socket.error entry, and the Python 3 socket.error entry. And indeed there's a rather significant difference! The Python 2 docs talk about socket.error as an exception that is, well, its own unique thing. However, the Python 3 docs say: "A deprecated alias of OSError." - and even tell us specifically that this changed in Python 3.3: "Changed in version 3.3: Following PEP 3151, this class was made an alias of OSError." Obviously, this is looking an awful lot like one more link in the chain of what's going wrong here.

With a bit of Python knowledge you should be able to figure out what's going on now. Think: if socket.error is now just an alias of OSError, what does if isinstance(e, socket.error) mean, in Python 3.3+ ? It means just the same as if isinstance(e, OSError). And guess what? requests.exception.ConnectionError happens to be a subclass of OSError. Thus, if e is an instance of requests.exception.ConnectionError, isinstance(e, socket.error) will return True in Python 3.3+. In Python 2, it returns False. It's easy to check this in an interactive Python shell or with a test script, to confirm.

Because of this, when we run under Python 3 and e is a requests.exception.ConnectionError, we're unexpectedly entering this block intended for handling socket.error exceptions and - because that block always returns, having the return False line that gets hit if the errno attribute check fails - we're never actually reaching the later block that's actually intended to handle requests.exception.ConnectionError instances at all, we return False before we get there.

There are a few different ways you could fix this - you could just drop the return False short-circuit line in the socket.error block, for instance, or change the ordering so the requests.exception.ConnectionError handling is done first. In the end I sent a pull request which drops the return False, but also drops the if isinstance(e, socket.error) checks (there's another, for nested exceptions, later) entirely. Since socket.error is meant to be deprecated in Python 3.3+ we shouldn't really use it, and we probably don't need to - we can just rely on the errno attribute check alone. Whatever type the exception is, if it has an errno attribute and that attribute is errno.ECONNRESET, errno.ECONNABORTED, or errno.EPIPE, I think we can be pretty sure this is a connection error.

What's the moral of this debugging tale? I guess it's this: when porting from Python 2 to Python 3 (or doing anything similar to that), fixing the things that outright crash or obviously behave wrong is sometimes the easy part. Even if everything seems to be working fine on a simple test, it's certainly possible that subtler issues like this could be lurking in the background, causing unexpected failures or (possibly worse) subtly incorrect behaviour. And of course, that's just another reason to add to the big old "Why To Have A Really Good Test Suite" list!

There's also a 'secondary moral', I guess, and that's this: predicting all the impacts of an interface change like this is hard. Remember the Python 3 docs mentioned a PEP associated with this change? Well, here it is. If you read it, it's clear the proposers actually put quite a lot of effort into thinking about how existing code might be affected by the change, but it looks like they still didn't consider a case like this. They talk about "Careless (or "naïve") code" which "blindly catches any of OSError, IOError, socket.error, mmap.error, WindowsError, select.error without checking the errno attribute", and about "Careful code is defined as code which, when catching any of the above exceptions, examines the errno attribute to determine the actual error condition and takes action depending on it" - and claim that "useful compatibility doesn't alter the behaviour of careful exception-catching code". However, Koji's code here clearly matches their definition of "careful" code - it considers both the exception's type, and the errno attribute, in making decisions - but because it is not just doing except socket.error as e or similar, but catching the exception elsewhere and then passing it to this function and using isinstance, it still gets tripped up by the change.

So...the ur-moral, as always, is: software is hard!

Comments

Ken Dreyer wrote on 2019-01-09 06:34:
Thanks for tracking this down Adam. I remember hitting this bug and having to go back to watching tasks in Python 2. is_conn_error() strikes me as a relic the era before python-requests.