Anaconda Test Day Tuesday, and debugging

Sorry (again) for the late notice, but Tuesday 2013-05-21 - tomorrow or today, depending - is Anaconda Test Day!

Yeah, in a way it's anaconda test day every day in Fedoraland, as we focus a lot of our day-to-day testing on the installer. But the folks running this Test Day have come up with some good extended test cases that go beyond what we cover in the release validation testing, so we should be able to find some more interesting bugs (it's always fun).

We'll be testing the current Beta candidate build, Beta RC2. As always, full instructions for testing are available in the Test Day page, and there will be QA folks on hand in #fedora-test-day on Freenode IRC for discussion and any help you need in testing or debugging. If you're not sure what IRC is or how to use it, we have instructions here, and you can also simply click here to join the chat through a Web front end.

If you haven't tried out the improved anaconda in Fedora 19 yet, this event is a great excuse to take a look at it: it is much more polished and stable now than it was in Fedora 18, so come try it for yourself and help us to improve it further before the final release of Fedora 19!

I've been really enjoying work on Fedora 19 lately. After the somewhat hellish Fedora 18 cycle where we worked more or less 24x7 on stabilising anaconda, Fedora 19 has been a pretty well-behaved release so far, and so I've had time to look around and try to investigate and fix up some other things. Among other things, today, I was looking into a couple of bugs which turned out to be very interesting (to weird people like me, anyhow) to debug. I don't know if anyone else feels like this, but I absolutely love this stuff - it's my favourite thing about my job, really, the process of pinning down exactly why something is going wrong, and how to fix it. I also love that in a project like Fedora, you rarely do it alone: it's very cool to work alongside (or just watch) a big group of extremely smart and knowledgeable people, both inside RH and in the Fedora community. I just had the urge to do a post mortem on a specific bug here, because I really love this process. Those of you who do this every day can safely move on. :)

Since upgrading to Fedora 19, I hit a strange problem three or four times where there was a long delay when using the 'su' command to become root. I'd run 'su', enter my password, and then it would sit there apparently thinking about it for 15-20 seconds before returning a root prompt; it's usually instant. This was different from the delay that usually happens when you get the password wrong - it happens when you get the password right, and is longer than that delay.

So when I finally got around to trying to figure out what was going on recently, I started out by trying to 'strace' the su process.

Now, strace is rarely the tool you really want to use for debugging. It's a very unwieldy thing to use, for most cases. There's usually a better way of doing it. But it can be invaluable for your real mystery bugs, like this one, where there is no other obvious way to attack the problem - no logs to sink your teeth into or obvious triage techniques you can use. It turned out to be invaluable for this bug.

At first I was stymied because if you just run the su process through strace from the outset, authentication always fails - strace interferes with it somehow. Tomas Mraz pointed out the workaround which really should've been obvious to me in hindsight: just run su first, then attach the strace to it (the 'strace -p [pid]' command does this). That works fine. So, now I had the strace output, and it had a reasonably obvious smoking gun:

stat("/root/.xauth/import", 0x7fff660f8b70) = -1 ENOENT (No such file or directory) getgid() = 1001 getuid() = 1001 pipe([4, 6]) = 0 pipe([7, 8]) = 0 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f2a174b9a10) = 15192 close(4) = 0 close(8) = 0 close(6) = 0 read(7,

PAUSE HERE

"", 2048) = 0 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=15192, si_status=1, si_utime=0, si_stime=0} --- close(7)

The pause happened at "PAUSE HERE": the strace output sat right in the middle of that line for 15 seconds before moving on. So what does that bunch of gibberish mean?

Extracting meaning from voluminous gibberish is an invaluable skill for the kind of stuff we're working on here, which you only really develop with experience. But in this particular case, here's how that stuff reads to me:

blahblahblahwordswordswords CHILD CHILD SIGCHLD 15192 blahblahblahwords...pause...SIGCHLD CLD_EXITED wordswordswords

The key thing to know is that, just before the pause, the su process created a child process. Basically, it ran some other command that it needed to run. This happens all the time: say you were writing a very unsophisticated file manager, if the user clicked the Create Directory button, your program would run a child 'mkdir' process. (That's a horrible way to write a file manager, but you get the idea).

So what this means is that we're not going to get any further staring at this strace, because the data just isn't there. The pause happens right after a child process is forked off, and right when the pause ends, we see that the child process finished. That makes it pretty damn obvious that the pause is happening in the child process, somehow.

So what next? Well, we find out what the child process is. That '15192' is the pid of the child process. How did I know that? I didn't (though a real developer probably would have), but it seemed like a reasonable guess - it's a pid-ish number, it's clearly associated with the creation of the child process, and if I were the developer of strace, then when a child process was created, I'd bloody well make sure it logged the pid. So it seemed a pretty safe bet, and it was easy enough to check: re-do the strace and check 'ps' output for the apparent pid while the pause was happening. And lo, it came to pass that I was right. So ps told me that the child process was this:

/usr/bin/xauth -f /run/gdm/auth-for-adamw-szT11D/database nlist :0

it's trying to read some kind of authentication information from a database provided by gdm. Okay. Now we're making progress.

The next step is probably pretty obvious: let's strace the child process. It's easy enough to watch the su strace output and fire off an 'strace -p 15192' (or whatever the new number was) right when the pause happens.

That yielded some nice gratifying results. The xauth strace showed this happening, seven times:

open("/run/gdm/auth-for-adamw-szT11D/database-c", O_WRONLY|O_CREAT|O_EXCL, 0600) = -1 EACCES (Permission denied) rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 nanosleep({2, 0}, 0x7fff8b56d5c0) = 0

it tries to access (actually, create) a file, gets 'Permission denied', WORDSWORDSWORDS, then goes away to sulk in the corner for a bit: after that 'nanosleep' line, there was a two second pause. Multiply that by 7 (actually I think it's probably 8, and the first happened before I got the strace started) and you get the long pause. Eureka!

Of course, we're not done yet. When you run up against something like this, there are two key questions:

a) Why is it trying to do that in the first place? b) Why is it failing?

a) can sometimes be an interesting question, but then, I always find it's a good idea to consider a third question: 'do the people who did this know better than I do?', to which the answer is often 'yes'. I'm certainly not going to start questioning the motives of the people who maintain util-linux unless it seems to be urgently necessary. So in a case like this, it's probably prudent to assume that there's a good and valid answer to a), and slide right along to b).

At this point it was necessary to call in the cavalry again, what with me still being the idiot monkey and all. So I got on IRC and bleated piteously for halfline - Ray Strode - the maintainer of GDM. He did most of the work from there on; I just sat around coming up with silly ideas.

It was still a pretty wide open problem area to start with; there are always multiple possibilities when you're looking at a problem like this. We had to consider, for instance, the possibility that this file creation always failed, but didn't always cause a delay. Or that the attempt to create the file itself isn't a 'normal' thing but some kind of failure path; perhaps in the 'success' case, the file creation isn't attempted. So it seemed prudent to run an strace on the 'success' case as well as the failure case. ajax - Adam Jackson - had a handy trick for achieving this. In the 'success' case the xauth command returns pretty much instantaneously, so you can't catch it in strace just by being handy with the copy/paste shortcuts. Noo, we have to be smarter than that. So, we do this:

cp /usr/bin/xauth /usr/bin/xauth-real cat > /usr/bin/xauth << EOF strace -o /tmp/$$.strace /usr/bin/xauth-real "$@" EOF

What we did there was replace the xauth binary with a wrapper script that runs the 'real' xauth (renamed to xauth-real) through strace. (The cat > << EOF...EOF stuff is just a way to create a file with any contents you like; you could just as easily create a file in gedit, and type in that strace line to it). The "$$" means "the pid of the process", so you wind up with /tmp/18104.strace or whatever.

After that nifty trick we run our 'su', and take a look at the strace. It shows that the creation of the file is indeed attempted in the success case, and works immediately. So as it turns out, this isn't a curveball: the problem really is that xauth should be able to create that file, but it's being denied permission. Still - always check for curveballs. Don't get blinkers on when you've caught a hint of what the problem is: at every stage of debugging, try and consider all the possibilities.

With that knowledge we could focus in a little bit. After looking at various things we worked our way around to what turned out to be the problem: on the system where I was seeing the bug, /run/gdm/ had the permissions 1770 (and ownership root.gdm). Well, that's clear enough.

We still weren't done, though! The question is always why: why did the directory have those permissions? Remember, the bug isn't permanent. Checking a normally-working system, the permissions were 0711, which is what halfline expected to see. Since I didn't see the bug all the time, it looked like the permissions were usually right, but somehow, sometimes, they got set to the wrong thing. So what would be changing the permissions in the middle of a running session?

Tracking that down was a bit tricky. We poked through the gdm codebase looking for calls to chmod or other functions for setting permissions and couldn't find anything. Finally, though, when we were getting pretty stumped, halfline thought to check the gdm package spec file. Eureka once again:

%attr(1770, root, gdm) %dir /run/gdm

And there, the prosecution rests: we have our culprit. On package installation, the permission on /run/gdm were set wrong. As we'd discovered while poking the gdm codebase, gdm has code to set the correct permissions - 0711 - at runtime, so most of the time, that was disguising this bug; when I booted up my system, gdm would fix the incorrect permissions. But what would trigger the bug would be any update to the 'gdm' package that I installed on the fly, without rebooting or restarting GNOME: the package install would set the permissions back to 1770, and boom, bug. A reboot would trigger gdm fixing the permissions and the bug went away again.

I don't know, maybe I'm just weird, but I enjoy the whole process of narrowing something like that down, and it's a great feeling when you get to the end and you're satisfied that you know exactly what broke and how to fix it. Even if someone else wound up doing most of the work! Thanks much to Ray, Adam and Tomas for that.

I actually intended to also work through in similar fashion the other interesting thing I figured out today, which is why most of anaconda's icons are missing when you run it from the new MATE Fedora spin, but this has gotten very long, so I guess I'll save that for another time! Thanks for reading, if you got this far.

Comments

samworm wrote on 2013-05-21 10:51:
"strace -f" (or "strace -ff -o filename") is handy in situations like this, it will automatically strace child processes for you, and in the case of "-ff" store each child's output in a separate file. Much easier than trying to attach to pids as they come into existence. Maybe I just like blunt instruments, but strace if often the first tool I turn to for debugging things like this ;)
adamw wrote on 2013-05-21 16:02:
samworm: I knew someone was going to teach me something there :) Thanks!