FreeIPA two factor authentication Test Day right now and tomorrow! (2013-06-06 to 2013-06-07)

Sorry for the late notice, but it seems everyone forgot to send out announcements for this event, so I'm doing it now!

EDIT: npmccallum (one of the developers involved in this Test Day) is very keen to get some testing results, so he's asked if we can extend the Test Day into tomorrow, given the lack of advance notice. So that's what we're doing! If this sounded interesting but you just didn't have the time, never fear: just come along tomorrow instead and we'll still be there.

Right now, FreeIPA two factor authentication Test Day is going on. This is a pretty specialized event, but if you have a FreeIPA setup, you should probably be interested: didn't you hear that two factor authentication is what all the cool kids are doing these days? Yubikey, dude. Yubikey.

As always, there are full instructions on the Test Day page and there will be developers and QA folks in #fedora-test-day on Freenode IRC for discussion and any help you need in testing or debugging. If you're not sure what IRC is or how to use it, we have instructions here, and you can also simply click here to join the chat through a Web front end.

Fedora 19 NetworkManager Test Day tomorrow (2013-06-04)!

We're in the final stretch of Fedora 19 Test Days now: this week marks the last two scheduled for the cycle. First up, tomorrow - Tuesday 2013-06-04 - is NetworkManager Test Day. As usual, we'll be checking NetworkManager is performing correctly across as many configurations as possible, and also checking out some advanced features like IPv6 support, channel bonding, and wifi hotspot functionality.

As always, full testing instructions and a live image for testing if you do not want to install Fedora 19 are available on the Test Day wiki page, and there will be developers and QA folks in #fedora-test-day on Freenode IRC for discussion and any help you need in testing or debugging. If you're not sure what IRC is or how to use it, we have instructions here, and you can also simply click here to join the chat through a Web front end.

(And no, I didn't forget about the debugging quiz - answer coming soon :>)

The banality of 'Don't be evil'

I'm never totally sure what to make of Julian Assange, but this piece is brilliant, and packed with zingers (including the headline): "In the book the authors happily take up the white geek’s burden."

And "This book is a balefully seminal work in which neither author has the language to see, much less to express, the titanic centralizing evil they are constructing." - well, quite.

Debugging 3: Pop Quiz Edition

I'm kinda enjoying these 'debugging' posts (even if no-one else is!), so I thought I'd do one in a different style. I'll just describe the bug. You get to try and explain it.

I have a properly-working Fedora 18 NFS server. It has a directory /export/fedora , containing the Fedora 19 Beta x86-64 DVD ISO image (and nothing else). Its /etc/exports looks like this:

/export 192.168.1.0/24(ro,sync,insecure,root_squash,no_subtree_check,fsid=0) /export/fedora 192.168.1.0/24(rw,nohide,sync,insecure,root_squash,no_subtree_check)

Assume no issues on the NFS server side.

Now:

  1. If I boot the Fedora 19 Beta x86_64 network install image and append the parameter 'inst.repo=nfs:192.168.1.105:/fedora' to the boot options, the installer boots, and the ISO on the NFS server is used as the installation repository.
  2. If I boot the Fedora 19 Beta x86_64 network install image and append the parameter 'inst.repo=nfs:192.168.1.105:/export/fedora' to the boot options, the installer boots, and the installation repository is set to Closest Mirror.
  3. If I boot a bare Fedora 19 Beta anaconda kernel/initramfs pair (which is the typical case for a PXE or virt-install installation) and pass the parameter 'inst.repo=nfs:192.168.1.105:/fedora' boot fails due to a failure to mount the NFS repository.
  4. If I boot a bare Fedora 19 Beta anaconda kernel/initramfs pair (which is the typical case for a PXE or virt-install installation) and pass the parameter 'inst.repo=nfs:192.168.1.105:/export/fedora', the installer boots (successfully retrieving the installer itself from the NFS repo), and the installation repository is set to Closest Mirror.

Identify all the factors which together explain the above behaviour.

Note: this has already been debugged and there are 'spoilers available', so if you want to figure this one out for yourself, avoid:

  • recent posts to anaconda-patches-list
  • https://bugzilla.redhat.com/show_bug.cgi?id=958763
  • https://bugzilla.redhat.com/show_bug.cgi?id=968023
  • anaconda IRC logs from this morning

  • the current version of https://fedoraproject.org/wiki/Anaconda/Options

You may refer to this version of Anaconda/Options as it existed before I edited it today, if you like.

Winner gets a full year of the highest paid Fedora support tier. Anyone who cheats gets a full year of texlive maintenance duty. Go!

Virtualization Test Days this week

After we signed off on Fedora 19 Beta (coming to a mirror near you tomorrow!) there's no rest: once again, we have two Test Days coming up this week. They are both virtualization related, too.

Tomorrow, 2013-05-28, is the main Virtualization Test Day, while Thursday 2013-05-30 is Spice Test Day.

We'll be testing the 'official' Fedora virtualization stack - qemu-kvm, libvirt, and virt-manager - to make sure it's working well for both host and guest duties in Fedora 19, and testing out some of the new features that have been added during the Fedora 19 development cycle, including the virtio random number generator, live migration of VMs without shared storage, and assignment of PCI devices from the host machine to the guest. Please come help us make sure the Fedora virtualization stack continues to be reliable and continues to improve!

As always, full testing instructions, live images etc are available on the Test Day pages. There will be QA folks on hand in #fedora-test-day on Freenode IRC for discussion and any help you need in testing or debugging. If you're not sure what IRC is or how to use it, we have instructions here, and you can also simply click here to join the chat through a Web front end.

More debugging

Continuing my utterly fascinating series on Bugs What I Figured Out Lately, here's one I spent most of Friday and Saturday on.

The symptom was pretty simple: a user with a UK keyboard had a key that didn't work.

The debugging, ah, the debugging was fun.

You probably don't think much about your keyboard, but the apparently-simple process of pressing a key on a keyboard and a character showing up on screen is really anything but simple. On Linux it's backed by one of the more complex and less well-understood mechanisms most users are likely to come across - xkb/xkeyboard-config.

The reason why might become more apparent if we take a quick look at all the possibilities. If you ignore the complexities and handwave a little bit, you can say that a typical xkb configuration offers 400+ possible 'keyboard layouts'. That's a big space. Why so many?

Well, mainly, you've got a lot for different countries, of course. It's kind of fascinating looking up different national keyboard layouts on Wikipedia; in many countries there are actual official standard documents defining the keyboard layout, which I didn't know. Anyhow, there's a ton of the buggers.

Then you get into the fun stuff: there are over a dozen 'variants' of the basic US English keyboard layout in xkb, for instance. Two or three of them are just different ways of inputting 'extended' characters (basically, characters only used in other languages) using the same basic layout. Then you get the fun stuff, like the DVORAK variant, or the Cherokee variant, which is intended to be switched with the basic US layout by Cherokee speakers so they can input both US and Cherokee characters. And then, because no-one in the Linux ever said "I don't think I have enough choices", people start putting the damn things in cauldrons and synthesizing them together, so you have a variant of the DVORAK variant for letting you input accented characters on your DVORAK layout.

But that's just 'layouts' in terms of what characters are mapped to what keycodes. You also have the fun of varying actual physical keyboard layouts - you can buy all kinds of exotic keyboards which have different keys in different places. This was rather more of a problem in the 1980s and 1990s, when xkb grew a lot of its complexity that is still hanging around and confusing the hell out of people now, but it's still a going concern in some ways. Notably for this bug, there still isn't a single international 'standard' physical keyboard layout (just the question of what keys there are on a keyboard in what arrangement, completely ignoring what's written on the keycaps). There are two very common alternatives, which I'll call 'US' and 'everyone else' (there's actually a third somewhat less common one and several much less common ones, and a few other countries use the 'US' physical board, but let's ignore that).

If you ignore function, arrow and number keys and just look at the block with alphanumeric characters, punctuation and modifier keys, there are two major differences between the two physical layouts. On a US keyboard the second row from the bottom has a Shift key on the far left, and then the first letter key on that row (which is 'Z' on a US keyboard). On the 'everyone else' keyboard, there are three keys in this space. On a UK keyboard there's still a shift key and a Z key, but the shift key is smaller and there's a key in the middle which is marked with a backslash and a broken pipe symbol. On other international keyboards the first key is almost always still a Shift key, but the other two might be different things, but the physical layout is the same. Over on the right hand side, the 'US' physical keyboard has a long enter key which is only a single row high, and a key above it (which is the backslash / pipe key on a US keyboard); on the 'everyone else' model, the Enter key is two rows high but less wide, and the other key is to the left of it on the third row instead of above it. Hard to describe, but if you look at a couple of pictures it's pretty obvious.

So (after 700 words!) back to the bug: for our poor reporter, who has a UK keyboard, the 'extra' key between Shift and Z - the backslash/pipe key - didn't (appear to) work. He pressed it, nothing seemed to happen. (UK users commonly use this key to enter, well, backslashes and pipes; there are actually other keys on the UK layout that can enter these characters, but they're less conveniently situated).

After hearing his report, I went haring off and learned a hell of a lot more about xkb than I had learned before, and possibly more than I actually wanted to know. I developed a neat elegant theory which explained all the apparent symptoms of the bug, which turned out to be completely wrong. This is something that happens frequently in debugging. At least to me. Maybe the rest of you are much smarter and don't have this problem. But don't get discouraged! You almost certainly learned something in coming up with your theory, and it will probably be useful to you in future.

My elegant, wrong theory revolved around an xkb setting. Well, actually, two xkb settings, one a 'real' one and one an abstraction. I am not going to explain the internals of xkb here because OH GOD NO, but as an example of the horrors you will find if you delve into it: you may be somewhat familiar with the most common way of configuring xkb, which is to fiddle with a small set of parameters:

XkbModel XkbLayout XkbVariant XkbOptions

It is, perhaps, instructive to the complexity of xkb internals to know that these don't exactly really exist - or rather, they're not really how xkb thinks about things. Internally, xkb thinks about keycodes, types, compat, symbols and geometry. It is a fairly little-known fact that you can actually specify an xkb config by specifying each of these directly (as XkbKeycodes, XkbTypes, XkbCompat etc etc). When you specify a config in the Model, Layout, Variant, Options format, what you are in fact doing is telling xkb to use a given set of rules to transform those options into a set of keycodes, types, compat, symbols and geometry settings. Essentially, xkb heard you liked configuration, dawg, so it wrapped its configuration scheme in a configuration scheme you can configure while you configure.

So, I had gotten about 60% of the xkb cluestick, and I started to latch on to the concepts of 'geometry' and 'model', without quite understanding that it was part of one of xkb's config schemes and the concept of a 'model' was part of the other. But it was at least becoming apparent to me that in general the xkb config you get from Fedora's installer or desktop tools did not specify either model or geometry, and that these concepts appeared to include the question of which physical keyboard layout is in use. It also was beginning to seem to me that, if you didn't specify this explicitly, xkb would assume the US physical keyboard.

All of this is actually true; I wasn't wrong. It just wasn't the source of the bug. But you can see why I thought it was, right? Surely, if xkb thinks we're using a US physical keyboard, and the US physical keyboard doesn't have that key at all, that's why it's not working, right?! There was even corroborating evidence: GNOME can render a preview of any given keyboard layout. If you pick the UK keyboard layout in GNOME and look at the preview, that key is missing (assuming you have a 'normal' Fedora X config, where xkb is assuming the US physical keyboard). If you then manually change your X config to specify the international keyboard model (which you can do by specifying XkbModel "pc105"), the preview does show the key. Pretty damning, right?

Well, as it turns out, no. Though I'm damn well giving myself a pass for thinking so. But as I dug deeper (and got some invaluable help from Rui Matos), it turns out that the concept of 'geometry' - which is what xkb calls the actual physical key layout - is not that important to xkb at all. xkb does not really use it, in terms of it being a factor in what keycodes are translated to what keysyms (actual on-screen characters...more or less), at all. It's actually only used for...rendering preview images of keyboard layouts.

@()^)@#(^&!&)!(&$!)(*$DAMNIT, xkb.

As noted above, the concept of a 'model' is actually an abstraction of the 'real' underlying configuration items. If you set a keyboard 'model' in your xkb config, what that actually means depends on the model you set, but for the models that are just representations of 'typical' physical keyboard layouts (pc101, pc102, pc104, pc105), it just results in the appropriate 'geometry' being set, and doesn't change any of the other config settings (keycodes, types, compat, symbols) at all. So it really doesn't matter.

This was confirmed with the bug reporter: changing model didn't fix his key. So my beautiful elegant theory was dead. Ah, well.

Fortunately, all this crap I'd learned about xkb internals made it 'easy' to diagnose the real problem, once I got the info necessary. If you've read this far, congrats, because here's something you may find useful. In debugging a bug of this type, the information you're going to really really want is this: the output of 'setxkbmap -print', and the output shown on the console when you run 'xev' and press the key in question. So, here's what I got from the bug reporter:

setxkbmap -print xkb_keymap { xkb_keycodes { include "evdev+aliases(qwerty)" }; xkb_types { include "complete" }; xkb_compat { include "complete" }; xkb_symbols { include "pc+gb+us:2+inet(evdev)+level5(lsgt_switch_lock)" }; xkb_geometry { include "pc(pc102)" }; };

KeyPress event, serial 36, synthetic NO, window 0x2200001, root 0x82, subw 0x0, time 695097, (367,321), root:(369,408), state 0x1, keycode 94 (keysym 0xfe11, ISO_Level5_Shift), same_screen YES, XLookupString gives 0 bytes: XmbLookupString gives 0 bytes: XFilterEvent returns: False

I might have figured it out just from that even before doing a deep dive into xkb, but it sure seemed obvious afterwards.

setxkbmap -print in particular is a really useful bit of data: it tells you exactly what the 'real' xkb parameters are set to right at that moment. No abstractions, no muss, no fuss: that's your 'real' xkb config right there. The setting for keycodes is sane and normal (that defines 'what are we expecting to see from the underlying kernel driver', which these days is just about always evdev; if it's set to the wrong thing you'll get complete gibberish, on the 'garbage in, garbage out' principle). types and compat, also boring. geometry, as we've established, unimportant (though for the record, 'pc102' is the international-style physical keyboard with no windows keys). The interesting bit is symbols - it usually is, because that's the bit that's doing most of the grunt work. The stuff listed there is basically 'the rules being used to determine what keycodes mean what keysyms'.

Those first four are pretty normal, but that "level5(lsgt_switch_lock)" is not at all, I hadn't actually seen it before. The 'lsgt' was a big warning flag: the 'missing key' in question is referred to as 'lsgt' internally in xkb (it means 'lesser/greater', because if you set the US layout on an international physical keyboard, that key gets mapped as the "lesser than" and "greater than" symbols). You might also, if you have sharp eyes, have spotted something apparently related in the xev output:

(keysym 0xfe11, ISO_Level5_Shift)

if so, congratulations, because it definitely is related. If you're looking at xev output and trying to figure out something related to a keypress, that part of the output is probably what you're going to find useful:

keycode 94 (keysym 0xfe11, ISO_Level5_Shift)

what that's telling us is that the key that was pressed has the 'keycode' 94, and that keycode is mapped by the currently-operational xkb rules to the 'keysym' 0xfe11, which - if you don't speak xkb as a native language - is handily explained to be "ISO_Level5_Shift". What we would be expecting if the key was working properly would be (keysym 0x5c, backslash). So what the xev output tells us is that the key actually is doing something, but not what we expected.

It happens that what it's doing is something quite subtle: "level5_shift" basically means 'this is a modifier key which will shift to the fifth level of characters while it's pressed'. The first level of characters is what you get when you just press a key with no modifiers, the second level is what you get when you press a key while holding Shift, the third level is what you get when you press a key while holding a third level modifier key, the fourth level is the third level shifted, the fifth level is yet ANOTHER level, and the sixth level is the fifth level shifted. Obviously, the point of the third, fourth, fifth, sixth levels is to let you input more than however many characters can possibly be input with all the keys on the keyboard with and without the shift modifier key.

In practice you can enter a hell of a lot of characters with just four levels, so the fifth level rarely gets used at all, and as the key was set to behave like a 'shift' key - it only did something if you held it down and pressed another key - the reporter could be forgiven for thinking it wasn't doing anything at all. It probably wouldn't have done much even if he had held it down and pressed another key, as I don't think the UK layout has a fifth level. Hell, I'm not sure it has a third (though some of its variants do).

So as it transpires, the setting "level5(lsgt_switch_lock)" was causing the key to act as a level 5 modifier key instead of whatever it would otherwise have done, and that's why the reporter thought it wasn't doing anything. I'd been poking other levels of the keyboard input stack too, so I knew the most obvious places to look for that setting being requested: an X config snippet, or GNOME's keyboard config settings.

In GNOME, the gsettings keys under org/gnome/desktop/input-sources define what GNOME wants the xkb config to be (and therefore, almost certainly, what it actually is when you're running GNOME, unless you've got some script changing them after GNOME starts, or you just changed them yourself with setxkbmap). The key 'xkb-options' specifies XkbOptions. Sure enough, the reporter had "level5(lsgt_switch_lock)" set in his xkb-options key. As soon as he unset the key, his \ key started working.

Whew! Sorry I've rambled again, but I hope this has illustrated a couple of things: it can take a lot of background knowledge to quickly and confidently diagnose a problem which from a superficial description seems simple, and it is not unusual to look at a problem carefully and conscientiously, come up with an explanation that appears entirely internally consistent and explains all the symptoms of the problem and even appears to be corroborated by other data, and then find out that that explanation is entirely wrong. If this happens to you, don't worry. It's probably not your fault. It's just...computers.

Anaconda Test Day Tuesday, and debugging

Sorry (again) for the late notice, but Tuesday 2013-05-21 - tomorrow or today, depending - is Anaconda Test Day!

Yeah, in a way it's anaconda test day every day in Fedoraland, as we focus a lot of our day-to-day testing on the installer. But the folks running this Test Day have come up with some good extended test cases that go beyond what we cover in the release validation testing, so we should be able to find some more interesting bugs (it's always fun).

We'll be testing the current Beta candidate build, Beta RC2. As always, full instructions for testing are available in the Test Day page, and there will be QA folks on hand in #fedora-test-day on Freenode IRC for discussion and any help you need in testing or debugging. If you're not sure what IRC is or how to use it, we have instructions here, and you can also simply click here to join the chat through a Web front end.

If you haven't tried out the improved anaconda in Fedora 19 yet, this event is a great excuse to take a look at it: it is much more polished and stable now than it was in Fedora 18, so come try it for yourself and help us to improve it further before the final release of Fedora 19!

I've been really enjoying work on Fedora 19 lately. After the somewhat hellish Fedora 18 cycle where we worked more or less 24x7 on stabilising anaconda, Fedora 19 has been a pretty well-behaved release so far, and so I've had time to look around and try to investigate and fix up some other things. Among other things, today, I was looking into a couple of bugs which turned out to be very interesting (to weird people like me, anyhow) to debug. I don't know if anyone else feels like this, but I absolutely love this stuff - it's my favourite thing about my job, really, the process of pinning down exactly why something is going wrong, and how to fix it. I also love that in a project like Fedora, you rarely do it alone: it's very cool to work alongside (or just watch) a big group of extremely smart and knowledgeable people, both inside RH and in the Fedora community. I just had the urge to do a post mortem on a specific bug here, because I really love this process. Those of you who do this every day can safely move on. :)

Since upgrading to Fedora 19, I hit a strange problem three or four times where there was a long delay when using the 'su' command to become root. I'd run 'su', enter my password, and then it would sit there apparently thinking about it for 15-20 seconds before returning a root prompt; it's usually instant. This was different from the delay that usually happens when you get the password wrong - it happens when you get the password right, and is longer than that delay.

So when I finally got around to trying to figure out what was going on recently, I started out by trying to 'strace' the su process.

Now, strace is rarely the tool you really want to use for debugging. It's a very unwieldy thing to use, for most cases. There's usually a better way of doing it. But it can be invaluable for your real mystery bugs, like this one, where there is no other obvious way to attack the problem - no logs to sink your teeth into or obvious triage techniques you can use. It turned out to be invaluable for this bug.

At first I was stymied because if you just run the su process through strace from the outset, authentication always fails - strace interferes with it somehow. Tomas Mraz pointed out the workaround which really should've been obvious to me in hindsight: just run su first, then attach the strace to it (the 'strace -p [pid]' command does this). That works fine. So, now I had the strace output, and it had a reasonably obvious smoking gun:

stat("/root/.xauth/import", 0x7fff660f8b70) = -1 ENOENT (No such file or directory) getgid() = 1001 getuid() = 1001 pipe([4, 6]) = 0 pipe([7, 8]) = 0 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f2a174b9a10) = 15192 close(4) = 0 close(8) = 0 close(6) = 0 read(7,

PAUSE HERE

"", 2048) = 0 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=15192, si_status=1, si_utime=0, si_stime=0} --- close(7)

The pause happened at "PAUSE HERE": the strace output sat right in the middle of that line for 15 seconds before moving on. So what does that bunch of gibberish mean?

Extracting meaning from voluminous gibberish is an invaluable skill for the kind of stuff we're working on here, which you only really develop with experience. But in this particular case, here's how that stuff reads to me:

blahblahblahwordswordswords CHILD CHILD SIGCHLD 15192 blahblahblahwords...pause...SIGCHLD CLD_EXITED wordswordswords

The key thing to know is that, just before the pause, the su process created a child process. Basically, it ran some other command that it needed to run. This happens all the time: say you were writing a very unsophisticated file manager, if the user clicked the Create Directory button, your program would run a child 'mkdir' process. (That's a horrible way to write a file manager, but you get the idea).

So what this means is that we're not going to get any further staring at this strace, because the data just isn't there. The pause happens right after a child process is forked off, and right when the pause ends, we see that the child process finished. That makes it pretty damn obvious that the pause is happening in the child process, somehow.

So what next? Well, we find out what the child process is. That '15192' is the pid of the child process. How did I know that? I didn't (though a real developer probably would have), but it seemed like a reasonable guess - it's a pid-ish number, it's clearly associated with the creation of the child process, and if I were the developer of strace, then when a child process was created, I'd bloody well make sure it logged the pid. So it seemed a pretty safe bet, and it was easy enough to check: re-do the strace and check 'ps' output for the apparent pid while the pause was happening. And lo, it came to pass that I was right. So ps told me that the child process was this:

/usr/bin/xauth -f /run/gdm/auth-for-adamw-szT11D/database nlist :0

it's trying to read some kind of authentication information from a database provided by gdm. Okay. Now we're making progress.

The next step is probably pretty obvious: let's strace the child process. It's easy enough to watch the su strace output and fire off an 'strace -p 15192' (or whatever the new number was) right when the pause happens.

That yielded some nice gratifying results. The xauth strace showed this happening, seven times:

open("/run/gdm/auth-for-adamw-szT11D/database-c", O_WRONLY|O_CREAT|O_EXCL, 0600) = -1 EACCES (Permission denied) rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 nanosleep({2, 0}, 0x7fff8b56d5c0) = 0

it tries to access (actually, create) a file, gets 'Permission denied', WORDSWORDSWORDS, then goes away to sulk in the corner for a bit: after that 'nanosleep' line, there was a two second pause. Multiply that by 7 (actually I think it's probably 8, and the first happened before I got the strace started) and you get the long pause. Eureka!

Of course, we're not done yet. When you run up against something like this, there are two key questions:

a) Why is it trying to do that in the first place? b) Why is it failing?

a) can sometimes be an interesting question, but then, I always find it's a good idea to consider a third question: 'do the people who did this know better than I do?', to which the answer is often 'yes'. I'm certainly not going to start questioning the motives of the people who maintain util-linux unless it seems to be urgently necessary. So in a case like this, it's probably prudent to assume that there's a good and valid answer to a), and slide right along to b).

At this point it was necessary to call in the cavalry again, what with me still being the idiot monkey and all. So I got on IRC and bleated piteously for halfline - Ray Strode - the maintainer of GDM. He did most of the work from there on; I just sat around coming up with silly ideas.

It was still a pretty wide open problem area to start with; there are always multiple possibilities when you're looking at a problem like this. We had to consider, for instance, the possibility that this file creation always failed, but didn't always cause a delay. Or that the attempt to create the file itself isn't a 'normal' thing but some kind of failure path; perhaps in the 'success' case, the file creation isn't attempted. So it seemed prudent to run an strace on the 'success' case as well as the failure case. ajax - Adam Jackson - had a handy trick for achieving this. In the 'success' case the xauth command returns pretty much instantaneously, so you can't catch it in strace just by being handy with the copy/paste shortcuts. Noo, we have to be smarter than that. So, we do this:

cp /usr/bin/xauth /usr/bin/xauth-real cat > /usr/bin/xauth << EOF strace -o /tmp/$$.strace /usr/bin/xauth-real "$@" EOF

What we did there was replace the xauth binary with a wrapper script that runs the 'real' xauth (renamed to xauth-real) through strace. (The cat > << EOF...EOF stuff is just a way to create a file with any contents you like; you could just as easily create a file in gedit, and type in that strace line to it). The "$$" means "the pid of the process", so you wind up with /tmp/18104.strace or whatever.

After that nifty trick we run our 'su', and take a look at the strace. It shows that the creation of the file is indeed attempted in the success case, and works immediately. So as it turns out, this isn't a curveball: the problem really is that xauth should be able to create that file, but it's being denied permission. Still - always check for curveballs. Don't get blinkers on when you've caught a hint of what the problem is: at every stage of debugging, try and consider all the possibilities.

With that knowledge we could focus in a little bit. After looking at various things we worked our way around to what turned out to be the problem: on the system where I was seeing the bug, /run/gdm/ had the permissions 1770 (and ownership root.gdm). Well, that's clear enough.

We still weren't done, though! The question is always why: why did the directory have those permissions? Remember, the bug isn't permanent. Checking a normally-working system, the permissions were 0711, which is what halfline expected to see. Since I didn't see the bug all the time, it looked like the permissions were usually right, but somehow, sometimes, they got set to the wrong thing. So what would be changing the permissions in the middle of a running session?

Tracking that down was a bit tricky. We poked through the gdm codebase looking for calls to chmod or other functions for setting permissions and couldn't find anything. Finally, though, when we were getting pretty stumped, halfline thought to check the gdm package spec file. Eureka once again:

%attr(1770, root, gdm) %dir /run/gdm

And there, the prosecution rests: we have our culprit. On package installation, the permission on /run/gdm were set wrong. As we'd discovered while poking the gdm codebase, gdm has code to set the correct permissions - 0711 - at runtime, so most of the time, that was disguising this bug; when I booted up my system, gdm would fix the incorrect permissions. But what would trigger the bug would be any update to the 'gdm' package that I installed on the fly, without rebooting or restarting GNOME: the package install would set the permissions back to 1770, and boom, bug. A reboot would trigger gdm fixing the permissions and the bug went away again.

I don't know, maybe I'm just weird, but I enjoy the whole process of narrowing something like that down, and it's a great feeling when you get to the end and you're satisfied that you know exactly what broke and how to fix it. Even if someone else wound up doing most of the work! Thanks much to Ray, Adam and Tomas for that.

I actually intended to also work through in similar fashion the other interesting thing I figured out today, which is why most of anaconda's icons are missing when you run it from the new MATE Fedora spin, but this has gotten very long, so I guess I'll save that for another time! Thanks for reading, if you got this far.

Quote of the day

jwb: so let me get this straight. you're using an unreleased kernel with a btrfs and ceph combination? atb9090_: thinking about it jwb: think harder.

Installer memory usage in F19 Beta TC3

So I had a few spare cycles this morning and thought I'd look at the memory usage of the installer in F19 Beta TC3.

The awesome Chris Lumens made this much easier back in 2011 by adding memory use logging functionality to anaconda. He wrote up his findings at the time - based on a Fedora 15 install - in this blog post.

I ran a very similar test to Chris' using F19 Beta TC3. We both did a default package set install from a DVD image. Only difference is that I did not switch language to German in the anaconda UI, as Chris did. Here's my graph:

Memory usage during F19 Beta TC3 install

As Chris did, I've annotated the spikes (although much more messily than he did).

So, what do we see? It's a bit hard to see in this graph as I blew through the install process very quickly, but I just ran another test, and if I complete the necessary spokes and leave the UI sitting at the hub screen before actually kicking off an install, the 'background level' of anaconda itself is 329MB. In Chris' graph, this 'background level' before package installation kicks in looks like it's around 200MB. However, oldUI was very different from newUI, and most of the difference there may be accounted for by the fact that we've already fired up storage and repository configuration code at this point.

The 'background level' during package installation - where the line sort of mostly sits aside from spikes - is around 590MB in my data. In Chris' F15 test, it looks like it was around 520MB-530MB (eyeballing it). So usage has increased about 60-70MB there.

The spikes are as follows:

A - sepolgen-ifgen running during install of policycoreutils-devel package (spikes to 796MB) B and C - both gtk-update-icon-cache, running during %posttrans I believe. See icon cache packaging guidelines. D - this whole ramp is yum's "VerifyTransaction" action, which is new since Fedora 15 (I forget which release introduced it)

The shorter unlabelled spikes I haven't looked into, but they're probably update-mime-databases, like the similar small spikes in Chris' F15 run.

It's rather odd that gtk-update-icon-cache runs multiple times in %posttrans, the policy seems designed to prevent this happening. From the raw data it looks like it runs three times, once at 08:29-31, once at 08:36-38, and once at 09:17. The memory usage spikes in all three cases. edit: Thanks to kalev who came up with a nifty bit of RPM magic to inspect all scripts that do gtk-update-icon-cache, it looks like this makes sense: there are three icon themes of significant size that are updated in %posttrans of various packages, hicolor, gnome and HighContrast. So the three spikes associated with gtk-update-icon-cache runs make sense.

So it looks like since F15, the general 'background' usage of memory during an anaconda run has increased modestly, and the spikes in usage have changed somewhat.

The good news is that the highest spike in a default F19 run is lower than the highest spike in a default F15 run; Chris' graph spikes clear over 800MB, while mine just touches it. The big spike in F15 was "from the selinux-policy-targeted package's %post running semodule"; that's gone completely in F19, there is no spike at all associated with selinux-policy-targeted. However, the biggest spike in F19 appears to be a new one: it is a different SELinux spike, this time sepolgen-ifgen running during install of policycoreutils-devel. It is not quite as big as the F15 spike, but obviously something we should try and optimize if possible.

gtk-update-icon-cache caused spikes both times. However, during F15 it ran only once (according to Chris), and was a somewhat smaller spike. It's labelled K in Chris' graph, and is a spike of 70-80MB, to 600MB total. In F19Btc3, the spike is of 110MB, to 700MB total.

The 'depmod' spike from Chris' run seems to have disappeared now.

Finally, the 'ramp' associated with the new VerifyTransaction step in yum is a new issue, and possibly worrying: I haven't tested with a larger package set, but this test indicates that the memory usage never seems to level off, which would indicate that memory usage during VerifyTransaction is directly proportionate to the number of packages being installed. So at least theoretically, if you did a really big package set install, the memory usage during VerifyTransaction might be the highest of all, and could go quite high. I'll try doing an 'as much stuff as possible' install and see how high it gets.

So we have a mixed picture from this small comparison; in fact the highest memory usage of a stock DVD install seems to have gone down slightly since F15, but there's clearly room for improvement, and there may be issues with larger package sets that we didn't have in F15.

ABRT and SSSD Test Days this week

This week in Test Days: we'll be testing ABRT on Tuesday 2013-05-07 and SSSD improvements and Active Directory integration on Thursday 2013-05-09!

ABRT is the Fedora tool for catching and reporting crashes. If you've been running Fedora 19, or you've updated with updates-testing in Fedora 18 in the last few days, you may have noticed some major changes to ABRT and libreport, including a completely new graphical tool for reporting crashes called gnome-abrt. We'll be testing out these big changes at the ABRT Test Day. ABRT gets better every Fedora release, but the more broad-based testing we get the more issues we can squish, so please, come along and help us test!

The SSSD improvements and Active Directory integration Test Day will focus on Fedora 19 enhancements to our enterprise authentication tools. In particular, we'll be testing integrating Fedora 19 systems into Active Directory domains. This probably won't be of interest to some of you, but if you use or even help to admin a FreeIPA or AD shop, you might well want to come along and help check if we have things working properly for your deployment.

As always, full instructions for taking part in each Test Day are available on the Wiki page, and we'll be making live images available so you can do as much of the testing as possible without needing to install a pre-release Fedora. QA and development folks will be present in the #fedora-test-day channel on Freenode IRC for discussion and any help you might need in testing. If you're not sure what IRC is or how to use it, we have instructions here, and you can also simply click here to join the chat through a Web front end.

Thanks to all in advance!