September 21st, 2011
Let’s take a look at why – as I’ve done for previous slips.
We tried something to try and help avoid slips this time: we release Fedora 16 Beta TC1 (Test Compose 1 – the first full ‘dry run’ for any given Fedora release or pre-release is a Test Compose, a full set of ISO images built like a regular release by release engineering, for QA to run validation tests on) a week ahead of the original schedule, on 2011-08-31 instead of 2011-09-06. While this obviously didn’t succeed in preventing a slip in the end, we feel like it had a positive effect; the extra week wasn’t wasted, and we think there’s a significant chance the Beta could be in danger of slipping two weeks without the early start on testing.
The accepted Fedora 16 Beta blocker bugs that were not addressed at the time of the Go/No-Go meeting are:
This bug was identified early in the Beta cycle, on 2011-09-05. However, we have not succeeded in identifying a specific cause for it, and it’s possible there may be multiple bugs with the same symptom. Although it was initially accepted as a blocker, it’s not proven to be a really severe problem in later Beta testing, and it’s possible this bug would be revised to non-blocker status on Friday. If we’d had no other blockers, we may have downgraded this one and gone ahead with the Beta release. It manifests itself as a 2 minute delay to system boot which seems to happen periodically – we have not found a system on which it happens at every boot. On boots when you hit the bug, you would also not be able to run the live installer successfully. The workaround is simply to reboot until you don’t hit the bug.
This is part of a family of bugs which is related to a major change in Fedora 16: the switch to the grub2 bootloader. Previously, when upgrading Fedora, both preupgrade and ‘normal’ anaconda-based upgrades would default to updating the existing bootloader configuration with the kernel from the new release, rather than re-installing the bootloader. With Fedora 16, the ‘update’ option is no longer available, and the intention is that upgrades will write a new grub2 bootloader. preupgrade, however, needs to be patched to actually request that anaconda writes the new bootloader. This bug was recognized quite early, but there is some QA fail associated with it. There was a similar bug for regular anaconda-based upgrades, to change the default action to ‘write a new bootloader’, and make that action work. We assumed preupgrade would follow the anaconda default and hence be fixed as well, but in actual fact, preupgrade writes a kickstart file and passes it to anaconda, and defines the bootloader action in that, and so will need a separate fix. We should have checked more closely into this bug rather than assuming it would be fixed.
This bug was identified as soon as the Beta RC1 build was made available, on 2011-09-15. It’s been the focus of our work since, as it’s a clearly critical bug that proved to be complex to diagnose. The symptom of the bug is that a file in the home directory of the ‘liveuser’ account on live images had the wrong SELinux context, which caused gnome-settings-daemon to crash when its helper processes could not access it, which caused GNOME itself not to start correctly. At first we were confused as to why the bug had suddenly appeared in Beta RC1 when neither selinux-policy nor the relevant GNOME components appeared to have changed in a way that would cause the bug. We finally worked out that a different bug in the live image generation tools had been masking this bug: it had caused the /home/liveuser/.local directory to be owned by the root user, which prevented the problematic file from being created at all. gnome-settings-daemon does not crash if the problematic file is not present – only if it is present, but incorrectly labelled – and so previous builds had not exhibited the problem. We were also confused by the fact that the behaviour with a non-live install, or when creating and logging into a new user on an existing F16 install, was different: this turned out to be due to yet a further bug in selinux-policy which did not trigger in the live boot case. After clearing up both those sources of confusion, and fixing the obscuring bugs, we were finally able to establish that this problem was down to the ‘filetrans’ mechanism by which the kernel should monitor newly created files and directories and apply the correct SELinux labels to them not working correctly. Fixing this issue ultimately proved to be the resolution for this bug. However, we were only able to establish the real problem and build a fix at a time that was too late to allow us to meet the release deadline. I don’t think that either QA or the developers did much wrong in this case; it was just a very tricky bug to pin down, and the process took longer than the time we had available.
This was another bug that relates to the introduction of the grub2 bootloader, and in this case, the introduction of GPT disk labels.
The GPT disk label format will replace the MS-DOS disk label format (often referred to simply as the MBR) which has been used on just about all disks in PCs for decades. It permits a more flexible partition table and partitions greater than 2TB in size, among other improvements. However, as with any Shiny New Thing, it introduces complexities.
With the old MS-DOS disk label / MBR system, there was a space behind the MBR on the disk in which a bootloader could expand its second stage. With GPT disk labels, there is no such space.
GPT is associated with the new EFI system firmware specification, which will eventually replace BIOS. On EFI-based systems, bootloaders can be installed into the EFI system partition. On BIOS-based systems, however, there is no such partition. Consequently, if you want to boot from a GPT-labelled disk on a BIOS-based (rather than EFI-based) system, it is necessary to put a special ‘BIOS boot partition’ on the disk. grub2 will use this partition to contain its second stage.
The upshot of this is that, in Fedora 16, there is a tighter relationship between bootloader installation and partitioning than was previously the case. With grub and MS-DOS disk labels, it was never the case that you needed to partition a drive in a specific way in order to install a bootloader to its MBR. With grub2 and GPT disk labels, this is the case.
The Fedora installer, anaconda, turns out to have been designed with the assumption that partitioning and bootloader installation do not need to be closely related. This bug essentially encapsulates the problems that emerge when such an installer is used in a situation in which the two are closely related. It’s notoriously difficult to work out all the implications of implementing a Shiny New Thing like the new disk label system and bootloader used by Fedora 16 in advance; pre-release testing exists in part to identify these kinds of issues.
Essentially, we discovered various cases in which anaconda would not allow you to install the bootloader in the logical and desired location, or would even install it in an inappropriate location silently and automatically.
anaconda has a test it runs to determine if a device is a valid one on which to place a bootloader’s first stage. This test fails if the device in question is the MBR of a GPT-labelled drive, and the system is BIOS-based. Now, anaconda’s logic had already been fortified to handle the case where such a drive was the only one being used in the installation: anaconda would realize the drive needed a BIOS boot partition, and either create one automatically (if automatic partitioning was in use) or prompt the user to create one (in manual partitioning).
However, the logic proved to be unequal to the case where multiple drives (of various types) were available to the installer. In the first noted case, anaconda would sometimes consider the USB stick from which it was being installed (if the user had written the live image, or DVD image, to a USB stick) as a potential bootloader target. In another case, the user might have two hard disks, one with a working OS installation and an MS-DOS disk label, and the other being formatted and Fedora 16 installed on it.
In both of these cases, anaconda would run its ‘is this a valid bootloader location?’ test on both the actual target installation drive, and the ‘other’ drive – the USB stick in one case, the ‘working OS’ drive in the other. This test would often pass – after all, the drive would likely have an MS-DOS disk label, and hence wouldn’t need any special handling to have a bootloader written to it.
Because it had found one drive which looked like a valid bootloader location, anaconda wouldn’t insist on the creation of a BIOS boot partition on the actual target drive for Fedora 16 installation.
If you selected manual partitioning, Fedora would prompt you for a bootloader location after package installation. If you encountered this bug, you would find that the MBR of the target drive was not an available choice – because no BIOS boot partition had been created on the target drive. You could only choose to install the bootloader to the root or /boot partition of the target drive, or to the MBR of the ‘other’ drive. Neither of these was likely to be what you actually wanted to do.
If you selected a form of automatic partitioning, the impact was even worse – anaconda would automatically (and silently) install the bootloader to the MBR of the ‘other’ drive. So if you installed Fedora 16 Beta RC1 to the second drive of a system whose first drive contained a working Fedora 15 installation, the first drive’s bootloader would be overwritten with a new grub2 bootloader, and the second drive (where you installed Fedora 16) would get no bootloader at all. This configuration might well work, but it would not be what you wanted. If you hit the USB key case, Fedora would install the bootloader to the MBR of the USB key, leaving you confused when you booted the system without the key attached and found it failed to boot (or plugged the key in and found it had grown a bootloader you did not expect).
Again, this bug proved somewhat tricky to pin down, as it’s easy not to hit it, or to hit it in various circumstances whose results seem quite different. It was reported shortly after the Beta RC1 release, but only completely diagnosed on 2011-09-19, which was already likely too late to meet the release schedule. It’s possible that we could have caught and diagnosed this bug earlier, but I’m not confident in stating that we ought to have done. It’s also not straightforward to decide how to fix it, and the fix we eventually decided upon is quite drastic and will require extensive testing.
This bug was only discovered quite recently. It’s not a hugely obviously critical bug like the previous few, but is a clear violation of the Beta criteria. QA could clearly have performed better here by doing the desktop validation testing more promptly: this bug would easily have been exposed by one of the desktop validation tests. However, a lot of our attention and resources were diverted to the more obvious bugs. On balance, we ought to have identified this bug earlier, but doing so would not actually have made a material difference to the release, as the other bugs would still have been present.
I can definitely see areas in the bugs we hit, and the speed with which we were able to discover and diagnose them, where the QA team can improve our performance. However, I think ultimately the slip would have been very difficult to avoid; the SELinux and bootloader bugs were simply too complex to diagnose and fix in time, even though we did discover them with several days to spare in the schedule. The SELinux bugs is one of those perfect storms of a somewhat obscure bug and several complicating circumstances which seems to arise once in a while and seems to be essentially impossible to design out of a six-month release cycle. The bootloader issue is a consequence of the kind of major change which Fedora exists to do: it’s unfortunate, but again, quite difficult to avoid. It is very, very difficult for a large and complex project like anaconda to figure out such a consequence of the grub2/gpt change in advance. It would have been possible for QA to discover and diagnose it sooner, though, and we will be extending our validation test case set to include some test cases which ought to aid in the discovery of similar cases in the future.
Well, I hope that helped to clarify some of the considerations that go into making (and delaying) a Fedora release!
As of this month, I am Red Hat’s Fedora QA team lead, which means I’m responsible for directing the efforts of the Red Hat staff who form a part of the Fedora QA team. Combined with my existing community manager role, this means I’m probably in the position of being the ‘point person’ for Fedora QA, responsible for ensuring we do our job of validating releases and pre-releases comprehensively and promptly. I’m still learning in the role and definitely feel that I’m not quite living up to the high standard set by James Laska yet, but I’m hoping to learn the lessons from each pre-release as we go and try to do a better job of ensuring the team gets the necessary testing done efficiently. Any errors or sub-par performance by the QA team in this release should be considered to rest on my shoulders, not those of the other RH staff and community members who make up the team, all of whom did sterling work. In particular I’d be remiss not to thank (if I may presume to do so) Tim Flink, Andre Robatino, Athmane Madjouj, Jóhann Guðmundsson, Mads Killerich, Thomas Gilliard, Dennis Gilmore, and the anaconda and selinux teams for going far above and beyond the call of duty in trying to get the release out on time. I know we all gave it our best shot.
We’re pretty confident that we’ll be able to get the Beta out with just the one week’s delay, so look out for it on 2011-10-04. It should be an exciting release, and thanks to this strict release validation process, it may be a bit late, but it should be a pretty solid Beta.