June 16th, 2009
So, here’s the deal on that. I’ve been explaining it in comments and so on, but I thought it would be worth a recap blog post. The reason there’s an unusually high number of bugs in the storage code in anaconda is simple: it was entirely rewritten for Fedora 11. Here’s the feature page explaining this.
As you can see from the feature page, this is a pretty complex area: there are multiple filesystems, things like RAID and LVM, hardware issues, and desired and pre-existing partition layouts to consider. There’s a ton of variables, in other words, many of the combinations of which we are unlikely ever to hit in internal testing. We (the QA group) could test installs forever and a day and probably not hit some of the situations that get hit very rapidly once code gets out to the real world.
Some questions arise. Why was a rewrite needed in the first place? This is explained on the feature page. Basically, the existing code is very old and was not written to be easily extensible; it makes it very hard to add support for interesting new things like LUKS and iSCSI. We wanted a more modular code base so these and future desirable storage-related innovations could be handled properly. The longer you delay a necessary rewrite, the more pain is involved, so it made sense to do it as soon as the need was identified.
Why was the rewrite put into the main Fedora release so soon? It would have been possible to keep the ‘old’ and ‘new’ tracks separate, maintain both, and ship Fedora 11 with the ‘old’ code, and maybe only bring the ‘new’ code into Fedora 12. There’s a couple of reasons not to do this.
First, as noted above, many situations just aren’t tested until the code gets out. Pretty much all the situations we can actually test internally actually work in F11′s anaconda. Our test matrix is full of green check marks. So even if we’d delayed the new storage code until F12, it probably wouldn’t have had many more problems fixed than it does at present. We have to find out about the problems before we can fix them.
Second, this would have used up (or, in many ways, wasted) rather a lot of developer resources. We would have had to split the anaconda team and had some of them work on maintaining the old code for the F11 release. This work would have been essentially wasted as that code was destined for the trash, and it would have been that many man hours diverted from work on the new code. So we decided to push the new branch into the main codebase relatively quickly and have it released in F11.
Final question is more of a rant I’m seeing a lot of: why do you guys suck so much? Why are there so many bugs? Surely the coders must just be lazy / incompetent if they couldn’t fix $MY_ISSUE before release?
Short answer – well, no, they’re not. First, here’s a solid number for you: from January 1st to now, a total of 332 bugs were filed on anaconda in Rawhide (so F11 at the time) and fixed. Here’s the list. Just as a quick ballpark comparison, the number for the kernel over the same period is 122. So it’s certainly not the case that the anaconda team are a bunch of lazy asses; they’ve been working their behinds off on bug fixing throughout the F11 cycle.
Are they incompetent? No, they’re not that either. As I mentioned at the top, storage during installation is an innately tricky area, and Fedora has to support a _lot_ of different scenarios here (especially as more or less the same code is used in RHEL, which has a lot of fairly robust requirements in this area). A lot of the variables have a huge range of possible values (previously existing partition layout, for example) – so what looks like just a ‘perfectly standard installation’ could actually have several thousand variations for different sets of data that previously exist on the disk and different hardware. When you’re working from scratch to write code to handle a situation this icky, it’s just inevitable that bugs happen. No set of coders could likely have produced a significantly different result.
In conclusion – we knew this was going to happen, and we went into it with our eyes open. We knew there’d be regressions. That’s regrettable, and it’s fine to criticize this, mark F11 down for it in reviews, and warn readers that the installer’s storage code is a bit problematic and they may hit issues here. That’s all perfectly true and valid and fair. What I wanted to address with this are just the questions of why this is the case, why it’s not because we just suck at what we do, and why we went ahead and did it even though we knew it would cause some level of pain. Hope it’s been useful.
And of course: FILE BUGS! When Anaconda fails, it usually gives you a dialog box with a traceback of the issue, and lets you save a copy. Please do so, and file a bug explaining where it failed, what choices you made during installation, and ideally the previous partition layout of your system. Include the traceback as an attachment. If you don’t report the problems, they don’t get fixed. Thanks!