PSA: Don't fedup to Fedora 21 right now (EDIT: you can now!)

EDIT 2014-11-05: It's fine to fedup to F21 now - at least so far as I know, and so far as the bug described in this post is concerned. We've made sure several different ways that you should not possibly be able to hit the 15 minute timeout bug.

It's probably not a good idea to try and upgrade to Fedora 21 with fedup right now.

Currently Fedora 21 has a build of systemd that includes a new feature that was added upstream after the release of 216, which is intended to time out system startup if it's not complete after 15 minutes - the idea being to avoid things like your laptop melting / starting a fire in your bag if it gets accidentally powered on, stuff like that.

Unfortunately, turns out that having a timeout that hard powers down the system if boot hasn't completed after 15 minutes doesn't work very well with fedup, because while fedup's actual 'install the updated packages' step is running, systemd considers that boot has not 'completed'. So if you try and fedup to Fedora 21 using a fedup environment that has the affected systemd build (like the one in the Beta tree, and also in the current 21 'stable' tree), and your 'install updated packages' boot takes more than 15 minutes, it'll just suddenly cut off and shut down. Obviously, there's quite a high chance that'll leave the system in a broken state.

So: don't do it. Really, don't.

We're currently investigating the best way to deal with this problem, and we'll certainly try to have it all straightened out by Beta release date (Tuesday). But of course, it's never a good idea to upgrade a production system to a pre-release, especially if you don't have good backups!

Comments

Matthew Miller wrote on 2014-11-01 14:02:
Adam, does this also affect offline updates if the system has not been updated in a while and there is a lot to do?
adamw wrote on 2014-11-01 15:52:
That's an interesting question! I hadn't thought of that one. It's possible it does. I'll look into it today (weekends, har.)
adamw wrote on 2014-11-01 20:10:
so I'm afraid I can't quite tell for sure. systemd doesn't log specifically when it decides the startup has got far enough and it's disabling the timer, afaics. I'm not sure I can slow down a VM enough to make the offline update install take 15 minutes. If you log in on tty2 while the update is running you get the 'systemd is starting up' message from pam-nologin, but 'systemctl status' shows 'running', which is different from the fedup case where it shows 'initializing', and 'systemctl status basic.target' shows that it's been reached. It's a bit tricky to see all the timeouts the upstream commit set up and exactly when they expire, at least for this dumbass monkey; I'd have to look at the commit with a bit more context. It might be best just to ask a systemd dev.uptime
sheepdestroyer wrote on 2014-11-01 14:30:
Oups, just a little too late for me. I got a system half upgraded with a lot a packages in double. Still able to boot but not log in Gnome. enlightment was still fine though so I could do a backup *after* the failed upgrade. Lucky in my misfortune :) It was a good excuse to format and install fresh from Beta-RC4 DVD iso for once (was lazyly updating each time since fedora 16)
Daniel Miranda wrote on 2014-11-01 16:22:
What about adding an override for basic.target in /etc/systemd/system to make sure the upgrade works, and removing it after it is done? Should work both as a manual step for now and as an automated step for the installer.
adamw wrote on 2014-11-01 19:18:
fedup doesn't actually use basic.target at all. See /usr/lib/dracut/modules.d/90system-upgrade/README.txt . I already have an approach that works (see the latest comments in the bug), just up to Will whether he likes it or not.
Daniel Miranda wrote on 2014-11-01 19:23:
Right, it seems there is both a timeout for some targets and a global startup timeout. Removing the former would obviously not help with the latter.
Robin wrote on 2014-11-02 01:45:
Does that mean you don't have any kind of continuous integration for testing whether upgrading works at all? Something like this should really be catched by automated testing, and it worries me a bit that it wasn't AFAICS.
adamw wrote on 2014-11-02 04:10:
No. 'CI' for a distribution is actually a pretty difficult thing to do, it's much much harder than for a single software project. OpenSUSE has probably the most advanced automated testing for a distro, but I don't know if even theirs does upgrade tests. Fedora's automated QA system is Taskotron - https://fedoraproject.org/wiki/Taskotron - but so far it does package tests, it doesn't do installation / upgrade tests. The previous system - AutoQA - briefly did Rawhide install tests, but it didn't stay working very long; like I said, it's a difficult thing to implement. Again, the bug is only apparent if the upgrade install step takes more than 15 minutes, so there's no guarantee automated testing would have caught it even if we had it; you usually keep automated tests simple so they're less likely to break, and a simple upgrade test is one with a small package set installed, and a small package set doesn't take long to upgrade.
Robin wrote on 2014-11-08 06:59:
I can understand, it's always more complicated than it looks from the outside. In general I've noticed that some projects struggle more with this (or don't see the need for it?) than what I'm used to from $dayjob, where doing CI is taken very seriously. Open source still seems to be catching up to this in some areas. Part of the problem is probably that it can be hard to get the infrastructure for a distributed project. On the other hand, I'm very excited by the work on GNOME Continuous :)! (Btw, I'm now on Fedora 21, it's great so far!)
adamw wrote on 2014-11-08 08:52:
Lots of F/OSS projects take CI seriously, and certainly a lot of Fedora / RH projects. You've seen the GNOME work, anaconda has some serious CI work going on, no-one really writes anything new without it any more (except me, because I'm an idiot monkey who can't code properly). But it really is the case that it's vastly more complex and problematic to apply the concept of 'CI' to a distribution than to a typical software project. I mean, it gets almost existential at points. What *is* a distribution? What does it *mean* to CI it? Strictly speaking, wouldn't that mean running every possible validation test we can think of every time anyone checked a single byte into any package, kickstart, comps...well, that's clearly impractical. But then you have to start deciding what it makes sense to test when, and then you have to start mapping out all the ways in which poking X can cause a change in Y and deciding which are the most significant / fragile ones, and all this is before you've written a damn line of code. And that's just *one* of the problems. Another major one with doing 'CI for a distribution' is it's just flat harder to test. For most software projects you can kinda stand up a test box and leave it running, and throw tests at it every so often. You can't really do that for a distribution - certainly not for something like 'validate the installer' or 'validate the upgrade process'. Your 'thing you run the test on' (let's call it a 'test client', that's what Taskotron calls it) needs to be basically blown away and rebuilt from scratch every time you do the test - to properly test an upgrade, you have to do a clean install of the release you're upgrading from and then run the upgrade. 'Disposable test clients' is what we call this, and it's what we're currently focusing on most heavily in Taskotron development, but it's not a particularly trivial problem to solve - not impossible, just involves a bunch of spadework. And so on, and so on...:)
yayo wrote on 2014-11-02 13:57:
What about fscking really fsck'd rootfs/other partitions, what if that takes too long? Is it time to panic yet.
adamw wrote on 2014-11-02 16:31:
it's ALWAYS time to panic! I don't want to answer in too much detail because, again, I haven't looked in detail at exactly what timeouts the upstream patch set and when they're disactivated. All I know for sure is it happens in the fedup case. There will be a systemd update for F21 soon which disables / improves the feature, obviously. fedup is a special case because of the 'install upgraded packages' step, which runs in a special environment which is actually a slightly customized initramfs. That initramfs, called upgrade.img, is built as part of each Fedora compose. The contents of each upgrade.img are then set and cannot be changed. The upgrade.img that's part of Beta RC4 has a copy of systemd with this feature and that can't be changed, so we need to work around it in a different way. For any other case 'fixing' this is a lot simpler and just involves changing systemd. If you're worried you can edit /etc/systemd/system.conf and add this line: StartTimeoutAction=none that'll stop the startup timeout from doing anything, when it's hit. Earlier and later systemd builds may complain when they see that line as it may be 'unknown' to them, but it won't hurt anything.
Howard Chu wrote on 2014-11-03 13:02:
re: testing a slow update - route to the update server over a 2G modem, or 19.2kbps dialup modem... Or use an update server serving off an SD card. etc...
adamw wrote on 2014-11-03 15:16:
Again, no use, because it's not the stage where updates are *downloaded* that needs to be slow, but the stage where they're *installed*.
Rik van Riel wrote on 2014-11-03 18:10:
Looks like this could break systems with a very long / slow fsck, too. Not a good idea for servers, either...
adamw wrote on 2014-11-03 18:49:
It's already getting reverted in systemd. fedup is the case that requires special handling because of how it works.
Patrick wrote on 2014-11-05 17:05:
This works now to upgrade to the first post-beta release candidate from 20 to 21 > yum update > yum install fedup > fedup --nogpgcheck --network 21 --product workstation --instrepo > https://dl.fedoraproject.org/pub/alt/stage/21_TC1/Server/x86_64/os/
adamw wrote on 2014-11-05 17:10:
It should work without the explicit --instrepo now, I just need to test it and send out some announcements.