PSA: Failure to boot after kernel update on Skylake systems

Hi folks! So in the last couple of days a significant issue in all Fedora releases has come to our attention, affecting (so far) several systems that use the Intel 'Skylake' hardware platform. Systems that appear to be affected so far - at least with some system firmware versions - include: Lenovo Thinkpad T460, Lenovo Thinkpad x260, Lenovo Yoga 260, ASUS Zenbook UX305CA, Asus Zenbook UX303UB, Samsung Notebook 9.

The problem appears like this: you install a kernel update, and the new kernel fails to boot, failing very early in the boot process (right after the boot loader). Older kernels boot fine.

For those who don't want to read much, there are a few workarounds available if this is the bug you're hitting:

  1. Boot an older kernel.
  2. Boot with the kernel parameter dis_ucode_ldr.
  3. Downgrade the microcode_ctl package to version 2.1-11 or earlier and force an initramfs rebuild, either with dracut or by reinstalling the packages for the kernels that don't boot.

You may also find that a system firmware update is available from your system manufacturer, and updating the system firmware makes the bug go away. So do please check with your manufacturer's site and try updating your system firmware if there's an update available.

We're sorry for the inconvenience, and we're looking at better fixes at present.

The story behind this bug is that it's not actually a kernel bug at all. It's a bug in microcode_ctl . This is a package which contains both processor 'microcode' updates and a loader for such updates, for Intel processors. You can think of processor microcode as being kind of like firmware for your processor; this mechanism lets Intel correct bugs and improve behaviour in processors after they've been released and shipped out. It also occasionally lets them break stuff, like in this case. :)

The way this mechanism works on most Linux distros (this bug is affecting other distros as well as Fedora, btw) is that if there's a microcode update for the CPU in your system at the time an initramfs is built, the update and a loader mechanism for it are built into the initramfs in such a way that they load very early during initramfs initialization, which is as early in the boot process as we can manage. If there is no microcode update for your CPU, you get an initramfs without this trickery.

Prior to microcode_ctl-2.1-12 there was no microcode update for the affected CPUs. microcode_ctl-2.1-12 has added a microcode update for these CPUs, so all initramfs'es built after microcode_ctl is updated to that version will include the update. The bug seems to be that on some system firmwares, the microcode load fails and hangs the system. On other system firmwares the microcode loads fine and the system boots.

The reason the bug appears when you update your kernel - rather than when you update microcode_ctl - is simply that updating microcode_ctl does not trigger any initramfs rebuilds; your existing installed kernels will still have initramfs'es with no microcode update loader mechanism. But when you install a new kernel, a new initramfs is generated for it, and now it will include the microcode update, and thus hit the bug.

This is why you can work around the issue by downgrading microcode_ctl and then regenerating the initramfs for affected kernels. It also means that if you regenerate the initramfs for a kernel that was working fine after microcode_ctl has been updated, that kernel will stop working.

The dis_ucode_ldr kernel parameter simply disables this microcode loading mechanism, which obviously avoids the bug happening.

Comments

Sérgio Basto wrote on 2016-07-27 23:30:
microcode_ctl-2.1-13.fc23.x86_64.rpm in updates-testing solves anything ?
Donny D wrote on 2016-07-28 15:17:
@ sergio, I have a skylake, and even the newest from rawhide does not solve the issue. I think it might be a mix of Dell dragging their feet for my Precision 7510, and this bug. Thanks to Adam for posting this.
Sérgio Basto wrote on 2016-07-28 20:40:
yeah, my dell LATITUDE E6410 , have some strange problems with microcode_ctl-2.1-13 , reverting to microcode_ctl-2.1-10, zcat /var/log/dnf.rpm.log*gz | grep microcode_ctl Jun 29 05:54:55 INFO Upgraded: microcode_ctl-2:2.1-12.fc23.x86_64 Jun 29 06:00:40 INFO Cleanup: microcode_ctl-2:2.1-10.fc23.x86_64
Sérgio Basto wrote on 2016-07-30 15:19:
after back to microcode_ctl-2.1-10.fc23.x86_64 and regenerate /boot/initramfs- with it , kernel-4.6.5-200.fc23.x86_64 doesn't ooops anymore.
Donny D wrote on 2016-08-16 16:12:
The latest update today on FC24 with kernel-4.6.6-300.fc24.x86_64 seems to have resolved my issue on the Dell Precision 7510. I will keep testing, and post an update in a couple days.
Donny D wrote on 2016-08-16 16:16:
Looks like Dell has also posted an update for the Precision 7510 in relation to microcode http://www.dell.com/support/home/us/en/04/Drivers/DriversDetails?driverId=2D61C&fileId=3559587173&osCode=BIOSA&productCode=precision-m7510-workstation&languageCode=en&categoryId=BI Fixes & Enhancements Fixes: - Resolved Intel soft Guard Extension driver install issue. - Resolved boot time problem when Hyperthreading is disabled. - Resolved Dell Thunderbolt Dock issues. Enhancements: - Support Win10 Enterprise. - Updated TI Power Delivery firmware. - Update Realtek USB LAN firmware. - Updated Embedded Controller firmware. - Update Intel Processor Micro code. <<<-------------------------------------RIGHT HERE - Added embedded LAN MAC address display item in BIOS setup menu. - Added Attempt legacy boot item in BIOS setup menu. - Removed Data wipe support when NVMe SSD attached. - Updated suspend/resume protections.
Donny D wrote on 2016-08-23 14:03:
OK, so after the BIOS update and the newest patches my Skylake system no longer has the microcode issue and 25 out of the last 25 boots have been successful.
chiyong wrote on 2016-11-07 08:53:
I'm also getting some problems with this micro code.
Cantam James wrote on 2017-10-03 09:06:
I am also getting the same error. Miro code not running properly.