The future of Fedora QA

Adam Williamson

2017-02-12 09:33

Welcome to version 2.0 of this blog post! This space was previously occupied by a whole bunch of longwinded explanation about some changes that are going on in Fedoraland, and are going to be accelerating (I think) in the near future. But it was way too long. So here's the executive summary!

First of all: if you do nothing else to get up to speed on Stuff That's Going On, watch Ralph Bean's Factory 2.0 talk and Adam Samalik's Modularity talk from Devconf 2017. Stephen Gallagher's Fedora Server talk and Dennis Gilmore's 'moving everyone to Rawhide' talk are also valuable, but please at least watch Ralph's. It's a one-hour overview of all the big stuff that people really want to build for Fedora (and RH) soon.

To put it simply: Fedora (and RH) don't want to be only in the business of releasing a bunch of RPMs and operating system images every X months (or years) any more. And we're increasing moving away from the traditional segmented development process where developers/package maintainers make the bits, then release engineering bundles them all up into 'things', and then QA looks at the 'things' and says "er, it doesn't boot, try again", and we do that for several months until QA is happy, then we release it and start over. There is a big project to completely overhaul the way we build and ship products, using a pipeline that involves true CI, where each proposed change to Fedora produces an immediate feedback loop of testing and the change is blocked if the testing fails. Again, watch Ralph's talk, because what he basically does is put up a big schematic of this entire system and go into a whole bunch of detail about his vision for how it's all going to work.

As part of this, some of the folks in RH's Fedora QA team whose job has been to work on 'automated testing' - a concept that is very tied to the traditional model for building and shipping a 'distribution', and just means taking some of the tasks assigned to QA/QE in that model and automating them - are now instead going to be part of a new team at Red Hat whose job is to work on the infrastructure that supports this CI pipeline. That doesn't mean they're leaving Fedora, or we're going to throw away all the work we've invested in the components of Taskotron and start all over again, but it does mean that some or all of the components of Taskotron are going to be re-envisaged as part of a modernized pipeline for building and shipping whatever it is we want to call Fedora in the future - and also, if things go according to plan, for building and shipping CentOS and Red Hat products, as part of the vision is that as many components of the pipeline as possible will be shared among many projects.

So that's one thing that's happening to Fedora QA: the RH team is going to get a bit smaller, but it's for good and sensible reasons. You're also not going to see those folks disappear into some kind of internal RH wormhole, they'll still be right here working on Fedora, just in a somewhat different context.

Of course, all of this change has other implications for Fedora QA as well, and I reckon this is a good time for those of us still wearing 'Fedora QA' hats - whether we're paid by Red Hat or not - to be reconsidering exactly what our goals and priorities ought to be. Much like with Taskotron, we really haven't sat down and done that for several years. I've been thinking about it myself for a while, and I wouldn't say I have it all figured out, but I do have some thoughts.

For a start I think we should be looking ahead to the time when we're no longer on what the anaconda team used to call 'the blocker treadmill', where a large portion of our working time is eaten up by a more or less constant cycle of waking up, finding out what broke in Rawhide or Branched today, and trying to get it fixed. If the plans above come about, that should happen a lot less for a couple of reasons: firstly Fedora won't just be a project which releases a bunch of OS images every six months any more, and secondly, distribution-level CI ought to mean that things aren't broken all the damn time any more. In an ideal scenario, a lot of the basic fundamental breakage that, right now, is still mostly caught by QA - and that we spend a lot of our cycles on dealing with - will just no longer be our problem. In a proper CI system, it becomes truly the developers' responsibility: developers don't get to throw in a change that breaks everything and then wait for QA to notice and tell them about it. If they try and send a change that breaks everything, it gets rejected, and hopefully, the breakage never really 'happens'.

Sadly (or happily, given I still have a mortgage to pay off) this probably doesn't mean Project Colada will finally be reality and we all get to sit on the beach drinking cocktails for the rest of our lives. CI is a great process for ensuring your project basically works all the time, but 'basically works' is a long way from 'perfect'. Software is still software, after all, and a CI process is never going to catch all of the bugs. Freeing QA from the blocker treadmill lets us look up and think, well, what else can we do?

To be clear, I think we're still going to need 'release validation'. In fact, if the bits of the plan about having more release streams than just 'all the bits, every six months' come off, we'll need more release validation. But hopefully there'll be a lot more "well, this doesn't quite work right in this quite involved real-world scenario" and less "it doesn't boot and I think it ate my cat" involved. For the near future, we're going to have to keep up the treadmill: bar a few proofs of concept and stuff, Fedora 26 is still an 'all the bits, every six months' release, and there's still an awful lot of "it doesn't boot" involved. (Right now, Rawhide doesn't even compose, let alone boot!) But it's not too early to start thinking about how we might want to revise the 'release validation' concept for a world where the wheels don't fall off the bus every five minutes. It might be a good idea to go back to the teams responsible for all the Fedora products - Server, Workstation, Atomic et. al - and see if we need to take another good look at the documents that define what those products should deliver, and the test processes we have in place to try and determine whether they deliver them.

We're also still going to be doing 'updates testing' and 'test days', I think. In fact, the biggest consequence of a world where the CI stuff works out might be that we are free to do more of those. There may be some change in what 'updates' are - it may not just be RPM packages any more - but whatever interesting forms of 'update' we wind up shipping out to people, we're still going to need to make sure they work properly, and manual testing is always going to be able to find things that automated tests miss there.

I think the question of to what extent we still have a role in 'automated testing' and what it should be is also a really interesting one. One of the angles of the 'more collaboration between RH and Fedora' bit here is that RH is now very interested in 'upstreaming' a bunch of its internal tests that it previously considered to be sort of 'RH secret sauce'. Specifically, there's a set of tests from RH's 'Platform QE' team which currently run through a pipeline using RH's Beaker test platform which we'd really like to have at least a subset of running on Fedora. So there's an open question about whether and to what extent Fedora QA would have a role in adapting those tests to Fedora and overseeing their operation. The nuts and bolts of 'make sure Fedora has the necessary systems in place to be able to run the tests at all' is going to be the job of the new 'infrastructure' team, but we may well wind up being involved in the work of adapting the tests themselves to Fedora and deciding which ones we want to run and for what purposes. In general, there is likely still going to be a requirement for 'automated testing' that isn't CI - it's still going to be necessary to test the things we build at a higher level. I don't think we can yet know exactly what requirements we'll have there, but it's something to think about and figure out as we move forward, and I think it's definitely going to be part of our job.

We may also need to reconsider how Fedora QA, and indeed Fedora as a whole, decides what is really important. Right now, there's a pretty solid process for this, but it's quite tied to the 'all the things, every six months' release cycle. For each release we decide which Fedora products are 'release blocking', and we care about those, and the bits that go into them and the tools for building them, an awful lot more than we care about anything else. This works pretty well to focus our limited resources on what's really important. But if we're going to be moving to having more and more varied 'Fedora' products with different release streams, the binary 'is it release blocking?' question doesn't really work any more. Fedora as a whole might need a better way of doing that, and QA should have a role to play in figuring that out and making sure we work out our priorities properly from it.

So there we go! I hope that was useful and thought-provoking. We've got a QA meeting coming up tomorrow (2017-02-13) at 1600 UTC where I'm hoping we can chew these topics over a bit, just to serve as an opportunity to get people thinking. Hope to see you there, or on the mailing list!