It’s not flaky!

“reliability” by mikecohen1872 is licensed under CC BY 2.0

This is a tech post, I promise. But first humour me, just for a minute. Imagine that six months ago you bought a new car. You saved hard and managed to get the one that you wanted, with most of the features that you were after, including the fancy fuel injection system. All is fine for 3 months, but then the fuel injection system starts to fail, intermittently, ….10% of the time. What do you do? You go to the garage where you bought the car and report the problem. The garage looks over the car, inspects the fuel injection system and tells you “it’s flaky”. Would you accept that as a summary of the problem? Would you want the mechanic to explain what he/she means by flaky? Or would you feel comfortable driving the car away, hoping it doesn’t continue to occur.

Interestingly, in everyday life, we wouldn’t accept being told that something is flaky when it’s ultimately unreliable, particularly when we depend upon it. And yet the term has found its way in to software delivery. It’s not a new phrase. I’ve been hearing it for years and I’m confident that I’ve used it at some point in my career. More recently I’ve grown to dislike the term and I’ve now got to the point that I need to express my distaste.

The basis of my distaste

1) It’s not ubiquitous : The term flaky is not ubiquitous. If I were to state that a particular automated CI build were flaky, what would that mean to you? It’s impossible to interpret that statement, other than to say it’s an abstraction of a genuine problem. It could be a problem with a particular build agent, a temporal issue, a faulty script, a permissions issue, a bug in the application being compiled etc etc etc. Using the term flaky prevents a team from having a consistent understanding of the symptoms or the cause.

2) Tolerance : Flaky, for one reason or another, seems to be more palatable than either the terms faulty or unreliable. Maybe, as the conveyor of information, that’s because we prefer to generalise and by generalising we avoid having to provide specific details. By avoiding specifics, we breeze past complicated investigations and discussions and thus avoid distraction from feature delivery. By stating that a CI build is flaky, for example, we avoid the detail, such as what’s the failure rate and what (categorically speaking) is the cause. Alternatively if we described the same CI build as unreliable or faulty we’d all have more of an inclination to fix it.

3) Entropy : Tolerance leads to entropy. Given a scenario where you rely on something that does not give you consistent results, and you become tolerant of it, rot occurs [see software entropy for more]. That one flaky test that you have today turns in to two flaky tests tomorrow. Before you know it you’re excluding or deleting tests that were once valuable. Given enough pressure, an impending deadline, a heavy handed leadership team, or a financially impacting bug, people will bypass/ignore a flaky test and release regardless. This can be catastrophic, since the reason that the test historically failed may not be the reason it’s failing today. Intermittent failures can mask genuine bugs which can have terrible consequences.

4) Unplanned work : Entropy leads to unplanned work and unplanned work can lead to poor delivery. To the point where it can derail even high performing teams. Flaky {builds}/{tests}/{scripts}/{deployments}/{networks}/{servers}/{apis} often become a time sink and are essentially a distraction from planned work. The more complex the scenario the more effort and time the unplanned work takes. Typically this kind of effort is unmonitored, the tasks involved don’t appear on any sprint/kanban boards and the effort isn’t taken in to account when planning future work. Unplanned work can be a project killer and is one of the common reasons why software delivery teams fail to deliver on business expectations.

5) Repetition and deterministic behaviour : Repetition is a given in our industry. We write code, test it, ship it, monitor it. And then we do it again. The thing about repetition, certainly in software delivery, is that you’re looking for consistent outcomes. When I run a build, I’m looking for a big green tick. If I can run that same build without a code change and get the same outcome each time, then I have a valuable, repeatable process. And more importantly deterministic behaviour. If I run that same build n times, without any changes, and get inconsistent results, I’ve lost value of repetition, lost predictability and lost any form of deterministic behaviour. Unpredictability in automated tools and processes kills trust and can, like entropy, mask underlying problems.

6) Hope is not a strategy : Hoping an intermittent problem will just disappear won’t actually make it happen. Yet we have a habit of hitting the button to trigger a re-execution, in the hope of a different outcome. An example I recently came across were some tests labelled as “flaky UI tests”. Further investigation made us realise that load tests being run by another team were leaving lots of data around post execution. The load tests served an entirely different purpose, but the aftermath was having a negative impact on the response times of a system we were testing. Up until that investigation, our tests had just been labelled “flaky ui tests”. Unfortunately re-running them could yield different outcomes, because, as we later found out, the DBA’s were monitoring the databases and adhoc cleaning up post load test data. This taught me that hoping a problem will vanish of it’s own volition is a poor strategy.

Conclusion

The term flaky is certainly not only attributable to tests and builds, though that’s the context in which I’ve heard it used most commonly. JetBrains has even created a feature in TeamCity to identify and highlight flaky tests. In contrast someone recently referred to my clients search feature as flaky and I’ve heard it used in relation to FTP servers, networks, virtual machines, monitoring tools and third party apis. Typically it rears its head when we are aware of the symptoms but not the cause.

I’ve worked in software delivery long enough to know that perfection rarely exists, that niggles creep in. We have to remain practical and pragmatic, asses risks, make informed decisions and trade offs. Realistically we can not always tackle the problems that occur at the time that they occur. But we need to ensure we’re making those challenges visible, highlighting the imperfections and adding clarity to the problems we face before rot occurs.

One things for sure. We need to stop using the term flaky! It has no place in software delivery.

Leave a comment