I’ve worked on a number of code bases that have intermittent problems. Usually these show up in nightly test runs. Often we’ll talk about them:
“Are we ready to deploy the new build?”
“Almost — I’m still looking at a few test failures, but I think they’re just intermittent.”
“Oh, Pat says we don’t need to worry about it — there are some intermittent failures in the request handling.”
Sometimes the word isn’t even used.
“I’m getting some weird results. Is it supposed to say 5 here?”
“Oh, yeah, it does that sometimes. Just reset the frequency switch and it’ll go away.”
On these teams, intermittent issues are generally treated like harmless nuisances. People work around them and move on. Sometimes there’s a cleanup push which lasts until something distracting comes along or the remaining intermittent issues are too hard.
Intermittent issues are far from harmless nuisances. What the word “intermittent” actually means is “this is an area where my mental model is inadequate”.
We don’t say cars intermittently stop working and sometimes get better when you reset the gas tank. This is because we understand that cars gradually use up their gas, we have systems for keeping track of gas usage (the fuel gauge, or sometimes a knowledge of the car combined with the odometer), and we understand the process of adding gas to a car. We even understand complexities, like the extra load of driving up a mountain with the AC on.
We don’t say that eyeballs intermittently stop working outside and it’s worse in the winter. This is because we understand that much outdoor light comes from the sun, we have a good sense of how much light is necessary for our vision, and we know that the days are shorter in the winter. We even understand complexities, like how streetlights increase the available light and thick clouds diminish it.
The reason “intermittent” issues are so problematic is that mental model mismatches are dangerous. Mental model problems act as both bug farms and as bug shelters. When you don’t understand a system properly, the actions you take will not align with its needs. When you don’t understand a system properly, you will have trouble detecting problems in it. The subtle weirdnesses that are the first symptoms of a problem will be lost in the noise.
Misunderstanding a system that you affect puts the continued health of the system at risk. If the system owners brush off intermittent problems instead of trying to heal the mental model gap, the system cannot thrive.
(Besides, even if I’m not concerned about the long-term health of the code, it’s much more fun to find and fix a dangling file-handle problem than it is to squint at ten pages of test logs and eventually decide that it’s an intermittent failure with nothing to be done.)