“Intermittent” is a Dirty Word

September 26th, 2013

I’ve worked on a number of code bases that have intermittent problems.  Usually these show up in nightly test runs.  Often we’ll talk about them:

“Are we ready to deploy the new build?”
“Almost — I’m still looking at a few test failures, but I think they’re just intermittent.”


“Oh, Pat says we don’t need to worry about it — there are some intermittent failures in the request handling.”

Sometimes the word isn’t even used.

“I’m getting some weird results.  Is it supposed to say 5 here?”
“Oh, yeah, it does that sometimes.  Just reset the frequency switch and it’ll go away.”

On these teams, intermittent issues are generally treated like harmless nuisances.  People work around them and move on.  Sometimes there’s a cleanup push which lasts until something distracting comes along or the remaining intermittent issues are too hard.

Intermittent issues are far from harmless nuisances.  What the word “intermittent” actually means is “this is an area where my mental model is inadequate”.

We don’t say cars intermittently stop working and sometimes get better when you reset the gas tank.  This is because we understand that cars gradually use up their gas, we have systems for keeping track of gas usage (the fuel gauge, or sometimes a knowledge of the car combined with the odometer), and we understand the process of adding gas to a car.  We even understand complexities, like the extra load of driving up a mountain with the AC on.

We don’t say that eyeballs intermittently stop working outside and it’s worse in the winter.  This is because we understand that much outdoor light comes from the sun, we have a good sense of how much light is necessary for our vision, and we know that the days are shorter in the winter.  We even understand complexities, like how streetlights increase the available light and thick clouds diminish it.

The reason “intermittent” issues are so problematic is that mental model mismatches are dangerous.  Mental model problems act as both bug farms and as bug shelters.  When you don’t understand a system properly, the actions you take will not align with its needs.  When you don’t understand a system properly, you will have trouble detecting problems in it.  The subtle weirdnesses that are the first symptoms of a problem will be lost in the noise.

Misunderstanding a system that you affect puts the continued health of the system at risk.  If the system owners brush off intermittent problems instead of trying to heal the mental model gap, the system cannot thrive.

(Besides, even if I’m not concerned about the long-term health of the code, it’s much more fun to find and fix a dangling file-handle problem than it is to squint at ten pages of test logs and eventually decide that it’s an intermittent failure with nothing to be done.)

Book Review: Debugging the Development Process

July 15th, 2010

Unlike the previous book, and unlike its title implies, this book is not about technical debugging at all. Rather, it’s about people skills and systems of people.

This book is clearly aimed at new and newish managers. It presents, argues for, and teaches a variety of human factors skills that relate to managing a small group of developers. Despite that, I think it’s useful to a wide variety of people. Most of the focus is on not letting inertia / denial / other people / conflict-avoidance make your decisions for you. It can be read like an engineer’s self-help book: action A causes bad result B via mechanism C; alternative action A’ prevents mechanism C and furthermore encourages good result D through mechanism E. There’s no coddling, but there’s no attacking, either. That said, I think it would be a dandy resource for its target audience.

Debugging the Development Process deals explicitly with subtle influences and with the fact that systems of people are not predictable by the sum of their parts. That’s both helpful and rare. I think systems and subtleties are more important to software folks than clean bite-sized absolutes, and it frustrates me that I can’t convince more of my industry.

As a pleasure read, this book also scores highly. Maguire’s writing style is unusually easy and enjoyable to read for a technical author. Like his earlier book, Writing Solid Code, Debugging was information-dense without being word-dense. I’m sad that he’s only written two books, both of them in the nineties — I’d like to see what he’d produce if he continued to write.

Perhaps the biggest flaw in this book is its age. It was published in 1994. While social systems don’t age as quickly or as badly as technical systems do (anyone want to buy my copy of Java 1.1 in a Nutshell?), there have still been useful advances in the field of software management in the last decade and a half. Much of the advice Maguire gives would still be useful for people in self-organizing teams, but it would have to be adapted from his presentation to suit. Similarly, he’s assuming a certain model of scheduling-up-front that, while it is flexible, is compatible with the incremental micro-scheduling practiced by many agile projects.

Overall, a helpful and enjoyable book, and a quick read — check it out!

Book Review: Why is the Phone on Fire?

April 5th, 2010

That’s not the full title of the book, just the most memorable part. The full title is If I Only Changed the Software, Why is the Phone on Fire?: Embedded Debugging Methods Revealed. I pulled it off the library shelf on a whim. I want more ability to explain debugging to others, and besides, library checkouts are cheap.

It turned out to be an extremely good whim. Despite some minor issues, this book is an extremely readable introduction to how to approach debugging, with a fair bit of specific advice. It’s structured as a series of chapters about a fictional team. Each chapter is bracketed by discussion of a specific real-world bug that caused a lot of very visible trouble. Inside each chapter, the fictional team struggles to figure out a problem that turns out to be caused by similar forces. The discussion focuses on their thought patterns and realizations. The writing is nicely specific without having to include a lot of source code. Symptoms and behaviors are the main focus.

Although the title specifies embedded debugging, most of the techniques in this book can (and should!) be applied to any sort of debugging. The value in this book isn’t in the specific situations it describes, but in the approach to diagnosis and problem-solving it teaches. That said, all of the situations in this book use relatively small software systems. A book on debugging larger systems would be a good next read. The closest match I can think of is Working Effectively With Legacy Code, but that has a development focus, not a debugging focus. (I recommend it in its own right, but it’s not an obvious follow-up for this book.)

Although Why is the Phone on Fire? is generally a good resource, there were some problems with its writing. The members of the fictional team behave in racially stereotyped ways. The writing style was occasionally pedantic — I could sometimes interpret this as the old hands on the team being patronizing to the newbie, but that’s problematic too. I think these problems are overshadowed by the positive qualities of the book, but potential readers should be aware of them going in.

Overall, it’s a very good book in an underserved market niche. I’m going to buy a copy for lending purposes.

Software is like Plumbing

February 28th, 2010

One of my mental models for how people interact with software is shaped by a few plumbing adventures I’ve had. They’re the standard sort of mishaps: one backed up sewer pipe and one overflowing shower drain. In each, I eventually learned the proximal cause (previous inhabitants of house were flushing inappropriate materials; apartment building didn’t have large enough pipes in the shower stack, which meant that our first-floor apartment served as a de facto drain pipe), but I had neither control, nor really any interest in why things were going wrong.

I’m interested in plumbing in an abstract sense. Large-scale human engineering is cool and plumbing history has undergone some interesting turns and twists. The plumbing I live with, however, should just work. I don’t want to think too much about how it works or why it works. I just want to take my shower and get on with my day. I don’t know much about the features that are necessary for that to “just work” — I won’t be able to tell you how wide to make the sewer pipes or how to keep metals from leaching into the clean water. I don’t care much about the features, either. What I care about are a) toilet works, b) shower works, c) faucets work, d) dishwasher works, and e) all the waste water goes away.

A lot of software teams would consider me a really annoying customer. Plumbers don’t, or if they do, they hide it well. They understand that the subject they’ve spent 40 hours a week on for the last two decades is not that interesting to me, their customer, and they tailor their message accordingly. I try to remind myself of this before I push data structures and installer dependencies on software customers — some people just don’t care about anything beyond “Your toilet was clogged. I’ve fixed it. Don’t flush menstrual products. That’ll be $119.95.” And that’s fine. I don’t want to trace pipes in my basement either.

Mixins vs. typeclasses

November 22nd, 2009

This is a fragment of a discussion on polymorphism that I had in another context. I was pleased by how well this explanation came out, and wanted to put it in a more public format.

I’ve thought more about the issue you raised of whether Haskell’s typeclasses are the same as Ruby’s mixins, and I think my answer is a firm “sort of”. A Ruby mixin provides a consistent API and a consistent implementation to any class it’s added to. A Haskell typeclass provides a consistent API and an individual implementation to any type it’s defined for. It’s roughly the difference between base classes and interfaces in C++/C#/Java.

Javascript is the new C

October 16th, 2009

I’ve become increasingly convinced that Javascript is following the path that C did a generation before.

C was the ubiquitous language of its time. It ran on every platform, although each platform had its own interpretation of things like integer size. It was a higher-level, more portable language than was commonly in use at the time. It was syntactically fairly simple, with a few key ideas that it used heavily, like macros and pointers. More than thirty years later, it’s still the interop language between different technical domains. Younger, higher-level languages offer the ability to write speedier code in C or to hook into other code written in C.

Javascript is the ubiquitous language of the Web. After a rocky start, it now runs on every Web browser, although some browsers interpret things differently and many users have configured their browsers differently.[1] It allows more interactivity than preceding technologies (server-side manipulation and DHTML). It’s syntactically quite simple, with a few key ideas that it uses heavily, like prototypes and first-class functions. It’s used extremely heavily. There are a large number of libraries built on it. Other technologies (like Silverlight) are starting to build out from it.

I think that C and Javascript are both going to be with us forever. I think that, just as C is increasingly the province of device drivers, compiler back-ends and heavily-optimized inner loops, Javascript is going to move into the background of Web programming, replaced by technologies we probably haven’t even met yet. However, despite their spiffiness, the new VirtualSmellX and 4DSpline Web languages will have Javascript modules at their core and will exchange data in JSON.

(Confidential to social historians of the 22nd century: you guys are going to have So Much Fun. “Rapid evolutionary change” doesn’t begin to cover it.)

[1] I once had to pay for a conference by hand-delivered check. My idiosyncratic NoScript settings had convinced the registration software that I didn’t have to pay, and no amount of twiddling on my part would convince it to allow me to pay.

Misuse of Arithmetic

October 1st, 2009

Many people know about crazy fraction reduction: 16/64 is 1/4 by normal arithmetic, but if you cancel the 6s, you get 1/4 as well. Right answer, wrong method.

I was just introduced to the inverse of crazy fraction reduction. 32/24 = 4/3, but if you cancel the 2s, 32/24 = 3/4. For some reason, this is more alarming to me than standard crazy fraction reduction.

If it hurts, kill something

September 13th, 2009

Two weeks ago at Agile 2009 Jim Shore and Arlo Belshee led an open-ended discussion session on end-to-end tests and how to get rid of them. Part of this session was lightning talks based on small-group discussion. This is a refinement of the lightning talk I gave.

A lot of teams have a legacy test codebase to match their legacy code codebase. Part of the solution to end-to-end testing for them is to find ways to whittle down their existing end-to-end tests. Wholesale deletion is possible, but emotionally/politically difficult. I propose piecewise deletion, starting with the most important areas.

Whenever an end-to-end test bites you — there’s a false positive, six hundred tests fail because you change one method, or you can’t change an assumption because too many tests call it — kill something that’s part of the problem. Sometimes it’ll be obvious what to change: if this end-to-end test is failing for no good reason, remove it. Sometimes it’ll be harder. Instead of determining the exact right thing to remove, remove something and keep going. Remove a helper method and the tests that use it. Remove a test hook from the production code and the tests that use it.

If you keep repeating this action, you’ll slash away at the test code that causes the most problems. This may mean that you have no tests left for some areas of the code. That’s a valuable exposure of information: this code is so problematic that worthwhile tests cannot be written for it. Now you know where to concentrate your refactoring and rewriting efforts.

The original talk was focused on end-to-end tests, but I think this is a broader tool. Whenever you have code that is a never-ending source of trouble and pain, start treating it more roughly. Eradicate some of the mess each time you walk through the code, even if you don’t know where the absolute worst spots are. There’s no sense in using laproscopic surgery on a leg that’s dead to the knee.

Dream it, then cause it

August 17th, 2009

One of the most fun aspects of technology is the ability to just up and create something. I actually find this easiest with stringcrafts. Feeling awkward at a party? Crochet finger puppets. Microphone for the read-aloud hard to keep at a standard distance from people’s mouths? Ply and braid a cord for it. Stuff needs transporting from hither to yon? Angle bungee cords properly to keep it from sliding off the bike rack.

I haven’t found ways to make programming usage as casual as string usage. Because of this, I sometimes forget that programming can be a casual solution to ideas or problems.

I read Simon Peyton Jones’ STM essay in Beautiful Code last night, and worked through the code to make sure I understood. I’d used STM in Haskell before, but for thorny networking problems that made it hard to understand how simple the magic actually is. At some point this morning I decided to combine STM and gtk2hs (which I’ve never used beyond a single ten-line example, and which I had installed incorrectly on my box as part of an installation of leksah) to produce a graphical representation of the Dining Philosophers problem. I now (after time off for reading, lunch, goofing off, and spending time with friends and their kittens) have a working and attractive implementation.

Creation is fun.

Emotions are important

July 14th, 2009

A friend of mine has a story that describes the clash between emotion-valuing and emotion-fearing conversation. While talking with a coworker about the Foobar module, he’d say, “I’m scared of what would happen if we put Barfoo data through it,” intending to start a discussion of the robustness of the code. Instead, the response would be, “Oh Paul, it’s okay. You don’t have to be scared. It’s okay.”

People, especially STEM-inclined people, often prize rationality over emotions[1]. Emotions are sticky and mushy and uncomfortable and hard to define. Numbers and facts are measurable and tidy[2]. I disagree; I think that both measurable information and fuzzy/gooshy information have value.

Metrics can be very helpful. How many units of work did we finish last release? How many bushels of corn were harvested from that field last year? How many milliliters of air can this person breathe out at once? Once you have a specific question, metrics and measurable information can help you answer it. Even if you don’t have a specific question, metrics can help compress specific data so you can see trends at a glance.

Emotions and fuzzy interpretations are also very helpful. I’m scared of what will happen to this code in a multi-threaded situation. I don’t feel secure about the performance of this system under load. This project is falling together perfectly. This patient looks woozy. Emotions are the high-level summations that your brain puts together from weeks or months of ill-defined input. They can’t be computed precisely, but as estimations, they’re very helpful.

This post was prompted by peripatetic axiom‘s Let’s talk about feelings post.

[1] The definition of “rational” is generally “agrees with whatever I’ve already decided is true”. This is a related, but separate, rant.

[2] I profoundly disagree with this too.