CodeNSM
The Problem · Part 16

“Oh, that one always throws” — the five scariest words in software

2026-06-16· 7 min read· by Think North

Every engineering team has a sentence like this, and if you listen for it, you'll hear it within a week:

"Oh yeah, that one always throws. It's fine. Ignore it."

Said casually, by a competent person, about a real error in a production system, to a new hire who — this is the important part — writes it down. Not the error. The lesson: here, this class of failure is background noise. Welcome aboard.

This post is about that sentence: where it comes from, what it costs, and why the most famous analysis of it involves a space shuttle.

The hum

No mature production system runs clean. Somewhere in your logs, right now, there's a steady drizzle of exceptions: the third-party API that times out a few times an hour, the parser that chokes on malformed input from that one enterprise client, the retry that fails and succeeds on the second attempt, mostly. Each has a story, the story is usually "annoying but survivable," and over time the whole set fuses into something your team perceives the way you perceive a refrigerator's hum — which is to say, not at all.

Watch how the hum actually forms, though, because the mechanism is where the danger is. Nobody ever decided "we accept a 3% failure rate on payment retries." What happened was: an error appeared; it was investigated once; the fix looked expensive and the impact looked small; someone muted the alert "for now"; and the next person to see it was told it was fine by the person before them. Each step was locally reasonable. The output is an error rate that no one chose, no one owns, and — critically — no one is measuring anymore, because measurement attention follows alerts, and the alerts are muted.

Your tolerated error rate wasn't set by engineering judgment. It was set by fatigue, then laundered into policy by time.

The sociologist and the space shuttle

The definitive study of this mechanism wasn't written about software. It's Diane Vaughan's The Challenger Launch Decision, her exhaustive analysis of how NASA came to launch a shuttle whose solid rocket boosters had a known, recurring flaw. Vaughan coined the phrase that engineering culture has been borrowing ever since: the normalization of deviance.

Here's the shape of it, and notice how little translation software requires. The O-rings sealing the booster joints were never supposed to be damaged AT ALL — that was the design spec. Then a flight came back with erosion. Anomaly! It was investigated, analyzed, and — because the mission had succeeded — recategorized: erosion within this much is acceptable. Then more flights, more erosion, each one technically reviewed, each one flown, each success quietly expanding the envelope of "normal." Vaughan's crucial point — and she's emphatic about this — is that NASA's engineers weren't villains cutting corners. They were diligent professionals following their own review processes. The deviance was normalized through the analysis, not around it: each anomaly-that-didn't-kill-anyone became evidence that the anomaly was survivable. The absence of catastrophe was doing the work that the presence of safety should have been doing.

Until January 28, 1986, when a cold morning pushed the eroded envelope one increment past its edge, on live television.

Now, your codebase is not a space shuttle, and I want to use Vaughan's work honestly rather than as a scare-quote garnish — her subject was a $3 billion vehicle with seven lives aboard, and the disanalogy matters. But the MECHANISM she documented doesn't require rockets. It requires only three ingredients: a recurring anomaly, survival of that anomaly, and a review culture that gradually converts survival into acceptability. Every software team has all three. "Oh, that one always throws" is O-ring erosion after the recategorization: a defect with seniority, protected by its own track record of not-yet-mattering.

What the hum actually costs

It eats your signal. This is the operational killer. The day a NEW error appears in a function that's thrown 50 times a day for two years, who notices? The alert has been muted since 2024. Your tolerated errors aren't just failures — they're camouflage for the failure that will matter, exactly the way a car alarm that cries nightly trains a neighborhood to sleep through the actual theft.

It compounds silently. The hum's components interact. The flaky vendor API is fine (retry handles it), and the retry queue backing up is fine (it drains overnight), and the overnight job running long is fine (it finishes by 6) — until a Black Friday puts all three "fines" in a row and you discover they were a single fuse. Vaughan's deepest finding was that normalized anomalies don't stay independent; they form the taken-for-granted floor that the next decision stands on.

It miscalibrates everyone. The 2024 DORA report tracks change-fail rate and stability as first-class outcomes because they separate elite performers from the rest — but a team that's normalized its hum literally cannot report its own stability accurately. Ask them "what's your error rate?" and they'll quote you the rate of errors they still notice. The tolerated ones have been redefined out of the denominator. The books are cooked, with no cook.

It gets inherited. This is the quietest cost and maybe the worst one. The hum doesn't just persist — it REPRODUCES. When the team stands up a new service, the alerting config gets copied from the old one, muted rules and all. When a veteran onboards a new hire, "ignore that one" is taught in week one, with the same tone as where the coffee machine is. When someone finally rewrites the flaky integration, they write retry-and-swallow logic that matches the old failure rate, because that rate has become the spec — the deviance is now load-bearing. Vaughan called the institutional version of this the culture of production: yesterday's exception becomes today's procedure becomes tomorrow's training material. Your error tolerance isn't a state. It's an heirloom, and you're actively bequeathing it.

And each tolerated error is also, mundanely, a running tax nobody invoices: the support tickets it generates downstream, the retries that burn compute, the on-call engineer who glances at the familiar red and looks away — a few minutes of attention, several times a day, forever. Small numbers, huge multipliers. That's the hum's favorite arithmetic.

De-normalizing, without heroics

The fix is not "care more" — Vaughan's engineers cared plenty. The fix is structural: deviance normalizes when anomalies are evaluated by memory and social precedent ("it's always done that") instead of against a recorded baseline. The antidote is embarrassingly unglamorous: per-function error accounting. Every function's failure rate, tracked against its own history, with "normal for this function" written down by an instrument instead of by folklore. The moment the baseline is explicit, the question changes from "is this fine?" (answered by vibes and tenure) to "is this function above its own baseline?" (answered by arithmetic) — and a second, more subversive question becomes possible: "we've been paying a 4% failure tax on this desk for two years; do we want to KEEP choosing that, now that it's a number in a report instead of a shrug in a hallway?" (Tracking exactly that — each function's error economics against its own baseline, with the chronically-elevated flagged as Stressed — is the piece of CodeNSM this post is quietly about.)

Below, a self-audit. It won't tell you whether you have a hum — you do. It'll tell you how thoroughly you've stopped hearing it.

References

  1. Vaughan, D. (1996). The Challenger Launch Decision: Risky Technology, Culture, and Deviance at NASA. University of Chicago Press.
  2. Google Cloud (2024). DORA Accelerate State of DevOps Report.
  3. Tornhill, A. & Borg, M. (2022). Code Red: The Business Impact of Code Quality.

See your own codebase as an office.

One pip install and every function reports for duty — archetype, live state, debt tier, and a single Code-Health North-Star. Free plan, no card.

Read next