The Problem · Part 6

The confidence machine

2026-06-27· 7 min read· by Think North

Here's a superpower you have and have never once thanked yourself for: you can hear doubt.

When a colleague says "this should handle the timezone thing," your brain automatically italicizes the should and files the timezone thing under "verify before trusting." When a PR description says "not 100% sure about the locking here," you read the locking code twice. When Dana — brilliant, terse Dana — writes an uncharacteristically long comment above a function, you know, without anyone telling you, that Dana was nervous, and that this function has teeth.

An entire risk-allocation system, running on tone. Nobody designed it. Nobody maintains it. And two years ago, without a memo, its input signal went silent.

The flat line

Ask a language model for a binary search and it produces one, confidently. Ask it for a distributed rate limiter with fairness guarantees under clock skew and it produces one, equally confidently. Same fluent prose in the docstring. Same tidy variable names. Same total absence of sweat.

One of those two outputs is almost certainly correct — the training data contains a million binary searches. The other is a genuinely hard problem where even specialist humans routinely get it wrong, and the model's output is, at best, a plausible draft. But nothing — NOTHING — in the artifact tells you which regime you're in. The confidence is not a signal about the code. It's a property of the machine, like the hum of a refrigerator.

With humans, confidence correlates with correctness — imperfectly, gameably, but reliably enough that we built our entire collaboration model on it. With models, the correlation between how the code sounds and how good it is approaches zero at exactly the difficulty levels where it matters most. You've been handed a colleague whose voice never wavers, ever, about anything.

In a person, we would call that a red flag. In a tool, we call it polish.

Take a second to inventory what the old signal actually consisted of, because it was richer than you think. Hedging comments ("this should be safe"). TODO and FIXME markers — literal flags of known doubt, planted by the person best positioned to know. Commit messages that trail off ("handle edge case, hopefully"). The over-long explanation that means the author was talking themselves into it. The suspiciously short one that means they didn't want to look at it anymore. Even formatting carried signal — code written nervously reads nervously. Every one of those channels is now flooded with the model's even, unhesitating output. The flags don't get planted, because the entity doing the writing has never once felt the feeling the flag was invented for.

The researchers who caught it on camera

This isn't theory; the miscalibration has been measured, on both sides of the human-machine boundary.

On the human side: Perry, Srivastava, Kumar and Boneh's Stanford study, "Do Users Write More Insecure Code with AI Assistants?", ran a controlled experiment on security-relevant programming tasks. Participants with an AI assistant wrote measurably less secure code than the control group — and here is the finding that belongs in a museum — believed they had written more secure code. The assistant didn't just fail to transfer correctness. It transferred confidence, separately from correctness, like a virus with two payloads and one of them counterfeit.

METR's 2025 randomized trial found the same signature in a different domain: experienced open-source developers, working in their own repositories, were measurably slower with AI assistance — while estimating they had been sped up by around 20%. Veterans, on home turf, misreading the machine's effect on their own performance by dozens of percentage points. The machine's serenity is contagious, and expertise is not a vaccine.

And on the machine side, the benchmark literature shows the flat line directly: on SWE-bench — Jimenez and colleagues' benchmark of real GitHub issues — models emit patch after patch with identical fluency, and a substantial fraction simply don't work. The failures read exactly like the successes. There is no benchmark for "sounding unsure when wrong," because the machine has no access to the difference.

Where the poison actually spreads

The subtle damage isn't the wrong code itself — wrong code is survivable; we've shipped it for decades. The damage is what uniform confidence does to every downstream trust decision, because your organization is a pipeline of people deciding how hard to check things:

The prompter checks less, because the output looks like the outputs that were fine. (It always looks like the outputs that were fine.)
The reviewer — already at 8x volume, see Part 4 — spends their scarce skepticism budget by triaging on smell. The code has no smell. Skepticism goes unspent.
The next developer, six months later, reads the confident docstring on the rate limiter and builds on top of it. Why wouldn't they? It says right there what it guarantees.
The incident responder, at 2am, initially refuses to suspect the polished module — the search starts in the ugly-looking legacy file, because ugliness is where bugs live. Used to live.

Four layers of misdirected vigilance, all downstream of one missing signal. Your organization didn't get more credulous. It's running the same trust algorithm it always ran — allocate scrutiny where the doubt sounds loudest — against an input that no longer carries doubt. A perfectly good algorithm, fed a severed wire.

(The old world had this problem in one narrow spot, by the way: the overconfident junior. And notice what we did about it — we wrapped that one person in extra review, mentorship, and a probation period. We built an entire social exoskeleton to contain a single miscalibrated voice. We have now given every developer a miscalibrated pair-programmer and dissolved the exoskeleton for throughput reasons.)

You can't fix the voice. You can stop listening to it.

Making models say "I'm not sure" turns out to be deeply hard — the confidence isn't a personality flaw to be coached out; it's how generation works. The artifact will keep sounding finished. So the correction has to happen on your side of the boundary, and it comes down to one discipline: replace tone with evidence.

Concretely, that means the trust a function earns should come from observable history, not from how its docstring reads. Has it run? How many times? What's its error rate, its latency drift, its behaviour at the boundaries, compared against other functions doing the same job? A function with a million clean production calls has earned the confidence its comments merely claim. A function that shipped yesterday has earned nothing, however serene it sounds. (This is, mechanically, what runtime telemetry substitutes for the severed wire — in CodeNSM's fleet, a function's trustworthiness is a number derived from what it did, not a vibe derived from what it says; the confident-sounding and the hesitant-sounding get exactly the same audit.)

The mental shift is small to state and large to live: treat every generated artifact the way a good editor treats copy from a famously smooth writer. The smoothness is real. It's just not about anything.

Doubt was never the enemy. Doubt was the free, self-updating map of where the danger lived. The new author doesn't carry the map — it walks every path, minefield or meadow, with the same untroubled stride, and it will narrate the walk beautifully either way.

Your job, starting now, is to stop asking the walker how the ground feels.

Instrument the ground.

The confidence machine

The flat line

The researchers who caught it on camera

Where the poison actually spreads

You can't fix the voice. You can stop listening to it.

References

See your own codebase as an office.

Read next

Nobody wrote this code

The prompt lottery

Graduating without scars