The Problem · Part 2

The prompt lottery

2026-06-23· 7 min read· by Think North

Imagine an airline where the pilots each rolled a die at takeoff. Six sides. Five of them: a normal, professional flight. One of them: the plane still lands, but the landing gear was improvised out of confidence and the maintenance crew won't find out for six months.

Now imagine the airline doesn't record the die rolls. Doesn't know the die exists. Publishes glowing on-time statistics.

You are the airline. The die is the prompt. Welcome to Part 2.

An experiment you can run today (please don't tell anyone the results)

Take your best engineer. Give them a real ticket — something meaty, like "add rate limiting to the export endpoint." Have them do it twice with their AI assistant, on two different mornings, with two honest-but-different prompts. The Monday prompt is careful: context pasted in, constraints spelled out, edge cases named. The Thursday prompt is the one we all actually write at 4:45pm: "add rate limiting to the export endpoint."

Diff the two results. Really look.

Monday's version reuses your existing middleware, respects your error taxonomy, and handles the concurrent-request case. Thursday's version works — it demos perfectly — but it hand-rolls a token bucket with a subtle race condition, imports a pattern from a framework you don't use, and stores state somewhere that won't survive your next deploy. Same developer. Same ticket. Same green checkmarks in CI.

The difference between those two artifacts is not skill, and it's not effort. It's the roll of a die you didn't know was being thrown: prompt phrasing, what happened to be in the context window, which model version was live, whether the relevant file got retrieved, the temperature of the sampling gods.

Quality used to be a function of who. Now it's a function of who, times prompt, times context, times model-of-the-week — and only the first factor appears anywhere in your records.

This isn't a hypothetical — people have measured the die

The variance is documented, not vibes. In the Stanford study "Do Users Write More Insecure Code with AI Assistants?", Perry and colleagues watched participants solve identical security-relevant tasks with an AI assistant and found not only more insecure solutions than the control group, but that outcomes varied meaningfully with how participants talked to the model — the ones who invested in their prompts, adjusted parameters, and iterated got materially better code than the ones who took the first plausible answer. Same task pool. Same assistant. The prompt was a quality dial, and most participants didn't know they were holding it.

Zoom out to the benchmark world and the picture repeats. SWE-bench — the benchmark of real GitHub issues that Jimenez and colleagues built to test whether models can resolve actual software problems — exists precisely because model performance on real tasks is wildly sensitive to context and setup; the same model swings dramatically depending on what it's shown and how the problem is framed. That sensitivity doesn't disappear when the model leaves the benchmark and joins your sprint. It just stops being measured.

And at the codebase level, GitClear's longitudinal research on AI-assisted code — analyzing hundreds of millions of changed lines — found churn and copy-paste duplication rising in step with assistant adoption: code increasingly written, shipped, and then rewritten or duplicated within weeks. That's what a quality lottery looks like from orbit. Individual rolls are invisible; the aggregate shows up as churn.

The part that should actually scare you

Here is the sentence I want you to say out loud in your next planning meeting, just to feel it:

"The quality of our codebase is now partially determined by a random variable, and we do not track it."

Not "we track it badly." Not "we track it informally." We. Do. Not. Track. It. There is no column anywhere in your company for "which die roll produced this function." Your git history records the human. Your CI records the tests passing. Your PR records a thumbs-up emoji. The single largest new source of variance in your codebase's quality — larger, arguably, than the variance between your engineers — is recorded nowhere.

Think about what your company does track. Ad spend by channel, to the dollar. Conversion by cohort. Support tickets by category. Somewhere there is a dashboard for how many people clicked a button whose color you A/B tested. And meanwhile the machine that writes an ever-growing share of your core asset is rolling dice, and the dice rolls go straight into production, unlabeled.

(A fun objection at this point: "engineers were always variable! Monday-Dave and Thursday-Dave were never the same engineer either." True! But Dave's variance was legible — his teammates knew his patterns, his reviewer knew his weak spots, and Dave-quality was roughly continuous over time. The lottery's variance is illegible, discontinuous, and hidden inside output that always looks like good-Dave. That last part is the killer. Bad human code usually looks bad. Bad generated code looks exactly like good generated code.)

Why review can't see the die roll

You might hope review catches the bad rolls. It catches some! The obvious ones. But the whole trouble with lottery losers is that they're syntactically indistinguishable from winners — fluent, idiomatic, confidently commented. A reviewer at 2026 volume (much more on this in Part 4) is pattern-matching for smells, and the model has read every style guide ever written. It does not produce smells. It produces polished code that is occasionally, structurally, wrong.

The failure isn't at the line level, where review looks. It's at the level of "this hand-rolled a thing we already had," "this pattern won't survive our deploy model," "this is subtly racy under load" — exactly the failures that need either deep context (which the review had no time for) or runtime evidence (which the review happens too early to see).

Which points at the only durable answer anyone has found: if the variance can't be prevented at the source and can't be caught at the gate, it has to be observed downstream, where the dice have already landed. Every function's actual behaviour in production — error rate, latency drift, whether it duplicates work something else already does — is the die roll, revealed. In CodeNSM's own fleet telemetry, this is exactly how lottery losers eventually announce themselves: not in the diff, but as functions whose runtime behaviour quietly diverges from their role's baseline weeks after the green checkmark.

The question to take with you

The lottery isn't going away. Prompting will get better, models will get better, and the variance will shrink — and then new models and new tools will widen it again, because that's what churn in tooling does. The question isn't how to stop rolling dice. It's this:

If quality became a random variable in your codebase two years ago, what in your organization would have noticed?

Go through the list. Review? Skims. CI? Checks what the tests check (Part 9 will ruin tests for you, sorry in advance). Velocity metrics? Actively rewards fast rolls. Your best engineer's intuition? Doesn't scale past what they personally read.

The honest answer, for almost every team, is: nothing would have noticed. Nothing did.

The dice have been rolling for two years. The results are already in your codebase — you just haven't looked at them yet.

The prompt lottery

An experiment you can run today (please don't tell anyone the results)

This isn't a hypothetical — people have measured the die

The part that should actually scare you

Why review can't see the die roll

The question to take with you

References

See your own codebase as an office.

Read next

Nobody wrote this code

Graduating without scars

The review theater