The Problem · Part 4

The review theater

2026-06-25· 7 min read· by Think North

Somewhere in your company, right now, an engineer is looking at a pull request. It's 1,400 lines. It was generated in about four minutes. The engineer has a stand-up in twenty minutes, six more PRs behind this one, and a sprint commitment of their own. They scroll. They scroll faster. Their eyes perform the motion of reading the way your eyes perform the motion of reading a terms-of-service page.

Then they type the most load-bearing acronym in modern software: LGTM.

Looks good to me. Not "is good." Not "I verified this." Looks good. To me. The acronym was always quietly confessing something, and it took an AI-volume PR queue to make the confession audible.

What review was actually for (hint: not what you think)

Let's start with the heresy, because it's peer-reviewed heresy. In 2013, Alberto Bacchelli and Christian Bird at Microsoft Research published "Expectations, Outcomes, and Challenges of Modern Code Review" — they surveyed and observed hundreds of developers and managers across Microsoft to find out what code review actually accomplishes. The number one stated motivation, by a wide margin: finding defects.

The measured reality: the comments reviewers actually leave are mostly about style, minor improvements, and maintainability — and the deep, defect-catching review that everyone believes in happens far less than anyone expects, and mostly only when the reviewer already has strong context on the code. What review demonstrably did deliver, they found, was subtler: knowledge transfer, team awareness, shared ownership. Review was less a filter than a broadcast — the mechanism by which understanding of a change spread from one head into several.

Read that again, because it's the hinge of this whole essay. A decade before AI, at one of the most sophisticated engineering organizations on Earth, review was already mostly not catching deep bugs. It was already primarily a comprehension ritual. It worked anyway, because the ritual had two quiet preconditions: the author understood the change and could defend it, and the volume was low enough that reviewers could actually build context.

You can see where this is going.

The load calculation nobody ran

Every review process has a capacity, like a bridge. Nobody wrote yours down, but it exists: some number of lines per reviewer per day at which comprehension actually happens. Under the old regime you were probably near it. Humans typed the code, so production rate and review rate were coupled — they were the same people, at the same speed. The system was self-balancing without anyone designing it to be.

Then one side of the equation got a rocket engine. GitHub's own controlled research on Copilot found developers completing tasks 55% faster — and that was 2022-era tooling; agentic workflows have pushed effective code-production multiples far higher on scoped work. GitClear's analysis of hundreds of millions of changed lines documents the volume wave directly: more code added, more code churned, more code duplicated, year over year, as assistant adoption climbed. Meanwhile the reviewing organ — a human, with a calendar, reading at the speed of human reading — got precisely 0% faster.

Five to ten times the volume through the same gate. There are only three mathematically possible outcomes: the queue backs up forever (nobody tolerates this), review time per line collapses (this is what actually happened), or teams route around review entirely (also happening, quietly). Collapse the time per line far enough and review doesn't degrade gracefully — it undergoes a phase change. Below some threshold of attention, you are no longer reviewing. You are skimming for the absence of alarm, which is a different cognitive act with a different failure profile. The 2024 DORA report caught the system-level signature: teams adopting AI assistance reported drops in delivery throughput and stability even as individual speed rose — the local acceleration arriving at the organizational level as strain.

Approve-with-vibes isn't a character flaw. It's the only equilibrium the math allows.

Why theater is worse than nothing (a short, mean argument)

Here's the truly perverse part. A rubber-stamp review isn't a weak safety mechanism — it's an anti-safety mechanism, because everything downstream still believes in it.

Your deployment policy assumes reviewed code is understood code. Your compliance story assumes it. Your incident process assumes it ("it was reviewed, so the reviewer can help debug it" — can they?). Your junior engineers calibrate their own carefulness against it ("someone senior will catch it if it's bad"). The author relaxes because there's a reviewer; the reviewer relaxes because the author is competent and the tests are green; and between those two relaxations, a 1,400-line generated change sails into production having been deeply read by no human being at any point, while every system and person around it behaves as if it had been.

A gate everyone knows is open changes behaviour honestly. A gate that looks closed changes nothing except everyone's willingness to lean on it.

That's the theater. The costumes are process. The audience is your compliance auditor. The pyrotechnics are real.

What actually survives the phase change

Be honest about what a skimming reviewer can still catch: naming, style, obvious footguns, "we already have a helper for this" (sometimes), anything the diff makes visually loud. Now list what they can't, at volume: subtle concurrency, wrong-but-plausible business logic, security properties (Perry and colleagues at Stanford showed AI-assisted code trending less secure while its authors grew more confident — the exact combination a skim can't detect), architectural erosion spread across twenty small PRs, and duplication of things that exist in files the reviewer has never opened.

Notice the pattern: everything still catchable lives on the surface of the diff. Everything lost lives in behaviour and context — how the code acts under load, over time, against the rest of the system. Which suggests the honest response isn't to exhort reviewers to read harder (they can't; the math is the math) but to move the missing half of review to where behaviour is actually visible: after merge, in production, continuously. Review the diff for what diffs can show; let runtime evidence review the rest. (This is the half CodeNSM was built for — the review that happens after the merge, function by function, against each one's own baseline, at whatever volume the models care to produce. Machines scale with machines.)

Your review process, meanwhile, deserves triage: spend the scarce human attention where the blast radius is — the auth gate, the payment path, the hot router — and stop pretending the same two eyeballs can warranty everything at 8x volume. Bacchelli and Bird's deepest finding was that review quality tracks reviewer context, not reviewer effort. Concentrate the context. Ration the warranty honestly.

The one-question audit

If you want to know tonight whether your review process is real or theatrical, you need exactly one question, asked of your last significant incident:

Did the review of the offending change have any realistic chance of preventing it?

Not "was it reviewed" — it was; everything is. Whether the review, as actually performed, at the speed it was actually performed, could have caught it. Keep asking across incidents. When the honest answer keeps being "no," you haven't lost your quality gate — you've discovered it left some time ago, and the theater has been running the shift since.

The curtain call is optional. The math isn't.

The review theater

What review was actually for (hint: not what you think)

The load calculation nobody ran

Why theater is worse than nothing (a short, mean argument)

What actually survives the phase change

The one-question audit

References

See your own codebase as an office.

Read next

Nobody wrote this code

The prompt lottery

Graduating without scars