The Problem · Part 29

Measuring what the code does (not what it says)

2026-06-08· 7 min read· by Think North

Suppose you were asked to judge a restaurant, thoroughly, with unlimited access to everything — except you may never taste the food and never watch a single diner eat.

You'd do real work. You'd read the menu (well-composed, maybe a little ambitious). You'd audit the supplier invoices and discover they've reordered fish sauce every week for a year but haven't bought saffron since 2023 — from which you'd infer, cleverly, quite a lot. You would produce a genuinely informative report. And you would still not know the one thing anyone actually wants to know: is the food good? Which dishes do people order? Which ones come back to the kitchen half-eaten? What does the Tuesday crowd suffer that the Friday crowd never sees?

This is, with almost no exaggeration, the state of the art in knowing your own codebase. Two instruments exist. Both read documents. Neither tastes the food.

Instrument one: reading the code

Static analysis is the menu-reader. It parses your source and grades its structure — complexity, smells, type errors, security patterns. Real value, decades of it; keep your linters. But be precise about what it measures: what the code says. Static analysis cannot tell a load-bearing function from a dormant one (Part 22's twins score identically), cannot see error rates or latency, and — the AI-era failure — keys on textual signals of carelessness that models simply don't emit. LLMs generate smell-free mediocrity at industrial scale: fluent, idiomatic, well-shaped code whose problems are behavioral, not grammatical. Judging generated code by static analysis is grading a con man on his penmanship. (And the volume problem compounds it: a linter filing seventeen complaints per file, across a codebase growing at generation speed, stops being an instrument at all — it's a fire hose pointed at a to-do list nobody will ever finish.)

Instrument two: reading the commits

Repository mining is the invoice-auditor, and it's the cleverer of the two. Adam Tornhill's Your Code as a Crime Scene built a whole forensic discipline on it: hotspots (complex code that keeps changing), change coupling (files that always move together), knowledge maps (who owns what, and what's abandoned). Tornhill and Borg's Code Red then attached business outcomes to it. This is real measurement and it's underused — but again, be precise about the object: repo mining measures how humans have behaved toward the code. Where they struggled, what they touched, what they fixed twice. It's a diary of the development process. And it has two blind spots, one old and one brand new. The old one: code that's rarely changed but constantly executed — the quiet load-bearing function that hasn't been committed to since 2024 — barely registers, while a hot mess that's dormant in production looks urgent. The new one is worse: repo mining's signal source is human behavior, and the AI era just adulterated it. When a model regenerates a file wholesale, churn stops meaning struggle; when "author" is a human-model blend, knowledge maps map nothing. GitClear's longitudinal data shows exactly this — churn and duplication patterns shifting under assistant adoption, which means the diary is increasingly being written by something that wasn't there.

The missing third instrument

Meanwhile the answers to every question this series has raised — which debt accrues interest (Part 22), where the firefighting concentrates (Part 23), what a rewrite would actually need to preserve (Part 24), what the CEO should be told (Part 25), what the tech lead's inner map contains (Part 26), all five health properties (Part 27) — live in exactly one place: production, at function granularity. How often does each function run? How often does it fail, and has that drifted from its own baseline? What does its latency tail do under Tuesday's load? Did anything call it at all this month?

"But I have observability!" — you partly do, and it's worth being precise about the gap. APM and tracing grew up around services and endpoints: they'll tell you the checkout endpoint is slow and let a human spelunk into traces to find out why, during an incident, on demand. What they don't natively give you is the standing, per-function census this series keeps needing — every function's utilization, reliability and drift, all the time, including the 96% of functions no dashboard was ever hand-built for. Endpoint-level observability is a security camera on the front door. The questions we keep asking are about what every employee in the building actually does all day. The distinction matters most exactly where this series lives: the fragile-and-load-bearing intersection is defined at function granularity, and so is dormancy, and so is the firefighting ratio's address list. Zoom the instrument out to endpoint level and every one of those concepts dissolves into an average.

Static analysis reads the essay. Repo mining reads the diary. Production is the only witness that watched the code actually live — and almost nobody has put it on the stand.

The standard objections arrive in a predictable order, so let's take them at speed. "The overhead!" — a function-level census needs counts, error tallies and latency aggregates, not a full trace of every call; sampled and aggregated, this is among the cheapest classes of telemetry there is, far lighter than the request tracing you already run. "The noise — ten thousand functions is unreadable!" — correct, which is why raw telemetry isn't the product any more than raw ledger lines are an audit; the census becomes an instrument only when something classifies and ranks it (keep reading). And "we'd never look at it" — also correct, and fine: nobody stares at instruments. Instruments are for the moment the needle moves, and the entire point of a per-function baseline is that the needle CAN move visibly, for functions no human was ever going to watch by hand.

Rules, not vibes

One more design constraint, and it's non-negotiable: whatever reads that third instrument must be deterministic. The fashionable 2026 move is to pour telemetry into a large model and ask what it thinks — and for a measurement instrument, that's disqualifying. A measurement you can't reproduce isn't a measurement; a classifier that flips a function between 'healthy' and 'fragile' with the model version is a mood ring with a pricing page. The warnings here are well-documented: Sculley and colleagues' hidden-technical-debt paper catalogued how learned components entangle and drift with the world; METR's randomized 2025 study found experienced developers misestimating their own AI-assisted productivity in the wrong direction — even human self-perception fails as a sensor, and an LLM judge adds a second layer of unauditable perception on top. Function-level behavior doesn't need a model's opinion anyway. Calls, errors, latency, dormancy: it's arithmetic with a good taxonomy. Where the physics is known, write the physics down.

That is the instrument CodeNSM exists to be, stated plainly for once since this is Part 29 of 30: a runtime census at function level, classified and scored by fixed rules, so the third column in that diagram finally reports alongside the other two. (The generative models stay where perception belongs — drafting fixes against deterministically-measured targets — never in the judging.) Whether you buy ours or build your own: the restaurant has been reviewed off the menu and the invoices for forty years. Taste the food.

Measuring what the code does (not what it says)

Instrument one: reading the code

Instrument two: reading the commits

The missing third instrument

Rules, not vibes

References

See your own codebase as an office.

Read next

Nobody wrote this code

The prompt lottery

Graduating without scars