Tests that test nothing
A quick story about a thermometer.
A hospital wants to be sure no patient has a fever, so it institutes a rule: every patient's temperature must be charted every hour. Compliance is measured, dashboards are built, the compliance number climbs to 98%. Magnificent. Then an auditor visits and notices something about how the temperature is being charted on the night shift: the nurse looks at the previous reading and writes it down again.
Every box is ticked. Every chart is full. The measurement ritual is executing flawlessly, at scale, hourly — and the hospital has NO IDEA whether anyone has a fever, and hasn't for months.
Now let me tell you how your test suite has been doing lately.
The circularity problem
Here's the workflow that has become totally standard, and which almost nobody has thought hard about: a function gets written (often generated). Then somebody — increasingly the same model, in the same session — is asked to "write tests for this."
And the model does! Diligently. It reads the function, determines what the function does, and writes assertions that the function does that. If calculate_discount returns 0.15 for gold-tier customers, the test asserts that gold-tier customers get 0.15. Run the suite: green. Coverage: up. Everyone: pleased.
Notice what happened. Nothing was tested except the code's agreement with itself. The test is a mirror held up to the implementation — if the implementation is wrong, the mirror faithfully reflects the wrongness, in green. Was 0.15 correct? That fact lives in a product spec, a pricing meeting, a founder's head — places the model, reading only the code, cannot reach. A test is only ever as good as its independent source of truth about what should happen. Generate the test from the code, and the independent source is the code. The circle closes. The assertion asserts nothing.
The old-world version of this failure at least required a human to consciously phone it in. A human writing tests — even lazily — usually knew what the function was for, and some intent leaked into the assertions despite their best efforts. The model has no access to intent. Circularity isn't its failure mode; it's its only available behaviour when the spec exists nowhere in its context. The test-writing revolution didn't automate testing. It automated the transcription of current behaviour into assertion syntax.
Regression detection, to be fair, survives — if the code changes accidentally, the mirror notices. That's worth something! But understand what you now have: a suite that defends whatever the code happened to do on generation day. Including the bugs. Especially the bugs — they're load-bearing now. They have tests.
Coverage was already a suspect metric. Now it's a compromised one.
The research community called this years before the models arrived. Inozemtseva and Holmes' ICSE 2014 study — bluntly titled "Coverage Is Not Strongly Correlated with Test Suite Effectiveness" — measured exactly what it says across large Java systems: once you control for suite size, how much code your tests execute tells you surprisingly little about how many faults they detect. Coverage counts lines visited, not truths checked. A test that calls a function and asserts nothing meaningful lights up every line it touches. The metric literally cannot see the difference between interrogation and tourism.
That was the pre-AI situation: coverage was a weak proxy defended mostly by the fact that hand-written tests took effort, and effort correlated (loosely) with intent. Generation severed that last thread. Tests are now free, which means coverage is now free, which means a rising coverage number carries approximately zero information — you are watching the night nurse's compliance dashboard. Google's mutation-testing work (Petrović and Ivanković's "State of Mutation Testing at Google") points at what an honest metric has to do instead: deliberately inject faults and check whether the suite notices. Kill rate, not visit rate. It is telling that the industry's most serious testing organization decided the only trustworthy question is "does the alarm actually go off?" — and it is equally telling how few teams anywhere ask it.
Meanwhile the benchmark world quietly demonstrates what real tests are worth: SWE-bench, the benchmark of real GitHub issues, is only able to score model patches at all because its repositories contain fail-to-pass tests — tests written from an understanding of what should happen, which the buggy code fails and a true fix passes. Every SWE-bench evaluation is a small proof that intent-bearing tests can mechanically verify machine-written code. And every generated mirror-test in your codebase is the same machinery running in reverse: verifying nothing, at scale, in green.
The two-generation trap
Now stack Part 9 on top of Parts 4 and 6, because the layers interlock into something genuinely nasty:
- The code is generated, with uniform confidence, subtly wrong at some unknowable rate (Part 6).
- Review, at volume, has degraded into a skim whose main comfort signals are… green tests and a coverage number (Part 4).
- The tests are generated from the code, so they are green precisely because they check nothing the code could fail (this part).
Each layer was supposed to backstop the one above it. Instead, the backstops are now made of the same material as the thing they're backstopping — machine output validating machine output, with the human reduced to admiring the agreement. It's not a safety net under a trapeze. It's a second trapeze, painted to look like a net, swinging in sync with the first because they share an engine.
(Self-aware aside: yes, a human can break the circle — write the test from the ticket instead of the code, feed the model the spec instead of the implementation, review assertions instead of coverage. Some disciplined teams do. But notice that every one of those fixes requires the scarce thing — human attention carrying actual intent — which is exactly the resource whose collapse created the problem. The circle isn't hard to break. It's hard to break at volume, which is the only place it matters.)
So what does honest verification look like now?
Strip away the ritual and ask what a test was ever for: it's a claim about how the code should behave, checked against how it does behave. The generated-mirror problem poisons the first half — the "should" went missing. But the second half has a supplement nobody's suite can fake: production. Real inputs, real load, real consequences. A function's live error rate, its latency under Thursday traffic, whether its callers retry, whether it's even called — these are behavioural facts, and they are gloriously incapable of being generated into agreement with anything. Production is the one test the code cannot write for itself. (This is the corner of the problem CodeNSM lives in — watching every function's runtime behaviour against its own baseline, precisely because a green suite and a rising coverage number no longer certify what they used to.)
None of which retires your test suite — pre-deploy verification still catches what it catches, and intent-bearing tests remain the gold standard when someone actually encodes the intent. The point is narrower and sharper: stop reading coverage as safety. Coverage now measures the productivity of your test generator. That's all. The night nurse is very productive.
The chart is full. The chart was always full.
Whether anyone has a fever — that, you still have to actually check.
References
- Inozemtseva, L. & Holmes, R. (2014). Coverage Is Not Strongly Correlated with Test Suite Effectiveness. ICSE.
- Petrović, G. & Ivanković, M. (2018). State of Mutation Testing at Google. ICSE-SEIP.
- Jimenez, C. et al. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? ICLR.
- GitClear — AI Copilot Code Quality: 2025 research on churn, duplication and moved code.