Rule engines where we know the physics, LLMs only for the last mile
Picture the architecture slide we could have drawn. Stream all your function telemetry into a large language model, let it "reason" about your codebase, put a sparkle emoji on it. It would demo beautifully. Investors would nod. And it would be the wrong machine for the job — the way a food processor is the wrong machine for checking your bank balance. We think the discipline to say that out loud is becoming a differentiator.
Why your measuring tape should be the most boring object you own
CodeNSM's core outputs — the archetype assigned to each function, its live state, its debt tier, its health score, the North-Star Metric they roll up into — are all computed by deterministic rule engines. Same inputs, same answer, every time, forever. Three reasons, and none of them is nostalgia:
- Auditability. When we tell your team that your payment-retry function is fragile and load-bearing, that claim may reprioritise a sprint. It must be explainable down to the arithmetic, and it must be the same answer tomorrow. A probabilistic classifier that flips a function between "Top performer" and "Injury-prone" run-to-run would be worse than useless — it would train you to ignore the instrument.
- Science. Our 51-hypothesis research programme depends on the taxonomy being a fixed measurement instrument. You cannot pre-register hypotheses about archetype behaviour if the archetype assignment drifts with a model version. Metrology before analytics.
- Privacy and cost. Deterministic classification runs on aggregates, ships no source code anywhere, adds near-zero marginal cost per function, and works identically for a 200-function side project and a 40,000-function monolith. Those properties are structural, not tuning.
There is a deeper point here, and it came from the machine-learning community itself. Sculley and colleagues' widely cited NeurIPS paper on hidden technical debt in ML systems warned that learned components erode abstraction boundaries: entanglement, undeclared consumers, feedback loops, behaviour that changes when the world does. Those costs are sometimes worth paying — when you genuinely don't know the physics of the problem. We mostly do. A function that is called ten thousand times a day, fails 4% of the time and sits on the request path is not an ambiguous, high-dimensional pattern-recognition problem. It is arithmetic with good taxonomy. Using a trillion-parameter model to do arithmetic is not sophistication; it is a category error with an invoice attached.
Where the physics is known, write the physics down. Spend the model where the physics runs out.
So where does the LLM actually earn its keep?
None of this is anti-LLM. It is about putting the stochastic component where stochasticity is the point — the last mile, where structured findings must become action inside a specific, messy, human codebase. Yours.
The clearest example is the fix prompt. When CodeNSM flags a regression, the deterministic layer knows everything measurable: the function, its file and line, its archetype, its error and latency history, the target that would mark it healthy again. What it cannot know is your codebase's idioms, your test conventions, your naming taste. So we compile the finding into a precise, verifiable brief — context, task list, success metric — and hand it to your coding agent. Benchmarks like SWE-bench have made the industry-wide pattern clear: agents perform dramatically better when given a concrete, checkable target than when asked to "improve the code." Our determinism is what makes the target checkable.
The same logic applies in reverse to evaluation. We deliberately do not use an LLM as the judge of code health — "LLM-as-judge" for a paid metric invites every bias the measurement literature warns about, with none of the reproducibility. And the caution is earned: METR's 2025 randomized study of experienced open-source developers found they were measurably slower when using AI assistance on familiar code, while believing they were faster. Sit with that for a second — the instrument's users misestimated its effect on themselves. A measurement company cannot build its ground truth on that. The 2024 DORA report likewise found AI adoption associated with small drops in delivery throughput and stability even as individual satisfaction rose. Enthusiasm and measurement are different instruments.
Complexity is a hot potato — someone always holds it
John Ousterhout's A Philosophy of Software Design argues that the great divide in systems is between complexity you pay down once, in design, and complexity you pay forever, in operation. A rule engine is exactly that trade: expensive judgment concentrated at design time — what distinguishes a Records Clerk from a Translator, what makes a state flip from Fit to Stressed — so that operation is cheap, explainable and testable forever after. The rules encode years of tech-lead judgment; that encoding, not any single rule, is the asset.
The four-question test every new feature has to pass
Internally the dividing line is a checklist, and it is short. Does the output need to be identical tomorrow? Rule engine. Will a customer make a resourcing decision on it? Rule engine. Does it feed the research programme? Rule engine, no exceptions. Is it prose for a human, or a brief for an agent that will be verified deterministically afterwards? That is the last mile, and the model is welcome to it. The checklist has survived every feature review so far, including the ones where the LLM version would have demoed better.
It also has a pleasant side effect on trust. When you ask "why is this function marked Stressed?", support answers with the actual rule and the actual numbers, in one message. There is no "the model weighs many factors" paragraph in our documentation, because there is no model in the measurement path to hide behind. In a market where every competitor's answer to hard questions is a shrug wrapped in the word "AI", being able to show your arithmetic is a moat made of something unfashionable: accountability.
So the honest architecture slide reads: deterministic where we know the physics, generative where we don't, and a hard, documented line between the two. Less fashionable than "AI-powered insights." Far more valuable to own.
References
- Sculley, D. et al. (2015). Hidden Technical Debt in Machine Learning Systems. NeurIPS.
- Jimenez, C. et al. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (ICLR).
- METR (2025). Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity.
- Google Cloud (2024). DORA Accelerate State of DevOps Report.
- Ousterhout, J. (2018). A Philosophy of Software Design.