Pre-registered software telemetry science: why we wrote 51 falsifiable hypotheses before looking at the data
Here is a magic trick every software vendor knows, whether or not they admit it. Collect a large proprietary dataset. Poke at it until something flatters the product. Publish that. The result looks like science — charts, percentages, an n in the footnote — but it is closer to advertising, because the poking guarantees a finding. Test enough post-hoc hypotheses against any dataset and some of them will clear significance by pure chance. The trick works every time. That should bother you.
Science noticed this about itself first. John Ioannidis's famous 2005 paper argued that most published research findings are false, and that flexibility in design, definitions and analysis is one of the main reasons. The remedy the open-science movement converged on is pre-registration: write down your hypotheses, your tests and your minimum sample sizes before you see the data, so the data can only confirm or refute — never inspire — the claim. As Nosek and colleagues put it in their PNAS paper on the pre-registration revolution, the whole point is to keep the line between hypothesis-generating and hypothesis-testing research honest.
So when we started collecting anonymized production telemetry across the CodeNSM fleet, we made ourselves a rule: no magic trick. The result is a research programme of exactly 51 pre-registered, falsifiable hypotheses about how production codebases actually behave — written in code, before the data arrived.
What 51 bets about your codebase look like
The hypotheses live in seven mutually exclusive categories, each about a different object of study:
- Structural laws of production codebases — e.g. is production call load Pareto-concentrated? Does the dormant share of functions have a floor? Does concentration grow with scale?
- Archetype behaviour — role-conditioned claims, such as whether third-party-integration code is reliably the most error-prone role in a codebase, or whether router-style functions carry a multiple of their headcount share of traffic.
- LLM-era authorship effects — how the AI-native cohort of codebases differs structurally from the rest.
- Tech-debt dynamics — how debt distributes, accumulates and concentrates over a project's life.
- Latency and reliability economics — tail behaviour, and whether slow tails and fragile tails travel together.
- Developer and team effects — repo-linked claims about how team size and commit patterns relate to runtime health.
- Methodology and construct validity — hypotheses about our own instrument, including the falsifiable claim that our North-Star Metric construct tracks something real. If our own metric fails its pre-registered validity test, that status will say refuted on our board, in public red.
Each hypothesis is registered in code with a plain-language statement, a fixed statistical test, a direction, and a minimum n below which we report nothing but "collecting". The evaluation engine is generic — it never changes per hypothesis — so there is no way to quietly adjust a test after the fact without the diff showing it. We booby-trapped our own ability to cheat.
The part where we can lose
Every one of the 51 can fail. Statuses are mechanical: collecting until minimum n, then validated, signal or refuted depending on effect direction and significance. And we genuinely expect some refutations — several hypotheses were registered as near-coin-flips on purpose. A research programme where nothing can fail is a brochure.
A hypothesis board with no red on it is not a research programme. It is a landing page.
This matters commercially as much as scientifically. The replication crisis taught everyone — the Open Science Collaboration's large-scale replication effort in Science being the landmark — that unregistered findings evaporate under scrutiny. Findings that survive pre-registration are the ones worth building on, citing, and acquiring.
Why you could check our arithmetic with a pocket calculator
The second design decision was made before the first: every hypothesis had to be testable from anonymized aggregates alone. The research observation schema contains no source code, no function names, no file paths, no company identifiers — only per-project daily aggregates: distribution shapes, shares, ratios, latency percentiles, archetype census counts, keyed by an irreversible project hash.
This constraint bites. There are interesting questions we simply cannot register, because answering them would require identifying data, and we chose not to collect it. But the constraint buys something rare: the research database itself is shareable. A reviewer, a journalist, or an acquirer's technical diligence team can be handed the observations file and recompute every statistic with a pocket calculator and patience — the tests are classical one-sample t, Welch t and Pearson correlations, implemented in pure Python, with degrees of freedom stored so anyone can check our arithmetic.
Wait — why is a telemetry company doing science at all?
Partly because the questions are genuinely open. Software engineering research has strong traditions — from Lehman's laws of software evolution onward — but surprisingly little of it is grounded in continuous production runtime behaviour across many independent codebases. Static analysis sees the code; repository mining sees the commits; almost nobody sees what the code actually does all day, across a fleet, in a form that can be published.
And partly, candidly, because it compounds. Every project that installs CodeNSM adds observations no one else has, against hypotheses no one else registered, under an anonymization design no one else committed to in advance. Individual features can be copied in a quarter. A pre-registered longitudinal dataset cannot be copied at any price — the registration dates are the moat.
What lands on the board next
As hypotheses cross their pre-registered thresholds, we will publish them — statement, test, n, effect size, p-value, and the anonymized observations behind them. Where a result refutes our own prior (or our own metric), that gets published with the same prominence. If you want to see the discipline before the results, the categories and the rules above are the commitment; the board fills in as the fleet grows.
If you run production Python and want your codebase to be part of the n — anonymously, aggregates only — installation takes two minutes, and the science costs you nothing.
References
- Ioannidis, J.P.A. (2005). Why Most Published Research Findings Are False. PLoS Medicine.
- Nosek, B.A. et al. (2018). The preregistration revolution. PNAS.
- Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science.
- Center for Open Science — Preregistration.
- Lehman, M.M. (1980). Programs, Life Cycles, and Laws of Software Evolution. Proceedings of the IEEE.