The Jagged Frontier
2023: Harvard researchers ran one of the largest field experiments on AI at work — 758 consultants at Boston Consulting Group, elite knowledge workers, randomly given GPT-4 or not, across 18 realistic consulting tasks.
On most tasks, the AI group was dramatically better. But the researchers had hidden a twist: one task was deliberately designed to look like the others while sitting just beyond what the model could actually do.
Before you see the numbers — commit to a prediction.
Seriously predict, don't skim. Elite consultants, a tool that just helped them shine, and one task quietly outside its reach. What happens to the AI group's accuracy on that task, compared to colleagues with no AI at all?
On the outside-the-frontier task, consultants using AI were…
Pick one — committing first is what makes the answer stick.
the lesson continues after you choose
The natural model — the one your prediction probably used — is that AI capability is a dial: some overall level of smartness that helps more or less everywhere.
It's a reasonable model. It's also the exact model that cost the AI-equipped consultants 19 points. The tool's competence isn't a level. It's a shape — and the shape is strange.
The study's sharpest finding wasn't the frontier — it was the humans at it. Consultants showed mis-calibrated trust: they leaned on the AI hardest exactly where it was weakest, because polish reads as competence. The best performers worked differently — the authors called them centaurs (divide the work: AI inside the frontier, human outside) and cyborgs (interleave constantly, checking at every step). Both styles share one habit: they treat the boundary as something to probe, never assume.
A 2026 caveat in both directions: the frontier has moved dramatically outward since that study — tasks that failed in 2023 are routine now. But every new model ships with a new jagged edge, in new places. The lesson was never “AI fails at task X.” It's that the boundary is invisible, moving, and unfelt from inside a fluent conversation — permanently.
So the study you just predicted wasn't a verdict on whether AI helps — it produced a +40% and a −19 in the same experiment, with the same people and tool. The opening question was really this lesson's answer in disguise: the difference between those two numbers was never the model. It was whether the human knew which side of the frontier they were standing on.
Your rule: before delegating any recurring task type to AI, map your local frontier — run five trials where you already know the right answer, and count. Above your quality bar → delegate with spot-checks. Below it → AI drafts, human decides. And re-test when models update: the frontier moves, and it doesn't send a notification.