CAPABILITY MAP Prediction 6 min

The Jagged Frontier

01 · THE SETUP

2023: Harvard researchers ran one of the largest field experiments on AI at work — 758 consultants at Boston Consulting Group, elite knowledge workers, randomly given GPT-4 or not, across 18 realistic consulting tasks.

On most tasks, the AI group was dramatically better. But the researchers had hidden a twist: one task was deliberately designed to look like the others while sitting just beyond what the model could actually do.

Before you see the numbers — commit to a prediction.

Seriously predict, don't skim. Elite consultants, a tool that just helped them shine, and one task quietly outside its reach. What happens to the AI group's accuracy on that task, compared to colleagues with no AI at all?

02 · YOUR CALL ⏸ YOUR CALL — PICK ONE TO CONTINUE

On the outside-the-frontier task, consultants using AI were…

If you pick A

That's the intuition the study demolished — and it's most people's pick. 'Expert + tool ≥ expert' feels like arithmetic. But it assumes the expert can tell when the tool is wrong, and that turned out to be exactly what fluent AI output erodes.

If you pick B

A sensible hedge. But it assumes consultants weighed the AI's contribution neutrally. They didn't — polished, confident prose got adopted, not audited. The effect had a direction, and it wasn't zero.

If you pick C — the mechanism

Correct — and worth feeling the size of that. Top-tier professionals, made substantially worse than colleagues with no AI, by a tool that had just boosted them everywhere else. The researchers called the cause 'falling asleep at the wheel': fluent output switched off their own judgment.

If you pick D

On the other 17-ish tasks you'd have been right — 12.2% more tasks done, 25% faster, 40% higher rated quality. That's what makes the twist task the important one: the same tool, the same people, and the sign of the effect flipped.

Pick one — committing first is what makes the answer stick.

the lesson continues after you choose

03 · NOT SO FAST

The natural model — the one your prediction probably used — is that AI capability is a dial: some overall level of smartness that helps more or less everywhere.

It's a reasonable model. It's also the exact model that cost the AI-equipped consultants 19 points. The tool's competence isn't a level. It's a shape — and the shape is strange.

04 · THE MECHANISM

The frontier is jagged: two tasks of equal human difficulty can sit on opposite sides.

THE PRINCIPLE

The jagged frontier

It means: AI capability forms an invisible, jagged boundary that does not track human difficulty. Tasks that feel equivalent to you can sit on opposite sides of it.
It works through: Models are superhuman where training signal was dense and verifiable, and weak where it wasn't — so capability follows the data's shape, not our intuition of 'hard'. Because output quality (fluency, confidence, format) looks identical on both sides, users can't feel the crossing — they only feel the polish.
Spot it when: Gold-medal maths next to a bungled scheduling constraint; flawless prose summarising a document it subtly misread. When your only evidence of quality is how good the output *looks*, you're at the frontier's edge.

The study's sharpest finding wasn't the frontier — it was the humans at it. Consultants showed mis-calibrated trust: they leaned on the AI hardest exactly where it was weakest, because polish reads as competence. The best performers worked differently — the authors called them centaurs (divide the work: AI inside the frontier, human outside) and cyborgs (interleave constantly, checking at every step). Both styles share one habit: they treat the boundary as something to probe, never assume.

A 2026 caveat in both directions: the frontier has moved dramatically outward since that study — tasks that failed in 2023 are routine now. But every new model ships with a new jagged edge, in new places. The lesson was never “AI fails at task X.” It's that the boundary is invisible, moving, and unfelt from inside a fluent conversation — permanently.

05 · BACK TO THE OPENING

So the study you just predicted wasn't a verdict on whether AI helps — it produced a +40% and a −19 in the same experiment, with the same people and tool. The opening question was really this lesson's answer in disguise: the difference between those two numbers was never the model. It was whether the human knew which side of the frontier they were standing on.

06 · TAKE THIS WITH YOU

Your rule: before delegating any recurring task type to AI, map your local frontier — run five trials where you already know the right answer, and count. Above your quality bar → delegate with spot-checks. Below it → AI drafts, human decides. And re-test when models update: the frontier moves, and it doesn't send a notification.

REFERENCES