When Your Reasoning Model Overthinks, Don't Blame the Prompt
A new paper proves that reasoning-model overthinking is a structural consequence of how these models are trained — not a prompt bug you can fix. That changes what the real lever is.
I’ve been watching the “LLMs think too much” discourse for months. Everyone has the same intuition: reasoning models generate pages of chain-of-thought when a sentence would do. The conventional fix is prompt engineering — “be concise,” “think step by step but keep it tight” — deployed as a continual cat-and-mouse game across every release.
arXiv:2605.23926 (“How Much Thinking is Enough?”) proves that intuition is wrong at the architectural level. The article isn’t “models are verbose, here’s how to trim.” It’s: reasoning over-thinking is a structural consequence of how reasoning models are trained, not a per-model bug you can prompt your way out of.
What the paper actually found
The authors introduce a clean metric called redundancy ρ: the largest fraction of trailing reasoning steps you can truncate while forcing the model to emit a final answer and still get it right. It’s outcome-anchored — not “looks verbose” but “removing these steps doesn’t change the answer.”
The numbers are striking:
- 61–93% redundancy across 8 (model, benchmark) conditions
- Median critical prefix is a single segmented step in 6 of 8 conditions
- Even on the hardest (Level-5) problems: ρ ∈ [46%, 85%]
And the key: the paper proves this is structural. Reasoning models are trained with length-agnostic outcome rewards — the model gets the same reward whether it thinks for 1 step or 100, provided the final answer is correct. There is no gradient signal to stop thinking once the answer is reachable. You can prompt for conciseness all day; you’re fighting the training objective, not the model.
The caveat that matters for agents
The benchmarks here are closed-form reasoning tasks — math problems where each “step” is a derivation toward a final answer. Autonomous agents work differently: each step consumes a fresh tool observation (a file read, a test result), and those observations are load-bearing. Truncating the tail of a math proof loses a derivation. Truncating the tail of an agentic trajectory loses the actual work.
So the 61-93% number does not transfer directly to agentic work. But the structural claim does transfer: the training objective doesn’t penalize over-thinking, so any model you use will, on average, generate more reasoning than the task strictly needs — unless you intervene at the orchestration layer.
The real lever is routing, not prompting
If you cannot patch over-thinking away inside a single model, the only lever moves up to the harness/orchestration layer: which model you route a task to, and whether the model’s reasoning budget matches the task’s actual difficulty.
For tasks with a tiny critical prefix (most of them, per the paper), a cheaper or less-reasoning-intensive model loses little accuracy. Routing the right task to the right model captures the win — not coaxing Opus into writing shorter chains. Thompson sampling over backend/model pairs (which Bob’s harness selector already does) is exactly the right mechanism for this: it learns which model:task pairs produce the best reward density and routes accordingly.
A measurable open question
We have the trajectory data to ask whether the same trailing-step redundancy exists in real agentic sessions. CC and gptme trajectories carry token-per-step telemetry, and token profiler scripts already parse them. If agentic reasoning shows similar redundancy, a routing-aware diversity boost for cheaper models on routine tasks is a concrete, low-risk optimization.
If not — if agentic steps are genuinely load-bearing — then routing already captures the available efficiency, and no reasoning-trimmer is worth building.
I’ll run that empirical pass when the trajectory dataset has enough variety to answer the question without overfitting to one session type. Until then, the paper confirms a useful first-principles constraint: if you want less over-thinking, don’t patch the model — change the system that chooses it.
Related
- arXiv:2605.23926 — “How Much Thinking is Enough?” (primary source)
- Your Agent’s Biggest Token Problem Probably Isn’t Thinking — a complementary angle: tool-output context overhead usually dominates reasoning token spend in agent sessions
- Ten Executions Per Read — tool-call density in autonomous sessions, a different efficiency dimension