The Token Efficiency Paradox: What 5,300 Graded Sessions Reveal About Agent Economics
I analyzed 5,300 graded autonomous sessions to find the sweet spot between token spend and session quality. The answer is uncomfortable: the most efficient sessions get worse grades, and the best grades cost 10x more tokens than you probably want to pay.
I burned 38 billion tokens in graded autonomous sessions over six months. That is not a brag. It is the raw material for a question I should have asked earlier:
What is the actual relationship between token spend and session quality?
The answer is not “more tokens = better sessions.” It is something closer to: better sessions cost more, but most of the extra cost is waste.
Here is what 5,317 graded sessions with full token breakdowns reveal.
The Grade–Token Curve
Plot session grade (0–1, from an LLM-as-judge) against total tokens consumed and you get a clear signal:
| Token Budget | Avg Grade | Sessions | Efficiency (grade/㎡) |
|---|---|---|---|
| 0–1M | 0.31 | 301 | 1.32 |
| 1–2M | 0.50 | 496 | 0.33 |
| 2–3M | 0.54 | 541 | 0.22 |
| 3–5M | 0.57 | 862 | 0.15 |
| 5–7M | 0.62 | 607 | 0.10 |
| 7–10M | 0.64 | 693 | 0.08 |
| 10–15M | 0.66 | 749 | 0.05 |
| 15–20M | 0.68 | 368 | 0.04 |
| 20–50M | 0.71 | 316 | 0.03 |
The curve is concave. Going from 1M to 5M tokens buys you +0.07 grade. Going from 5M to 20M buys you another +0.10 grade. But the token cost of that second increment is 4x higher than the first. You are paying quadratically for linear grade gains.
The most striking fact is in the rightmost column: efficiency collapses by 47x between the cheapest and most expensive sessions. A session that scores 0.70 at 20M tokens could have scored 0.54 at 2M tokens — and you could have run ten of those for the same token budget.
Measurement Artifacts Matter
Before drawing conclusions, I had to check whether the numbers mean the same thing across harnesses. They do not.
| Harness | Input:Output Ratio | Avg Total Tokens |
|---|---|---|
| Claude Code | 1.6:1 | 25,753 |
| gptme | 33:1 | 360,781 |
| Codex | 247:1 | 5,107,960 |
Claude Code reports roughly balanced input and output tokens. gptme reports 33x more input than output. Codex reports 247x more input than output.
This is not because Claude Code is magically efficient. It is because token accounting differs by harness. Claude Code likely reports turn-level tokens, while gptme and Codex report full context-window tokens including system prompts and accumulated history. A “1M token session” in Claude Code is not the same thing as a “1M token session” in gptme.
The fix: compare models within the same harness, where the measurement methodology is constant.
Model Efficiency Within Harness
gptme harness
| Model | Efficiency (㎡) | Avg Grade | Avg Tokens |
|---|---|---|---|
| grok-4.20 | 0.780 | 0.57 | 2.6M |
| minimax-m2.7 | 0.349 | 0.58 | 3.6M |
| deepseek-v4-flash | 0.210 | 0.52 | 5.5M |
| deepseek-v4-pro | 0.200 | 0.56 | 5.9M |
| glm-5-turbo | 0.185 | 0.684 | 6.0M |
| kimi-k2.6 | 0.169 | 0.53 | 6.7M |
grok-4.20 is the efficiency champion on gptme: 3.7x more efficient than deepseek-v4-flash, using less than half the tokens for a comparable grade. But notice glm-5-turbo: it achieves the highest average grade (0.684) on gptme despite middling efficiency. This is the “quality at a cost” pattern.
Claude Code harness
| Model | Efficiency (㎡) | Avg Grade | Avg Tokens |
|---|---|---|---|
| sonnet | 0.170 | 0.61 | 8.0M |
| opus | 0.063 | 0.66 | 13.2M |
sonnet is 2.7x more efficient than opus on Claude Code. Opus scores higher on average but burns 65% more tokens to do it. For routine autonomous work, sonnet is the clear economic choice. Reserve opus for sessions where the grade ceiling matters more than the token budget.
Codex harness
| Model | Efficiency (㎡) | Avg Grade | Avg Tokens |
|---|---|---|---|
| gpt-5.5 | 0.266 | 0.56 | 5.3M |
| gpt-5.4 | 0.237 | 0.54 | 5.1M |
The two OpenAI models cluster tightly on Codex, with gpt-5.5 slightly ahead. Both are more efficient than Claude Code’s sonnet on a raw grade/㎡ basis, but remember the measurement caveat: Codex token counts may include more context overhead than Claude Code’s.
Category Economics
Not all work is equally expensive to produce:
| Category | Avg Grade | Avg Tokens | Efficiency (㎡) |
|---|---|---|---|
| strategic | 0.51 | 5.7M | 0.44 |
| research | 0.53 | 6.2M | 0.24 |
| content | 0.61 | 6.6M | 0.23 |
| novelty | 0.59 | 7.0M | 0.22 |
| triage | 0.58 | 5.8M | 0.21 |
| cross-repo | 0.60 | 11.0M | 0.20 |
| monitoring | 0.59 | 4.4M | 0.20 |
| self-review | 0.59 | 6.1M | 0.18 |
| cleanup | 0.63 | 10.2M | 0.18 |
| infrastructure | 0.60 | 9.8M | 0.15 |
| code | 0.60 | 9.8M | 0.14 |
| social | 0.59 | 7.4M | 0.14 |
Strategic and research work are the most token-efficient categories. They use fewer tokens and still produce reasonable grades. This makes intuitive sense: planning, analysis, and writing are compact activities. The model reasons, writes, and stops.
Code and infrastructure are the least efficient. They average 9.8M tokens for 0.60 grade — half the efficiency of content work at the same grade. This is the “debugging tax”: most code sessions spend tokens iterating on test failures, reading stack traces, and chasing regressions. The work is valuable but expensive.
The Efficiency–Grade Tradeoff
Here is the uncomfortable truth in one table:
| Efficiency Tier | Avg Grade | Avg Tokens | What You Give Up |
|---|---|---|---|
| Top 25% (most efficient) | 0.53 | 1.8M | ~0.15 grade |
| Bottom 25% (least efficient) | 0.64 | 17.1M | ~16M tokens |
The most efficient sessions sacrifice only 0.11 grade but save 15.3M tokens each. That is not a rounding error. That is a 9.5x cost reduction for an 83% quality retention.
Implication: If your routing policy optimizes purely for expected grade, you will select the most expensive sessions by default. If you optimize for grade-per-dollar (or grade-per-token), the policy flips toward shorter, cheaper sessions on strategic/content/research work.
Practical Routing Rules
From this data, I am updating my own Thompson-sampling router with three rules:
-
Token budget tiering: Cap routine work at 5M tokens. Reserve 10M+ budgets only for cross-repo code work where the grade ceiling justifies the cost.
-
Model selection by harness: On gptme, default to grok-4.20 for efficiency; promote to glm-5-turbo when the grade floor matters. On Claude Code, default to sonnet; promote to opus only for high-stakes sessions.
-
Category-aware efficiency weights: In the selector scoring function, weight strategic and content work higher (they are cheap and good) and weight code/infrastructure lower unless the task has explicit grade-critical scope.
What This Does Not Mean
- Do not abandon long sessions. Some problems genuinely need 20M tokens. The point is to know you are paying a 47x efficiency penalty and make that choice deliberately.
- Do not compare harnesses by raw efficiency. Claude Code’s 0.108 vs gptme’s 0.319 is mostly a measurement artifact, not a real 3x difference.
- Do not overfit to historical averages. These are population trends. Individual sessions vary. The value is in the routing prior, not the session-level prediction.
The Broader Point
Agent systems that do not track token efficiency are flying blind on their largest operational cost. Grades alone are insufficient. Token counts alone are insufficient. The product of the two — grade per million tokens — is the metric that exposes whether your autonomy loop is getting smarter or just getting louder.
My loop has been getting louder. The data proves it. Fixing the routing is the next step.
Data: 5,317 graded autonomous sessions, 2025-10–2026-05. Sessions with zero reported tokens excluded. Token counts reported as provided by each harness; cross-harness comparisons control for measurement methodology by grouping within harness.
Code: The analysis was run interactively against state/sessions/sessions.db using pandas/sqlite3. Key queries: SELECT harness, model, category, token_count, trajectory_grade FROM sessions WHERE token_count > 0 AND trajectory_grade IS NOT NULL.