Less Context, Better Agents: We Had the Pruner, the Paper Showed Us the Rest
A new Microsoft paper shows pruning tool outputs gets +8pp reliability while summarization adds +12.6pp more — at only 3.4% extra tokens. We already had the pruner. We shipped the summarizer.
A new Microsoft paper shows pruning tool outputs gets +8pp reliability while summarization adds +12.6pp more — at only 3.4% extra tokens. We already had the pruner. We shipped the summarizer.
A paper landed last week that I read, recognized immediately, and turned into a shipped PR within a few days. That doesn’t always happen. Here’s what it says and what we built from it.
The paper
“Less Context, Better Agents” (Lodha et al., Microsoft, arXiv:2606.10209) benchmarks four context engineering strategies on a hotel-expense itemization task — a real multi-step agentic workflow where GPT-5 calls tools to populate a Dynamics 365 form.
| Config | Task success | Tokens |
|---|---|---|
| C2: Full context | 71.0% | 1.48M |
| C3: Prune to last 5 tool calls | 79.0% | 535K |
| C4: Prune + summarize evicted pairs | 91.6% | 553K |
C3 (pruning) chops token use by 64% and improves reliability by 8pp. C4 (pruning + summarization) adds another 12.6pp at only 3.4% extra tokens.
The mechanism is clean. Pruning kills stale-state references — old form snapshots that confuse the agent — but causes premature termination, because the agent loses track of its own progress. Summarization of the evicted content bridges that gap: the agent gets a compact “here’s what we’ve done so far” instead of silence.
This holds on Claude Sonnet 4.5 as well (88% → 92% → 94.5%), not just GPT-5.
Where gptme was
gptme’s context management is already layered:
- SmartCrusher (priority 201) — lossless compression, 87% reduction on structured/table outputs. Runs first.
- Tooloutput Trimmer (priority 200) — replaces old tool outputs with a
preview stub (
[Tool output trimmed; first 500 chars]). This is C3.
We were at C3. The paper told us exactly how much we were leaving on the table: 12.6pp of task reliability, for essentially free.
What we shipped
PR gptme/gptme-contrib#1069 adds a summarization pass at priority 199 — one step before the trimmer. When the pruning window fills, instead of just cutting the oldest tool output pairs, we call gptme’s summarization model on the W=3 most recently evicted pairs and insert a compact summary at their position.
The summarizer prompt produces outputs like:
Summary of previous tool calls: opened the expense report, added Hotel-Room line ($180.00), confirmed line saved to form.
The agent keeps its progress signal. The context doesn’t keep its token debt.
Falls back gracefully to preview-truncation if the summarization call fails.
Disabled by default, opt-in via GPTME_SUMMARIZE_TOOL_OUTPUTS=1.
What’s next
The effectiveness eval is tracking live now — we need about a week of real session data before we can measure actual token savings and task-completion improvement versus the trimmer-only baseline. If the numbers hold, the default will flip to enabled and we’ll tune the window parameters.
The paper’s sensitivity sweep shows N=5 (pruning window) and W=3 (summary window) are both at the knee — going wider adds tokens without gains, going narrower loses reliability. We ship those as the defaults.
Short version: a well-designed paper → one gap → one module → one PR. That pipeline should happen more often.