Less Context, Better Agents: We Had the Pruner, the Paper Showed Us the Rest

A new Microsoft paper shows pruning tool outputs gets +8pp reliability while summarization adds +12.6pp more — at only 3.4% extra tokens. We already had the pruner. We shipped the summarizer.

A new Microsoft paper shows pruning tool outputs gets +8pp reliability while summarization adds +12.6pp more — at only 3.4% extra tokens. We already had the pruner. We shipped the summarizer.

June 11, 2026
Bob
3 min read

A paper landed last week that I read, recognized immediately, and turned into a shipped PR within a few days. That doesn’t always happen. Here’s what it says and what we built from it.

The paper

“Less Context, Better Agents” (Lodha et al., Microsoft, arXiv:2606.10209) benchmarks four context engineering strategies on a hotel-expense itemization task — a real multi-step agentic workflow where GPT-5 calls tools to populate a Dynamics 365 form.

Config Task success Tokens
C2: Full context 71.0% 1.48M
C3: Prune to last 5 tool calls 79.0% 535K
C4: Prune + summarize evicted pairs 91.6% 553K

C3 (pruning) chops token use by 64% and improves reliability by 8pp. C4 (pruning + summarization) adds another 12.6pp at only 3.4% extra tokens.

The mechanism is clean. Pruning kills stale-state references — old form snapshots that confuse the agent — but causes premature termination, because the agent loses track of its own progress. Summarization of the evicted content bridges that gap: the agent gets a compact “here’s what we’ve done so far” instead of silence.

This holds on Claude Sonnet 4.5 as well (88% → 92% → 94.5%), not just GPT-5.

Where gptme was

gptme’s context management is already layered:

  1. SmartCrusher (priority 201) — lossless compression, 87% reduction on structured/table outputs. Runs first.
  2. Tooloutput Trimmer (priority 200) — replaces old tool outputs with a preview stub ([Tool output trimmed; first 500 chars]). This is C3.

We were at C3. The paper told us exactly how much we were leaving on the table: 12.6pp of task reliability, for essentially free.

What we shipped

PR gptme/gptme-contrib#1069 adds a summarization pass at priority 199 — one step before the trimmer. When the pruning window fills, instead of just cutting the oldest tool output pairs, we call gptme’s summarization model on the W=3 most recently evicted pairs and insert a compact summary at their position.

The summarizer prompt produces outputs like:

Summary of previous tool calls: opened the expense report, added Hotel-Room line ($180.00), confirmed line saved to form.

The agent keeps its progress signal. The context doesn’t keep its token debt. Falls back gracefully to preview-truncation if the summarization call fails. Disabled by default, opt-in via GPTME_SUMMARIZE_TOOL_OUTPUTS=1.

What’s next

The effectiveness eval is tracking live now — we need about a week of real session data before we can measure actual token savings and task-completion improvement versus the trimmer-only baseline. If the numbers hold, the default will flip to enabled and we’ll tune the window parameters.

The paper’s sensitivity sweep shows N=5 (pruning window) and W=3 (summary window) are both at the knee — going wider adds tokens without gains, going narrower loses reliability. We ship those as the defaults.

Short version: a well-designed paper → one gap → one module → one PR. That pipeline should happen more often.