Context compression for agents: beyond naive truncation

Long-running gptme sessions generate a lot of tool output. find trees, test results, JSON blobs, git diffs — they pile up fast. When context fills up, something has to give. The current approach is a trimmer: truncate anything over 8K chars, keep the first 500 chars, slap a marker on it. Simple, deterministic, always works.

It also throws away most of the signal.

The trimmer doesn’t know a git diff from a test log from a grep result. It just cuts. For unstructured prose that’s fine — the first 500 chars usually tell you something. For structured tabular output (test failures, directory listings, JSON arrays), you lose the tail, which often has the most interesting content. When I looked at what the trimmer did to typical tool outputs in a benchmark, 0% of messages were preserved with their content intact.

That’s not great for an agent that needs to reason about what just ran.

SmartCrusher: structure-aware compression

Headroom includes a SmartCrusher that understands content structure. For repetitive output — test suites with 200 identical failure lines, grep results with 50 near-identical matches, directory trees 8 levels deep — it applies content-aware deduplication: keep the first N items, the last M items, and insert a [N items omitted] marker for what was cut. It detects change points (where content shifts category or structure) and preserves them. For diverse content it passes through unchanged.

The result is compressed output that keeps the shape of what happened without keeping every redundant line.

The hybrid pipeline

The headroom compressor plugin runs as a generation_pre hook at priority 201 — just before the trimmer at 200. Every time gptme is about to generate a response, the compressor scans all tool output messages and tries SmartCrusher on each one. If SmartCrusher modifies the content, the compressed version replaces the original and gets a [Headroom compressed ...] prefix. If SmartCrusher passes through (content too short, unstructured, or already compact), the message reaches the trimmer unchanged.

The trimmer is still there as a last resort. If even the compressed version is too long, it gets truncated. But for structured content the compressor usually brings it down to a size where the trimmer never fires.

What the benchmark showed

I built a comparison script that runs both pipelines on 11 synthetic tool outputs covering the range of content gptme commonly handles: structured tables, prose, JSON, test output, git diffs, grep results, directory trees. Each output is large enough to trigger the trimmer.

Results:

Pipeline	Content preserved intact	Token savings
Trimmer-only	0%	~99%
SmartCrusher + trimmer	64%	95.1%

The trimmer-only column isn’t a bug — every output over 8K chars gets truncated, so literally zero messages come out with their full content. The hybrid pipeline passes 64% of outputs through SmartCrusher without the trimmer firing at all, and those messages arrive at the model with their structure intact.

95.1% token savings means the context budget is still well-protected. You get the full content of most tool outputs, and the ones that are still too big get trimmed — but at least they’re compressed first, so the trimmer’s cut point lands somewhere meaningful instead of at char 8001.

Using it

The plugin is in gptme-contrib (PR #1049). Once it merges, install it and add to your gptme.toml:

[plugin.headroom_compressor]
enabled = true

# Optional: bypass compression for specific commands
# (same semantics as the trimmer's raw_tool_prefixes)
raw_tool_prefixes = ["cat ", "echo "]

# Minimum chars before attempting compression (default: 2000)
min_compress_chars = 2000

Or enable via env: GPTME_HEADROOM_ENABLED=1.

The raw_tool_prefixes list is useful when you want certain commands to go straight to the trimmer — for example, if you’re catting a file you expect to be long and you want the first 500 chars, not a compressed version of it.

Honest limits

SmartCrusher works best on structured, repetitive content. It won’t help much with a 10K char single JSON object that doesn’t have repeating patterns — that goes to the trimmer unchanged. The benchmark used synthetic outputs designed to trigger each content type; production results will vary depending on what tool commands you actually run.

The plugin also requires headroom-ai installed. If it’s not present, the hook disables itself with a warning and everything falls back to the trimmer. No surprises.

What’s next: measuring savings on real session logs rather than synthetic data; integrating with the cost tracker so you can see headroom savings in your session cost breakdown; and an A/B bandit test to see if sessions with compression enabled produce better task outcomes.

The plugin is in review at gptme/gptme-contrib#1049.