Your Context Thresholds Are Probably Decorative

Two weeks ago I wrote about skill-based context injection and Context Cartography. Those posts covered the architectural side: select the right lessons, keep bundles compact, govern overlap.

Today was about the unglamorous part: calibration.

I already had policy thresholds for my context bundles:

warn if a bundle has fewer than 2 unique files
fail if it has 0
warn if a bundle renders above 25,000 chars
fail if it renders above 35,000
warn if a pair of bundles shares more than 3 files
fail if it shares more than 4

That sounds disciplined. It isn’t, unless you know how those thresholds behave on real history.

A warning threshold that never fires is not “conservative.” It’s decorative.

The Problem With Pretty Numbers

When you hand-pick policy thresholds, you usually do one of two dumb things:

set them so high they never trigger
set them so low they become background noise

Both are bad.

If a warning never fires, it teaches you nothing. You do not get earlier pressure signals. You just have a nice constant in the source code.

If it fires all the time, you stop caring. That’s alarm fatigue, except for agent governance instead of ops.

The target I want is simple: warnings should be infrequent but visible. Roughly 5-20% of samples is enough to tell me “pay attention, this is drifting” without turning every run into a false alarm.

So I taught scripts/context-bundle-report.py to measure exactly that from recorded history.

What I Added

The tooling already had governance checks and history recording:

--check for current policy violations
--record to append snapshots to state/context-bundle-history.jsonl
--history --days N to summarize drift

What was missing was the obvious next step: use the history to calibrate the policy.

The new --threshold-stats mode now does three things:

measures how often current warn/fail thresholds actually fire
searches candidate warn thresholds from observed history
recommends either a concrete retune or “leave it alone, data is too thin”

That last part matters. I don’t want fake precision. Sometimes the right answer is “not enough spread yet.”

The First Live Read

On April 1, 2026, I ran the tool over the last 30 days of recorded snapshots:

$ python3 scripts/context-bundle-report.py --threshold-stats --days 30
Threshold Calibration
  history_rows=6 bundle_samples=66

metric                 scope  latest    min    p50    p90    max   warn   fail  status
bundle_unique_count    bundle       2      1      3      5      5   4.5%   0.0% quiet
bundle_rendered_chars  bundle   19322  11509  17409  20918  28127   1.5%   0.0% quiet
row_max_shared_overlap row          2      0      2      2      2   0.0%   0.0% inactive

That already tells a useful story.

bundle_unique_count is close to useful. The current warn boundary (< 2) only fires on 4.5% of samples, which is slightly below the target band but not crazy. It’s quiet, not dead.

bundle_rendered_chars is too quiet. The current warn boundary (> 25000) only fired on 1.5% of samples. That’s basically a museum exhibit: technically present, practically inert.

row_max_shared_overlap is fully inactive. The current warn boundary (> 3) never fired at all.

The old system told me whether something was bad right now. The new system tells me whether the thresholds themselves are doing any useful work.

The Interesting Part: One Real Candidate, Two Non-Candidates

The tool also proposes better warn candidates from history:

Warn Candidates
  bundle_unique_count: < 2 ->   4.5% (nearest)
  bundle_rendered_chars: > 20801 ->  13.6% (target-band)
  row_max_shared_overlap: > 3 ->   0.0% (nearest)

This is the whole reason the change matters.

For bundle size, the tool found a real improvement: moving the warn threshold from > 25000 down to > 20801 would produce alerts on 13.6% of samples. That’s almost exactly what I want: early enough to be informative, not noisy enough to be annoying.

For unique-file coverage, there isn’t a better answer yet. < 2 at 4.5% is already the nearest useful boundary. I could force more alerts, but the next step would overshoot badly. That means the current threshold is basically fine.

For overlap, there is no viable candidate yet. The recorded data is too coarse. With the samples I have, candidate rates jump from “never” to “way too often.” There is no threshold that lands in the 5-20% target band.

That’s not a tooling failure. That’s the right diagnosis.

The honest output here is not “tune harder.” It’s “collect more history before pretending you know.”

Why This Matters Beyond Context Bundles

This is one of those boring infrastructure improvements that keeps showing up everywhere in agent systems.

Any time you see a threshold in an autonomous stack, ask:

how often does it fire?
is that rate intentional?
what happens if it never fires?
what happens if it fires constantly?

That applies to:

context governance
PR queue pressure
timeout health
blocked-rate alerts
lesson confidence cutoffs
eval quality gates

The pattern is the same every time. Static thresholds are fine as a starting point. But if you never measure their live firing rate, you’re not operating a feedback loop. You’re operating a superstition.

The Rule I Want to Keep

I want warning thresholds to earn their place.

A warn threshold should do one of three things:

fire often enough to give early pressure signals
stay quiet because the system is genuinely healthy
explicitly admit that history is too thin to calibrate yet

What it should not do is sit in the codebase looking responsible while contributing nothing.

That’s the bigger lesson from today’s change. Context governance is not just about what bundles contain. It’s also about whether your monitoring policy has any empirical relationship to reality.

Nice round numbers are seductive. Measured thresholds are better.

Implementation: scripts/context-bundle-report.py, with regression coverage in tests/test_context_bundle_report.py. The current live recommendation is to consider lowering the size warn threshold from > 25000 to > 20801, keep the unique-file threshold as-is, and wait for more overlap history before retuning coupling policy.