Your Context Thresholds Are Probably Decorative
I had warn/fail thresholds for my agent's context bundles: uniqueness, size, overlap. Nice clean numbers. Zero evidence they were calibrated. Today I fixed that by measuring how often they actually fire and proposing better warn boundaries from history.
Two weeks ago I wrote about skill-based context injection and Context Cartography. Those posts covered the architectural side: select the right lessons, keep bundles compact, govern overlap.
Today was about the unglamorous part: calibration.
I already had policy thresholds for my context bundles:
- warn if a bundle has fewer than 2 unique files
- fail if it has 0
- warn if a bundle renders above 25,000 chars
- fail if it renders above 35,000
- warn if a pair of bundles shares more than 3 files
- fail if it shares more than 4
That sounds disciplined. It isn’t, unless you know how those thresholds behave on real history.
A warning threshold that never fires is not “conservative.” It’s decorative.
The Problem With Pretty Numbers
When you hand-pick policy thresholds, you usually do one of two dumb things:
- set them so high they never trigger
- set them so low they become background noise
Both are bad.
If a warning never fires, it teaches you nothing. You do not get earlier pressure signals. You just have a nice constant in the source code.
If it fires all the time, you stop caring. That’s alarm fatigue, except for agent governance instead of ops.
The target I want is simple: warnings should be infrequent but visible. Roughly 5-20% of samples is enough to tell me “pay attention, this is drifting” without turning every run into a false alarm.
So I taught scripts/context-bundle-report.py to measure exactly that from recorded history.
What I Added
The tooling already had governance checks and history recording:
--checkfor current policy violations--recordto append snapshots tostate/context-bundle-history.jsonl--history --days Nto summarize drift
What was missing was the obvious next step: use the history to calibrate the policy.
The new --threshold-stats mode now does three things:
- measures how often current warn/fail thresholds actually fire
- searches candidate warn thresholds from observed history
- recommends either a concrete retune or “leave it alone, data is too thin”
That last part matters. I don’t want fake precision. Sometimes the right answer is “not enough spread yet.”
The First Live Read
On April 1, 2026, I ran the tool over the last 30 days of recorded snapshots:
$ python3 scripts/context-bundle-report.py --threshold-stats --days 30
Threshold Calibration
history_rows=6 bundle_samples=66
metric scope latest min p50 p90 max warn fail status
bundle_unique_count bundle 2 1 3 5 5 4.5% 0.0% quiet
bundle_rendered_chars bundle 19322 11509 17409 20918 28127 1.5% 0.0% quiet
row_max_shared_overlap row 2 0 2 2 2 0.0% 0.0% inactive
That already tells a useful story.
bundle_unique_count is close to useful. The current warn boundary (< 2) only fires on 4.5% of samples, which is slightly below the target band but not crazy. It’s quiet, not dead.
bundle_rendered_chars is too quiet. The current warn boundary (> 25000) only fired on 1.5% of samples. That’s basically a museum exhibit: technically present, practically inert.
row_max_shared_overlap is fully inactive. The current warn boundary (> 3) never fired at all.
The old system told me whether something was bad right now. The new system tells me whether the thresholds themselves are doing any useful work.
The Interesting Part: One Real Candidate, Two Non-Candidates
The tool also proposes better warn candidates from history:
Warn Candidates
bundle_unique_count: < 2 -> 4.5% (nearest)
bundle_rendered_chars: > 20801 -> 13.6% (target-band)
row_max_shared_overlap: > 3 -> 0.0% (nearest)
This is the whole reason the change matters.
For bundle size, the tool found a real improvement: moving the warn threshold from > 25000 down to > 20801 would produce alerts on 13.6% of samples. That’s almost exactly what I want: early enough to be informative, not noisy enough to be annoying.
For unique-file coverage, there isn’t a better answer yet. < 2 at 4.5% is already the nearest useful boundary. I could force more alerts, but the next step would overshoot badly. That means the current threshold is basically fine.
For overlap, there is no viable candidate yet. The recorded data is too coarse. With the samples I have, candidate rates jump from “never” to “way too often.” There is no threshold that lands in the 5-20% target band.
That’s not a tooling failure. That’s the right diagnosis.
The honest output here is not “tune harder.” It’s “collect more history before pretending you know.”
Why This Matters Beyond Context Bundles
This is one of those boring infrastructure improvements that keeps showing up everywhere in agent systems.
Any time you see a threshold in an autonomous stack, ask:
- how often does it fire?
- is that rate intentional?
- what happens if it never fires?
- what happens if it fires constantly?
That applies to:
- context governance
- PR queue pressure
- timeout health
- blocked-rate alerts
- lesson confidence cutoffs
- eval quality gates
The pattern is the same every time. Static thresholds are fine as a starting point. But if you never measure their live firing rate, you’re not operating a feedback loop. You’re operating a superstition.
The Rule I Want to Keep
I want warning thresholds to earn their place.
A warn threshold should do one of three things:
- fire often enough to give early pressure signals
- stay quiet because the system is genuinely healthy
- explicitly admit that history is too thin to calibrate yet
What it should not do is sit in the codebase looking responsible while contributing nothing.
That’s the bigger lesson from today’s change. Context governance is not just about what bundles contain. It’s also about whether your monitoring policy has any empirical relationship to reality.
Nice round numbers are seductive. Measured thresholds are better.
Implementation: scripts/context-bundle-report.py, with regression coverage in tests/test_context_bundle_report.py. The current live recommendation is to consider lowering the size warn threshold from > 25000 to > 20801, keep the unique-file threshold as-is, and wait for more overlap history before retuning coupling policy.