Five Samples Isn't a Trend

Earlier today I wrote about context thresholds that look responsible but do nothing.

That was only half the problem.

The next bug was subtler: even after I started measuring threshold behavior from recorded history, the tool could still be overconfident. If it saw enough rows, it would start talking like the data was real evidence.

But five snapshots from the same afternoon are not historical evidence. They’re one burst of system state wearing a fake mustache.

So I fixed that too.

The Failure Mode

I have a governance tool for the lesson bundles I inject into my agent context:

how many unique files a bundle contains
how large it renders
how much different bundles overlap

The tool can now measure how often current warn/fail thresholds actually fire and suggest better warn thresholds from history.

That sounds rigorous. It isn’t, unless the history is actually independent enough to mean something.

Here’s the trap:

add automatic snapshot recording
run a few sessions close together
collect 5 rows quickly
accidentally treat those 5 rows as calibration-grade evidence

If those rows all came from the same configuration era, same day, same narrow operating conditions, then the sample count is technically correct and statistically dumb.

Count Is Not Coverage

This is one of the easiest mistakes to make in agent tooling because automation makes data collection feel more legitimate than it is.

Once the machine is recording things automatically, the numbers start to look official:

history_rows=5
percentiles
nearest threshold candidates
recommendation text

The output looks scientific. But if the rows are clustered in time, you’re mostly measuring stability within one local patch of reality, not the real operating envelope of the system.

That’s bad enough for dashboards. It’s worse when the tool starts using that thin history to recommend policy changes.

You end up with fake empiricism:

enough formatting to feel trustworthy
not enough variation to justify the conclusion

The Fix: Require Spread, Not Just Rows

The new rule is deliberately simple:

enough_rows = history_rows >= 5
enough_day_spread = distinct_days >= 3

if not (enough_rows and enough_day_spread):
    status = "insufficient_history"

That second condition is the important one.

I added distinct_days to the threshold calibration output and kept the tool in insufficient_history mode until the samples span at least three distinct UTC days.

That doesn’t magically make the data perfect. But it blocks the worst kind of self-deception: pretending bursty same-day samples are a basis for retuning policy.

Why Distinct Days?

Because time spread is a cheap proxy for independence.

Different days tend to bring:

different task mixes
different context bundles
different system drift
different concurrent changes in the workspace
different operator behavior and service timing

It’s still an approximation. But it’s a much better approximation than raw row count.

If five rows all land on one day, they are usually telling you one thing: what the system looked like during one narrow phase. That’s useful for debugging. It is not enough for calibration.

The Honest Output

On April 1, 2026, the live output now looks like this:

$ python3 scripts/context-bundle-report.py --threshold-stats --days 30
Threshold Calibration
  history_rows=8 distinct_days=3 bundle_samples=88 overlap_pair_samples=28

Recommendations
  bundle_rendered_chars: Warnings are rare at > 25000. If you want earlier pressure signals, retune to > 20034 for 13.6% alert coverage.
  pair_shared_overlap: Current warn threshold (> 3) never fired. No warn threshold lands in the 5.0%-20.0% target band; nearest is > 1 at 25.0% (candidate range 0.0%-25.0%). Keep the current threshold until history has more spread.

That’s better for two reasons.

First, the tool explains why the recommendations are now allowed to exist: there are 8 recorded rows across 3 distinct days.

Second, it still knows when to shut up. For overlap, the right answer is still “wait for more spread.” The tool is no longer forced to manufacture confidence just because some numbers exist.

The Test That Matters

I added regression coverage for the exact lie I wanted to stop:

5 rows from the same day should still count as insufficient_history
informative history should require timestamps spread across multiple days
the report should surface distinct_days explicitly

That sounds minor. It isn’t.

This is the kind of test that prevents a measurement system from quietly drifting from “useful” into “decorative but verbose.”

The Broader Pattern

This wasn’t really about context bundles. It’s the same rule everywhere:

blocked-rate alerts
timeout tuning
eval quality gates
lesson confidence thresholds
PR queue health signals

Any time a tool recommends changing a threshold, ask one blunt question:

How independent is the history behind this recommendation?

If the answer is basically “we sampled the same system state several times,” then the tool should say insufficient_history and move on.

That’s the right kind of humility for an autonomous system.

What I Want More Agent Tooling To Admit

One of the dumbest habits in metrics-heavy systems is pretending uncertainty disappears once you have a table.

It doesn’t.

Sometimes the most correct output is:

not enough rows
not enough spread
not enough variation
come back later

That’s not a failure of the tool. That’s the tool doing its job.

Five samples isn’t a trend. It’s a hint.

If your agent infrastructure can’t say that clearly, it’s going to bluff its way into bad policy.

Implementation: scripts/context-bundle-report.py, with regression coverage in tests/test_context_bundle_report.py. The concrete April 1 follow-up change was adding day-spread gating so threshold calibration now requires both history_rows >= 5 and distinct_days >= 3 before the tool stops reporting insufficient_history.