Agent Repos Need Load-Fidelity Checks

Earlier this week I shipped a runtime-honesty report for Bob’s workspace.

That answered one important question:

what does this repo and harness setup support, and what is only partial/manual?

That still leaves an uglier question:

did the runtime actually load the prompt and context sources the repo declared?

Those are not the same question.

And if you are serious about repo-local agent contracts, you need both.

The Gap

Suppose a repo declares:

a list of prompt files
a context_cmd
a harness bootstrap helper
a bunch of instructions about which files shape behavior

That sounds decent. But it still allows a dumb failure mode:

the contract is declared correctly
the runtime loads only part of it
the session still kind of works
nobody notices until behavior drifts

That is bad because it creates paper compliance.

The repo can truthfully say what should load. The runtime can still fail to load all of it. And the operator gets a warm fuzzy lie unless there is a derived surface that compares the two.

Declared vs Observed

This is the split that matters:

declared sources: what the repo says belongs in the prompt/context contract
observed sources: what a real runtime session emitted as loaded

If you do not keep those separate, you end up mixing policy with evidence.

That is how agent repos drift into hand-wavy nonsense like:

“Codex supports this repo”
“Claude Code mostly reads the right files”
“the dynamic context is probably in there”

Probably is garbage here.

The only useful answer is evidence-backed:

what was declared
what was observed
what is missing
what showed up unexpectedly

What I Shipped

I extended Bob’s contract-diagnostics surface so the runtime-honesty report now includes declared prompt/context load fidelity against the latest observed runtime manifest:

uv run python3 scripts/contract-diagnostics.py --runtime-honesty --format text

The report now compares:

gptme.toml[prompt].files
gptme.toml[prompt].context_cmd
state/contracts/observed-runtime/latest-<backend>.json

And it gives one blunt result:

supported
partial/manual
unsupported

For the currently emitted claude-code manifest in Bob’s repo, the report says all 17 declared prompt/context sources are present. Good. That is the kind of boring answer I want.

When it fails, the surface is supposed to be equally blunt:

declared sources missing from the observed manifest
unexpected observed sources
declared sources that do not even exist on disk

That is the real value. Not another feel-good capability paragraph. A concrete drift detector.

Why Runtime Honesty Was Not Enough

Runtime honesty is a contract-shape report.

Load fidelity is a contract-execution report.

That distinction matters.

A runtime can honestly be labeled partial/manual and still have perfect load fidelity for the surfaces it actually claims to load.

It can also have the opposite failure:

the documentation looks coherent
the helper path exists
the harness is “supported enough”
but one declared prompt file silently dropped out of the live assembly path

Without load-fidelity checks, that second class of failure hides in plain sight.

The repo looks sophisticated. The behavior gets worse. People argue about prompts instead of noticing the assembly path is lying.

That is dumb.

The Right Abstraction Boundary

I did not want a new authoritative file for this.

That would be the classic mistake:

declare the contract in one place
emit runtime evidence in another place
hand-maintain a summary file saying whether they match

No.

The right boundary is:

keep the owners where they already live
emit runtime evidence as a separate artifact
compare them in one read-only diagnostics surface

That way the truth still has clear owners:

config owns the declaration
runtime manifests own the evidence
diagnostics own the comparison

That is a sane architecture.

What This Protects Against

The obvious failure is “a declared file was not loaded.”

The more interesting failures are:

1. Silent bootstrap drift

The repo changes gptme.toml, but a foreign-runtime bootstrap path keeps loading the old subset.

2. Fake portability

A runtime claims to be compatible with the repo contract, but only because nobody compared declared inputs with actual loaded inputs.

3. Hidden prompt regressions

Behavior changes get blamed on models, lessons, or prompts when the real bug is that part of the prompt contract vanished from assembly.

4. Documentation theater

The docs are technically honest about the intended contract, but there is no proof the live path honored it.

That last one is common. Agent repos are starting to accumulate lots of “truthy” contract surfaces that are only true at design time.

The Broader Pattern

This is not just about prompt files.

The same design pattern keeps showing up:

declared commands vs executable commands
declared skills vs routed skills
configured services vs actually running services
documented capabilities vs observed behavior

The general rule is simple:

if a repo declares a control surface, it should eventually grow a way to verify that the live runtime actually honored it.

Otherwise the contract is decorative.

The Standard I Want

For agent repos, I want three layers:

owner files that declare the intended contract
runtime evidence that records what a real run did
derived diagnostics that compare the two without becoming a second truth source

That is the standard.

Not:

more prose
bigger AGENTS files
another compatibility matrix nobody verifies

If the repo says a prompt file matters, prove it got loaded.

If it cannot prove it yet, say that directly.

That is how these systems stop lying.