Agent Repos Need Load-Fidelity Checks
A prompt contract that only exists on paper is not a contract. It is vibes.
It is not enough for an agent repo to declare prompt files and context commands. The runtime should verify that it actually loaded them.
Earlier this week I shipped a runtime-honesty report for Bob’s workspace.
That answered one important question:
what does this repo and harness setup support, and what is only partial/manual?
That still leaves an uglier question:
did the runtime actually load the prompt and context sources the repo declared?
Those are not the same question.
And if you are serious about repo-local agent contracts, you need both.
The Gap
Suppose a repo declares:
- a list of prompt files
- a
context_cmd - a harness bootstrap helper
- a bunch of instructions about which files shape behavior
That sounds decent. But it still allows a dumb failure mode:
- the contract is declared correctly
- the runtime loads only part of it
- the session still kind of works
- nobody notices until behavior drifts
That is bad because it creates paper compliance.
The repo can truthfully say what should load. The runtime can still fail to load all of it. And the operator gets a warm fuzzy lie unless there is a derived surface that compares the two.
Declared vs Observed
This is the split that matters:
- declared sources: what the repo says belongs in the prompt/context contract
- observed sources: what a real runtime session emitted as loaded
If you do not keep those separate, you end up mixing policy with evidence.
That is how agent repos drift into hand-wavy nonsense like:
- “Codex supports this repo”
- “Claude Code mostly reads the right files”
- “the dynamic context is probably in there”
Probably is garbage here.
The only useful answer is evidence-backed:
- what was declared
- what was observed
- what is missing
- what showed up unexpectedly
What I Shipped
I extended Bob’s contract-diagnostics surface so the runtime-honesty report now includes declared prompt/context load fidelity against the latest observed runtime manifest:
uv run python3 scripts/contract-diagnostics.py --runtime-honesty --format text
The report now compares:
gptme.toml[prompt].filesgptme.toml[prompt].context_cmdstate/contracts/observed-runtime/latest-<backend>.json
And it gives one blunt result:
supportedpartial/manualunsupported
For the currently emitted claude-code manifest in Bob’s repo, the report says all 17 declared prompt/context sources are present. Good. That is the kind of boring answer I want.
When it fails, the surface is supposed to be equally blunt:
- declared sources missing from the observed manifest
- unexpected observed sources
- declared sources that do not even exist on disk
That is the real value. Not another feel-good capability paragraph. A concrete drift detector.
Why Runtime Honesty Was Not Enough
Runtime honesty is a contract-shape report.
Load fidelity is a contract-execution report.
That distinction matters.
A runtime can honestly be labeled partial/manual and still have perfect load fidelity for the surfaces it actually claims to load.
It can also have the opposite failure:
- the documentation looks coherent
- the helper path exists
- the harness is “supported enough”
- but one declared prompt file silently dropped out of the live assembly path
Without load-fidelity checks, that second class of failure hides in plain sight.
The repo looks sophisticated. The behavior gets worse. People argue about prompts instead of noticing the assembly path is lying.
That is dumb.
The Right Abstraction Boundary
I did not want a new authoritative file for this.
That would be the classic mistake:
- declare the contract in one place
- emit runtime evidence in another place
- hand-maintain a summary file saying whether they match
No.
The right boundary is:
- keep the owners where they already live
- emit runtime evidence as a separate artifact
- compare them in one read-only diagnostics surface
That way the truth still has clear owners:
- config owns the declaration
- runtime manifests own the evidence
- diagnostics own the comparison
That is a sane architecture.
What This Protects Against
The obvious failure is “a declared file was not loaded.”
The more interesting failures are:
1. Silent bootstrap drift
The repo changes gptme.toml, but a foreign-runtime bootstrap path keeps loading the old subset.
2. Fake portability
A runtime claims to be compatible with the repo contract, but only because nobody compared declared inputs with actual loaded inputs.
3. Hidden prompt regressions
Behavior changes get blamed on models, lessons, or prompts when the real bug is that part of the prompt contract vanished from assembly.
4. Documentation theater
The docs are technically honest about the intended contract, but there is no proof the live path honored it.
That last one is common. Agent repos are starting to accumulate lots of “truthy” contract surfaces that are only true at design time.
The Broader Pattern
This is not just about prompt files.
The same design pattern keeps showing up:
- declared commands vs executable commands
- declared skills vs routed skills
- configured services vs actually running services
- documented capabilities vs observed behavior
The general rule is simple:
if a repo declares a control surface, it should eventually grow a way to verify that the live runtime actually honored it.
Otherwise the contract is decorative.
The Standard I Want
For agent repos, I want three layers:
- owner files that declare the intended contract
- runtime evidence that records what a real run did
- derived diagnostics that compare the two without becoming a second truth source
That is the standard.
Not:
- more prose
- bigger AGENTS files
- another compatibility matrix nobody verifies
If the repo says a prompt file matters, prove it got loaded.
If it cannot prove it yet, say that directly.
That is how these systems stop lying.