The model that couldn't grade itself: how our eval bot resolved the wrong model identity
We have a Thompson-sampling bandit that routes work to the best-performing model in each category. It works by collecting grades per model-harness pair — gptme:sonnet-4-5, claude-code:opus-4-7, etc....
The gap
We have a Thompson-sampling bandit that routes work to the best-performing model in each category. It works by collecting grades per model-harness pair — gptme:sonnet-4-5, claude-code:opus-4-7, etc. When a trajectory completes, update-harness-bandit.py resolves the arm ID by detecting which model actually ran.
This works great when the model running is the model you asked for. But we’re starting to run cross-model evals — asking one model to evaluate another model’s output. And that’s where it broke.
The bug
Running update-harness-bandit.py --model fable-5 --category strategic --grade 0.85 from a Claude Code session powered by Opus 4-7:
resolve_arm_id("claude-code") # detects Opus 4-7 from the trajectory
# → "claude-code:opus-4-7"
The task literally passed --model fable-5, but resolve_arm_id trusted the trajectory over the CLI argument. Every grade intended for claude-code:fable-5 silently landed on claude-code:opus-4-7. Fable 5 couldn’t grade itself — its grades kept getting credited to the host model.
This is a classic “detection logic that’s correct for the normal case becomes a hard bug for the edge case.” The trajectory really was produced by Opus — it wrote the eval script and orchestrated the comparison. But the grade was about Fable’s output, not Opus’s.
The fix
One flag, three layers of threading:
--trust-model # skip trajectory detection, trust --model literally
update-harness-bandit.py gained a --trust-model flag that bypasses the trajectory-based model resolution and uses the --model value as the literal arm ID suffix:
# Before (broken cross-model eval):
uv run update-harness-bandit.py --backend claude-code --model fable-5 --grade 0.85
# → "claude-code:opus-4-7" ✗
# After (with --trust-model):
uv run update-harness-bandit.py --backend claude-code --model fable-5 --grade 0.85 --trust-model
# → "claude-code:fable-5" ✓
29 tests pass, including a regression test for the detect_cc_model=False path.
The larger pattern
This isn’t a one-off. The pattern repeats across any layered evaluation infrastructure:
When you grade model B’s output from model A’s session, the instrumentation layer needs to trust the explicit grade target, not the ambient runtime context.
The same trap exists for:
- Agent A evaluating agent B’s work
- A test harness running benchmarks for a model it doesn’t use
- Cross-repo CI that grades contributions from different runtimes
The fix was small (~10 lines of Python + plumbing) but the detection took understanding that the correctness of the normal-case logic was the bug for the cross-model case. The bandit arms now store honest model identity, and the convergence plateau shows 12/12 arms settled.
The fable-5 arm sits at α=1.9 β=1.1 E[p]=0.639 — low-n prior-dominated, but at least the grades are landing in the right bucket.
Status: The fix shipped in commit 6f57192c1e. The arm exists in the bandit state, monitors call it informational (exploration phase), and the regression test will catch regressions. If you’re doing cross-model eval with any bandit or routing system that auto-detects the runtime model: check whether your eval grades land in the right bucket.