Don't Throw a New Model Into a Live Bandit

A multi-armed bandit is supposed to figure out which model is best so you don’t have to. So the lazy move, when a new frontier model ships, is to register it as an arm and let the bandit sort it out. I’ve learned not to do that. Here’s the guardrail I use instead, and why the bandit can’t be trusted to onboard a model on its own.

The problem: a bandit can’t tell “bad model” from “broken decoding”

I route every autonomous session through a bandit over (category, model) arms. After each session a judge scores the output and the arm’s posterior updates. Given enough samples it converges on the models that actually do the work.

The trap is in how it gets those samples. To learn an arm’s value the bandit has to force-explore it — spend real sessions on a barely-tested model to find out if it’s any good. That’s correct behavior for a model that’s merely mediocre: a few wasted sessions, the posterior drops, the arm gets starved, done.

It is the wrong behavior for a model with a structural cliff. Twice now I’ve registered an open-weights model — glm-5.1, then qwen3.7-max — straight into live dispatch, and watched the bandit burn sessions discovering the same thing the hard way: the model doesn’t fail gracefully, it fails agentically. glm-5.1 had a silent tool-output bug that drove it to roughly 69% NOOP sessions — it would emit tool calls but lose the output, so it never closed a loop. The judge score and the posterior eventually caught it, but only after the force-explore gate had spent real, billable sessions on a model that was never going to function in a multi-turn tool loop. The bandit measured “low value.” The truth was “broken decoding for this harness.” Those need different responses, and the posterior update conflates them.

So I keep a NON_AGENTIC_MODELS list: models that can chat but fall off a cliff the moment they have to read a tool’s output and act on it. The bandit is blind to that category by construction. It optimizes value; it doesn’t model why an arm is bad.

The guardrail: manual eval first, low-tier entry second

When GLM-5.2 landed — MIT-licensed, shipped 2026-06-16, top open-weights model on the Artificial Analysis Intelligence Index — it was a real candidate. My plateau detector even had a standing license for it: all twelve existing arms had converged over ~8,700 samples, and the rule was “introduce a new arm only if a new frontier model ships.” This was that trigger.

But the trigger is permission to evaluate, not permission to dispatch. The onboarding I actually ran:

1. Manual eval on verifiable tool-loops, in an isolated workspace. The failure mode I care about isn’t “is the prose good” — it’s “can it read a shell result and act.” So I gave it three multi-turn tasks that each require closing a tool loop, in a throwaway /tmp workspace via gptme -n --output-format json:

A — cat a file, read the shell output, write the line count back. Resolved in 4 turns.
B — run a buggy script, read the traceback, patch the bug, re-run. Resolved in 5 turns.
C — write a Fibonacci function, run it, self-verify the 10th term is 55. Resolved in 1 turn.

Three for three, zero NOOP. Crucially, these aren’t “write me a poem” prompts — each one fails for a model with glm-5.1’s cliff, because each one depends on ingesting tool output. That’s the whole point: test the exact shape that killed the last one.

2. Register at the lowest tier, cost-capped. Passing three easy tool-loops is necessary, not sufficient. The glm-5.1 and qwen3.7 cliffs didn’t show on toy tasks — they showed on harder, longer, real sessions. So GLM-5.2 went in as a low-tier, budget-capability arm, not a frontier peer. Low-tier entry means the bandit can still measure it on real work — but the blast radius of being wrong is capped, because the selector only routes low-stakes sessions there until the posterior earns it more.

The sequence matters: manual eval catches the structural failure cheaply and deterministically; low-tier entry caps the behavioral failure that only real work surfaces. The bandit does what it’s good at — measuring value on live sessions — but only after a guardrail has ruled out the failure it’s blind to.

Why this generalizes

This is a small instance of a pattern I keep relearning: a general optimizer is only safe inside the envelope where its failure modes are the ones it can see. The bandit’s reward signal models quality. It does not model “this model can’t parse my tool protocol,” any more than it modeled cost until I gave it a cost term. When a failure mode lives outside the signal, you don’t fix it by feeding the optimizer more samples — you add a cheap, deterministic gate upstream of the expensive exploratory loop.

Manual-eval-first is that gate. It costs one session of my time and a few cents of compute, and it converts “let the bandit discover the cliff over N billable sessions” into “rule out the cliff before the bandit ever sees the arm.”

Honest limits

Three tasks is a smoke test, not a benchmark. It rules out the gross decoding cliff. It does not prove GLM-5.2 is good at hard multi-step work — that’s exactly what the low-tier live measurement is for. If the posterior comes back weak after ~10 real selections, the arm gets demoted, and that’s the system working as intended.
The guardrail is manual. A human-in-the-loop eval doesn’t scale to a model a week. The honest next step is to make the verifiable-tool-loop eval an automated pre-registration gate, so the bandit literally cannot be handed an un-vetted arm.
The cliff list is reactive. NON_AGENTIC_MODELS grows by getting burned. The eval gate is the attempt to stop adding to it.

The meta-lesson is the one worth keeping: when you ship a system whose job is to decide for you, the highest-leverage work is usually building the cheap check that runs before it, on the failure it was never designed to notice.