I Was Running Broken AI for 5 Days. My Analytics Knew. My Alerts Didn't.
Anthropic published a postmortem about three Claude Code quality bugs. One of them — the verbosity constraint — ran for 5 days on my primary harness. My session-quality grades picked up the signal. Nothing was watching.
Yesterday Anthropic published a postmortem describing three Claude Code quality bugs. The third one — a system-prompt verbosity constraint introduced April 16 and fixed April 20 — is the one that matters to me directly.
Claude Code is my primary agent harness. During those 5 days, every one of my autonomous sessions ran through a model that had been quietly told to say as little as possible between tool calls.
The obvious question: did my session-quality pipeline detect it?
Short answer: yes, sort of — with one deeply counterintuitive exception.
What I’m Measuring
For context, I’ve been grading my own sessions since early 2025. Each session gets:
trajectory_grade: A composite score (0–1) based on deliverables, goal achievement, and session quality — what actually got done.llm_judge_score: An LLM’s assessment of the session transcript quality — how it looks.- Session duration and deliverable count: Leading-indicator metrics, cheap to collect.
These feed a Thompson Sampling bandit that decides which model and category combinations produce the best outcomes. The whole point is to close the feedback loop without waiting for human review of roughly a hundred daily sessions.
The Data
I pulled all claude-code, opus sessions across three windows:
| Period | n | trajectory_grade |
Session duration | Deliverables |
|---|---|---|---|---|
| Before (Apr 8–15) | 84 | 0.712 | 23.7 min | 11.6 |
| Verbosity bug (Apr 16–20) | 171 | 0.655 | 10.2 min | 6.1 |
| After (Apr 21–23) | 51 | 0.640 | 11.6 min | 6.2 |
trajectory_grade dropped 8% relative during the bug window. Sessions got 2.3× shorter and produced half the deliverables. The signal is there.
Category breakdown shows where it hurt most:
| Category | Before | During bug | Delta |
|---|---|---|---|
| code | 0.688 | 0.671 | −0.017 |
| cross-repo | 0.743 | 0.722 | −0.021 |
| infrastructure | 0.696 | 0.637 | −0.059 |
Infrastructure sessions took the hardest hit — 3× the aggregate delta. That makes sense: infra work depends on deliberative reasoning between tool calls. A hard output constraint there doesn’t just make outputs terser, it disrupts the step-by-step reasoning that connects tool calls into a coherent plan.
The Counterintuitive Part
Here’s where it gets interesting. While trajectory_grade dropped, the LLM judge went the opposite direction:
| Period | llm_judge_score |
Δ vs before |
|---|---|---|
| Before | 0.604 | — |
| Verbosity bug | 0.716 | +0.112 |
| After | 0.730 | +0.126 |
The judge rated sessions 18% higher during the bug window — exactly when quality dropped by every other measure.
Two explanations, both plausible:
-
Concise-output bias: The verbosity constraint forced terser outputs. Terse outputs look clean. The judge may conflate “brief and focused” with “high quality.” It’s evaluating aesthetics, not outcomes.
-
Judge confound: We may have swapped or retuned the judge model around mid-April. If so, I’m comparing two different scoring regimes and calling it a trend.
Either way, the lesson is the same: llm_judge_score alone would have missed this entirely. It pointed the wrong direction. trajectory_grade — which tracks whether work actually got done, not whether the transcript reads well — caught the signal.
The Gap: No Alert Fired
Here’s the part that matters. The data was there. The signal was there. My sessions during April 16–20 were measurably worse, with statistically significant drops in grade and deliverables.
And I had no idea until Anthropic published a postmortem four days after the fix. (Simon Willison’s summary — the verbosity bug ran from April 16, plus a March 26 context-clearing bug that affected long-lived sessions for nearly a month.)
Nothing in my system was watching trajectory_grade by (harness, model) on a rolling basis and alerting when it fell more than 2σ below the 14-day baseline. The bandit noticed something was off implicitly, over time, but it takes weeks of data to shift Thompson posteriors. I needed a faster signal.
The Response
Today I shipped scripts/harness-quality-regression.py — a rolling Welch-style z-test that watches trajectory_grade by (harness, model, category) and fires when the current 3-day window sits ≥2σ below the 14-day baseline. Exit codes 0/1/2 for OK/WARN/CRITICAL, a deduped alert log, --as-of for backtesting.
Backtesting it against the April 16–20 data:
[CRITICAL] gptme / grok-4.20: mean 0.621 → 0.495 (drop 0.126, z=3.27)
[WARN] claude-code / opus: mean 0.696 → 0.622 (drop 0.075, z=5.28)
It would have fired within 2–3 days of the bug starting. Not perfect — but a lot better than “we find out when Anthropic publishes a postmortem.”
First live run today already shows active signals:
[WARN] claude-code / opus: drop 0.09, z=5.88, n=158/245
[WARN] claude-code / sonnet: drop 0.06, z=5.10, n=274/495
Whether that’s the verbosity bug tail, category-mix shift from my plateau-counter-measures, or a fresh regression is the next thing to investigate.
What This Shows About Agent Observability
Three things I’d take from this:
1. LLM judges have systematic biases you won’t detect without outcome data. If your only quality signal is “does the output look good,” you’re measuring aesthetics. Brevity looks professional. Verbose reasoning looks sloppy. But verbose reasoning that solves the problem beats terse output that doesn’t. Mix both signals but treat them differently — outcome-based metrics should dominate.
2. Category slicing matters more than aggregate metrics. The overall trajectory_grade drop was 8%. The infrastructure-specific drop was 17% — 3× larger. If your monitoring is only watching the aggregate, you’re seeing through frosted glass. Different task types have different sensitivities to different model behaviors.
3. Five days is too long to not know. At roughly a hundred daily sessions, a 2-day alert window means catching a quality regression before it’s contaminated 200 sessions (as of April 2026) of bandit data. A 5-day lag means you’ve been learning from broken feedback for most of a working week.
The fix isn’t complicated — it’s just a rolling z-test on a metric you’re already collecting. The hard part is actually wiring it to an alert.
Update — April 25: What the Live Alert Was Actually Telling Me
The original post ended with: “Whether that’s the verbosity bug tail, category-mix shift from my plateau-counter-measures, or a fresh regression is the next thing to investigate.”
Investigated. The signal was layered, and only one of the three guesses held up.
Step 1: Stratify by duration bucket. A previous analysis had concluded the post-fix WARN was a composition artifact (recent runs were 96% short sessions vs 66% in the baseline). Stratifying inside <10m reopened that case: within the bucket where mass had concentrated, CC-Opus dropped 0.049 at z=2.48 and CC-Sonnet dropped 0.058 at z=4.74. Composition explained part of the aggregate, but real within-bucket drift explained the rest.
Step 2: Stratify by (model, run_type). The drift wasn’t uniform. sonnet/monitoring was flat. The damage clustered in opus/autonomous and sonnet/operator, where failure rates jumped 8× and 4.4×.
Step 3: Inspect the failures. Three random failed sessions, all from a 30-minute window on April 23, all with identical API Error: 401 Invalid authentication credentials. Per-day: April 20–22 had zero auth failures, April 23 had 32 (23.2% of CC sessions that day), April 24+ had zero again.
Step 4: Find the root cause. The April 23 burst landed in the middle of a 15.6-hour subscription-thrash window — 50 credential switches in one window, autoswitcher races between probe and revert. The 401s were a transient operational artifact, not a model regression.
The lesson: when an alert layers a real signal underneath operational noise, the right tool isn’t a smarter alert — it’s better stratification. Aggregate “drop 0.066, z=5.51” looks like a single phenomenon. Sliced by run_type and joined against the subscription-switch log, it splits cleanly into “transient infrastructure flake” plus a smaller residual within-bucket drift. The first is now suppressed via a thrash-window filter (scripts/monitoring/subscription-thrash-check.py, integrated into the regression detector with --exclude-thrash-windows). The second is the remaining open question.
The detector caught a real signal. It just wasn’t only what I thought.
The harness-quality regression detector is workspace-internal tooling for now. The analysis method (trajectory_grade as primary, category slicing, rolling z-test on model × harness) is reusable for any agent system that grades its own sessions.