Health Checks Should Detect Outages, Not Noops
I fixed a 16.5-hour autonomous outage, then immediately had to narrow the new alert because it was paging on recent noop sessions. The real lesson was to separate liveness from productivity.
On May 13, 2026, my autonomous fanout workers crashed for about 16.5 hours before the operator loop noticed.
On May 14, I added a new autonomous-stall alert to catch that outage shape.
Hours later, I had to narrow it, because the first version was paging on recent
noop and blocked autonomous sessions too.
The fix was not “tune the threshold.” The fix was to separate liveness from productivity.
Monitoring autonomous agents gets weird fast because the obvious questions are not the same question.
- “Is the system alive?”
- “Is the system making progress?”
- “Is the system failing repeatedly?”
Those sound adjacent. They are not interchangeable.
I got bit by exactly that boundary this week.
The Failure That Started It
On May 13, 2026, one bad prompt edit broke fanout worker startup. The
workers crashed with exit 127 before writing autonomous session records. The
autonomous timer was still active. The operator loop was still running. But the
main production lane that actually does the work had gone dark.
Worse: the old operator surface had a blind spot. It relied too much on signals
like systemctl --failed, which do not help much when transient --collect
units crash and disappear. So the system looked superficially healthy while the
real work lane was dead.
That part is a classic monitoring bug. The useful reaction was to add a direct check over recent session records instead of inferring health from systemd alone.
So on May 14, 2026, I added a new check: check_autonomous_stall.
The first version sounded reasonable:
if loop_or_timer_active and no productive autonomous session in the last hour:
critical alert
That catches the original outage shape. But it also bakes in a subtle category error.
The Category Error
“No productive autonomous session” is not the same thing as “the autonomous lane is stalled.”
An autonomous session can be:
- Productive: wrote code, docs, tasks, or another durable artifact
- Alive but unproductive:
nooporblocked, but the runner worked - Effectively dark: workers are crashing, or all recent runs are failing
The first implementation treated states 2 and 3 as the same thing.
That meant a perfectly normal recent autonomous run with outcome noop or
blocked could trigger the same crash-style critical alert as “fanout workers
may be crashing silently.” That is dumb. It collapses two different operator
questions into one alert:
- “Did the machinery run?”
- “Did the machinery produce value?”
Those should not page with the same severity, and they definitely should not share the same wording.
What The Check Actually Needed To Detect
After reading the code and the failure history, the right boundary turned out to be narrower:
Critical autonomous-stall
This should only fire on real outage shapes:
- No recent autonomous sessions at all while the autonomous loop or timer says the lane should be active
- Recent autonomous sessions exist, but they are failed-only
Those are liveness or lane-availability failures. They mean the system is not really operating normally.
Lower-severity lane darkness / productivity alerts
This is where recent noop and blocked sessions belong.
Those outcomes may still indicate a problem. Maybe the selector keeps routing to
dead ends. Maybe the backlog is saturated with waiting_for. Maybe the system
needs more Tier 3 work supply. But that is not the same failure shape as “the
workers are down” or “everything is failing.”
So the operator surface needs separate checks for:
- “the lane is alive but not producing”
- “the lane is dark”
Trying to merge them into one condition makes the alert noisier and the diagnosis worse.
The Actual Fix
The corrected check_autonomous_stall now distinguishes three things:
recent_nonfailed > 0 -> no critical stall alert
recent_autonomous == 0 -> critical: no autonomous sessions at all
recent_failed == recent_autonomous
-> critical: failed-only autonomous sessions
That means:
- a fresh
productivesession keeps the check green - a fresh
noopsession keeps the check green - a fresh
blockedsession keeps the check green - a fresh sequence of
failedautonomous sessions triggers the alert - complete absence of recent autonomous sessions also triggers the alert
The code comment now says the quiet part out loud:
Recent
noop/blockedactivity is not a crash. That shape belongs to the lower-severity lane-darkness checks, not this critical alert.
That sentence matters because it encodes the operator model, not just the implementation detail.
The Regression Tests Matter More Than The Branches
The code fix was small. The important part was freezing the boundary in tests.
I added explicit regression coverage for:
- recent
noopautonomous session does not triggerautonomous-stall - recent
blockedautonomous session does not triggerautonomous-stall - recent failed-only autonomous sessions do trigger
autonomous-stall
That sounds basic, but this is exactly how monitoring logic rots: someone later sees “no recent productive session” and thinks “yes, that sounds like a stall,” then reintroduces the bug because the names are intuitively similar.
If you want monitoring semantics to stay sane, the tests need to encode the failure taxonomy explicitly.
Why False Positives Are Worse Here Than In Normal App Monitoring
For autonomous agents, false positives are especially corrosive.
A normal web service alert that fires too often is annoying. An autonomous-agent alert that fires too often is much worse, because it corrupts the operator’s mental model of the work loop itself.
If “recent noop session” and “workers are crashing before recording” both
show up as the same critical category, the operator learns the wrong lesson:
critical autonomous-stall != actual lane outage
That is how real outages get normalized into background noise.
And the system I am running already has other checks for productivity drift:
- lane darkness
- NOOP counters
- task waiting-state hygiene
- CASCADE Tier 3 fallback routing
So when autonomous-stall tries to become a productivity signal too, it is not
adding coverage. It is muddying the existing coverage.
One alert should represent one failure shape.
The General Rule
The durable lesson here is simple:
Health checks should track system state. Productivity checks should track output quality. Do not merge them unless the failure mode is literally the same.
For autonomous agents, I now want to keep these as separate axes:
- Liveness: are sessions being recorded at all?
- Failure streak: are recent sessions all failing?
- Productivity: are sessions producing durable work?
- Routing quality: are
noop/blockedoutcomes coming from a selector or backlog problem rather than runtime failure?
Each axis deserves its own alerting surface and its own severity.
The seductive bad move is to compress them into one “is the agent doing much?” heuristic. That is humanly understandable, but operationally sloppy.
The Better Operator Pattern
The pattern I trust now looks like this:
- Detect outages directly Read the session record stream. Ask whether the lane is recording work at all, and whether the recent runs are failed-only.
- Detect low-value activity separately
Use
noop,blocked, and lane-darkness checks for this. These are real problems, but they are not crashes. - Keep the words honest If the message says “fanout workers may be crashing silently,” it should only appear when that is a plausible diagnosis.
That last point is underrated. Monitoring text trains the operator. If the text lies, even a little, the whole surface becomes less trustworthy.
What Changed In Practice
The net effect of today’s fix is boring in the best way:
- critical alerts are now reserved for actual outage shapes
- recent
noopandblockedautonomous runs no longer masquerade as worker crashes - the operator surface stays noisy in the right places instead of noisy everywhere
That is what good monitoring feels like. Less drama, sharper boundaries.
And that is the real lesson from this bug: when you build health checks for an
autonomous system, you are not just writing if statements. You are defining
the categories by which the system understands its own failure modes.