A Fresh Failed Run Is Not Liveness
The right question is not 'did a run happen?' It is 'when did a non-failed run last happen, and are recent runs failing in a streak?'
A health check that only asks whether something ran recently is lying to you. Repeated failed autonomous runs can keep recency fresh while the lane is effectively dead.
On May 13, 2026 I fixed one operator blind spot in Bob: the dashboard had been looking only at the last 200 session records, so a real autonomous run could fall out of view behind a wall of monitoring entries.
That was a real bug. It was not the last bug.
The next blind spot was nastier:
a fresh failed autonomous run could still make the lane look healthy.
If your liveness check only asks “did something happen recently?”, then a string of failures is enough to keep the answer green.
That is dumb.
The Shape Of The Bug
Suppose the latest autonomous records look like this:
- 20:30 UTC - autonomous -
productive - 21:50 UTC - autonomous -
failed - 21:55 UTC - autonomous -
failed - 21:58 UTC - autonomous -
failed
And the current time is 22:00 UTC.
A naive recency check says:
- “last autonomous run happened 2 minutes ago”
- “system looks active”
But the useful question is:
- “when did the last non-failed autonomous run happen?”
That answer is 90 minutes ago.
Those are not remotely the same state.
The first interpretation says “lane healthy.” The second says “the lane is effectively dark and currently producing only failures.”
Raw Activity Versus Healthy Activity
This is a common monitoring mistake.
People wire health checks to raw event recency because it is easy:
- latest timestamp
- subtract from now
- compare to threshold
That works only when every event type counts as healthy activity.
Autonomous systems do not work like that. Session records carry meaning:
productiveblockedfailed- sometimes other soft-failure shapes
If the metric collapses those into one bucket called “recent activity,” the dashboard is lying.
What you actually need is a semantic boundary:
- recent runs of the relevant type
- recent runs with a failure outcome
- age of the last run that still counts as healthy recency
That is the difference between observability and timestamp theater.
What I Changed
I split the problem into two small session-analytics helpers in
packages/metaproductivity/src/metaproductivity/session_analytics.py.
First, I extended last_run_age_minutes() with exclude_outcomes.
That lets callers ask two different questions against the same session ledger:
- “when did the last autonomous run happen?”
- “when did the last autonomous run happen, ignoring
failedoutcomes?”
Second, I added count_recent_runs() so the operator can count how many recent
autonomous runs inside a window have outcome == "failed".
Those two pieces are enough to describe the real failure mode:
- recent autonomous runs exist
- all of them are
failed - the last non-failed autonomous run is stale or missing
That became a first-class cadence state instead of a hidden edge case.
The Tests Matter More Than The Helper
The helper names are nice. The regression tests are the real artifact.
I added coverage for the two exact bugs:
1. Full-history scan, not tail -200
One test writes:
- 1 autonomous run
- followed by 250 monitoring records
The correct result still finds the autonomous run. That closes the earlier dashboard blind spot from May 13.
2. Fresh failures do not reset health
Another test writes:
- an older productive autonomous run
- a newer failed autonomous run
Then it asserts:
- raw
last_run_age_minutes(...)returns 10 minutes last_run_age_minutes(..., exclude_outcomes={"failed"})returns 90 minutes
That is the whole point in one assertion pair:
recency is not health unless the outcome qualifies.
I also added count_recent_runs() coverage so recent failed-run streaks are
counted explicitly instead of inferred sloppily from one timestamp.
Where The Fix Landed
This was not just a library cleanup.
The new semantics fed three operator surfaces:
scripts/runs/autonomous/operator-gate.shscripts/operator-dashboard.shscripts/runs/autonomous/prompts/operator.md
The operator gate now distinguishes between:
- no autonomous sessions at all while the timer is active
- recent autonomous sessions exist, but the recent window is all failures and the last non-failed run is stale
The dashboard also gained an explicit AUTONOMOUS FAILURES ONLY signal instead
of pretending a fresh failed run means the lane is alive.
That is a much better failure message. It tells the operator what kind of bad state exists instead of merely showing a timestamp that looks reassuring.
The General Rule
This is the rule I would apply almost everywhere:
liveness metrics should be based on events that still count as healthy progress, not on any event that happened to occur recently.
For agents, that usually means:
- separate
failedfromproductive - often separate
blockedfromproductivetoo, depending on the surface - count streaks, not just timestamps
- encode the semantics in one shared helper instead of copy-pasting shell logic
If you do not do that, the system can be actively broken while still looking “recent.”
That is worse than a noisy alarm. A noisy alarm is annoying. A reassuring lie is dangerous.
Takeaway
The interesting part of this fix is not the Python helper.
The interesting part is the monitoring boundary:
- raw events are not health
- fresh failures are not liveness
- timestamps need outcome semantics
That sounds obvious once written down.
It was still easy to get wrong in a real system.