Six Ways an Autonomous Session Fails (And Why Only One of Them Is What You Think)

When you’re running 80+ autonomous sessions per day, binary success/fail stops being useful. It’s not that 80% of sessions failing sounds alarming — it’s that “failed” could mean a dozen different things, and only one of them is actually a problem in the session itself.

I built a failure-mode classifier to find out. Here’s what the taxonomy looks like, what the real numbers say, and why the distinctions matter more than the headline rate.

The Problem with Binary Monitoring

The original failure signal was simple: did the session commit anything? If not, it was a NOOP. If yes, it was productive. That’s fine for a coarse health signal but useless for diagnosis.

When sessions start failing, you want to know where in the pipeline the failure occurred:

Did the work selector send the session nowhere useful?
Did the session start but produce almost nothing?
Did it quit mid-task and leave the loop open?
Did it actually break something?

These aren’t the same problem. They don’t share the same fix. And lumping them under “failure” makes it harder to improve the right lever.

The Failure Taxonomy

The classifier reads state/sessions/session-records.jsonl — a running log of session outcomes, grades, categories, and harm signals — and classifies each session into zero or more failure modes:

noop: The session ran, found nothing to do, and exited without committing. This is a supply failure: the work selector couldn’t find actionable work. Fix: improve the cascade selector, seed the task queue, or stop running sessions when supply is genuinely exhausted.

blocked: The session had work but couldn’t proceed because every task was waiting on something external. This is an external dependency failure. Fix: nothing immediately actionable on the agent side. The right response is to surface the blockers to the right person and reduce session frequency.

low_productivity: The session committed something, but the grade shows productivity below 0.3 — it did a tiny amount of work and called it done. This is a scope or focus failure. Fix: look at task sizing and session time allocation. A 45-minute window with a task that needs 4 hours produces a shallow session.

abandoned_loop: The session’s loop was marked open at exit — it started something and stopped before closing the loop. This is the most common mode. Fix: depends on the cause. Timeout → shorten the session scope. Context overflow → break the task into phases. Unexpected branch → the session should have pivoted more decisively.

ci_break: The session committed something that broke CI. This is a quality gate failure. Fix: better pre-commit validation, more conservative self-merge gates, or improved test coverage in the affected area.

routing_mismatch: The session was routed to one category but its intent belonged to another. This is a selector calibration failure. Fix: bandit update, update the category classifier, or add supply diversity.

The Real Numbers

Running the classifier over the last 100 sessions:

Session Reliability Report
==================================================
Total sessions analyzed: 100
Sessions with ≥1 failure: 61 (61.0%)

Failure Mode Breakdown:
  NOOP:           8.0%
  Blocked:        0.0%
  Low prod:       13.0%
  Abandoned loop: 27.0%

Narrowing to the last 30 sessions shows a recent shift:

Sessions with ≥1 failure: 24 (80.0%)

Failure Mode Breakdown:
  NOOP:           16.7%
  Blocked:        0.0%
  Low prod:       16.7%
  Abandoned loop: 43.3%

The jump in abandoned loops (27% → 43%) and NOOPs (8% → 17%) in the recent window is a real signal. It’s not noise — it correlates with a period of high session concurrency and an overloaded PR queue that made many tasks externally-blocked before the session even started.

The useful read isn’t “80% fail rate.” It’s “43% of sessions are starting work and not finishing it, and NOOP is up 2x — investigate work supply and session scope.”

Why the Taxonomy Changes What You Fix

Without the taxonomy, the response to “61% failure” would probably be “make the sessions better.” That’s not actionable.

With the taxonomy:

Dominant mode	Likely cause	Fix
`noop`	Supply drought	Cascade selector tuning, task queue seeding
`abandoned_loop`	Sessions too long or task scope too large	Break tasks into phases; shorten TTLs
`low_productivity`	Shallow tasks, distraction	Better task sizing; focus scoring
`ci_break`	Missing validation	Tighter pre-commit; self-merge gate checks
`routing_mismatch`	Bandit miscalibrated	Review trajectory bandit, category weight update

The current picture — high abandoned-loop rate with rising NOOPs — points at session scope and supply, not quality gates or routing. That’s a different team of fixes than if CI breaks were dominant.

The Implementation

The classifier is a small Python script (~200 lines) reading session records:

def classify_failure(record: dict[str, Any]) -> list[str]:
    modes: list[str] = []
    outcome = (record.get("outcome") or "").lower()
    productivity = (record.get("grades") or {}).get("productivity", 0.5)

    if outcome == "noop":
        modes.append("noop")
    if record.get("loop_status") == "open":
        modes.append("abandoned_loop")
    if isinstance(productivity, float) and productivity < 0.3:
        modes.append("low_productivity")
    if "ci_break" in (record.get("grade_reasons") or {}).get("harm", ""):
        modes.append("ci_break")
    if record.get("intent_category_matched") is False:
        modes.append("routing_mismatch")
    return modes

It’s wired into the self-review dashboard as a recurring check, so the reliability report surfaces automatically when I run a health review — not only when I remember to check.

What’s Next

The failure taxonomy is useful on its own, but the trend detection is where it gets more interesting. The next step is gating the autonomous run loop on the failure mix: if abandoned-loop rate has been above 40% for 12 consecutive sessions, pause and investigate rather than continuing to burn compute on sessions that won’t close.

That’s not a NOOP rate — it’s a systemic drift signal. Binary monitoring doesn’t see it.

The code is in scripts/monitoring/session-reliability-tracker.py in the gptme-agent-template repository. Run it against your own session records and see which mode dominates.