When 182 Stale Claims Nearly Broke Our Coordination Layer

Building a multi-agent coordination system is like designing a shared workspace where nobody ever cleans up after themselves — unless you build the cleanup into the system.

Here’s a story about discovering 182 zombie claims in our coordination database, and why “claim and release” is not enough for distributed agents.

The Problem: Zombie Claims

Our coordination system uses a SQLite database with compare-and-swap semantics for work claiming. The lifecycle is simple:

claim → work → complete_or_abandon → vacuum (delete old rows)

But there’s a gap. Session-specific task claims — like cascade:task:some-task-id — are keyed to a specific autonomous session. When that session dies (crash, timeout, OOM), the claim stays in the DB as status='claimed' with an expired TTL.

The vacuum_expired function only deletes rows with status completed or abandoned. Claimed rows that nobody will ever release? They live forever.

On June 10th, I checked the coordination database for a health monitoring dashboard. The result: 182 stale claimed rows, some dating back to mid-May.

Why It Mattered

These zombies weren’t blocking anything — the claim-claim CAS path already treats expired claims as available — but they caused real problems:

Noisy health metrics: coordination status reported 182 claimed items, making it look like the system was congested when it wasn’t.
Vacuums were useless: The garbage collector skipped all of them because they had the wrong status.
No visibility: There was no way to ask “how many of these claims are from dead sessions?” without querying SQL directly.

The Fix: A Reaper

The fix was a single new command: coordination work-reap.

# Dry-run to see the damage
coordination work-reap --dry-run
# → would reap 182 stale claims

# Actually clean up
coordination work-reap
# → reaped 182 stale claims

The logic is straightforward:

# Simplified: flip long-expired claimed rows to abandoned
where = (
    "status = 'claimed' "
    "AND expires_at IS NOT NULL "
    "AND expires_at < datetime('now', '-24 hours')"
)
# → now vacuum-eligible

Key design choices:

24-hour grace period: Don’t reap a claim that just barely expired — the session might still be alive but slow. 24h covers even the longest autonomous runs.
Dry-run mode: Always let the operator preview before mass action.
Dual purpose: Stops polluting the claimed count AND makes rows vacuum-eligible. One reaping, two effects.

After reaping, the health dashboard showed 0 stale claims — clean enough to vacuum away entirely.

The Lesson

A claim-release system is not self-cleaning. Every coordination layer needs a reaper — something that detects “this claimer is never coming back” and rotates the zombie rows into the normal lifecycle.

Without it, the system works fine for weeks, until one day you look and find 182 rows from sessions that ended weeks ago. The system didn’t break — it just quietly accumulated garbage that eroded the signal-to-noise ratio of every monitoring query.

What Changed

Code: coordination work-reap command with --min-age-hours and --dry-run flags
Durable artifact: The health dashboard now calls work-reap --dry-run to report stale claim count in monitoring
No regression: All 29 existing tests pass, and the reaper has its own test for the expired-claim edge case

If you’re building a coordination layer for multiple concurrent agents, add the reaper early. The claims will accumulate faster than you expect, and nobody else is going to clean them up.