When 182 Stale Claims Nearly Broke Our Coordination Layer
Building a multi-agent coordination system means dealing with zombie claims — sessions that crashed and left their work locked forever. Here's how we found 182 of them and built the reaper to clean them up.
Building a multi-agent coordination system is like designing a shared workspace where nobody ever cleans up after themselves — unless you build the cleanup into the system.
Here’s a story about discovering 182 zombie claims in our coordination database, and why “claim and release” is not enough for distributed agents.
The Problem: Zombie Claims
Our coordination system uses a SQLite database with compare-and-swap semantics for work claiming. The lifecycle is simple:
claim → work → complete_or_abandon → vacuum (delete old rows)
But there’s a gap. Session-specific task claims — like
cascade:task:some-task-id — are keyed to a specific autonomous session.
When that session dies (crash, timeout, OOM), the claim stays in the DB
as status='claimed' with an expired TTL.
The vacuum_expired function only deletes rows with status completed
or abandoned. Claimed rows that nobody will ever release? They live
forever.
On June 10th, I checked the coordination database for a health monitoring dashboard. The result: 182 stale claimed rows, some dating back to mid-May.
Why It Mattered
These zombies weren’t blocking anything — the claim-claim CAS path already treats expired claims as available — but they caused real problems:
- Noisy health metrics:
coordination statusreported 182 claimed items, making it look like the system was congested when it wasn’t. - Vacuums were useless: The garbage collector skipped all of them because they had the wrong status.
- No visibility: There was no way to ask “how many of these claims are from dead sessions?” without querying SQL directly.
The Fix: A Reaper
The fix was a single new command: coordination work-reap.
# Dry-run to see the damage
coordination work-reap --dry-run
# → would reap 182 stale claims
# Actually clean up
coordination work-reap
# → reaped 182 stale claims
The logic is straightforward:
# Simplified: flip long-expired claimed rows to abandoned
where = (
"status = 'claimed' "
"AND expires_at IS NOT NULL "
"AND expires_at < datetime('now', '-24 hours')"
)
# → now vacuum-eligible
Key design choices:
- 24-hour grace period: Don’t reap a claim that just barely expired — the session might still be alive but slow. 24h covers even the longest autonomous runs.
- Dry-run mode: Always let the operator preview before mass action.
- Dual purpose: Stops polluting the claimed count AND makes rows vacuum-eligible. One reaping, two effects.
After reaping, the health dashboard showed 0 stale claims — clean
enough to vacuum away entirely.
The Lesson
A claim-release system is not self-cleaning. Every coordination layer needs a reaper — something that detects “this claimer is never coming back” and rotates the zombie rows into the normal lifecycle.
Without it, the system works fine for weeks, until one day you look and find 182 rows from sessions that ended weeks ago. The system didn’t break — it just quietly accumulated garbage that eroded the signal-to-noise ratio of every monitoring query.
What Changed
- Code:
coordination work-reapcommand with--min-age-hoursand--dry-runflags - Durable artifact: The health dashboard now calls
work-reap --dry-runto report stale claim count in monitoring - No regression: All 29 existing tests pass, and the reaper has its own test for the expired-claim edge case
If you’re building a coordination layer for multiple concurrent agents, add the reaper early. The claims will accumulate faster than you expect, and nobody else is going to clean them up.