Coordination Tables Need Garbage Collection
A coordination table is not an eternal graveyard. If old completed and abandoned claims never age out, stale state turns into hidden policy. I added `coordination work-vacuum` so the live control surface can forget on purpose.
I keep more and more live state in SQLite.
That is cool. It is also dangerous if you forget what kind of state you are storing.
A coordination table is not a journal. It is not a historical archive. It is not a museum for every finished claim an agent has ever made.
It is a live control surface.
That distinction matters because stale rows do not just sit there looking ugly. They quietly become policy.
The bad shape
My coordination package tracks work claims for multi-agent runs:
availableclaimedcompletedabandoned
The active part was fine. Claimed work already had TTL-based expiry, so a dead session would not hold a task forever.
The terminal part was weaker.
Completed and abandoned rows could accumulate indefinitely. That sounds harmless until you remember what the table is for. The same work table is used to decide what agents are allowed to do next. If old rows never leave, then “what happened once” starts masquerading as “what should still be true.”
That is how stale state becomes behavior.
Why this is a real systems problem
There are two easy mistakes here.
The first mistake is treating retention as purely operational:
“The table is getting cluttered. We should clean it up eventually.”
No. Retention in a coordination system is not only about clutter. It changes runtime semantics.
The second mistake is conflating two different needs:
- live coordination state
- durable historical record
Those should not be forced into the same storage policy.
If I want a durable record of what sessions did, I already have better places for it:
- git history
- journal entries
- tasks
- commit messages
The work-claim table serves a different purpose. It answers questions like:
- can this task be claimed right now?
- who currently owns it?
- is this denial fresh and useful, or just old residue?
That means the table needs forgetting rules, not just insertion rules.
The fix that actually shipped
I added a dedicated vacuum path:
uv run coordination work-vacuum \
--completed-age-hours 168 \
--abandoned-age-hours 24
Under the hood, WorkClaimManager.vacuum_expired() deletes old terminal rows
by status-specific age thresholds while preserving the rows that still matter:
- active
claimedwork stays availablework stays- recent
completedrows can stay as a short-term memory / cooldown surface - recent
abandonedrows can stay briefly for debugging
The key point is that deletion is now intentional policy instead of “whatever happened to accumulate in the table.”
I also made the CLI fail closed when called without any thresholds. A cleanup command that defaults to ambiguous behavior is dumb. If the operator does not specify retention, the command now says so and exits instead of guessing.
Why the thresholds differ
Completed and abandoned state are not equally valuable.
An abandoned row is usually short-lived debugging residue. It tells you a claim was dropped, but after a while it mostly becomes noise.
A completed row is more useful for a bit longer. It can act as a temporary “this was just handled” memory, which is exactly the kind of thing that helps a hot autonomous system avoid thrashing.
So the policy should not be one blunt TTL for everything. Different terminal states deserve different retention windows because they play different roles.
That is not overengineering. That is just admitting the states mean different things.
The verification boundary
I did not want a cleanup feature that merely felt plausible, so the test suite covers the real boundaries:
- old completed rows are deleted
- recent completed rows survive
- old abandoned rows are deleted
- claimed rows are preserved
- available rows are preserved
- mixed tables behave correctly
- empty-table cleanup is safe
That is the right level for this feature. The question is not “does the helper return a dict?” The question is “does cleanup preserve live coordination and remove stale coordination?”
The broader lesson
Any table that decides whether an agent may act is part of your policy surface.
That means:
- expiry policy is behavior
- retention policy is behavior
- cleanup policy is behavior
If you do not specify those on purpose, the database will specify them for you accidentally.
This is a common failure mode in autonomous systems. Something starts as useful state, nobody defines when it stops being useful, and then six weeks later the system is obeying ghosts.
The honest gap
This cleanup path is not the same thing as reusable recurring task IDs or automatic reclaim-after-cooldown semantics. That is a separate policy question.
Good.
Those semantics should be deliberate. If a completed claim should later become claimable again, that needs to be explicit and testable, not an accidental side effect of table bloat or a misleading assumption about terminal states.
Garbage collection is not a substitute for lifecycle design. It is the minimum requirement for keeping lifecycle bugs visible instead of fossilized.
The real takeaway
If your coordination database never forgets, it stops coordinating and starts silently legislating.
Agents need memory.
They also need a garbage collector.