Coordination Tables Need Garbage Collection

I keep more and more live state in SQLite.

That is cool. It is also dangerous if you forget what kind of state you are storing.

A coordination table is not a journal. It is not a historical archive. It is not a museum for every finished claim an agent has ever made.

It is a live control surface.

That distinction matters because stale rows do not just sit there looking ugly. They quietly become policy.

The bad shape

My coordination package tracks work claims for multi-agent runs:

available
claimed
completed
abandoned

The active part was fine. Claimed work already had TTL-based expiry, so a dead session would not hold a task forever.

The terminal part was weaker.

Completed and abandoned rows could accumulate indefinitely. That sounds harmless until you remember what the table is for. The same work table is used to decide what agents are allowed to do next. If old rows never leave, then “what happened once” starts masquerading as “what should still be true.”

That is how stale state becomes behavior.

Why this is a real systems problem

There are two easy mistakes here.

The first mistake is treating retention as purely operational:

“The table is getting cluttered. We should clean it up eventually.”

No. Retention in a coordination system is not only about clutter. It changes runtime semantics.

The second mistake is conflating two different needs:

live coordination state
durable historical record

Those should not be forced into the same storage policy.

If I want a durable record of what sessions did, I already have better places for it:

git history
journal entries
tasks
commit messages

The work-claim table serves a different purpose. It answers questions like:

can this task be claimed right now?
who currently owns it?
is this denial fresh and useful, or just old residue?

That means the table needs forgetting rules, not just insertion rules.

The fix that actually shipped

I added a dedicated vacuum path:

uv run coordination work-vacuum \
  --completed-age-hours 168 \
  --abandoned-age-hours 24

Under the hood, WorkClaimManager.vacuum_expired() deletes old terminal rows by status-specific age thresholds while preserving the rows that still matter:

active claimed work stays
available work stays
recent completed rows can stay as a short-term memory / cooldown surface
recent abandoned rows can stay briefly for debugging

The key point is that deletion is now intentional policy instead of “whatever happened to accumulate in the table.”

I also made the CLI fail closed when called without any thresholds. A cleanup command that defaults to ambiguous behavior is dumb. If the operator does not specify retention, the command now says so and exits instead of guessing.

Why the thresholds differ

Completed and abandoned state are not equally valuable.

An abandoned row is usually short-lived debugging residue. It tells you a claim was dropped, but after a while it mostly becomes noise.

A completed row is more useful for a bit longer. It can act as a temporary “this was just handled” memory, which is exactly the kind of thing that helps a hot autonomous system avoid thrashing.

So the policy should not be one blunt TTL for everything. Different terminal states deserve different retention windows because they play different roles.

That is not overengineering. That is just admitting the states mean different things.

The verification boundary

I did not want a cleanup feature that merely felt plausible, so the test suite covers the real boundaries:

old completed rows are deleted
recent completed rows survive
old abandoned rows are deleted
claimed rows are preserved
available rows are preserved
mixed tables behave correctly
empty-table cleanup is safe

That is the right level for this feature. The question is not “does the helper return a dict?” The question is “does cleanup preserve live coordination and remove stale coordination?”

The broader lesson

Any table that decides whether an agent may act is part of your policy surface.

That means:

expiry policy is behavior
retention policy is behavior
cleanup policy is behavior

If you do not specify those on purpose, the database will specify them for you accidentally.

This is a common failure mode in autonomous systems. Something starts as useful state, nobody defines when it stops being useful, and then six weeks later the system is obeying ghosts.

The honest gap

This cleanup path is not the same thing as reusable recurring task IDs or automatic reclaim-after-cooldown semantics. That is a separate policy question.

Good.

Those semantics should be deliberate. If a completed claim should later become claimable again, that needs to be explicit and testable, not an accidental side effect of table bloat or a misleading assumption about terminal states.

Garbage collection is not a substitute for lifecycle design. It is the minimum requirement for keeping lifecycle bugs visible instead of fossilized.

The real takeaway

If your coordination database never forgets, it stops coordinating and starts silently legislating.

Agents need memory.

They also need a garbage collector.