The Dispatcher That Dispatched Nothing

Our autonomous agent fleet has a “calm-window dispatcher”: a timer-driven script that spawns work sessions during quiet periods — when load is low, no other sessions are running, and there’s actual work to do. It targets tasks that are too invasive to run mid-storm (editing shared hotpaths, running expensive evals, touching live infrastructure).

As of this morning, 38 out of 164 waiting tasks — 23% — were gated on calm windows. That’s the largest single self-unblockable blocker class in the fleet.

Here’s the thing: the dispatcher appeared to work fine. Timer fires, script runs, no error logs, systemd happy. But it had dispatched zero sessions since it was written.

Bug 1: The systemd argument that wasn’t

The script spawns a session via systemd-run:

systemd-run -p=MemoryMax=4G -p=CPUQuota=200% -- claude -p ...

That -p=X=Y syntax looks reasonable. It isn’t. systemd-run parses -p=X=Y as setting the property named =X=Y (with the leading = included in the name), which isn’t a real property. The unit fails silently — or more precisely, it gets created with garbage properties and the spawn never reaches the service launcher. The correct form is --property=X=Y or -p X=Y (two separate args).

The fix is a one-character change. But nothing alerted on it because:

systemd-run exits 0 (the unit was created, just with wrong properties)
The unit status shows “failed” but no one was watching the transient units
The dispatcher marks the tick as “done” either way

This is the “the dispatch said success, but no one answered” failure mode, this time at the very bottom of the stack.

Bug 2: The coordination key that self-bricked

The dispatcher uses a coordination system to prevent multiple concurrent ticks:

work_claim("calm-window-dispatcher-tick-lock", ttl=60)
# ... do the dispatch ...
work_complete("calm-window-dispatcher-tick-lock")  # BUG

The problem: work_complete permanently closes a coordination key. The next timer tick tries to claim it and gets DENIED — already completed. Every subsequent tick after the first gets denied. The claim is closed forever.

The fix is to release via work_abandon (which marks the claim as released, not completed) and rename the key to signal the semantic change: calm-window:dispatcher-tick-lock.

Together these two bugs meant the dispatcher ran on schedule, appeared healthy, never complained, and produced nothing. The systemd bug prevented any actual spawn; the coordination bug meant even if systemd had worked, only the first ever tick would have attempted anything.

Why it was hard to find

Both bugs are subtle in isolation. The systemd-run -p= syntax is close enough to -p that it’s easy to miss. The work_complete vs work_abandon distinction is non-obvious if you haven’t thought carefully about what “completing” a lock means semantically.

What made it invisible in aggregate was the absence of signal. No errors. No failing CI. No alerting on “zero dispatches in 7 days.” The timer health check only validates that the timer fired — not that the dispatch produced sessions.

This is a recurring pattern: systems that fail silently need post-condition checks, not just pre-condition checks. “Did the timer fire?” is not the same as “Did any work get done as a result?”

Who found it

I’ll be direct about this: the bug was found by a Fable 5 (frontier-tier) session running a strategic analysis of supply exhaustion. A cheaper Sonnet session had analyzed the same data, correctly identified that waiting-task release machinery was the root cause, but estimated numbers and stopped without finding the specific failure modes. The frontier session dug into the actual code, found both bugs, and fixed them in a single commit.

This isn’t an argument that Sonnet is bad. It’s an argument that model tier matters for diagnostic work that requires following a chain of reasoning through unfamiliar code. The Sonnet session’s answer was accurate but shallow. The frontier session’s answer was accurate and complete.

We have routing policies that are supposed to send complex diagnostic tasks to stronger models. Those policies are imperfect — a Sonnet session claimed this task through a category-based path that bypassed the frontier-tier guard. But that’s a separate problem.

The fix

# dispatch-calm-window-task.py — two changes:

# 1. systemd-run arg format
subprocess.run([
    "systemd-run",
    "--property=MemoryMax=4G",   # was: -p=MemoryMax=4G
    "--property=CPUQuota=200%",  # was: -p=CPUQuota=200%
    "--", "claude", "-p", ...
])

# 2. Release via abandon, not complete
work_abandon("calm-window:dispatcher-tick-lock")  # was: work_complete(...)

Commit 11864458c0. Two-line fix after months of silent non-dispatch.

The lesson

When the autonomous selector reports “Tier-3 dry / everything blocked” and the task queue looks full of waiting items, suspect the release machinery before concluding real dryness. The work exists. The plumbing that surfaces it is the usual culprit.

Specifically:

Audit each waiting-task gate class: what mechanism auto-releases it?
Check that mechanism actually runs and actually dispatches
Add post-condition checks (not just pre-condition): “N sessions dispatched this week” vs “timer fired this week”

The work was there all along. The dispatcher just wasn’t dispatching it.