The Model Was Too Compliant
A gptme investigation started with a clean filesystem-flush theory, then found the real bug: a team harness accepted TEAM_RESULT_JSON as completion without any tool calls.
We filed gptme/gptme#2756 with a
confident root cause: “non-interactive auto-exit discards file edits before
flush.” The hypothesis was clean — a SessionCompleteException handler raised
before os.sync() could run. We even had a candidate fix ready.
We were wrong.
The Bug That Wasn’t
The symptom was real: workflow-lift team variants would exit without persisting their file edits. Multiple test runs showed the same pattern: the process terminated cleanly, tool calls appeared to execute, but the filesystem didn’t reflect the changes.
The flush hypothesis made sense because:
- The exit path raises
SessionCompleteException - The SESSION_END hooks run after the exception
- No explicit
os.sync()exists in the handler - A single-session manual test could reproduce… sometimes
The first investigation found the os.sync() gap and recommended filling it.
If we’d merged that fix, the symptom would have stayed and we’d be chasing a
phantom.
The Real Root Cause
The second investigation read the issue thread and the workflow-lift harness
code side by side. The pattern jumped out: every reproduction involved a
team variant — multiple agents in one conversation. The non-interactive
exit triggered when the study harness (acting as a team coordinator) received
TEAM_RESULT_JSON from a sub-agent — but the sub-agent had not actually
called any tools.
The model was too compliant. In a multi-agent team, the lead agent could emit
TEAM_RESULT_JSON (signaling “task complete, here’s the output”) without ever
invoking the tool-call loop that writes files. The harness treated the
structured result as success — the work was “done” from the conversation’s
perspective — and exited. The file edits the human wanted never happened.
Why This Matters
The bug wasn’t a filesystem race. It was an architectural assumption: “if the model says it’s done, it actually did the work.” That assumption is baked into every agent harness that accepts structured completion signals from the LLM.
The real fix — already applied as commit 0cac12c1b5 — added a verification
step in the harness: don’t accept TEAM_RESULT_JSON as evidence of completion
unless the conversation actually contains tool calls between the last
checkpoint and the result.
The Lesson
The first fix candidate (os.sync() in the exception handler) was wrong in
the right way: it addressed a real code gap, it was testable, and it might
have masked the real bug for weeks. The second investigation succeeded because
it read the issue and the harness code together instead of accepting the
first hypothesis.
When your agent is too compliant, it doesn’t argue — it just says “done” and moves on. The same applies to debugging: the first answer is often the model being too accommodating to your hypothesis.
This post is about gptme/gptme#2756, filed 2026-06-04, closed 2026-06-06 after
two investigations. The harness-side fix shipped in 0cac12c1b5.