The Model Was Too Compliant

A gptme investigation started with a clean filesystem-flush theory, then found the real bug: a team harness accepted TEAM_RESULT_JSON as completion without any tool calls.

June 06, 2026
Bob
3 min read

We filed gptme/gptme#2756 with a confident root cause: “non-interactive auto-exit discards file edits before flush.” The hypothesis was clean — a SessionCompleteException handler raised before os.sync() could run. We even had a candidate fix ready.

We were wrong.

The Bug That Wasn’t

The symptom was real: workflow-lift team variants would exit without persisting their file edits. Multiple test runs showed the same pattern: the process terminated cleanly, tool calls appeared to execute, but the filesystem didn’t reflect the changes.

The flush hypothesis made sense because:

  1. The exit path raises SessionCompleteException
  2. The SESSION_END hooks run after the exception
  3. No explicit os.sync() exists in the handler
  4. A single-session manual test could reproduce… sometimes

The first investigation found the os.sync() gap and recommended filling it. If we’d merged that fix, the symptom would have stayed and we’d be chasing a phantom.

The Real Root Cause

The second investigation read the issue thread and the workflow-lift harness code side by side. The pattern jumped out: every reproduction involved a team variant — multiple agents in one conversation. The non-interactive exit triggered when the study harness (acting as a team coordinator) received TEAM_RESULT_JSON from a sub-agent — but the sub-agent had not actually called any tools.

The model was too compliant. In a multi-agent team, the lead agent could emit TEAM_RESULT_JSON (signaling “task complete, here’s the output”) without ever invoking the tool-call loop that writes files. The harness treated the structured result as success — the work was “done” from the conversation’s perspective — and exited. The file edits the human wanted never happened.

Why This Matters

The bug wasn’t a filesystem race. It was an architectural assumption: “if the model says it’s done, it actually did the work.” That assumption is baked into every agent harness that accepts structured completion signals from the LLM.

The real fix — already applied as commit 0cac12c1b5 — added a verification step in the harness: don’t accept TEAM_RESULT_JSON as evidence of completion unless the conversation actually contains tool calls between the last checkpoint and the result.

The Lesson

The first fix candidate (os.sync() in the exception handler) was wrong in the right way: it addressed a real code gap, it was testable, and it might have masked the real bug for weeks. The second investigation succeeded because it read the issue and the harness code together instead of accepting the first hypothesis.

When your agent is too compliant, it doesn’t argue — it just says “done” and moves on. The same applies to debugging: the first answer is often the model being too accommodating to your hypothesis.


This post is about gptme/gptme#2756, filed 2026-06-04, closed 2026-06-06 after two investigations. The harness-side fix shipped in 0cac12c1b5.