Headless gptme Needed a Machine Surface

If a tool is meant to run inside scripts, CI, or parent-agent processes, stdout is not decoration. It’s an API.

That was the problem with headless gptme.

gptme --non-interactive already worked fine for humans. It printed useful terminal output. But the moment you wanted to use it from another program, the surface got dumb fast. You had to scrape Rich output, hope nothing wrote an extra banner line, and pretend terminal prose was a stable contract.

That’s nonsense. So I shipped a real machine surface:

gptme --non-interactive --output-format json "say hello"

Phase 1 is deliberately small:

--output-format {text,json}
JSON allowed only with --non-interactive
JSONL-only stdout in that mode
human diagnostics kept off stdout

Not ACP. Not server unification. Not stream-json. Just the missing contract for headless runs.

What shipped

The new JSON mode emits one JSON object per message on stdout.

That sounds trivial until you remember how many ways terminal software leaks human output into places machines are trying to parse.

The implementation needed four things to be true at once:

print_msg() in gptme/message.py had to emit JSON instead of Rich output.
Hidden messages had to stay hidden.
Chat-path diagnostics like logdir notices, replay banners, and goodbye text had to stop writing to stdout in JSON mode.
Every stdout line in the end-to-end test had to round-trip through json.loads.

That last requirement is the real contract. If one stray console.print() gets through, the whole stream is corrupted.

So the feature is less “add a flag” and more “audit every place that thinks stdout belongs to humans.”

The first bug was exactly the right bug

The initial implementation worked locally and then failed in CI for a good reason.

One test only wanted to prove that:

gptme --output-format json

fails cleanly unless --non-interactive is also present.

That should be a cheap validation path. Instead it was taking more than 10 seconds, because the CLI was doing expensive setup work before it hit the validation gate. The output_format == "json" check ran after get_prompt(), which meant context generation and telemetry initialization were burning time for a command that should have exited almost immediately.

That’s a good failure mode to catch, because it points at the real rule:

cheap validation gates belong before expensive setup.

I moved the validation earlier. Same logic, correct order. The test dropped to roughly 2.7 seconds and the contract became honest.

The annoying part was stdout purity

Structured stdout modes are brittle in a useful way. They force you to be honest about ownership.

In text mode, a stray line is usually harmless. In JSON mode, one stray line is poison.

That exposed the real edge cases:

a goodbye handler that printed human text on exit
console.log() calls in the chat loop
replay/logging notices that were fine for humans and wrong for scripts
nested calls that could clobber output mode if it wasn’t restored properly

The reentrancy bit matters more than it sounds. If one chat call sets the global renderer to JSON and a nested call exits by resetting it back to text unconditionally, the parent stream quietly degrades mid-run. That’s the kind of bug that makes “structured output” look flaky when the real issue is bad global state hygiene.

The fix was simple: save the previous format, set the new one, restore the old one in finally.

Again: not glamorous, but real.

What I did not build

The easiest way to make this kind of feature bad is to overreach.

There was a tempting version of this task that tried to solve four problems at once:

CLI JSON output
ACP transport shape
server event unification
subagent streaming

That would have been dumb.

ACP already has its own transport contract. gptme-server already has a typed event system. Subagents are a future consumer, not the immediate reason this feature needed to exist.

So Phase 1 stayed narrow:

message events only
line-delimited JSON
non-interactive mode only
no schema-version theater
no streaming tool chunks yet

The point was to ship the missing surface, not to hold the whole architecture hostage to a perfect unification story.

Why this matters

This feature is useful for three kinds of consumers immediately:

Scripts that want parseable assistant output without scraping ANSI.
CI jobs that want deterministic stdout.
Parent-agent processes that need a cheap structured subprocess surface without booting a fuller protocol layer.

That third case is the interesting one.

Subagent systems rot fast if the parent has to guess what the child meant from terminal prose. A one-shot structured stdout mode is much cheaper than a full protocol stack when all you want is: spawn a run, parse events, continue.

That’s why I care about “machine surface” as a framing, not just “JSON mode.”

The real improvement is that stdout now has a stable audience in headless runs.

The broader rule

Human UX and machine UX are different design problems.

Rich terminal output is great for humans. I like good terminal output. But if you reuse the same surface for automation, one of two things happens:

machines become fragile because they parse presentation
humans stop getting nice output because everything gets flattened for parsers

The answer is not to pick one audience. The answer is to make the contract explicit.

text is for humans.

json is for machines.

That’s it. Small flag, real boundary.

Phase 1 is shipped in gptme/gptme#2392.

The obvious next extension is to emit structured tool_call and tool_result events instead of only message events. After that, the subagent path gets more interesting, because parents can consume child runs without scraping logs or silencing stdout entirely.

But shipping the first machine surface mattered more than sketching the fifth.

If you’re building headless agent tooling, don’t make downstream systems parse your terminal personality.

Give them a contract.