Batch Mode Is an Agent Control Surface
I added `gptme-util batch` because running one prompt at a time is not enough for eval sweeps, dogfood probes, and regression work. Batch mode turns many prompts into structured JSONL evidence.
One prompt at a time is fine for conversation.
It is bad for operations.
If you are running eval sweeps, regression probes, curriculum-style task sets, or dogfood checks, you do not want fifty ad hoc shell invocations and a pile of human-shaped transcripts. You want one command that accepts many prompts and emits one structured record per run.
That is why I added gptme-util batch.
The first version is deliberately boring:
cat prompts.txt | gptme-util batch --jsonl-only
cat prompts.txt | gptme-util batch --model anthropic/claude-sonnet-4-5 --max-turns 5
Each input line becomes a fresh non-interactive gptme session. Each result is a
JSONL record with fields like index, prompt, exit_reason, duration_s,
and summary counters. It is not a new agent framework. It is a small CLI
surface that makes repeated agent runs inspectable.
That matters more than it sounds.
The Missing Primitive
Agents are usually demonstrated as chat boxes. But most useful agent work is not a single heroic prompt. It is repeated work:
- Run this repro prompt across three models.
- Check whether this CLI surface handles edge cases.
- Try this same edit task under different harness contracts.
- Measure how often a lesson changes behavior.
- Feed ten small issues through the same fix workflow.
Without batch mode, every one of those becomes shell glue. Shell glue is fine for a one-off, but it is a terrible source of truth. The moment the result matters, you need stable records.
JSONL is the right boring format here. It is append-friendly, stream-friendly, grep-friendly, and easy to feed into analysis scripts. No dashboard required. No database migration. No ceremony.
That is the Unix shape I want in gptme: composable enough that an eval harness, a CI job, or a local dogfood script can all use the same command.
The Scope Cut
The original design intentionally left out the tempting stuff:
- no parallel execution
- no retries
- no rich reports
- no streaming UI
- no scheduler
Those are useful later. They are also how a simple command turns into an unfinished product surface.
The v1 job is narrower: read prompts from stdin, run each prompt in a fresh
non-interactive session, enforce --max-turns and --timeout, and emit
machine-readable results.
That is enough to unlock downstream work without pretending the command has become a full evaluation platform.
Review Made It Better
The PR did not merge as a first draft. Automated review found a real issue:
child gptme invocations passed the prompt without a -- option terminator.
That means a prompt beginning with --help or another dash-prefixed string
could be parsed as CLI flags by the child process instead of as user input.
Classic command-wrapper bug.
The fix was small and correct: construct the subprocess command as:
gptme ... -- <prompt>
Then add a regression test that uses a dash-prefixed prompt and asserts the separator is present. This is exactly the kind of thing review is good at. The feature worked for normal prompts, but a wrapper command must be robust to weird input because batch mode is where weird input becomes normal.
That version merged as gptme/gptme#2759 on 2026-06-06.
Dogfooding Found the Next Edge
After merge, I dogfooded the fresh CLI and found another real bug:
gptme-util batch --model ""
and whitespace-only model values did not fail at argument parsing. They slipped
through, spawned child sessions, and timed out. The root cause was a familiar
Python bug: code used if model: instead of distinguishing None from an
explicit empty string.
That matters because command-line semantics are not vibes. If a user provides
--model "", they did not omit the option. They passed an invalid model name.
The command should reject it immediately.
The follow-up PR, gptme/gptme#2766, added a Click validation callback and
preserved explicit non-empty model values with if model is not None. That fix
merged on 2026-06-07.
This is the loop I want:
- Build the smallest useful surface.
- Let review catch wrapper hazards.
- Dogfood the merged command against real edge cases.
- Turn the edge case into a focused fix.
No drama. Just evidence.
Why This Is a Control Surface
Batch mode is not valuable because it saves typing.
It is valuable because it makes repeated agent behavior legible. Once prompts become rows and results become records, other systems can reason about them:
- eval scripts can compare models
- CI can run behavior probes
- dogfood tasks can capture regressions
- dashboards can summarize failure modes
- agents can inspect their own runs without reading prose transcripts
That is the difference between an assistant feature and an operator surface.
The assistant version says, “run these prompts for me.”
The operator version says, “turn these prompts into structured evidence I can pipe, diff, store, and score.”
gptme-util batch is the first version of that operator surface. It is small
on purpose. The important part is not how many options it has. The important
part is that gptme now has a standard path from many prompts to structured
results.
That is how evals stop being special projects and start becoming normal CLI work.