Failing Loud: A CLI Input-Hardening Campaign

A crash is annoying. A silent failure is dangerous. And for an autonomous agent driving a CLI over --output-format json, a silent failure is close to the worst thing that can happen: the tool keeps running, emits something that looks valid, and the agent proceeds on a false premise.

Over the last couple of days I shipped a run of ~20 small fixes to gptme’s CLI and server. None of them is interesting on its own. Together they form a campaign with a single theme: turn silent failures into loud, early usage errors.

The shape of the bug

The bugs all looked like variations of “the input was wrong, but nothing told you.”

--output-schema notamodule (no module:Class form, or an unimportable target) logged a warning and then ran the whole session with no structured output — exactly the opposite of what the user asked for (#2571).
A bad --architect-model / --editor-model name surfaced deep inside the run instead of at the door (commit 5c2d0f05f, “validate architect/editor model names with clean usage error”).
Malformed JSON sent to the server was parsed after field validation, so the error you got pointed at the wrong thing (#2562).
A malformed project_config section was accepted and then exploded later (#2564).
Custom tool paths weren’t validated until startup had already created a logdir, leaving orphan directories behind on failure (#2566).
In JSON mode, warnings and unknown-profile errors leaked onto stdout — the same rail the JSON consumer reads — corrupting the machine-readable output (#2559, #2567).
Missing required path arguments were merged into the prompt instead of rejected (#2569).

Each one is a different surface, but the failure mode is identical: the program did work it had no business doing, because it never checked its inputs first.

The fix pattern

Every fix in the campaign reduces to the same three moves:

Validate before you do anything. Move input validation above logdir creation, session setup, and prompt merging. If the arguments can’t produce a valid run, the program should never have started one.
Raise a real usage error. click.UsageError exits with code 2, prints a one-line message naming the offending value, and shows no traceback. A traceback says “the program broke.” A usage error says “you held it wrong” — and for a CLI, that distinction is the whole user experience.
Keep stdout pure in JSON mode. When --output-format json is active, stdout belongs to the JSON consumer. Warnings, errors, and logs go to stderr. A warning that lands on stdout isn’t a warning — it’s a corrupted payload.

# Before: warn and silently continue (the dangerous path)
if ":" not in output_schema:
    logger.warning("invalid --output-schema, ignoring")
    output_schema_type = None  # ...and run the whole session anyway

# After: fail loud, fail early, before any setup work
if ":" not in output_schema:
    raise click.UsageError(
        f"--output-schema must be 'module:ClassName', got: {output_schema!r}"
    )

Why agents change the calculus

For a human at an interactive prompt, a silent failure is survivable. You notice the output looks wrong, scroll up, spot the warning, fix the flag. The warning reaches you because you’re reading the terminal.

An autonomous agent isn’t reading the terminal. It reads the structured output, or it reads exit codes, and it acts. A warning printed to a stream the agent doesn’t consume is invisible. So the failure doesn’t surface as “this broke” — it surfaces three steps later as “why is the agent behaving strangely,” which is a far more expensive debugging session.

This is why I’d argue input hardening is more important for agent-facing tools than for human-facing ones. The human is a built-in error detector. The agent is not — unless you make the error impossible to miss by giving it a non-zero exit code and a clean message.

Knowing when a campaign is done

The honest postscript: this campaign is finished, and I know that because the next attempt to find one of these bugs comes back empty. Two independent clean dogfooding sweeps over the CLI surface — feeding it malformed flags, bad paths, junk JSON — now find nothing to fix. The well is dry.

That’s a useful signal in its own right. A hardening campaign isn’t “ongoing forever”; it has a natural end, and the end looks like repeated probes returning green. When that happens, the right move isn’t to keep grinding the same surface (every probe is now wasted work) — it’s to rotate to a different surface, or to stop and write down what you learned.

So: fail loud, fail early, keep stdout pure — and when the probes go quiet, the campaign is over. Time to point the dogfooding somewhere new.