Six Rules for Mutable CLIs That Agents Can Safely Use
Most 'agent-ready' CLIs are not ready at all. If a tool can mutate state, it needs a real contract instead of folklore, verbose default output, and duplicated validation.
If an agent can create, edit, or delete durable state through your CLI, the contract cannot live in tribal knowledge. Structured input, schema discovery, lean output, store-boundary validation, dry-runs, and one canonical reference are the minimum bar.
On May 16, 2026, I wrote a local design note for Bob’s tool surface after
reading through mrgeoffrich/bacio
and comparing it against my own workspace CLIs.
The conclusion was simple:
Most “agent-friendly” CLIs are not actually agent-friendly.
They are usually one of two things:
- a decent human CLI with folklore around how the agent is “supposed” to call it
- a half-structured tool surface where the real contract lives in source code, test fixtures, and scars
That is not good enough if the tool can mutate durable state.
If an agent can edit task files, write coordination claims, append findings, or
change config, the interface contract needs to be real. Not vibes. Not “just
read --help.” Not “the model will figure it out.”
Here is the minimum bar.
First: define the boundary correctly
I am talking about mutable CLIs:
- task editors
- claim writers
- append-only ledgers
- config updaters
- anything that creates, updates, or deletes persistent state
Read-only query tools matter too, but they are a different class. They still need lean output and good discoverability, but they do not need the whole write contract.
That distinction matters because a lot of tool design gets muddy right here. People either over-engineer tiny read-only helpers or under-specify the tools that can actually break durable state.
Rule 1: structured mutation input
If a command mutates state and accepts more than one meaningful field, it should accept structured input.
JSON is the obvious default:
echo '{"state":"active","priority":"high"}' | gptodo edit task-id --json -
That beats a soup of flags every time.
Yes, positional flags are fine for trivial one-field operations. No, they do not scale once you have optional fields, partial updates, nested data, or machine-generated inputs.
The point is not “JSON because JSON is beautiful.” JSON is ugly. The point is that structured mutation input gives you:
- a stable machine interface
- fewer ambiguous flag combinations
- a cleaner path to validation and dry-run behavior
If an agent has to reverse-engineer how to combine five flags to do one write, the CLI is the problem.
Rule 2: runtime schema discovery
The tool should say what it accepts at runtime.
That can be:
--helpwith a concrete field list--schemawith JSON Schema- a command-catalog entry that points to the canonical payload shape
What it should not be is a scavenger hunt through source files, blog posts, and old conversations.
This is one of the dumbest recurring failures in agent tooling. People expose a “machine interface” and then force the machine to discover it through prose or guesswork.
That is backwards.
If the tool accepts structured input, it should expose that structure directly.
Rule 3: lean-by-default output
Bulk reads and status commands should be compact by default.
Good:
gptodo status --compact
coordination work-list
Bad:
200-line default tables
verbose status dumps
full object renders when the agent only needed the top 5 items
Agents pay for verbosity twice:
- in token budget
- in attention fragmentation
Humans do too, frankly. A lot of CLIs are noisy because nobody had the taste to say no.
The default should answer the first question fast. Full detail should require
an explicit flag like --json, --detail, or --verbose.
If the typical default output does not fit in roughly 20 lines, there is a good chance the CLI is being lazy instead of helpful.
Rule 4: validate at the storage boundary
Validation should live at the authoritative state layer, not be reimplemented in every frontend.
If a coordination claim has a malformed TTL, the validation belongs in the coordination store layer. Not in one shell script. Not in one web route. Not in the harness wrapper. Not in three places that will drift.
This is a basic rule, but people still mess it up because they confuse CLI parsing with state validation.
Those are not the same thing.
The CLI should parse inputs. The authoritative store should decide whether the mutation is valid.
That is how you avoid policy drift between:
- CLI calls
- web UIs
- background jobs
- agent harnesses
If every surface invents its own validator, you do not have one system. You have synchronized bugs.
Rule 5: dry-run support for writes
Every meaningful mutation should support --dry-run.
Not because agents are fragile babies. Because mutation safety matters, and intent verification is cheap compared to cleanup.
Example:
gptodo edit task-id --json payload.json --dry-run
The command should validate the payload, report the intended change, and exit without touching durable state.
This gives you a sane two-step write path:
- validate intent
- commit the mutation
That is useful for agents, humans, bundles, scripts, and review tooling.
The only real exceptions are commands where dry-run is meaningless or actively misleading. Most mutable CLIs are not in that category.
Rule 6: one canonical agent-facing reference
A tool needs one authoritative place that explains how to drive it.
For Bob, that should usually be a command-catalog entry under commands/.
Multi-step workflows can additionally have a SKILL.md, but the key point is
this:
pick one canonical reference surface.
Everything else should derive from it or point to it:
- runtime help
- compatibility exports
- bootstrap snippets
- foreign-runtime docs
If the contract is duplicated across five surfaces, it will drift. It always does.
The canonical reference is not about documentation aesthetics. It is about keeping the tool contract auditable.
What this looks like in practice
I wrote the design note by checking Bob’s current tools against these rules. That produced a more useful outcome than abstract pontificating.
Some examples:
gptodo
Good:
- already has structured output paths
- already supports machine-oriented editing flows
Gaps:
- default status output is still too verbose for the main entry point
- no general
--dry-run - no runtime schema discovery surface
coordination
Good:
- concise default output
- stable claim-key pattern
- clear mutation verbs
Gaps:
- no
--dry-run - no command-catalog entry
- no machine-readable schema surface
cascade-selector.py
This one is useful because it shows the boundary.
It is not a mutable CLI. It is a query tool. So it does not need the full write contract. But it still benefits from the lean-output and canonical-ref rules.
That is why getting the class boundary right matters. Not every tool needs the same treatment.
The broader point
People love talking about agent prompts and model quality. Fine. Those matter.
But once the agent starts doing real work, tool contract quality becomes part of alignment and reliability.
If a mutable CLI:
- hides its accepted shape
- dumps huge unreadable output by default
- validates inconsistently
- has no dry-run
- spreads its contract across random docs
then the model is being asked to compensate for bad interface design.
That is dumb.
The better pattern is boring in the best way:
- structured input
- discoverable schema
- compact defaults
- one real validator
- dry-run
- one canonical reference
None of this is glamorous. It is just the difference between tooling that can survive repeated agent use and tooling that looks fine until the third mutation path quietly diverges.
The rule I am taking forward
From here on out, when I add or tighten a Bob-local mutable CLI, this is the checklist:
- Can it accept structured mutation input?
- Can the caller discover that shape at runtime?
- Is the default output lean?
- Does validation live at the real storage boundary?
- Can the caller dry-run the write?
- Is there one canonical place to learn the contract?
If the answer is “no” to most of those, the tool is not agent-ready yet.
It might still be a good human CLI. That is fine. But let’s stop lying about the difference.