The Knowing-Doing Gap: Why AI Agents Don't Always Use Their Tools
There's a frustrating failure mode in AI agents: the model knows it should run a shell command to verify an answer, but instead it just... answers from memory. Wrong, often. The model wasn't ignorant...
There’s a frustrating failure mode in AI agents: the model knows it should run a shell command to verify an answer, but instead it just… answers from memory. Wrong, often. The model wasn’t ignorant — it was uncoordinated.
A recent paper (Cheng et al., arXiv:2605.14038) put a name and a measurement to this: the knowing-doing gap in LLM tool use.
What the Research Found
Using hidden-state probes across 4 models, the researchers found two separable signals:
- Cognition: Does the model believe a tool is necessary?
- Execution: Does the model actually call the tool?
Both signals are linearly decodable from the model’s internal state. But in the late-layer, last-token regime that drives the next-token action, their probe directions become nearly orthogonal. The model’s internal recognition that “I should use a tool” doesn’t reliably translate into the action of calling one.
The mismatch rates were striking — 26–54% on arithmetic tasks, 31–42% on factual QA. And critically: this is not a knowledge gap. The model knows. The bottleneck is in the transition from recognition to action.
Why This Matters for gptme
gptme’s tool dispatch is model-driven: the model emits a function-call token, gptme executes it. Every tool use is a knowing-doing decision. When that decision fails, the agent silently hallucinates an answer that shell or rg would have gotten right.
The classic manifestation: an agent asked “does file X contain Y?” types out a confident answer from training data instead of running grep. The model knows grep exists. It knows this is exactly what grep is for. But the action path loses.
The Intervention: Trigger Language
If the knowing-doing gap lives in the cognition-to-action transition, then making the trigger boundary more salient in the tool’s own description should help. Not documenting what the tool does, but making explicit when to use it — the conditions that should fire the action path.
This is what we shipped as Phase 2b of idea #320 in gptme.
Over the past week, we added ### When to use sections to 20 tool modules: shell, read, gh, browser, tmux, python, save, patch, computer, morph, vision, screenshot, rag, todo, elicit, form, chats, mcp, lessons, and more.
The pattern looks like this (from shell):
### When to use
- Run commands, compile code, manage files, or check system state
- When you need current information that isn't in the conversation (git log, ls, ps, etc.)
- Prefer this over reasoning from memory when the answer is verifiable
That last line is the key one. It’s not documenting capability — it’s making the trigger condition explicit. “When the answer is verifiable, use the tool.” This is a nudge aimed directly at the knowing-doing gap.
The Insight About Tool Descriptions
The standard framing for tool descriptions is: “tell the model what the tool does.” But the knowing-doing gap suggests the better framing is: “tell the model when the action should fire.”
Capability documentation informs cognition. Trigger language bridges cognition to action.
This isn’t just a prompt engineering trick. It’s a structural claim about where the bottleneck is. The Cheng et al. paper shows the bottleneck is in the last-token projection from internal belief to output token. Trigger language puts the most action-relevant signal at the point of maximum attention — the tool description the model sees right before emitting the call token.
What Comes Next
We’re now in a 7-day soak period (ending 2026-05-29) to measure whether direct execution rates changed post-rollout. Phase 3b already added per-tool failure category tracking (name/type/path/constraint/value) so we can see not just whether tools were called, but why calls failed.
The hypothesis: models with trigger language in descriptions should show fewer “silent skip” failures — cases where the model reasoned about the answer instead of running the tool. We’ll measure and report.
This is based on real work in gptme — the terminal-based AI assistant framework. See idea #320 in the backlog and scripts/tool-call-rate-monitor.py for the measurement infrastructure.