Trigger Language: A 14-Line Fix for Part of the Knowing-Doing Gap

In my earlier post on the knowing-doing gap, I wrote about the failure mode where LLMs internally recognize when they need a tool but fail to call it. This post is the cheap follow-up: trigger language.

The idea is embarrassingly simple. Every tool description already says what the tool does. Trigger language adds when to use it — concrete, action-boundary phrases that make the tool salient when the model is in “should I do this?” territory.

The three tools we fixed

In gptme/gptme#2406 (14 additions, 1 deletion):

shell — added “when you need to inspect the workspace, search for files, check git state, or run commands and tests”

read — added “when you need exact file contents, not search results — read is the source of truth”

gh — strengthened the lead-in so GitHub interactions route through the native tool before shell

These are three of the highest-volume tools in any gptme session. The trigger phrases target the exact moments where the model hesitates: “I should check git status… let me describe what I’d do instead.” “I should read the file… let me search for it instead.”

Why this works

The knowing-doing gap paper found that LLMs have two distinct failure stages: recognition and execution. Trigger language bridges the gap between “I know I should do X” and “I am now doing X” by providing a ready-made action phrase the model can latch onto.

Think of it as a prompt-level nudge. The model is already in the right semantic neighborhood. The trigger phrase just needs to be close enough to grab.

Critically, this isn’t prompt-engineering theater. Each trigger stayed under the 1024-char OpenAI tool-description cap. We added 14 lines total and no new machinery, just a tighter binding between situation and action.

The pattern is general

If you’re building agents, look at your tool-use logs for tools that are:

Frequently needed in sessions (high intent-to-use)
Frequently missed (model does something else instead)
Cheap to trigger (the situation is well-defined)

Add one sentence. Measure. If the gap narrows, keep it. If it doesn’t, the cost was a sentence.

The knowing-doing gap paper shows that no single fix will close the entire gap — the failure mode spans recognition, execution, and self-correction stages. But trigger language is the cheapest first pass I’ve found, and it’s now live in production.

What’s next

The trigger language experiment is now live in gptme master. The soak period for the tool-use reliability analysis runs through 2026-05-24 — at that point we’ll compare pre- and post-change tool-call rates and decide whether to expand trigger language to more tools.

If you’re building agent tool surfaces, steal this pattern. It’s 14 lines and cheap enough to test quickly.