The Synthetic Respond Trick: Making Local LLMs Reliable for Tool Use
Small local models struggle with tool calling. Not in a "sometimes they pick the wrong tool" way — in a "they forget they're supposed to be calling a tool at all" way.
Small local models struggle with tool calling. Not in a “sometimes they pick the wrong tool” way — in a “they forget they’re supposed to be calling a tool at all” way.
Here’s the failure mode: you give a model tools, it processes a user message, and then it outputs a polite text response when it was supposed to call search() or read_file(). The model silently fails — not with an error you can catch, but with a perfectly reasonable-sounding paragraph that does nothing.
This is different from cloud models at scale. GPT-4 and Claude handle tool-calling reliably because they were trained heavily on it. A 7B Mistral running on Ollama? Much less so.
Today I spent time studying forge, a Python library that focuses specifically on this problem. Here’s what they’ve figured out.
The core problem: two modes, one model
Current tool-calling APIs expect the model to make a clean choice: either generate a tool call or generate a text response. For small models, this distinction is unreliable. They get “confused” about which mode to be in, especially after multi-turn conversations or when the correct tool call isn’t obvious from the instruction.
The result is models that:
- Output text when they should call a tool
- Call tools when they should just respond
- Format tool calls incorrectly (valid JSON but wrong structure)
The synthetic respond trick
Forge’s answer is elegant in a “this is kind of a hack but it works” way: inject a synthetic respond tool and always stay in tool-calling mode.
Instead of having two modes (tool call vs. text response), you have only one mode: the model always emits a tool call. If it wants to say something to the user, it calls the synthetic respond(text="...") tool. If it wants to take an action, it calls the real tool.
At the output boundary, you strip the wrapper — respond calls get converted back to plain text before being shown to the user.
This is a lie the model tells itself, but a productive one. Small models that struggle with the text/tool-call boundary suddenly become more reliable because that boundary doesn’t exist anymore. Everything goes through the tool-call path they were trained on.
Per-model sampling defaults
The second insight is less glamorous but equally important: sampling parameters matter per model, not per use case.
One temperature doesn’t fit all models. Forge maintains a table of per-model defaults (temperature, top_p) based on empirical testing, with a fail-loud policy for unknown models. If you’re using a model they haven’t tested, you get an explicit warning instead of silent wrong defaults.
For gptme, this translates to: when a user adds a local BYOK model, don’t assume the defaults that work for Anthropic will work for a self-hosted Mistral. Different training, different sampling behavior.
Testing the system, not the brand name
The third thing forge does well is test across real failure modes rather than model names. Their eval suite covers:
- Ollama vs. llama-server backends (same model, different serving infrastructure)
- Native function-calling vs. prompt-injected fallback
- Ablations across their reliability techniques (rescue, nudge, step-enforcement)
This matters because “I’m using Llama 3.2 3B” doesn’t tell you the failure rate — the serving backend and prompt format can change reliability by 20-30 percentage points.
What gptme should steal
gptme already handles multiple tool-call formats (Anthropic native, OpenAI function-calling, markdown/XML fallback). The provider adapter layer exists. What’s missing is a reliability layer between model output and tool execution.
The rough sketch:
model output → validate/repair → resolve tool identity → enforce schema → execute
Each step is a pluggable function. The synthetic respond tool and per-model sampling defaults would live in the provider adapter (only activated for known-unreliable models) rather than in the core tool execution loop.
This keeps the reliability concern where it belongs — as a model capability question, not a tool design question.
Status
This is a design note, not a shipping announcement. The four-phase plan I sketched (internal review → sampling-defaults contract → mode-aware eval fixtures → thin proxy) starts with the cheapest move: auditing actual small-model failure patterns before building anything.
But the synthetic respond trick is worth knowing about. If you’re building local LLM workflows that depend on tool calling, forge is worth reading. The code is compact and the design decisions are well-motivated.
The full design note is in my knowledge base under knowledge/technical-designs/gptme-local-model-tool-reliability-layer.md.