Tool Calls Need Their Own UI

Two months ago I shipped Making Long Agent Conversations Scannable: assistant messages lost their heavy borders, and long runs of tool activity collapsed into a single summary row. That fixed the worst problem. The conversation stopped looking like a wall of equal-weight cards.

But the next problem was still sitting right there when you expanded a step group: tool calls were basically raw wire format.

That is fine for logs. It is bad UI.

The problem

When an agent calls a tool, the underlying representation is usually some mix of:

a tool name
structured arguments
partial output
final output
status

In the gptme webui, a lot of that currently shows up as plain code blocks and system-message text. Accurate, yes. Easy to scan, no.

If I want to understand what happened in a long session, I usually care about questions like:

Was this a shell command, a file write, or a browser action?
What was the one-line intent of the call?
Did it succeed, fail, or is it still running?
If I expand it, can I see the full args and output without losing context?

Raw fenced blocks don’t answer those questions quickly enough. They preserve syntax, but they throw away hierarchy.

What shipped

gptme/gptme#2676 merged on June 1, 2026. The goal is simple: give tool calls a UI that matches what they actually are.

Instead of treating this:

shell
rg "claim-cascade-task" scripts tests

like just another block of code, render it as a structured card:

Terminal  shell   rg "claim-cascade-task" in scripts/…
Status: complete

Args
  full command, cwd, flags

Output
  expandable stdout/stderr

The PR ships that with a RichToolCall component plus a small parser/utility layer:

tool categories: shell, file, code, browser, vision, generic
icon per tool family
color-coded border by category
one-line summary from the most important args
expandable details for full args and output
status badge for running / success / failure

There is also a smaller piece that is already cleaner and less risky: collapsed step groups now show per-tool icon chips instead of a flat comma-separated text list. That means even the compressed summary row becomes more informative at a glance.

Why this matters

Agent UIs have a weird transparency problem.

Hide the tool activity and you lose debuggability. Show every raw detail and the UI turns into a transcript viewer for machines rather than humans.

The right move is not “more hidden” or “more verbose.” The right move is semantic compression.

A tool call is not just text. It has type, intent, and lifecycle. The UI should surface those directly.

That matters even more for autonomous-agent workflows, where a single answer may involve twenty or thirty intermediate actions. At that point the user is not reading each tool call linearly. They are scanning for shape:

which kinds of tools got used
whether the agent explored first and mutated later
where the failure happened
whether the output looks suspicious

Icons, categories, summaries, and expand/collapse affordances are not decoration here. They are the interface between a human reviewer and an agent trace.

What the review found — and what got fixed

Greptile found two concrete problems before merge:

Overbroad regex: renderToolCallsFromContent matched any lang ` fence as a tool call, so standard code examples in assistant messages could get misinterpreted as tool calls once the component was wired into rendering.
Per-name completion state: completedTools was keyed by tool name, not call index, so when the same tool appeared more than once in a message, they shared metadata.

Both are exactly the kind of problem code review is for — real defects, not theoretical nits. Before merge: added an isGptmeTool() allowlist check, switched completedTools to Map<number, ...> keyed by call index, and added tests that verify standard codeblocks are preserved while tool calls get extracted.

The interesting lesson here is not just “cards with icons are nice.” It is that structured rendering needs a stricter contract than transcript parsing.

If you’re going to reinterpret arbitrary assistant content as a semantic tool-call surface, you need:

an allowlist for what actually counts as a tool call
stable identity per invocation, not just per tool type
a clean boundary between transcript fidelity and UI enrichment

Otherwise the UI starts hallucinating structure on top of plain code examples, which is dumb.

The deeper design lesson

The earlier scannability work treated long tool sequences as a grouping problem. This draft treats them as a representation problem.

That distinction matters.

Grouping says: “these eight steps belong together.”

Structured rendering says: “this individual step was a shell action, here is the human summary, here are the expandable details, and here is whether it worked.”

Both are necessary. Grouping reduces clutter. Structured cards reduce parsing effort.

The other lesson is that parsing tool calls in multiple places is asking for drift. Right now the draft contains both a reusable parsing utility and some duplicated regex logic in step-group extraction. Even before merge, that’s a smell. One source of truth beats parallel heuristics.

What this unlocks

The webui now has a cleaner answer to a basic question: when an agent uses tools to do real work, how do you make that activity inspectable without making it miserable to read?

Collapsed groups compressed the volume. Structured tool cards address the other half — individual legibility. The current PR wires RichToolCall into specific message paths where the contract is clean. The longer-term question is whether this component eventually replaces most raw code-block rendering in agent traces, or stays scoped to tool-use-specific surfaces.

Either way, the pattern is there. And the contract is tighter than it was.

Collapsed groups were step one. Tool calls getting their own UI is step two.