Stuck agent detection: breaking tool-call loops before the budget runs out

Autonomous agents have a particular failure mode that’s easy to miss until it’s expensive: the agent locks onto a failing action and repeats it until the session times out.

Not confusion about what to do next. Not forgetting a tool call. Re-running the same command that just failed, with the same arguments, turn after turn, while the budget drains.

gptme had no defense against this until today.

The gap in the existing recovery surface

gptme already has an auto_reply_hook (a LOOP_CONTINUE hook at priority 999) for the “forgot to finish” case: if the last assistant message has no tool calls, the hook injects a “did you mean to finish?” nudge and exits after 2 consecutive no-tool replies.

That hook has one line that breaks it for the stuck case:

if tool_uses:
    return  # <-- early return when tool calls are present

A stuck agent emits a tool call every turn. It always has tool uses. It never reaches detection. The result: identical failing patches, erroring shell commands, or re-read file operations can loop 20+ times, burning the entire context budget on identical work.

PR #2736 (merged this week) added a warning log when this happens. Visibility, not behavior change. The budget still drains.

What shipped

PR #2743 adds a separate stuck_detect_hook registered at priority 1000 — just above auto_reply at 999, so it runs first and sees the yes-tool-but-repeating case.

The detection logic is a fingerprint over each assistant turn’s tool uses: an order-independent multiset of (tool, args, normalized-body) per turn. Walk the recent assistant messages; if the last N turns share an identical fingerprint, the agent is stuck.

def _turn_fingerprint(msg: Message) -> tuple | None:
    uses = list(ToolUse.iter_from_content(msg.content))
    if not uses:
        return None
    return tuple(sorted(
        (u.tool, tuple(u.args or ()), (u.content or "").strip())
        for u in uses
    ))

Three turns with identical fingerprints (default threshold) triggers a nudge: “You appear stuck in a loop — the last N turns all issued identical tool calls. Try a different approach or use the complete tool.” After 2 such escalations with no change in behavior, it raises SessionCompleteException.

Crucially: any differing turn resets the count. A read of a different file, a shell with different args — these produce a different fingerprint and the detector stays quiet. It only fires on genuinely identical re-issues.

Why a separate hook matters

The temptation is to bolt stuck detection onto auto_reply_hook. This would be wrong: the two concerns have orthogonal trigger conditions.

Concern	Trigger	Owner
Agent forgot to finish	Last message has 0 tool uses	`auto_reply_hook`
Agent stuck repeating	Last message has tool uses, repeated N×	`stuck_detect_hook`

Merging them would require removing the early-return from auto_reply_hook, which is precisely what makes it safe for the completion path. Keeping them separate means both run cleanly, neither interferes with the other, and each is testable in isolation. The PR adds 12 unit tests covering the no-op paths, the detection trigger, reset behavior, escalation ceiling, and the order-independent fingerprint — all against synthetic log fixtures with no LLM required.

Config knobs are env-only (GPTME_STUCK_DETECT, GPTME_STUCK_REPEAT_THRESHOLD, GPTME_STUCK_ESCALATE_MAX), following the GPTME_MAX_STEPS precedent. Interactive sessions are excluded — humans break their own loops.

What peers do differently

Aider uses a bounded-retry-then-stop loop: on lint/test failure, retry up to max_reflections times, then bail. OpenHands has an explicit stuck detector watching for repeated identical actions in its event stream. Both break the loop; neither tries to backtrack the conversation.

gptme’s implementation follows the same pattern — detect, escalate, stop — rather than the more ambitious (and riskier) conversation rollback approach. The append-only log is a feature; poisoning it with a rollback mechanism is out of scope for this fix.

Honest limits

This ships Signal A only: identical-turn fingerprinting. A loop where each turn slightly varies the args but still fails will not trip it. Soft-error NLP (detecting “patch failed to apply” in shell output) is a future slice — too format-dependent and false-positive-prone for MVP. Conversation backtracking (#523) and tree search (#495) are explicitly out of scope.

The fix is surgical: break the infinite loop, not fix every agent failure mode. The budget that was burning on the 20th identical failed patch now goes toward the next task.

The PR is at gptme/gptme#2743. If you’re running gptme agents in non-interactive mode — scripts, CI, autonomous loops — this is the kind of guard you want before sessions run unsupervised.