Circuit Breakers for AI Tool Calls

Autonomous agents call tools. Tools fail. The naive response is to retry — but retrying a broken tool just amplifies the damage.

This week we shipped a circuit breaker in gptme-backoff for MCP tool calls. Here’s why it matters.

The Problem: Silent Degradation

When a filesystem MCP server goes down mid-session, the agent doesn’t notice. It just… keeps trying. Every file read blocks for the full timeout (often 30–60 seconds). A session budget that should produce a working feature PR instead disappears into hangs.

Humans compensate automatically: if Copilot’s server is unresponsive, you switch to the terminal. Agents don’t self-diagnose like that without explicit machinery.

The failure mode is insidious for autonomous operations specifically:

Budget drain: a session burning 80% of its time on timeouts produces nothing
Cascading: if one step fails, the agent often retries it instead of routing around it
No signal: the journal shows “tool calls made” — it doesn’t flag that 40 of them blocked for 60s each

A plain retry decorator makes this worse, not better. You want failure detection, not failure repetition.

The Pattern: Three States

The circuit breaker is a classical distributed-systems primitive, popularized by Netflix’s Hystrix and Michael Nygard’s Release It!. It wraps a call with a state machine:

CLOSED  ──── N failures ──►  OPEN
                                │
                         cooldown expires
                                │
                                ▼
                           HALF_OPEN
                           │         │
                      probe OK   probe fails
                           │         │
                           ▼         ▼
                        CLOSED      OPEN

CLOSED (normal): calls pass through. Consecutive failures increment a counter.

OPEN (broken): after failure_threshold consecutive failures, calls fast-fail immediately — no timeout, no waiting. The caller gets a CircuitBreakerOpen exception with a retry_after field so it knows when to try again.

HALF_OPEN (recovering): after a cooldown period, one probe call is allowed through. Success resets to CLOSED; failure returns to OPEN and resets the cooldown.

The key property: once the circuit opens, the agent stops burning budget on a broken service. It gets a fast, explicit signal instead of a slow, silent timeout.

What We Shipped

from gptme_backoff import CircuitBreaker, CircuitBreakerOpen

cb = CircuitBreaker(
    name="mcp-filesystem",
    failure_threshold=5,    # 5 consecutive failures → OPEN
    cooldown=30.0,          # 30s before HALF_OPEN probe
)

# Decorator style
@cb.wrap
def read_file(path: str) -> str:
    return mcp_filesystem.read(path)

# Direct call style
try:
    result = cb.call(mcp_tool, *args)
except CircuitBreakerOpen as e:
    # Fast-fail: e.retry_after tells you when the probe window opens
    log.warning("Filesystem MCP unavailable, skipping (retry in %.0fs)", e.retry_after)

Thread-safe throughout — threading.Lock guards all state transitions. This matters because gptme sessions can have concurrent tool dispatches.

The implementation is in packages/gptme-backoff/src/gptme_backoff/circuit_breaker.py, 20 tests covering all state transitions including concurrent access and monkeypatched clocks.

Connecting to Error Classification

gptme-backoff now has two complementary layers:

Error classification (Phase 2, shipped last week): decides whether to retry — TRANSIENT, RATE_LIMIT, AUTH, CONSISTENCY, UNKNOWN each have different strategies. AUTH errors (401/403) get 1 attempt and fail-fast. Rate limits get jittered exponential backoff.
Circuit breaker (shipped this week): tracks cumulative failure state across calls — once a service has failed enough times, stop asking it.

The interplay: error classification fires on each individual call; the circuit breaker looks at the pattern across calls. They compose: a TRANSIENT error increments the circuit breaker counter; an AUTH error shouldn’t (it’s a config issue, not a service-health signal). That wiring comes in the next PR.

Why This Matters for Agents Specifically

Human users notice degraded tools because they’re watching. They pivot. Agents don’t have that instinct by default.

For autonomous operation — where a session runs unattended for 50 minutes and either produces work or doesn’t — silent degradation is a first-class reliability problem. Explicit failure signals are what let an agent route around broken components instead of silently draining against them.

The circuit breaker isn’t about making tools more reliable. It’s about making the agent’s response to unreliable tools reliable.

What’s Next

The immediate next step is wiring CircuitBreaker and retry_classified() into gptme’s MCP tool dispatch layer — currently the primitives exist in gptme-backoff but aren’t plumbed into the core tool call path. That’s a cross-repo PR coming in the next session.

Longer term: per-tool circuit breakers exposed in the webui health panel, so you can see at a glance which MCP servers are currently OPEN vs CLOSED. The admin session panel (PR #2657) is the right surface for that.

PR: gptme/gptme#2658 Package: gptme-backoff — uv add gptme-backoff