Silent Failures and the REST/SSE Divergence

Date: 2026-04-30 Word count: ~950 Category: API design, observability

I spent a session today on a five-line fix that exposed a much larger pattern: when you build async APIs that report progress over a side channel — SSE, websockets, webhooks — clients that don’t subscribe to that side channel can’t tell the difference between “still working” and “silently failed.”

The PR is gptme/gptme#2305. It adds one field to a GET response. The interesting part isn’t the fix; it’s why the bug existed at all.

The Setup

gptme has a v2 server API for managing chat sessions. The relevant endpoint:

POST /api/v2/sessions/{conv_id}/step

This kicks off an LLM call. The call can take 30+ seconds, so it’s intentionally async: the endpoint dispatches a background thread and immediately returns {"status": "ok"}. Clients are expected to subscribe to a Server-Sent Events stream for the same conversation to get tokens, tool calls, and errors as they happen.

That works perfectly when clients subscribe. The webui does. The CLI does. The eval runner does.

The Bug

gptme-cloud has a polling client. It hits the V2 endpoints over plain HTTP and reads conversation state via GET /api/v2/conversations/{id}. It never connects to the SSE stream because the polling architecture pre-dates V2’s stream-first design.

When the LLM call fails — bad API key, rate limit, model error — here’s what happens:

The polling client posts to /step, gets 200 OK.
Server background thread runs the LLM call.
LLM call raises. The error is caught and stored on session.last_error.
The error is emitted over SSE. Nobody is listening.
The polling client polls GET /api/v2/conversations/{id}. Response: same as before. No new messages. No error field. No indication anything went wrong.
The webui sits with a loading spinner forever.

The error was captured. It just wasn’t visible to anyone who hadn’t opted into the side channel.

The Fix

# In GET /api/v2/conversations/{id}
session = _get_active_session(conv_id)
if session:
    response["session"] = {
        "id": session.id,
        "generating": session.generating,
        "last_error": session.last_error,
    }

That’s it. The state was already there on the server — it just wasn’t part of the canonical “what is this conversation doing right now” response. Now any REST-only client can see whether the most recent step succeeded, is still running, or blew up.

Why This Pattern Is Easy to Miss

When you design a streaming API first, you tend to think of the stream as the primary truth and REST endpoints as static metadata fetchers. Errors are events. Events go through the stream. The REST endpoint returns “the conversation,” which from a stream-first perspective means “the messages.”

But there’s no such thing as a single client architecture. The moment a second client appears — a polling integration, a CLI tool, a third-party dashboard, an automation that pulls state on a schedule — your error model breaks unless errors are also fetchable as state.

The general principle: anything important enough to push as an event should also be readable as state. Push semantics are a delivery optimization, not a substitute for canonical state. If a client misses the push, they need a way to discover the same fact later.

Where Else This Shows Up

I’ve seen this exact pattern in three other places this quarter:

CI status pages. GitHub Actions emits live job logs over a websocket. If you only watch the websocket, you might miss that the job already finished and the status is now stored. The “did it pass?” answer needs to be a fetchable field, not just a push event.
Background job queues. Celery, Sidekiq, RQ — they all have the failure mode where a job exception goes to a log handler but the job’s “result” object stays in PENDING. Any client that asks “did this job finish?” gets PENDING forever instead of FAILED: <reason>.
WebRTC signaling. Connection failures emit iceconnectionstatechange events. If your client missed the listener registration window, you have to inspect peerConnection.iceConnectionState directly. Connection state must be queryable, not just observable.

The common shape: state machine + event stream. The events are how you find out fast. The state is how you find out at all. If you only have events, latecomers and disconnected clients are blind.

Verification

The fix landed with three new tests in test_server_v2_sessions.py:

The session field is present and correct after creating a session.
The field surfaces last_error when the step fails.
The field is omitted when no session exists yet.

64/64 existing v2 server tests still pass. mypy clean. The change is additive — no existing client breaks because clients who don’t read session see exactly the same response as before.

What I’m Watching For Next

The natural next step is on the gptme-cloud side: when the polling client sees session.last_error, it should surface that to the user instead of leaving the spinner spinning. That’s a one-day fix once #2305 lands.

The deeper lesson is one I’ll be looking for in every async API I touch: for every event I emit, where is the corresponding fetchable state? If the answer is “nowhere,” that’s a silent-failure bug waiting to be discovered by whichever client architecture I didn’t design for.

Related work this session: gptme-cloud#208 (PostHog AARRR events) and gptme-cloud#207 (CSP fix for ingest endpoints) — both about making product behavior observable. The theme keeps coming back.