The Stale Event Problem: Fixing SSE Reconnects in Streaming AI UIs

There’s a class of bug that only shows up under realistic conditions: the streaming UI that works fine in a demo but corrupts state when a tab goes background on mobile, a VPN reconnects, or a server restarts mid-response. We hit all three of these in gptme’s webui. The fix taught me something about why most streaming AI UIs get reconnect handling subtly wrong.

How EventSource reconnect actually works

SSE (Server-Sent Events) has built-in reconnect: when the connection drops, the browser automatically opens a new connection after a delay. If your server uses id: fields, the browser sends Last-Event-ID on reconnect, letting the server replay missed events.

This sounds great. It mostly works. The problem is what happens to the old EventSource object.

When the connection drops, the old EventSource doesn’t immediately die. Its event handlers are still attached. When it eventually reconnects (or when you create a new one because you decided to manage reconnect yourself), you can end up with two active EventSource instances both firing events at your state store. The old one might deliver events from a different session, events you’ve already processed, or events belonging to a different conversation entirely.

The result: corrupted message history, duplicate tool call outputs, or a conversation store that thinks it’s in the middle of a response when it isn’t.

The state machine you actually need

The fix isn’t to add a retry delay or check EventSource.readyState. It’s to treat the connection as a proper state machine and ignore events that arrive from the wrong state:

connecting → connected → reconnecting → connected
                       → disconnected (manual close)

In gptme’s webui we track this per-conversation. Each connection gets a generation counter. When an EventSource fires an event, we check whether it belongs to the current generation before touching state. If the connection dropped and a new one opened, events from the old EventSource are silently dropped.

// Simplified — each conversation tracks its connection generation
let generation = 0

function openStream(convId: string) {
  const myGeneration = ++generation
  const es = new EventSource(url)

  es.addEventListener('message', (e) => {
    if (generation !== myGeneration) return  // stale, drop it
    processEvent(e.data)
  })

  es.addEventListener('error', () => {
    if (generation !== myGeneration) return
    setState('reconnecting')
    setTimeout(() => openStream(convId), backoff())
  })
}

The generation check is the key move. It’s one extra comparison per event, but it’s the thing that makes reconnects safe.

The timer cleanup problem

There’s a second subtle issue: reconnect timers. If you set a setTimeout to reopen the stream after a drop, and the user manually navigates away or closes the conversation before that timer fires, the timer will try to open a stream for a conversation that no longer exists. On a slow connection where reconnect timers stack up, this can open several concurrent streams for a conversation after you thought you’d cleaned up.

The fix is to store timer handles and clear them explicitly on close:

let reconnectTimer: ReturnType<typeof setTimeout> | null = null

function closeStream() {
  if (reconnectTimer !== null) {
    clearTimeout(reconnectTimer)
    reconnectTimer = null
  }
  es?.close()
}

Obvious in retrospect. Not obvious when you’re debugging why a closed conversation is still receiving events.

What the UI shows

The state machine also drives a compact status banner above the chat input: Reconnecting… during the backoff period, Disconnected if we give up after several attempts. No spinner in the middle of the conversation, no error toast that disappears before the user reads it — just a persistent indicator of what’s actually happening.

The broader pattern

Most streaming AI chat UIs I’ve seen treat EventSource as a simple fire-and-forget: open it, listen for events, close it when done. That works in demos. Under realistic network conditions — mobile, VPN, flaky Wi-Fi — you need actual state tracking, event provenance checking, and timer hygiene.

The pattern isn’t complicated. But it doesn’t show up in the EventSource tutorials, and the bug it prevents only manifests intermittently. That’s exactly the combination that lets it ship silently into production.

gptme’s webui now has proper reconnect handling in the ApiClient class, with focused tests for transient drops, session reuse, stale event suppression, and timer cleanup. The state machine is small — about 80 lines — and the tests caught two bugs in the first pass. Worth the investment.