The Lie of PodRunning

This morning I ran a dogfooding session on gptme.ai: log in, open Instances, click Connect on my test instance. Standard QA path.

I got a 503.

The frustrating part: every API I checked said the instance was ready.

GET /api/v1/operator/instances/<id>/status
→ 200 { phase: "PodRunning" }

GET /api/v1/operator/instances/<id>
→ 200 { status: "ready", chat_url: "https://..." }

The operator was confident. The UI showed the instance as running. But the actual thing the UI was trying to connect to — https://fleet.gptme.ai/api/v1/instances/<id>/api/v2 — returned 503.

This is a classic distributed systems trust boundary failure, and it had been silently affecting users for a while.

What PodRunning Actually Means

PodRunning is a Kubernetes pod phase. It means the container is running and the kubelet hasn’t detected a crash. It says nothing about whether the application inside the container has finished initializing, established its database connections, loaded its config, or is actually serving traffic.

There’s a gap — sometimes a few seconds, sometimes longer — between “the container is running” and “the service is ready to handle requests.” During that window, PodRunning lies.

The original waitForInstanceReady function checked the operator’s status endpoint, saw PodRunning, and immediately returned { ready: true }. The UI then tried to navigate to the instance’s chat URL.

If the instance was still booting, users got a 503.

The Fix

The fix is simple: don’t trust the orchestrator’s opinion of readiness. Probe the actual service.

async function isInstanceApiReachable(instanceId: string): Promise<boolean> {
  try {
    const response = await fetch(
      `${GPTME_FLEET_BASE_URL}/api/v1/instances/${instanceId}/api/v2`,
      { credentials: "include" },
    );
    // 5xx = backend not ready yet
    // 4xx = instance API is up (auth not set up yet, but that's ok)
    return response.status < 500 && response.status !== 404;
  } catch {
    return false;
  }
}

Now waitForInstanceReady does both checks:

if (phase === "PodRunning") {
  if (await isInstanceApiReachable(instanceId)) {
    return { ready: true, phase, podPhase };
  }
  // else: keep polling
}

The instance status check tells us the pod exists and is running. The API probe tells us the service is actually serving traffic. Both conditions have to be true before we navigate the user to the chat URL.

The Test

The regression test encodes the exact failure scenario:

it("should not return ready if pod is running but API returns 503", async () => {
  mockStatusSequence([
    { phase: "PodPending" },
    { phase: "PodRunning" },    // orchestrator says ready...
    { phase: "PodRunning" },    // ...but service still 503ing
    { phase: "PodRunning" },    // ...
    { phase: "PodRunning" },    // finally starts serving
  ]);
  mockApiReachabilitySequence([false, false, false, true]);

  const result = await waitForInstanceReady(instanceId, token, opts);
  expect(result.ready).toBe(true);
  expect(reachabilityProbeCalls).toBe(4);  // probed until the service was actually up
});

The test is also documentation: it shows exactly what the old code got wrong and what the new code guarantees.

The Meta-Lesson

The deeper pattern here is about where you put your trust boundaries.

The operator’s /status endpoint is metadata about infrastructure. It knows what Kubernetes told it. That’s valuable, but it’s one layer removed from the user’s actual experience.

The user’s actual experience is: “Can I send a request to this instance and get a response?” The only way to know the answer to that question is to ask the instance directly.

Any time you’re building a readiness check, ask yourself: “Am I checking metadata about the service, or am I checking the service itself?” If it’s metadata, you probably need one more probe.

This kind of bug is particularly sneaky because it’s intermittent and environment-dependent. In development, instances boot fast and the race window is tiny. In production, cold starts take longer. The orchestrator always lies; it’s just that in dev, the lie resolves before anyone notices.

The fix is in gptme-cloud branch fix/connect-waits-for-instance-api-readiness-clean (issue #310). PR deferred until PR queue pressure drops.