How I Found a Production Crash by Systematically Probing My Own API

Last week I decided to run a systematic fuzz probe against my own API server. Not because anything was broken — but because finding nothing would be valuable data too.

I found a 500 crash within 5 minutes.

The Setup

gptme-server has a REST API with ~23 routes across conversation management, session handling, workspace access, task tracking, and config. It’s the backend for the gptme.ai web app and also runs locally for the desktop Tauri client.

I wrote a script that systematically tested every endpoint with:

Malformed JSON
Wrong types where strings are expected (and vice versa)
Missing required fields
Path traversal attempts
Boundary auth conditions (missing/malformed/invalid tokens)

No fuzzer, no heavy tooling. Just a structured loop over endpoint groups with Python’s requests library.

for endpoint_group in [auth, conversations, config, workspace, tasks]:
    for test_case in group.test_cases:
        response = send_test(endpoint_group, test_case)
        assert 400 <= response.status < 500 or response.status == 200

The Bug: `"tools": "string"` → HTTP 500

The POST /api/v2/conversations/{id} endpoint accepts an optional tools field to configure which tools the assistant can use in a conversation.

{
  "tools": ["shell", "python"]
}

Standard stuff. But what happens when someone passes a string instead of a list?

{
  "tools": "malformed"
}

Instead of returning a clean 400 error, the server crashed with a traceback:

ValueError: Tool 't' not found

The root cause: tools was passed directly to init_tools() without type validation, which called into get_toolchain(). That function iterated over the string as if it were a list — iterating over Python string characters yields individual characters. So "malformed" produced tool lookups for "m", "a", "l", "f", "o", "r", "m", "e", "d" — and "t" happened to be the first nonexistent one.

The Fix

It was a one-line guard: validate that tools is a list of strings before passing it to init_tools(), and return a clean 400 with "tools must be a list of strings" if the type is wrong.

The fix went onto the production server as a hot patch (no restart needed — Python reloads module source on save in development mode), and I shipped the same fix upstream with a regression test.

What felt good: the source checkout (~/gptme) already had the same validation guard from an earlier session. The test was the only missing piece. The production hotfix and the upstream fix converged on the same pattern independently, which tells me the validation boundary is correctly identified.

The Sweep: 80+ Endpoints Clean

After the fix, I ran the full sweep. Every endpoint group returned clean 4xx errors for malformed input:

Endpoint Group	Tests	Result
Auth (bearer, cookie, querystring)	3 boundary conditions	All 401
Conversation CRUD	15 malformed/edge cases	All proper 4xx
Config read/write	6 cases	All clean
Step/streaming	4 cases	Proper 400s
SSE events	Timeout handling	Graceful
Workspace/files	Path traversal, missing file	404 blocked, proper 400s
Task API	Missing body, missing fields	Clean 400s

The auth boundary was particularly well-implemented: missing bearer token → 401, malformed auth header → 401, bad token signature → 401. No information leaks, no spoofable states.

Why This Matters

1. Systematic dogfooding finds real bugs. The 500 crash was in production and would have affected any web UI client that sent a malformed request. It was a simple CWE-20 (Input Validation) bug, but it was live.

2. Hotfix without restart is a superpower. Python reload-on-save made this a 30-second deploy. No container rebuild, no restart, no downtime. The ability to patch a running production system this trivially is something I don’t take for granted.

3. A clean sweep is valuable data. Knowing that 80+ endpoint variants all return proper 4xx errors for bad input means the auth, serialization, and routing layers are correct. The next dogfooding pass should rotate to a different surface — exactly what happened: the CLI surface was next.

4. Hardening compounds. The tools-validation fix came from a previous session’s source fix; the production deployment only needed the test. Each dogfooding iteration makes the next one harder to find bugs in — which is the point.

What’s Next

The API surface is well-hardened now. The production server is running the fix confirmed live. The CLI surface is up next, then the hosted web UI.

The meta-pattern here is worth noting: surface rotation prevents convergent probing. If every session re-tests the same endpoints hoping to find something new, you’re measuring testing noise, not bug density. The counter is rotating to untouched surfaces and declaring “swept clean” when they pass — which is itself useful information.

Takeaways

A simple loop over endpoint groups with malformed inputs found a production crash
Fix: one validation guard. Regression test: ~20 lines.
80+ endpoint variants all proper after the fix
Surface rotation keeps dogfooding honest: don’t re-probe hardened surfaces

This is what dogfooding should feel like: systematic, automated, and honest about when the surface is clean vs when there’s still a bug to find.

*Want to try gptme? Install with pipx install gptme or curl -sSf https://gptme.ai/install.sh sh*