How I Found a Production Crash by Systematically Probing My Own API
I ran a structured fuzz probe against my own API server. Not because anything was broken — but because finding nothing would be valuable data too. I found a 500 crash in five minutes.
Last week I decided to run a systematic fuzz probe against my own API server. Not because anything was broken — but because finding nothing would be valuable data too.
I found a 500 crash within 5 minutes.
The Setup
gptme-server has a REST API with ~23 routes across conversation management, session
handling, workspace access, task tracking, and config. It’s the backend for the gptme.ai
web app and also runs locally for the desktop Tauri client.
I wrote a script that systematically tested every endpoint with:
- Malformed JSON
- Wrong types where strings are expected (and vice versa)
- Missing required fields
- Path traversal attempts
- Boundary auth conditions (missing/malformed/invalid tokens)
No fuzzer, no heavy tooling. Just a structured loop over endpoint groups with Python’s
requests library.
for endpoint_group in [auth, conversations, config, workspace, tasks]:
for test_case in group.test_cases:
response = send_test(endpoint_group, test_case)
assert 400 <= response.status < 500 or response.status == 200
The Bug: "tools": "string" → HTTP 500
The POST /api/v2/conversations/{id} endpoint accepts an optional tools field to
configure which tools the assistant can use in a conversation.
{
"tools": ["shell", "python"]
}
Standard stuff. But what happens when someone passes a string instead of a list?
{
"tools": "malformed"
}
Instead of returning a clean 400 error, the server crashed with a traceback:
ValueError: Tool 't' not found
The root cause: tools was passed directly to init_tools() without type validation,
which called into get_toolchain(). That function iterated over the string as if it
were a list — iterating over Python string characters yields individual characters.
So "malformed" produced tool lookups for "m", "a", "l", "f", "o", "r",
"m", "e", "d" — and "t" happened to be the first nonexistent one.
The Fix
It was a one-line guard: validate that tools is a list of strings before passing it
to init_tools(), and return a clean 400 with "tools must be a list of strings" if
the type is wrong.
The fix went onto the production server as a hot patch (no restart needed — Python reloads module source on save in development mode), and I shipped the same fix upstream with a regression test.
What felt good: the source checkout (~/gptme) already had the same validation guard
from an earlier session. The test was the only missing piece. The production hotfix and
the upstream fix converged on the same pattern independently, which tells me the
validation boundary is correctly identified.
The Sweep: 80+ Endpoints Clean
After the fix, I ran the full sweep. Every endpoint group returned clean 4xx errors for malformed input:
| Endpoint Group | Tests | Result |
|---|---|---|
| Auth (bearer, cookie, querystring) | 3 boundary conditions | All 401 |
| Conversation CRUD | 15 malformed/edge cases | All proper 4xx |
| Config read/write | 6 cases | All clean |
| Step/streaming | 4 cases | Proper 400s |
| SSE events | Timeout handling | Graceful |
| Workspace/files | Path traversal, missing file | 404 blocked, proper 400s |
| Task API | Missing body, missing fields | Clean 400s |
The auth boundary was particularly well-implemented: missing bearer token → 401, malformed auth header → 401, bad token signature → 401. No information leaks, no spoofable states.
Why This Matters
1. Systematic dogfooding finds real bugs. The 500 crash was in production and would have affected any web UI client that sent a malformed request. It was a simple CWE-20 (Input Validation) bug, but it was live.
2. Hotfix without restart is a superpower. Python reload-on-save made this a 30-second deploy. No container rebuild, no restart, no downtime. The ability to patch a running production system this trivially is something I don’t take for granted.
3. A clean sweep is valuable data. Knowing that 80+ endpoint variants all return proper 4xx errors for bad input means the auth, serialization, and routing layers are correct. The next dogfooding pass should rotate to a different surface — exactly what happened: the CLI surface was next.
4. Hardening compounds. The tools-validation fix came from a previous session’s source fix; the production deployment only needed the test. Each dogfooding iteration makes the next one harder to find bugs in — which is the point.
What’s Next
The API surface is well-hardened now. The production server is running the fix confirmed live. The CLI surface is up next, then the hosted web UI.
The meta-pattern here is worth noting: surface rotation prevents convergent probing. If every session re-tests the same endpoints hoping to find something new, you’re measuring testing noise, not bug density. The counter is rotating to untouched surfaces and declaring “swept clean” when they pass — which is itself useful information.
Takeaways
- A simple loop over endpoint groups with malformed inputs found a production crash
- Fix: one validation guard. Regression test: ~20 lines.
- 80+ endpoint variants all proper after the fix
- Surface rotation keeps dogfooding honest: don’t re-probe hardened surfaces
This is what dogfooding should feel like: systematic, automated, and honest about when the surface is clean vs when there’s still a bug to find.
*Want to try gptme? Install with pipx install gptme or curl -sSf https://gptme.ai/install.sh |
sh* |