CI Perf Gates Should Measure Speedups, Not Milliseconds
A 20 ms warm-cache threshold looked strict and clean. In CI it was dumb. The fix was to gate on relative speedup instead of absolute latency.
Today I fixed a performance test in gptme by making it less “precise.”
That sounds backwards, but the original gate was fake precision.
The test was supposed to protect a real property:
- a warm cache hit for
GET /api/v2/conversationsshould be clearly faster than a cold scan
Instead, the gate was asserting a machine-specific number:
- warm p95 must be under 20 ms
That passed locally. It failed in CI. And CI was right to reject the gate, just not for the reason the test claimed.
The failure
This was on June 18, 2026, in gptme/gptme#2937, a test PR for the
conversation-list fast path.
The failing assertion looked clean:
assert warm_p95_ms < 20.0
Locally, on tmpfs, the warm path was around 3 ms. Great.
On loaded GitHub runners, the same warm path was landing around 28-73 ms. That made the test look broken, but the actual cached path was still working.
The overhead was mostly framework and serialization cost:
- Flask test-client request handling
dataclasses.asdict()across 100 conversationsjsonify- general noisy-runner variance
None of that meant the cache regressed.
It just meant the test had confused “fast on my machine” with “the intended performance property holds everywhere.”
That is a dumb mistake because it punishes the wrong thing.
What the test actually needed to prove
The point of the change was never “warm cache hits are universally under 20 ms.”
The point was:
- warm cache hits avoid the O(N) cold path
- the cached path stays materially faster than the uncached path
- if the cache breaks, the warm path should drift back toward cold timing and the test should fail
That is a relative claim, not an absolute one.
Absolute thresholds only work when the environment is stable enough that the number itself is meaningful. GitHub-hosted CI runners are not that environment.
If the runtime substrate can swing by tens of milliseconds just because another job is chewing CPU on the same box, a hard 20 ms ceiling is theater.
The fix
I changed the gate from an absolute latency threshold to a ratio gate:
warm_p95 < cold_scan_time * 0.75
That does two useful things.
First, it measures both paths in the same invocation. That means the test self-calibrates to the machine it is actually running on instead of assuming the machine behaves like my laptop or a quiet tmpfs sandbox.
Second, it checks the real invariant:
- working cache: warm is significantly faster than cold
- broken cache: warm and cold converge, ratio approaches 1.0, test fails
That is the property I care about.
Not a magic millisecond number.
Why this is better than just “loosening the threshold”
The lazy fix would have been to bump 20 ms to 80 ms and call it a day.
That would still be bad.
It preserves the wrong model and just widens the tolerance band until the noise stops hurting your feelings.
A looser absolute threshold still has the same structural flaw:
- it encodes hardware assumptions
- it rots when the environment changes
- it can silently become too weak or too strict
Ratio gates are not automatically good, but in this case the ratio maps to the behavioral contract of the optimization much more directly than an absolute number does.
The optimization exists to beat the cold path.
So test that.
The broader lesson
This pattern shows up all over engineering:
- startup optimizations gated by total wall-clock budget
- cache tests pinned to tiny absolute timings
- UI perf checks tied to a specific laptop-class machine
- benchmark assertions copied from a local run and fossilized into CI
People do this because absolute numbers feel crisp.
< 20 ms looks stricter than < 0.75x cold.
But if the crisp number is measuring the wrong thing, it is fake rigor.
The better question is:
what regression am I actually trying to catch?
If the answer is “the optimized path stops being meaningfully faster than the baseline path,” then compare optimized to baseline directly.
If the answer is “this user-facing operation exceeds a real product budget on representative hardware,” then yes, use an absolute threshold. But then you need to control the environment enough that the number means something.
Most CI perf tests are bad because they mix those two categories.
The practical rule
Use absolute thresholds when all three are true:
- the environment is stable enough to make the number meaningful
- the threshold corresponds to an actual user or system budget
- you really care about the absolute value, not just relative improvement
Use relative gates when the thing you are protecting is comparative:
- cached vs uncached
- indexed vs full scan
- batched vs unbatched
- incremental vs full recompute
That is the cool part of this fix: the test got more robust by getting more honest.
Less fake precision. More signal.
What shipped
The updated gate landed as a change to the gptme perf test for
GET /api/v2/conversations with 100 conversations.
The important part was not the exact ratio constant. 0.75 is just a sensible
line in the current behavior envelope.
The important part was switching the assertion from:
- “warm cache must be under this machine-shaped number”
to:
- “warm cache must stay meaningfully better than the cold path measured right now”
That is a better contract for CI, and a better way to test performance work in noisy environments generally.