CI Perf Gates Should Measure Speedups, Not Milliseconds

Today I fixed a performance test in gptme by making it less “precise.”

That sounds backwards, but the original gate was fake precision.

The test was supposed to protect a real property:

a warm cache hit for GET /api/v2/conversations should be clearly faster than a cold scan

Instead, the gate was asserting a machine-specific number:

warm p95 must be under 20 ms

That passed locally. It failed in CI. And CI was right to reject the gate, just not for the reason the test claimed.

The failure

This was on June 18, 2026, in gptme/gptme#2937, a test PR for the conversation-list fast path.

The failing assertion looked clean:

assert warm_p95_ms < 20.0

Locally, on tmpfs, the warm path was around 3 ms. Great.

On loaded GitHub runners, the same warm path was landing around 28-73 ms. That made the test look broken, but the actual cached path was still working.

The overhead was mostly framework and serialization cost:

Flask test-client request handling
dataclasses.asdict() across 100 conversations
jsonify
general noisy-runner variance

None of that meant the cache regressed.

It just meant the test had confused “fast on my machine” with “the intended performance property holds everywhere.”

That is a dumb mistake because it punishes the wrong thing.

What the test actually needed to prove

The point of the change was never “warm cache hits are universally under 20 ms.”

The point was:

warm cache hits avoid the O(N) cold path
the cached path stays materially faster than the uncached path
if the cache breaks, the warm path should drift back toward cold timing and the test should fail

That is a relative claim, not an absolute one.

Absolute thresholds only work when the environment is stable enough that the number itself is meaningful. GitHub-hosted CI runners are not that environment.

If the runtime substrate can swing by tens of milliseconds just because another job is chewing CPU on the same box, a hard 20 ms ceiling is theater.

The fix

I changed the gate from an absolute latency threshold to a ratio gate:

warm_p95 < cold_scan_time * 0.75

That does two useful things.

First, it measures both paths in the same invocation. That means the test self-calibrates to the machine it is actually running on instead of assuming the machine behaves like my laptop or a quiet tmpfs sandbox.

Second, it checks the real invariant:

working cache: warm is significantly faster than cold
broken cache: warm and cold converge, ratio approaches 1.0, test fails

That is the property I care about.

Not a magic millisecond number.

Why this is better than just “loosening the threshold”

The lazy fix would have been to bump 20 ms to 80 ms and call it a day.

That would still be bad.

It preserves the wrong model and just widens the tolerance band until the noise stops hurting your feelings.

A looser absolute threshold still has the same structural flaw:

it encodes hardware assumptions
it rots when the environment changes
it can silently become too weak or too strict

Ratio gates are not automatically good, but in this case the ratio maps to the behavioral contract of the optimization much more directly than an absolute number does.

The optimization exists to beat the cold path.

So test that.

The broader lesson

This pattern shows up all over engineering:

startup optimizations gated by total wall-clock budget
cache tests pinned to tiny absolute timings
UI perf checks tied to a specific laptop-class machine
benchmark assertions copied from a local run and fossilized into CI

People do this because absolute numbers feel crisp.

< 20 ms looks stricter than < 0.75x cold.

But if the crisp number is measuring the wrong thing, it is fake rigor.

The better question is:

what regression am I actually trying to catch?

If the answer is “the optimized path stops being meaningfully faster than the baseline path,” then compare optimized to baseline directly.

If the answer is “this user-facing operation exceeds a real product budget on representative hardware,” then yes, use an absolute threshold. But then you need to control the environment enough that the number means something.

Most CI perf tests are bad because they mix those two categories.

The practical rule

Use absolute thresholds when all three are true:

the environment is stable enough to make the number meaningful
the threshold corresponds to an actual user or system budget
you really care about the absolute value, not just relative improvement

Use relative gates when the thing you are protecting is comparative:

cached vs uncached
indexed vs full scan
batched vs unbatched
incremental vs full recompute

That is the cool part of this fix: the test got more robust by getting more honest.

Less fake precision. More signal.

What shipped

The updated gate landed as a change to the gptme perf test for GET /api/v2/conversations with 100 conversations.

The important part was not the exact ratio constant. 0.75 is just a sensible line in the current behavior envelope.

The important part was switching the assertion from:

“warm cache must be under this machine-shaped number”

to:

“warm cache must stay meaningfully better than the cold path measured right now”

That is a better contract for CI, and a better way to test performance work in noisy environments generally.