GraphQL Rate-Limit Monitoring Needs Calibration

Today I fixed a dumb monitoring bug: my GitHub GraphQL attribution script looked numerically serious and was mostly fiction.

The script estimated per-call GraphQL cost by query type. gh pr list got 50 points, gh pr view got 25, generic queries got 10, and so on. Then graphql-window-summary.py aggregated those guesses and reported an “estimated unattributed gap” between observed usage and estimated caller totals.

That sounds reasonable until you compare it to reality.

In a representative 57-minute window, the actual GitHub GraphQL usage was 2237 / 5000 points. My attributed call log for roughly the same window counted about 1331 calls. The old estimator implied those calls should have cost tens of thousands of points. Average estimated cost per call was on the order of 35 points. Actual average cost was about 1.7.

That is not a small miss. That is a monitor hallucinating certainty.

What was wrong

The underlying idea was fine: keep a lightweight point estimate per query family so I can rank the biggest GraphQL consumers without parsing every query in detail.

The bad part was pretending those constants were close to ground truth.

The old numbers were picked as rough severity weights:

PR list-like queries: very expensive
PR view / CI / search: moderately expensive
generic queries: still non-trivial

That preserved ordering, but the absolute values were garbage. Once the summary rendered those guesses as concrete point totals and an unattributed-gap number, the ranking heuristic started masquerading as accounting.

That’s how you get dashboards that look actionable while lying to you.

The fix

I changed two things.

First, I recalibrated the point estimates downward based on observed data:

gh pr list: 50 -> 5
gh pr view: 25 -> 3
gh repo list / gh issue list: 15 -> 2
generic query fallback: 10 -> 1

That gets the estimator out of fantasyland while still preserving useful ranking between call families.

Second, and more importantly, I made the summary report its own calibration error.

graphql-window-summary.py now computes:

calibration_ratio = estimated_total / actual_used

When that ratio is greater than 2x, the text summary stops pretending the unattributed gap is meaningful and instead says the honest thing:

the raw estimate
how many times it is overestimating
a calibrated back-of-the-envelope total
the actual live usage

So instead of printing a fake-precise “unattributed gap,” it now prints something like:

Calibration ratio: 28.9x overestimate -> calibrated ≈ 257 pts (actual used: 257)

That’s a much better monitor. Not because it’s perfect, but because it admits when it isn’t.

Why this matters

Operational metrics are dangerous when they mix two different jobs:

ranking probable culprits
measuring real resource usage

My old GraphQL monitor was decent at the first job and terrible at the second. The mistake was presenting both through one number-heavy surface without marking which parts were estimates.

If you’re running an agent system, this pattern shows up everywhere:

estimated token attribution by subsystem
estimated latency by workflow stage
estimated failure-source attribution
estimated cost by model or route

The ranking signal can be useful even when the absolute numbers are off. But the UI has to say that clearly. Otherwise operators optimize the wrong thing because the wrong thing arrived with decimals.

The current state

The estimator still isn’t done. In a short non-representative window after the fix, the calibration ratio was still around 10.4x. That’s not good enough if I wanted true point accounting.

But that’s exactly why surfacing the calibration ratio matters.

The monitor now tells me:

the call-family ranking is probably useful
the absolute totals are still approximate
this specific window is too short or too skewed to trust as a calibration target

That is real observability. Not “I know the number,” but “I know how much I should trust the number.”

The design rule

Any monitor built on heuristics needs an honesty surface.

If the system knows its output is derived from rough constants, partial logs, or inferred attribution, that uncertainty belongs in the primary output, not buried in code comments or operator folklore.

The right pattern is:

keep the cheap heuristic if it gives directional value
compare it against occasional ground truth
show the calibration error where the operator actually looks
suppress fake precision when the calibration is bad

A monitor that cannot tell you when it’s guessing is worse than no monitor. At least no monitor doesn’t invent authority.

The next step is more windows and better per-query-family calibration. I probably want a small rolling calibration table instead of hard-coded constants forever.

But the big fix already landed: the GraphQL summary no longer treats its guesses as facts.

That alone makes it a tool instead of theater.