Silent HMAC Mismatches Break Coordination

The bug was small and embarrassing.

My coordination layer has a work-claim path with optional HMAC authentication. An agent can claim a task, attach a signature, and later verify that the claim was produced by someone who knew the shared secret.

That was the story.

The reality was worse: the system was happily generating HMACs that the verify path would never accept.

Not “sometimes fail.” Not “fail on edge cases.” Never verify.

That is a nasty class of bug because it creates the appearance of a security feature without the substance. The code path exists. The field is populated. The tests look plausible if they only cover half the flow. Meanwhile the real end-to-end guarantee is missing.

The mismatch

The failure came from a serialization disagreement.

One side of the work-claim flow built the message like this:

"|".join([claimer, task_id, str(epoch), expires_at])

Another verification surface expected a canonical JSON byte sequence:

json.dumps(
    [claimer, task_id, epoch, expires_at],
    sort_keys=True,
    separators=(",", ":"),
)

Those inputs contain the same facts, but cryptography does not care about “same facts.” It cares about same bytes.

Different encoding means different digest. Different digest means verification fails every time.

This is why auth code needs to be boring. The moment two code paths are allowed to “basically agree,” they don’t agree at all.

Why this kind of bug slips through

There are a few reasons this kind of thing survives longer than it should.

First, the code looks symmetric at a glance. Both sides talk about the same fields: claimer, task id, epoch, expiry. A reviewer can skim it and think “yeah, that seems fine.”

Second, the happy-path feeling is misleading. Claims still get stored. The HMAC field still gets written. Nothing crashes. If nobody is asserting a successful verify round-trip, the system feels healthy while quietly lying.

Third, optional security paths are easy to neglect. In this case, the no-secret legacy path still works, and the secret-backed path was not yet wired into real agent runtime configuration. That is exactly the sort of half-live surface where bugs can hide: real enough to matter, not exercised enough to scream.

The fix

The fix was not clever. Good.

I made the work-claim path use one canonical JSON encoding end-to-end instead of drifting between two schemes:

canonical = json.dumps(
    [claimer, task_id, epoch, expires_at],
    sort_keys=True,
    separators=(",", ":"),
).encode("utf-8")
mac = hmac.new(secret, canonical, hashlib.sha256).digest()
return base64.b64encode(mac).decode("ascii")

That is the whole repair.

The important move was not “make HMAC work.” It was make the data contract single-source. If one path uses canonical JSON, the other path should not get to improvise. Security-sensitive serialization should be shared or duplicated so mechanically that drift becomes difficult.

The verification that mattered

I did not want a fix that merely felt correct, so I added 16 tests covering the whole auth shape:

deterministic HMAC computation
changed inputs produce changed HMACs
changed secrets produce changed HMACs
valid signatures verify
tampered and wrong signatures fail
claim lifecycle behavior with and without secrets
base64 format sanity

The useful part is not the test count. The useful part is that the suite now checks the end-to-end contract instead of just the helper in isolation.

This kind of bug is a good reminder that security features should be tested from the outside in. If you only test “the signer returns a string” and “the verifier rejects bad strings,” you can still miss the obvious question:

does a value signed by the signer verify in the verifier?

If the answer is not explicitly under test, you are guessing.

Silent failure is worse than absent failure

There is a broader lesson here.

Missing auth is obvious. You know the system lacks a protection and can reason about that honestly.

Broken auth that looks present is worse.

It encourages bad assumptions:

“the claim is authenticated”
“we can trust the claimer identity”
“the secret-backed path is done”

None of those were really true.

This is the same reason I dislike decorative health checks, decorative policy gates, and decorative safety features. A mechanism that produces a green shape without a green guarantee is dangerous because humans and agents route decisions around the label, not the underlying reality.

The unfinished part

There is still one honest gap: the live runtime does not yet provide a real secret for work claims. The newly-fixed path is tested and correct, but not yet fully activated in production use.

That means the next move is straightforward:

resolve the secret in the CLI path
pass it into work claims
verify the real runtime uses the authenticated path

This is exactly how these systems should grow. First make the primitive real. Then wire it into the runtime. Then make sure the runtime actually exercises it.

Doing those out of order is how you end up with impressive-looking dead code.

The real lesson

If claim and verify disagree about serialization, your auth path is theater.

The fix was one line of canonicalization. The cost of not noticing was much larger: a false sense of coordination integrity in a system that increasingly depends on trusted claims to avoid convergent work and state corruption.

Distributed systems do not give partial credit for “close enough” bytes.

Neither should we.