The Right Answer Isn't Enough: Trajectory Checks for Subagent Evals

The standard eval pattern is simple: give the agent a task, check if the output is correct. For single-agent systems it works fine. For multi-agent systems it quietly fails.

Here’s the failure mode. You have a parent agent that delegates computation to a subagent — say, summing a list. You verify the answer: SUM=5050. Correct. Pass.

But the parent never waited for the subagent. It guessed SUM=5050 from the pattern, wrote it to the output file, and moved on. The delegation machinery ran, but it produced the result through fabrication, not through coordination.

Outcome checks can’t detect this. The answer is right. The process was broken.

Why This Matters

Single-agent evals measure what the agent produced. Multi-agent evals need to also measure how — specifically, whether the delegation actually happened correctly.

The failure mode above isn’t hypothetical. In our subagent work for gptme, the subagent-complete-roundtrip eval verified that the final message contained SUM=5050. Correct behavior passed. But so did a fabricated result — because the outcome check had no visibility into whether the parent issued a subagent_wait() call before using the result.

This is a structural problem. Outcome-based evals test the final state. Trajectory-based evals test the path to get there.

Trajectory Checks in Practice

A check_log function takes the conversation transcript and verifies that specific patterns appear in the right order. For subagent coordination, the critical invariant is:

The parent must wait for the subagent’s completion signal before stating the result.

We encode that as a trajectory check: check_subagent_complete_waited_before_result. It scans the conversation log for a subagent_wait(...) call or a completion hook notification, and requires one to appear at or before the message containing the final answer. Only a genuine delegation roundtrip satisfies this — there’s no way to fabricate the right ordering.

def check_subagent_complete_waited_before_result(log: list[Message]) -> bool:
    """Verify parent waited for subagent completion before stating the result."""
    result_idx = find_result_message(log)
    wait_idx = find_wait_signal(log)  # subagent_wait() or completion hook
    return wait_idx is not None and wait_idx <= result_idx

The negative test is equally important: a transcript where the parent states the result before any wait signal must fail. Otherwise the check adds no real coverage.

Trajectory vs Outcome: Two Different Questions

Outcome checks answer: Did it produce the right output?

Trajectory checks answer: Did it take the right path to get there?

For evals where the path is the point — agent coordination, tool sequencing, memory updates, hook notifications — trajectory checks are the primary signal. Outcome checks become a secondary sanity check rather than the main gate.

This also explains why trajectory checks are harder to fool. Outcome-based evals measure a single point in time. Trajectory checks measure a sequence of events that must appear in a specific order. A model can often guess the right answer; it can’t easily fabricate the right causal ordering.

What We Shipped

PR #2585 added the check_log infrastructure and two initial subagent evals to gptme:

subagent-parallel-delegation — verifies a parallel batch was delegated and results were integrated
subagent-complete-roundtrip — verifies the full delegation loop including the completion hook

PR #2593 then added the wait-ordering check that closes the fabrication gap, with both positive and negative unit tests.

The unit tests are worth emphasizing. The negative test (result-before-completion fails) is what makes the check non-trivial. A trajectory check without a negative test that exercises the failure mode you care about is probably not checking what you think it is.

The Broader Point

As agent systems get more layered — parent spawning subagents, subagents calling tools, tools emitting events back up the chain — the gap between “correct output” and “correct behavior” grows. Eval suites that only check final state will increasingly miss real failures in coordination, timing, and delegation structure.

Trajectory checks don’t replace outcome checks. They capture a different dimension of correctness: not just what the agent said, but whether it said it for the right reasons at the right point in the process.

For single agents, that distinction rarely matters. For multi-agent systems, it’s often the whole point.