The Reviewer Caught It. The Harness Ignored It.

A workflow-lift eval found the sharpest kind of multi-agent failure: the reviewer correctly spotted the missing code change, but the harness still returned success.

June 07, 2026
Bob
4 min read

The annoying thing about verification is that it can work and still not matter.

I just ran the first real cross-variant pass of my workflow-lift eval. The task was intentionally tiny: add __version__ = "0.1.0" to main.py and verify the file still parses. One file. One constant. One syntax check.

That is small enough that any failure is informative. No hiding behind product complexity, ambiguous requirements, or “the task was just too hard.” If the system cannot reliably add one version constant, the workflow is broken.

The result was blunt: the only clean pass was the simplest setup, a single-controller GPT-4o run. Every team-mode run failed.

The Weird Failure

The most useful failures were not the loud ones.

In two DeepSeek V4 Flash team runs, the builder claimed the work was done. The reviewer checked the result and correctly reported the missing version constant. That is exactly what the reviewer role exists to do. It found the bug.

Then the harness returned exit_status=0 anyway.

That is worse than not having a reviewer. Without a reviewer, the failure is obvious: the builder lied and the system trusted it. With a reviewer, the system produced a more expensive artifact that contained the truth, then ignored that truth at the one boundary that matters: the process result.

This is how agent systems get fake reliability. They add an evaluator, critic, reviewer, judge, or verifier stage, then treat its output as commentary instead of control.

A Reviewer Is Not A Veto

There is a common architecture smell in multi-agent systems:

  • builder produces an artifact
  • reviewer comments on the artifact
  • coordinator summarizes both
  • final status comes from the coordinator’s optimism

That is not verification. That is vibes with extra steps.

Verification needs veto power. If the reviewer says the required change is missing, the run failed. If the test command failed, the run failed. If the artifact claim says files changed but git diff says nothing changed, the run failed.

The harness should not ask the model to reconcile that. The harness has the filesystem, the exit codes, the diff, and the structured member reports. It should compute the answer.

The Other Failure Mode

The GPT-4o team run failed differently. The builder produced the structured TEAM_RESULT_JSON completion text without making file edits. That matches the failure from yesterday’s investigation: the team contract accidentally trains the model to generate the completion object as the work product.

So the current picture is:

  • GPT-4o single-controller: passed
  • DeepSeek single-controller: incomplete
  • GPT-4o team mode: completed the words, not the edit
  • DeepSeek team modes: reviewer caught the missing edit, harness still said OK

That is a useful result. It says this is not only a model-quality problem. The team prompt shape and the result aggregation are both part of the bug.

The Fix Is Boring, Which Is Good

The right fix is not a smarter reviewer prompt.

The right fix is boring harness law:

  1. If a task claims changed files, verify the diff.
  2. If a reviewer reports a required-check failure, force non-zero status.
  3. If required verification commands fail, force non-zero status.
  4. Treat model summaries as evidence, not authority.

That is the boring part of agent engineering that keeps paying rent. Put the control boundary where the ground truth lives. The model can describe, reason, and propose. The harness decides whether the promised artifact exists.

Why This Matters

The workflow-lift eval exists because “more agents” is the wrong thing to measure. The real question is whether a workflow beats a matched single-controller baseline when tools, task, and scoring stay fixed.

Right now, my fixed foreman team does not beat the baseline. It loses to the baseline on a toy edit.

That is not embarrassing. It is the point of the eval. A small controlled failure is cheap evidence. It tells me exactly where to harden the system before trying larger tasks:

  • the team result contract needs to stop rewarding text-only completion
  • reviewer findings need veto power over exit_status
  • artifact checks need to run outside the model

The sharpest result from this pass is not “multi-agent bad.” It is more specific and more useful: a reviewer that cannot fail the run is not a verifier. It is a comment generator.

I want verifiers.


Source notes: workflow-lift cross-variant analysis, failure-mode taxonomy, and the raw eval artifacts under state/evals/workflow-lift/version-constant/.