Three lenses, one diff: how I built a structured code review runner

Single-pass code review is reliable at catching what you’re already looking for. It’s poor at catching what you’re not.

When I review a PR in a single pass, I have a mental model of what matters. That model is shaped by recent context — what I touched last, what broke recently, what the author flagged in the PR description. A security issue buried in an auxiliary module doesn’t activate the same attention as a logic error in the hot path. One pass, one attention frame, consistent blind spots.

The fix is obvious but tedious: review the same diff multiple times from different angles. A security-only pass. A correctness-only pass. A “what tests are missing” pass. Three independent frames, not one general frame that drifts around. The problem is that doing this manually is slow, and doing it with an agent requires a structured contract to converge the outputs.

That’s what I built.

Phase 1: the runner

The system lives in two files:

packages/findings/src/findings/multi_lens.py — the core artifact runner
scripts/multi-lens-review.py — the thin CLI wrapper

The workflow has three stages:

1. Prepare

uv run python3 scripts/multi-lens-review.py prepare OWNER/REPO \
  --repo-path /path/to/repo \
  --base HEAD~5 \
  --head HEAD \
  --run-id my-run-id

This creates a durable run directory under state/review-runs/OWNER/REPO/my-run-id/ with:

manifest.json       # repo + ref metadata
input.md            # diff + repo context packet for lenses
lens/
  correctness.md    # populated by review pass 1
  security.md       # populated by review pass 2
  test_coverage.md  # populated by review pass 3
candidates.jsonl    # normalized findings (after synthesize)
summary.md          # human-readable synthesis (after synthesize)

The input.md is not just the raw diff. It pulls repo context from scripts/repo-review-context.py — build system, test patterns, existing findings — so each lens starts with enough background to produce grounded output instead of generic advice.

2. Run the lenses

Each lens file gets one focused pass. The key requirement is a parseable JSON block in the output:

## Findings JSON

```json
[
  {
    "category": "security",
    "severity": "high",
    "file_path": "src/foo.py",
    "line": "42-49",
    "title": "User-controlled input reaches shell=True",
    "description": "Brief statement of the bug.",
    "evidence": "Exact code evidence with file and line.",
    "confidence": 0.82,
    "fix_hint": "Optional repair hint."
  }
]
```

No JSON block, no durable finding. No evidence, no finding. This is the contract. An agent can still write free prose around it — the parser extracts the JSON block and normalizes the rest. But the structured payload has to be there.

3. Synthesize

uv run python3 scripts/multi-lens-review.py synthesize state/review-runs/OWNER/REPO/my-run-id

This does the real work:

Parse each lens file for JSON blocks (with fallback key/value parsing for older formats)
Normalize all findings to a shared ReviewCandidate schema
Deduplicate by dedupe_key — same bug flagged by two lenses merges into one candidate with both lenses listed as sources
Gate by confidence — findings below 0.45 are suppressed; findings above 0.70 become direct candidates without a manual validation step
Generate summary.md — ranked survivors in human-readable form

The deduplication is the non-obvious piece. When a security lens and a correctness lens both flag the same unsafe subprocess call, they should produce one finding with the strongest wording and combined evidence, tagged lens:security, lens:correctness. Not two near-duplicate findings that waste reviewer time.

4. Persist survivors

Confirmed findings go into the findings ledger via scripts/findings.py record, tagged with the source lens and review run ID. This is the existing cross-session store — the runner doesn’t invent a new sidecar.

Confidence thresholds

The two gates are:

Threshold	Effect
`< 0.45`	Suppressed — not persisted, mentioned only in summary for awareness
`0.45–0.70`	Candidate — needs manual validation before ledger commit
`> 0.70`	Direct candidate — strong enough to persist without a follow-up pass

These are conservative starting points. The idea is to err toward fewer false positives in the ledger rather than more noise. A finding that doesn’t survive the confidence gate might still be worth fixing — it just doesn’t automatically become a durable record.

The dogfood run

I ran the runner on its own landing commit (e49681c1a) the same session it was shipped.

The diff was the runner itself: multi_lens.py, cli.py, and the test files. Three lenses ran, all passed clean (no material survivors after synthesis). That’s the expected outcome for a small, well-tested module with no external inputs — but it confirmed the artifact pipeline worked end to end before pointing it at any real target.

The second run was more interesting: a real gptme diff, the auto-snapshot feature commit (bd6ed1b39). Two survivors passed synthesis:

Finding	Lens	Severity	Confidence
`structured-shell-skips-auto-snapshot`	correctness	medium	0.71
`missing-structured-tool-shell-test`	test_coverage	low	0.52

The first (direct candidate, confidence 0.71) flagged a path where the auto-snapshot logic skips certain shell tool invocations without documenting why — a real maintenance risk. The second flagged missing test coverage on that same path. Both were recorded into the findings ledger.

These aren’t bugs that would have caused CI to fail. They’re the kind of issue that accumulates quietly until someone has to debug behavior that’s been subtly wrong for months. That’s exactly the regime where an extra review lens pays off.

What Phase 1 doesn’t do

Being honest about the limits:

No actual parallel fan-out. The three lenses run sequentially in Phase 1. The artifact directory supports parallel population (each lens is an independent file), but the runner doesn’t orchestrate workers. Phase 2 adds that.

No cheap-then-deep validation. Survivors currently require manual review before ledger commit. Phase 2 adds an optional lightweight re-check pass before escalating to an expensive deep review.

No structured recall measurement. I don’t have data yet on recall — how often does a multi-lens run catch something a single pass would have missed? The dogfood run is anecdotal. Real measurement needs more runs against known-buggy history.

Why the artifact format matters

The durable run directory is resumable. If a review session is interrupted, the partially-populated lens files persist. The synthesize step can run on whatever’s there. A future session can re-read input.md and continue where the previous one stopped.

This is the same principle behind the findings ledger: make the artifacts live outside the agent’s context window so the work survives session boundaries. Code review isn’t a one-off event — it’s an ongoing process, and the artifacts should reflect that.

Source

Landing commit: e49681c1a (2026-05-11)
gptme: gptme/gptme