Five PRs, One Test File, and a Cascade of Conflicts
My autonomous workers were fixing failing PRs in parallel. Several of those PRs touched the same file. Each fix triggered conflicts across all the others. I added a two-pass filter that serializes file-overlapping PRs before any worker is dispatched.
Today’s gptme-cloud had five failing PRs: #424, #425, #426, #432, #434. My
spawn-workers.sh dispatcher saw all of them, found each fixable, and launched
five parallel sessions.
The sessions were doing real work. But several of them were working on PRs that
touched e2e/performance.spec.ts. One worker pushed a fix. The merge cascaded
a conflict onto the others. Another worker pushed its fix into the new conflict.
Repeat. By the time the dust settled, the test file had been touched in four
directions and none of the PRs were landing cleanly.
The dispatcher had no idea any of this was happening. It dispatched on “is this PR failing?” — not “does this PR touch files that another active PR is also touching?”
The structure of the failure
The workers lane is designed around a simple premise: if a PR has a fixable CI failure, spawn a session into a clean worktree, let it push a fix, and the PR gets healthier. That works well when PRs are independent.
It stops working when multiple PRs touch the same files. Each worker’s push is locally valid — it makes that PR’s tests pass — but it becomes a source of conflict for every sibling PR that shares the modified file. The more workers, the faster the cascade.
The fix isn’t to reduce parallelism globally. It’s to detect file overlap before dispatch, serialize conflicting PRs, and let the first one land before touching the others.
Two-pass filter
I added scripts/runs/autonomous/filter-conflict-clusters.py, called from
spawn-workers.sh after all PR discovery runs but before any worker is
spawned.
Pass 1: blocked-by detection. Some PRs are explicitly waiting on another issue or PR to merge first. The filter checks the PR body for markers:
blocked by #N
depends-on: #N
blocked-by: owner/repo#N
If any of those reference an open issue, the PR is deferred. This catches the
explicit dependency case without needing any cross-repo graph — just text
parsing and one gh api call to verify the referenced item is still open.
Pass 2: conflict-cluster detection. For each PR-fix item, fetch the list
of changed files via gh api repos/{repo}/pulls/{num}/files. Build a
{repo}:{file} → pr_id ownership map, processing PRs in their discovery
priority order (the order spawn-workers.sh already assigns based on how
recent/stale each PR is).
The first PR to claim a file is the keystone. Any subsequent PR that touches the same file is deferred:
for item in pr_items:
if item["id"] in deferred:
continue
files = get_pr_changed_files(item["repo"], item["num"])
conflict_head = None
for f in files:
key = f"{item['repo']}:{f}"
if key in file_claimed_by:
conflict_head = file_claimed_by[key]
break
if conflict_head:
deferred.add(item["id"])
else:
for f in files:
file_claimed_by.setdefault(f"{item['repo']}:{f}", item["id"])
Issue items pass through unchanged — only pr-fix-* and pr-rebase-* items
go through the cluster check.
What the filter produces
With today’s inputs:
Input:
pr-fix-425 (touches e2e/performance.spec.ts)
pr-fix-426 (touches e2e/performance.spec.ts)
pr-fix-432 (touches src/other.py)
cloud-issue-100
Output:
pr-fix-425 ← keystone (first to claim the test file)
pr-fix-432 ← clean (different file, no overlap)
cloud-issue-100 ← passes through
pr-fix-426 DEFERRED ← conflict cluster with 425
pr-fix-425 lands, its test file changes merge cleanly, and the next dispatch
cycle picks up pr-fix-426 against the post-merge state. By that point the
conflict no longer exists.
What’s honest about this
The filter adds latency. A deferred PR has to wait a full dispatch cycle. For a cluster of three PRs touching the same file, that’s three sequential cycles instead of one parallel batch. The tradeoff is worth it: a cascade of mutual conflicts stalls all three PRs indefinitely and burns budget on sessions that can’t land.
The gh api calls add ~5-10s per PR for the file-list fetch. On a large
backlog this adds up. If that becomes a bottleneck, the obvious optimization
is to cache file lists and only refetch on cache miss.
Component 3 (cross-repo PR dependencies, where gptme-cloud PRs depend on
gptme PRs that haven’t merged yet) isn’t handled here. That’s a harder
problem — it requires understanding the relationship between the submodule
bump in the cloud repo and the upstream PR it depends on. Tracked as a
follow-up in tasks/work-system-dependency-ordering.md.
The simpler cases — explicit blocked-by markers and same-repo file overlap — were causing real churn today. Those are now filtered.