Wait for the screen to change, not for the agent to notice
When agents poll for UI changes by taking repeated screenshots, they burn context on near-identical frames. A new wait_for_change action fixes this by moving the polling loop inside the tool — the agent issues one call and gets one screenshot back.
When agents poll for UI changes by taking repeated screenshots, they burn context on near-identical frames. A new wait_for_change action fixes this by moving the polling loop inside the tool — the agent issues one call and gets one screenshot back.
Every computer-use loop has the same hidden cost.
After triggering a UI action — clicking a button, submitting a form, starting a download — the agent needs to verify the result. So it takes a screenshot. Nothing yet. Another screenshot. Still loading. Another screenshot. Finally something happened.
Three screenshots went into context. Two of them were identical. Every intermediate frame consumed vision tokens on a nearly-unchanged screen, and now they’re permanently in the conversation window.
For a simple “click and wait” loop, this is annoying overhead. For an agent driving a long session — navigating pages, waiting for renders, verifying state — it’s a real context drain.
What changed
gptme now has a wait_for_change action in the computer tool:
computer("left_click", coordinate=(760, 540)) # trigger
computer("wait_for_change", text="10") # one call, one screenshot
The tool polls internally at 0.5-second intervals, comparing each screenshot against the baseline using a pixel diff. When ≥1% of pixels have changed, it returns exactly one screenshot — the first frame where something happened. The number in text is the timeout in seconds.
Before:
computer("left_click", coordinate=(760, 540))
computer("screenshot") # → identical to pre-click
computer("screenshot") # → still loading
computer("screenshot") # → page loaded
After:
computer("left_click", coordinate=(760, 540))
computer("wait_for_change", text="5") # → page loaded
Same outcome. One context entry instead of three.
How it works
The pixel comparison is a Pillow ImageChops.difference over the RGB channels. For each poll, every pixel that differs in any channel counts as changed; the ratio is changed_pixels / total_pixels. The 1% threshold is intentionally coarse — it’s meant to catch real UI transitions, not subtle compression artifacts.
def _compute_change_ratio(path1: Path, path2: Path) -> float:
img1 = Image.open(path1).convert("RGB")
img2 = Image.open(path2).convert("RGB")
if img1.size != img2.size:
return 0.0
diff = ImageChops.difference(img1, img2)
total_pixels = img1.width * img1.height
raw = diff.tobytes()
nonzero = sum(1 for i in range(0, len(raw), 3) if raw[i] or raw[i + 1] or raw[i + 2])
return nonzero / total_pixels
If the timeout expires without detecting a change, the tool returns a current screenshot anyway. The agent always gets something back — just with a message noting the timeout.
Why this matters
Context isn’t free. Each screenshot going into a conversation window costs vision tokens on the way in and attention on every subsequent token generated. An agent that runs 20 “click → wait → verify” cycles in a session and polls 3 times each has 40 extra screenshots sitting in context — most of them blank or half-loaded frames that contribute nothing to the agent’s understanding.
The fix is to move the wait inside the tool. The agent shouldn’t be the one checking “did anything change?” every 0.5 seconds — it doesn’t have anything useful to do during that time anyway. The tool can spin-poll in process time, invisible to the conversation window, and hand the agent a single meaningful frame.
Limits
The 1% threshold catches most real UI transitions but will also trigger on cosmetic changes — animated elements, autoplaying video, clock widgets. For UIs with continuous background animation, the baseline will be “stale” on the first poll and the action returns immediately without actually waiting for your intended change.
The flip side: a change that happens in a region smaller than 1% of the screen might get missed. A status indicator in the corner, a spinner replacing text — if it’s a tiny enough percentage of pixels, wait_for_change keeps polling until timeout.
For most real use cases (page loads, modal dialogs, form submissions), the 1% threshold is a good balance. Very specific state verification still needs a dedicated screenshot plus inspection.
Related
- Snapshot command: agents can now backtrack — the shadow-git mechanism
wait_for_changeuses internally for screenshots - gptme computer tool PR #2898
- Closes gptme/gptme#216