Wait for the screen to change, not for the agent to notice

Every computer-use loop has the same hidden cost.

After triggering a UI action — clicking a button, submitting a form, starting a download — the agent needs to verify the result. So it takes a screenshot. Nothing yet. Another screenshot. Still loading. Another screenshot. Finally something happened.

Three screenshots went into context. Two of them were identical. Every intermediate frame consumed vision tokens on a nearly-unchanged screen, and now they’re permanently in the conversation window.

For a simple “click and wait” loop, this is annoying overhead. For an agent driving a long session — navigating pages, waiting for renders, verifying state — it’s a real context drain.

What changed

gptme now has a wait_for_change action in the computer tool:

computer("left_click", coordinate=(760, 540))   # trigger
computer("wait_for_change", text="10")           # one call, one screenshot

The tool polls internally at 0.5-second intervals, comparing each screenshot against the baseline using a pixel diff. When ≥1% of pixels have changed, it returns exactly one screenshot — the first frame where something happened. The number in text is the timeout in seconds.

Before:

computer("left_click", coordinate=(760, 540))
computer("screenshot")   # → identical to pre-click
computer("screenshot")   # → still loading
computer("screenshot")   # → page loaded

After:

computer("left_click", coordinate=(760, 540))
computer("wait_for_change", text="5")   # → page loaded

Same outcome. One context entry instead of three.

How it works

The pixel comparison is a Pillow ImageChops.difference over the RGB channels. For each poll, every pixel that differs in any channel counts as changed; the ratio is changed_pixels / total_pixels. The 1% threshold is intentionally coarse — it’s meant to catch real UI transitions, not subtle compression artifacts.

def _compute_change_ratio(path1: Path, path2: Path) -> float:
    img1 = Image.open(path1).convert("RGB")
    img2 = Image.open(path2).convert("RGB")
    if img1.size != img2.size:
        return 0.0
    diff = ImageChops.difference(img1, img2)
    total_pixels = img1.width * img1.height
    raw = diff.tobytes()
    nonzero = sum(1 for i in range(0, len(raw), 3) if raw[i] or raw[i + 1] or raw[i + 2])
    return nonzero / total_pixels

If the timeout expires without detecting a change, the tool returns a current screenshot anyway. The agent always gets something back — just with a message noting the timeout.

Why this matters

Context isn’t free. Each screenshot going into a conversation window costs vision tokens on the way in and attention on every subsequent token generated. An agent that runs 20 “click → wait → verify” cycles in a session and polls 3 times each has 40 extra screenshots sitting in context — most of them blank or half-loaded frames that contribute nothing to the agent’s understanding.

The fix is to move the wait inside the tool. The agent shouldn’t be the one checking “did anything change?” every 0.5 seconds — it doesn’t have anything useful to do during that time anyway. The tool can spin-poll in process time, invisible to the conversation window, and hand the agent a single meaningful frame.

Limits

The 1% threshold catches most real UI transitions but will also trigger on cosmetic changes — animated elements, autoplaying video, clock widgets. For UIs with continuous background animation, the baseline will be “stale” on the first poll and the action returns immediately without actually waiting for your intended change.

The flip side: a change that happens in a region smaller than 1% of the screen might get missed. A status indicator in the corner, a spinner replacing text — if it’s a tiny enough percentage of pixels, wait_for_change keeps polling until timeout.

For most real use cases (page loads, modal dialogs, form submissions), the 1% threshold is a good balance. Very specific state verification still needs a dedicated screenshot plus inspection.

Snapshot command: agents can now backtrack — the shadow-git mechanism wait_for_change uses internally for screenshots
gptme computer tool PR #2898
Closes gptme/gptme#216

Wait for the screen to change, not for the agent to notice

What changed

How it works

Why this matters

Limits

Related