gptme can now use the mouse wheel: scroll action lands in computer-use

gptme’s computer-use tool lets an agent drive a real desktop: move the mouse, click, type, take screenshots. Until this week, one interaction was missing: the scroll wheel.

That sounds like a small gap. It isn’t.

What you can’t do without scrolling

Without a scroll wheel, an agent looking at a web page sees the top ~800 pixels and nothing else. If the button is below the fold, it’s invisible. If the setting is three pages down, the agent has no path to it except guessing at coordinates and hoping.

Real computer use involves constant scrolling — through documentation, file listings, settings panels, spreadsheets, chat histories, search results. Every UI designed after 1995 assumes a scroll wheel exists. Agents that can’t scroll are effectively blind to most of the content in any scrollable viewport.

What shipped

The scroll action (gptme/gptme#2884, merged June 14) adds mouse wheel support to the computer tool:

computer("scroll", coordinate=(512, 400), text="down")   # scroll down 3 clicks
computer("scroll", coordinate=(512, 400), text="up")     # scroll up

Four directions: up, down, left, right. The text field carries the direction; coordinate specifies where to position the cursor before scrolling (scroll events are position-dependent on most window systems).

Under the hood:

Linux (X11): xdotool click --repeat 3 4 for scroll-down (button 4 = wheel down, button 5 = wheel up, 6/7 for horizontal)
macOS: Quartz.CGEventCreateScrollWheelEvent for native scroll wheel events
CUA sandbox transport: delegates to sandbox.mouse.scroll() when running in Anthropic’s computer-use sandbox

Where this fits

gptme’s computer-use tool has been growing steadily. The current action set:

Action	What it does
`move`	Move mouse to coordinates
`click`	Click (left/right/middle)
`double_click`	Double-click
`drag`	Click-and-drag
`type`	Type text
`key`	Press a key or key combo
`scroll`	Mouse wheel (new)
`screenshot`	Capture the screen
`wait`	Pause

The scroll action was the most requested missing primitive (tracked in gptme#216). With it, the tool covers the full set of mouse interactions an agent needs for realistic desktop automation.

Verification

342 lines added across 4 files, 112 tests passing. The implementation is straightforward — no new dependencies, no breaking changes. Linux uses the existing xdotool (already a dependency for click/type). macOS uses the existing pyobjc-framework-Quartz (already imported for other actions).

The main subtlety was getting the xdotool invocation right: the initial implementation used a per-click loop (N subprocess spawns for N scroll clicks), but Greptile flagged the extra overhead. The merged version uses xdotool click --repeat N button — single subprocess, same effect.

What’s next

Scroll is the last mouse primitive, but the higher-level orchestration loop remains future work. The “look → act → look” cycle — scroll down, take a screenshot, decide what to do next — still requires the agent to explicitly chain actions. A built-in scroll-and-observe primitive would make computer-use sessions dramatically shorter. That’s tracked in gptme#216.

For now: gptme agents can scroll. They can finally see the rest of the page.