gptme can now use the mouse wheel: scroll action lands in computer-use
The scroll action fills one of the last gaps in gptme's computer-use toolkit — agents can now use the mouse wheel to navigate web pages, documents, and list-heavy UIs, not just click and type.
The scroll action fills one of the last gaps in gptme's computer-use toolkit — agents can now use the mouse wheel to navigate web pages, documents, and list-heavy UIs, not just click and type.
gptme’s computer-use tool lets an agent drive a real desktop: move the mouse, click, type, take screenshots. Until this week, one interaction was missing: the scroll wheel.
That sounds like a small gap. It isn’t.
What you can’t do without scrolling
Without a scroll wheel, an agent looking at a web page sees the top ~800 pixels and nothing else. If the button is below the fold, it’s invisible. If the setting is three pages down, the agent has no path to it except guessing at coordinates and hoping.
Real computer use involves constant scrolling — through documentation, file listings, settings panels, spreadsheets, chat histories, search results. Every UI designed after 1995 assumes a scroll wheel exists. Agents that can’t scroll are effectively blind to most of the content in any scrollable viewport.
What shipped
The scroll action (gptme/gptme#2884, merged June 14) adds mouse wheel support to the computer tool:
computer("scroll", coordinate=(512, 400), text="down") # scroll down 3 clicks
computer("scroll", coordinate=(512, 400), text="up") # scroll up
Four directions: up, down, left, right. The text field carries the direction; coordinate specifies where to position the cursor before scrolling (scroll events are position-dependent on most window systems).
Under the hood:
- Linux (X11):
xdotool click --repeat 3 4for scroll-down (button 4 = wheel down, button 5 = wheel up, 6/7 for horizontal) - macOS:
Quartz.CGEventCreateScrollWheelEventfor native scroll wheel events - CUA sandbox transport: delegates to
sandbox.mouse.scroll()when running in Anthropic’s computer-use sandbox
Where this fits
gptme’s computer-use tool has been growing steadily. The current action set:
| Action | What it does |
|---|---|
move |
Move mouse to coordinates |
click |
Click (left/right/middle) |
double_click |
Double-click |
drag |
Click-and-drag |
type |
Type text |
key |
Press a key or key combo |
scroll |
Mouse wheel (new) |
screenshot |
Capture the screen |
wait |
Pause |
The scroll action was the most requested missing primitive (tracked in gptme#216). With it, the tool covers the full set of mouse interactions an agent needs for realistic desktop automation.
Verification
342 lines added across 4 files, 112 tests passing. The implementation is straightforward — no new dependencies, no breaking changes. Linux uses the existing xdotool (already a dependency for click/type). macOS uses the existing pyobjc-framework-Quartz (already imported for other actions).
The main subtlety was getting the xdotool invocation right: the initial implementation used a per-click loop (N subprocess spawns for N scroll clicks), but Greptile flagged the extra overhead. The merged version uses xdotool click --repeat N button — single subprocess, same effect.
What’s next
Scroll is the last mouse primitive, but the higher-level orchestration loop remains future work. The “look → act → look” cycle — scroll down, take a screenshot, decide what to do next — still requires the agent to explicitly chain actions. A built-in scroll-and-observe primitive would make computer-use sessions dramatically shorter. That’s tracked in gptme#216.
For now: gptme agents can scroll. They can finally see the rest of the page.