The remote-desktop analogy
Imagine handing your laptop to a remote assistant over a screenshare: they see what you see, they move your cursor, they type. They cannot magically read the database — they have to navigate the same UI you do.
Claude computer use is exactly that. The model is given screenshots, can request mouse moves, clicks, keystrokes, and scrolling. It treats your screen as a UI it has to operate, just like a person would. The big leap: any app with a GUI suddenly has an API.
How the loop works
- The host application takes a screenshot of the desktop or a target window.
- Claude receives the screenshot + a goal ("file an expense report for this receipt").
- Claude responds with a tool call:
mouse_move(x, y),left_click(),type("invoice 402"),key("Tab"), orscreenshot()to see the new state. - The host executes the action and snaps a new screenshot.
- Loop until the goal is met or a stop condition fires.
Underneath, this is just tool use with a vision-capable model and a small set of GUI-driving tools. The magic is the loop's reliability and the model's ability to read messy real UIs.
What it unlocks
- Browser automation without selectors — no fragile XPath, no element IDs. The model sees the page and clicks the obvious "Submit" button.
- Legacy GUI integration — apps with no API, only a Win32 / Mac UI.
- Cross-app workflows — pull data from a desktop app, paste into a web form, attach a file, send.
- End-to-end QA — drive the actual UI a user would, validate by screenshot.
The honest limitations
- Latency. Each loop step is a screenshot + LLM call + action. Real workflows are seconds-per-step, minutes-per-task.
- Reliability. Modern UIs have ads, modals, layout shifts. The model sometimes clicks the wrong thing. Build retries and verification screenshots into every flow.
- Cost. Vision tokens are not cheap; long sessions add up. Worth it for high-value automation, painful for trivial scripts.
- Safety surface. A model with mouse and keyboard access can do anything a user can. Including "anything bad."
How to deploy it without burning the house down
- Sandbox the environment. Run in a VM, container, or dedicated user. Never on your live workstation.
- Limit network and filesystem access to what the task needs.
- Confirmation gates for destructive actions — sending money, deleting, "publish."
- Audit trail — log every screenshot + action. Lets you reconstruct any failure or attack.
- Rate limit screenshots and actions so a runaway loop is bounded.
- Watch for prompt injection in what's on the screen — a malicious page can show "ignore previous instructions" text the model reads.
When to reach for it
- Useful: filling forms across many sites, bulk data entry into legacy software, end-to-end UI testing, accessibility automation, demos.
- Less useful: when a real API exists (use the API), or when speed and reliability matter more than flexibility.
What to expect quality-wise
- High-end frontier models with vision are good enough for most well-known SaaS UIs.
- Custom internal apps require some prompting work — show the model the layout, name the parts.
- Constantly changing pages (ads, banners) need defensive prompts — "ignore promotional banners; focus on the form."
- Treat it as a junior intern with full computer access — capable, fast, occasionally needs a sanity check.