2025-10-19.22.26.02.mov
- OS AI Computer Use
Local agent for desktop automation. It currently integrates Anthropic Computer Use (Claude) but is architected to be provider‑agnostic: the LLM layer is abstracted behind LLMClient, so OpenAI Computer Use (and others) can be added with minimal changes.
What this project is:
- A provider‑agnostic Computer Use agent with a stable tool interface
- An OS‑agnostic execution layer using ports/drivers (macOS and Windows today)
- A CLI you can bundle into a single executable for local use
What it is not (yet):
- A remote SaaS; this is a local agent
- A finished set of drivers for every OS/desktop (Linux Wayland has limits for synthetic input)
Highlights:
- Smooth mouse movement, clicks, drag‑and‑drop with easing and timing controls
- Reliable keyboard input (robust Enter on macOS), hotkeys and hold sequences
- Screenshots (Quartz on macOS or PyAutoGUI fallback), on‑disk saving and base64 tool_result
- Detailed logs and running cost estimation per iteration and total
- Multiple chats
- Images upload
- Voice input
- AI API Agnostic
See provider architecture in docs/architecture-universal-llm.md, OS ports/drivers in docs/os-architecture.md, and packaging notes in docs/ci-packaging.md.
Requirements:
- macOS 13+ or Windows 10/11
- Python 3.12+
- Anthropic API key:
ANTHROPIC_API_KEY(for now; OpenAI planned)
Install:
# (optional) create and activate venv
python -m venv .venv && source .venv/bin/activate
# install dependencies
make install
# (optional) install local packages in editable mode (mono-repo dev)
make dev-installmacOS permissions (for GUI automation):
make macos-perms # opens System Settings → Privacy & Security panelsGrant permissions to Terminal/iTerm and your venv Python under: Accessibility, Input Monitoring, Screen Recording.
Requirements:
- macOS 13+ or Windows 10/11 (unit tests on any OS; GUI tests macOS/self‑hosted Windows)
- Python 3.12+
- Anthropic API key (
ANTHROPIC_API_KEY)
Install:
# (optional) create and activate venv
python -m venv .venv && source .venv/bin/activate
# install top-level dependencies
make installmacOS permissions (required for GUI automation):
# open System Settings → Privacy & Security panels
make macos-permsGrant permissions to Terminal/iTerm and your venv Python under: Accessibility, Input Monitoring, Screen Recording.
Run the agent (CLI):
export ANTHROPIC_API_KEY=sk-ant-...
python main.py --provider anthropic --debug --task "Open Safari, search for 'macOS automation', scroll, make a screenshot"# 1) Open Chrome, search in Google, take a screenshot
python main.py --provider anthropic --task "Open Chrome, focus the address bar, type google.com, search for 'computer use AI', open first result, scroll down and take a screenshot"
# 2) Copy/paste workflow in a text editor
python main.py --provider anthropic --task "Open TextEdit, create a new document, type 'Hello world!', select all and copy, create another document and paste"
# 3) Window management + hotkeys
python main.py --provider anthropic --task "Open System Settings, search for 'Privacy', navigate to Privacy & Security, disable GEO"
# 4) Precise drag operations
python main.py --provider anthropic --task "In Finder, open Downloads, switch to icon view, drag the first file to Desktop"Useful make targets:
make install # install top-level dependencies
make test # unit tests
RUN_CURSOR_TESTS=1 make itest # GUI integration tests (macOS; requires permissions)
make itest-local-keyboard # run keyboard harness
make itest-local-click # run click/drag harnessFor development with backend + frontend (Flutter UI):
# (optional) create and activate venv
python -m venv .venv && source .venv/bin/activate
# install Python dependencies
make install
# install local packages in editable mode for mono-repo dev
make dev-install# Set your API key
export ANTHROPIC_API_KEY=sk-ant-...
# (optional) enable debug mode
export OS_AI_BACKEND_DEBUG=1
# Start backend on 127.0.0.1:8765
os-ai-backend
# Or run directly via Python module
# python -m os_ai_backend.appBackend environment variables (optional):
OS_AI_BACKEND_HOST- host address (default:127.0.0.1)OS_AI_BACKEND_PORT- port number (default:8765)OS_AI_BACKEND_DEBUG- enable debug logging (default:0)OS_AI_BACKEND_TOKEN- authentication token (optional)OS_AI_BACKEND_CORS_ORIGINS- allowed CORS origins (default:http://localhost,http://127.0.0.1)
Backend endpoints:
GET /healthz- health checkWS /ws- WebSocket for JSON-RPC commandsPOST /v1/files- file uploadGET /v1/files/{file_id}- file downloadGET /metrics- metrics snapshot
cd frontend_flutter
# Install Flutter dependencies
flutter pub get
# Run on macOS
flutter run -d macos
# Or run on other platforms
# flutter run -d chrome # web
# flutter run -d windows # WindowsFrontend config (in code):
- Default backend WebSocket:
ws://127.0.0.1:8765/ws - Default REST base:
http://127.0.0.1:8765
See frontend_flutter/README.md for more details on the Flutter app architecture and features.
- Smooth mouse motion: easing, distance‑based durations
- Clicks with modifiers:
modifiers: "cmd+shift"for click/down/up - Drag control:
hold_before_ms,hold_after_ms,steps,step_delay - Keyboard input:
key,hold_key; robust Enter on macOS via Quartz - Screenshots: Quartz (macOS) or PyAutoGUI fallback; optional downscale for model display
- Logging and cost: per‑iteration and total usage/cost with 429 retry logic
- OS‑agnostic execution: core depends only on OS ports; drivers are loaded per OS (see
docs/os-architecture.md). - macOS (supported):
- Full driver set with overlay (AppKit), robust Enter (Quartz), screenshots (Quartz/PyAutoGUI), sounds (NSSound).
- Integration tests available; requires Accessibility, Input Monitoring, Screen Recording.
- Single‑file CLI bundle via
make build-macos-bundle.
- Windows (implemented, not yet integration‑tested):
- Drivers for mouse/keyboard/screen via PyAutoGUI; overlay/sound are no‑ops baseline.
- Unit contract tests exist; for GUI tests use a self‑hosted Windows runner (see
docs/windows-integration-testing.md). - Single‑file CLI bundle via
make build-windows-bundle(build on Windows).
- Linux: not provided out‑of‑the‑box. X11 can support synthetic input (XTest), while Wayland often restricts it. Contributions welcome.
Key options (partial list):
- Coordinates/calibration
COORD_X_SCALE,COORD_Y_SCALE,COORD_X_OFFSET,COORD_Y_OFFSET- Post‑move correction:
POST_MOVE_VERIFY,POST_MOVE_TOLERANCE_PX,POST_MOVE_CORRECTION_DURATION
- Screenshots
SCREENSHOT_MODE(native|downscale)VIRTUAL_DISPLAY_ENABLED,VIRTUAL_DISPLAY_WIDTH_PX,VIRTUAL_DISPLAY_HEIGHT_PXSCREENSHOT_FORMAT(PNG|JPEG),SCREENSHOT_JPEG_QUALITY
- Overlay
PREMOVE_HIGHLIGHT_ENABLED,PREMOVE_HIGHLIGHT_DEFAULT_DURATION,PREMOVE_HIGHLIGHT_RADIUS, colors
- Model/tool
MODEL_NAME,COMPUTER_TOOL_TYPE,COMPUTER_BETA_FLAG,MAX_TOKENSALLOW_PARALLEL_TOOL_USE
See file for full list and comments.
The agent expects blocks with action and parameters:
- Mouse movement
{"action":"mouse_move","coordinate":[x,y],"coordinate_space":"auto|screen|model","duration":0.35,"tween":"linear"}- Clicks
{"action":"left_click","coordinate":[x,y],"modifiers":"cmd+shift"}- Key press / hold
{"action":"key","key":"cmd+l"}
{"action":"hold_key","key":"ctrl+shift+t"}- Drag‑and‑drop
{
"action":"left_click_drag",
"start":[x1,y1],
"end":[x2,y2],
"modifiers":"shift",
"hold_before_ms":80,
"hold_after_ms":80,
"steps":4,
"step_delay":0.02
}- Scroll
{"action":"scroll","coordinate":[x,y],"scroll_direction":"down|up|left|right","scroll_amount":3}- Typing
{"action":"type","text":"Hello, world!"}- Screenshot
{"action":"screenshot"}Responses are returned as a list of tool_result content blocks (text/image). Screenshots are base64‑encoded.
Unit tests (no real GUI):
make testIntegration (real OS tests, macOS; Windows via self‑hosted runner):
export RUN_CURSOR_TESTS=1
make itestIf macOS blocks automation, tests are skipped. Grant permissions with make macos-perms and retry.
Windows integration testing options are described in docs/windows-integration-testing.md.
Recommended setup: Flutter as pure UI, local Python service:
- Transport: WebSocket + JSON‑RPC for chat/commands, REST for files
- Streams: screenshots (JPEG/PNG), logs, events
- Example notes:
docs/flutter.md
To run backend + frontend in development mode, see the Development Mode section above.
Note: project code and docs use English.
- Fork → feature branch → PR
- Code style: readable, explicit names, avoid deep nesting
- Tests: add unit tests and integration tests when applicable
- Before PR:
make test
RUN_CURSOR_TESTS=1 make itest # optional if GUI interactions changed- Commit messages: clear and atomic
Architecture, packaging and testing docs:
- OS Ports & Drivers:
docs/os-architecture.md - Packaging & CI:
docs/ci-packaging.md - Windows integration testing:
docs/windows-integration-testing.md - Code style:
CODE_STYLE.md - Contributing:
CONTRIBUTING.md
Packaging (single executable bundles):
- macOS:
make build-macos-bundle→dist/agent_core/agent_core - Windows:
make build-windows-bundle→dist/agent_core/agent_core.exe
Apache License 2.0. Preserve NOTICE when distributing.
- See
LICENSEandNOTICEat repository root.
- Cursor/keyboard don’t work (macOS): grant permissions in System Settings → Privacy & Security (Accessibility, Input Monitoring, Screen Recording) for Terminal and current Python.
- Integration tests skipped: restart terminal, ensure same interpreter (
which python,python -c 'import sys; print(sys.executable)'). - Screenshots empty/missing overlay: enable Screen Recording; check screenshot mode settings.
Issues/PR in this repository. Attribution is listed in NOTICE.