LangGraph agent for Computer Use via VNC with Multi-Model LLM Support
A Python package that enables autonomous desktop interaction through VNC using state-of-the-art Computer Use models (Gemini 2.5 or Anthropic Claude Haiku 4.5). The agent runs a tight observe → propose → act loop, grounding actions directly from screenshots without OCR or template matching.
- Multi-model LLM support - Choose between Gemini 2.5 Computer Use or Anthropic Claude Haiku 4.5
- Direct screenshot grounding - No OCR, no template matching (per Computer Use design)
- Generic desktop automation - Works with any VNC-accessible desktop
- Secure credential management - Multi-tenant support with OS keyring, .netrc, or environment variables
- Safety-first HITL - Optional human-in-the-loop gates for risky actions
- Coordinate denormalization - Handles any screen resolution
- LangGraph orchestration - Stateless vision architecture for efficient token usage
- Structured logging - Full run artifacts with screenshots at each step
- Markdown execution reports - Human-readable reports with embedded screenshots showing what the agent saw, thought, and did
- MCP server - FastMCP 2.0 streaming server for AI agent integration
The agent implements a continuous observe-propose-act loop:
Task
↓
Observe (VNC Screenshot)
↓
Propose (LLM: Gemini or Claude)
↓
HITL Gate (if confirmation needed)
↓
Act (Execute on VNC)
↓
Loop until done or timeout
vnc-use supports multiple LLM providers through a pluggable architecture:
Provider | Model | Best For | Speed | Cost |
---|---|---|---|---|
Gemini | gemini-2.5-computer-use-preview-10-2025 |
Complex tasks, multi-step workflows | Medium | Medium |
Anthropic | claude-haiku-4-5-20251015 |
Fast responses, cost optimization | Fast | Low |
Select via environment variable:
# Use Anthropic Claude Haiku 4.5 (faster, cheaper)
export MODEL_PROVIDER=anthropic
export ANTHROPIC_API_KEY=sk-ant-...
# Use Gemini 2.5 Computer Use (default)
export MODEL_PROVIDER=gemini
export GOOGLE_API_KEY=...
Or via CLI flag:
vnc-use run --model-provider anthropic --task "..."
vnc-use run --model-provider gemini --task "..."
Both models use the same VNC backend, HITL safety, and action execution infrastructure.
Key Design Decisions:
- Stateless vision: Each turn sends only the current screenshot + text history (no screenshot accumulation)
- Normalized coordinates: Gemini returns 0-999 coords, automatically converted to pixels
- Safety gates: Optional human-in-the-loop for risky actions
- Structured logging: Every run generates detailed artifacts including markdown reports
# Clone repository
git clone git@github.com:mayflower/vnc-use.git
cd vnc-use
# Install with uv
uv sync
# Or with pip
pip install -e .
# Optional: Install OS keyring support for encrypted credential storage
pip install -e ".[keyring]"
Dependencies:
- Python 3.10+
- LangGraph, Google GenAI SDK, vncdotool, Pillow, FastMCP
- Optional:
keyring
package for OS-encrypted credential storage
# Copy the example environment file
cp .env.example .env
# Edit .env and add your Google API key
# GOOGLE_API_KEY=your_google_api_key_here
# Start both VNC desktop and MCP server
docker-compose up -d
# Check services are running
docker-compose ps
Available services:
- VNC Desktop (web): http://localhost:6901 (password: vncpassword)
- VNC Desktop (VNC): localhost:5901
- MCP Server: http://localhost:8001/mcp
Using the MCP Server:
# The MCP server is now running at http://localhost:8001/mcp
# Use any MCP client to connect and call the execute_vnc_task tool
Using Python API (local):
# If running locally without Docker, set API key
export GOOGLE_API_KEY=your_google_api_key
CLI:
uv run vnc-use run --task "Open a browser and search for LangGraph"
Python API:
from vnc_use import VncUseAgent
# First, configure credentials (one-time setup):
# vnc-use-credentials set localhost --server localhost::5901 --password vncpassword
# Then use the agent (credentials looked up automatically):
agent = VncUseAgent(
vnc_server="localhost::5901",
vnc_password="vncpassword", # Or omit if using credential store
step_limit=40,
seconds_timeout=300,
hitl_mode=True, # Enable safety confirmations
)
result = agent.run("Open the browser and navigate to google.com")
print(f"Success: {result['success']}")
print(f"Artifacts: {result['run_dir']}")
MCP Server:
# Docker MCP server is pre-configured with vnc-desktop credentials
# For custom servers, configure credentials first:
# vnc-use-credentials set my-vnc --server my-vnc.example.com::5901
import asyncio
from fastmcp import Client
async def run_task():
async with Client("http://localhost:8001/mcp") as client:
result = await client.call_tool(
"execute_vnc_task",
{
"hostname": "vnc-desktop", # Uses pre-configured Docker credentials
"task": "Open browser and search for LangGraph",
}
)
print(result.content[0].text)
asyncio.run(run_task())
The agent supports these Computer Use actions (mapped to VNC operations):
click_at(x, y)
- Click at normalized coordinatesdouble_click_at(x, y)
- Double-click at coordinates (for launching apps)hover_at(x, y)
- Hover at coordinatestype_text_at(x, y, text, press_enter?, clear_before_typing?)
- Type textkey_combination(keys)
- Press keyboard shortcuts (e.g., "ctrl+a")scroll_document(direction, magnitude)
- Scroll pagescroll_at(x, y, direction, magnitude)
- Scroll at locationdrag_and_drop(x, y, destination_x, destination_y)
- Drag and dropwait_5_seconds()
- Wait 5 seconds (useful for page loads)
The following browser-specific actions are excluded by default because they require direct browser API access and cannot be reliably implemented via VNC mouse/keyboard simulation:
open_web_browser
- Use generic click/type actions to launch browsers from desktopnavigate
- Click URL bar and type URLs insteadgo_back
,go_forward
- Click browser navigation buttons insteadsearch
- Type in search boxes instead
You can override these exclusions by passing excluded_actions=[]
when initializing the agent, though browser tasks are more reliably accomplished using the generic UI actions.
All coordinates from Gemini are normalized to 0-999 (1000x1000 reference grid) and automatically converted to pixels based on the current screenshot dimensions.
VNC server credentials are stored securely using a multi-tier credential store system. Credentials are never passed as tool parameters to avoid exposing them to LLMs.
The system tries multiple storage backends in order (most secure first):
-
OS Keyring (encrypted, recommended for production)
- macOS: Keychain
- Windows: Credential Locker
- Linux: Secret Service / GNOME Keyring
- Requires:
pip install vnc-use[keyring]
-
~/.vnc_credentials (Unix .netrc format)
- Simple text file with
chmod 600
permissions - Standard Unix pattern, works everywhere
- Format:
machine hostname\nlogin server\npassword secret
- Simple text file with
-
Environment Variables (fallback for single-tenant)
VNC_SERVER
andVNC_PASSWORD
- Useful for Docker/testing
- Only supports one VNC server
Set credentials for a VNC server:
# Password will be prompted (recommended)
vnc-use-credentials set vnc-prod --server prod.example.com::5901
# Or specify password inline (⚠ visible in shell history)
vnc-use-credentials set vnc-desktop --server vnc-desktop::5901 --password vncpassword
List configured servers:
vnc-use-credentials list
Get credentials:
# Show server (password masked)
vnc-use-credentials get vnc-prod
# Show password in plain text
vnc-use-credentials get vnc-prod --show-password
Delete credentials:
vnc-use-credentials delete vnc-prod
Once credentials are stored, refer to VNC servers by hostname only:
from vnc_use import VncUseAgent
# Credentials looked up automatically by hostname
agent = VncUseAgent(vnc_server="vnc-prod")
result = agent.run("Open browser and go to example.com")
For the MCP server, pass only the hostname:
result = await client.call_tool(
"execute_vnc_task",
{
"hostname": "vnc-desktop", # Credentials from server-side store
"task": "Open browser...",
}
)
- ✓ Passwords never in LLM conversation logs
- ✓ Passwords never in tool parameters
- ✓ OS-level encryption (when using keyring)
- ✓ Multi-tenant support (hundreds of VNC servers)
- ✓ Credential rotation without code changes
The VNC Computer Use agent can be run as a Model Context Protocol (MCP) server, enabling integration with MCP clients and providing streaming progress updates, observations, and screenshots during execution.
1. Start the services:
# Configure your API key
cp .env.example .env
# Edit .env and set GOOGLE_API_KEY=your_google_api_key
# Start VNC desktop + MCP server
docker-compose up -d
# Verify services are running
docker-compose ps
# Should show:
# vnc-use-test-desktop - VNC desktop (ports 5901, 6901)
# vnc-use-mcp-server - MCP server (port 8001)
2. Test the MCP server:
# View VNC desktop in browser
open http://localhost:6901 # Password: vncpassword
# MCP server endpoint
curl http://localhost:8001/mcp
3. Use from Python client:
import asyncio
from fastmcp import Client
async def main():
async with Client("http://localhost:8001/mcp") as client:
result = await client.call_tool(
"execute_vnc_task",
{
"hostname": "vnc-desktop", # Credentials from server-side store
"task": "Open the browser and go to example.com",
"step_limit": 20,
}
)
print(f"Success: {result.content[0].text}")
asyncio.run(main())
Note: The Docker MCP server is pre-configured with credentials for vnc-desktop
from environment variables. For production use with multiple VNC servers, use the credential management CLI.
With Docker (Recommended):
# Start all services (VNC + MCP server)
docker-compose up -d
# MCP server available at http://localhost:8001/mcp
# VNC desktop accessible at vnc-desktop::5901 (inside Docker network)
# View logs
docker-compose logs -f mcp-server
# Stop services
docker-compose down
Without Docker (Local VNC):
# Start your own VNC server first, then:
export GOOGLE_API_KEY=your_google_api_key
export MCP_HOST=127.0.0.1 # localhost only (default)
export MCP_PORT=8001 # default port
uv run vnc-use-mcp
# MCP server available at http://localhost:8001/mcp
# Connect to your VNC server at localhost::5901
The server exposes a single tool for executing VNC tasks.
Parameters:
hostname
(required): VNC server hostname for credential lookup (e.g., "vnc-desktop", "vnc-prod")task
(required): Task description to executestep_limit
(optional): Maximum steps (default: 40)timeout
(optional): Timeout in seconds (default: 300)
Security Note: Credentials are looked up from the server-side credential store using the hostname. Never pass passwords as parameters - they would be exposed to the LLM.
Returns:
{
"success": true,
"run_id": "20251010_153334_96982a39",
"run_dir": "runs/20251010_153334_96982a39",
"steps": 12,
"error": null
}
The MCP server streams real-time updates during execution:
- Progress notifications: Step count and current action
- Model observations: What the agent sees and thinks
- Compressed screenshots: 256px width images at each step
- Action results: Execution status of each action
Basic Example - Browser Navigation:
import asyncio
import json
from fastmcp import Client
async def browse_website():
"""Navigate to a website and extract information."""
async with Client("http://localhost:8001/mcp") as client:
# Discover available tools
tools = await client.list_tools()
print(f"Available: {[t.name for t in tools]}")
# Execute task
result = await client.call_tool(
"execute_vnc_task",
{
"hostname": "vnc-desktop",
"task": "Open browser and go to python.org. Tell me the latest Python version shown on the homepage.",
"step_limit": 20,
"timeout": 120,
}
)
# Parse result
data = json.loads(result.content[0].text)
print(f"\n{'='*70}")
print(f"Success: {data['success']}")
print(f"Steps taken: {data['steps']}")
print(f"Run directory: {data['run_dir']}")
if data.get('error'):
print(f"Error: {data['error']}")
print(f"{'='*70}")
return data
asyncio.run(browse_website())
Advanced Example - Web Scraping:
async def scrape_news():
"""Extract news headlines from a website."""
async with Client("http://localhost:8001/mcp") as client:
task = """
Open a web browser and navigate to a tech news site.
Find the top 3 news headlines on the page.
Report the headlines as a numbered list.
"""
result = await client.call_tool(
"execute_vnc_task",
{
"hostname": "vnc-desktop",
"task": task,
"step_limit": 30,
}
)
data = json.loads(result.content[0].text)
# Check screenshots
if data['success']:
print(f"✓ Task completed in {data['steps']} steps")
print(f"Check execution report:")
print(f" cat {data['run_dir']}/EXECUTION_REPORT.md")
asyncio.run(scrape_news())
Credential Configuration:
The Docker MCP server is pre-configured with credentials for vnc-desktop
via environment variables. The credentials are automatically set up when the container starts.
For multiple VNC servers in production:
- Configure credentials on the MCP server host using
vnc-use-credentials set
- Pass only the
hostname
parameter - credentials are looked up server-side - Never pass passwords in tool parameters - they would be exposed to the LLM
Service not starting:
# Check Docker services status
docker-compose ps
# View logs
docker-compose logs mcp-server
docker-compose logs vnc-desktop
# Restart services
docker-compose restart
Connection refused:
# Verify MCP server is listening
curl http://localhost:8001/mcp
# Check port is not already in use
lsof -i :8001
# Try rebuilding containers
docker-compose down
docker-compose up -d --build
Task fails immediately:
- Check GOOGLE_API_KEY is set correctly in
.env
- Verify VNC desktop is running:
docker-compose ps
- Test VNC access in browser: http://localhost:6901
- Check MCP server logs:
docker-compose logs mcp-server
Desktop state persists between runs:
# Restart VNC desktop for clean state
docker-compose restart vnc-desktop
sleep 10 # Wait for desktop to be ready
Accessing screenshots and reports:
# Screenshots are saved in the runs/ directory (volume-mounted from container)
ls runs/
cat runs/LATEST_RUN_ID/EXECUTION_REPORT.md
- Default binding: The server binds to
127.0.0.1
(localhost only) by default - Network exposure: Set
MCP_HOST=0.0.0.0
to expose externally (use with caution) - No authentication: The server does not implement authentication (add reverse proxy if needed)
- API keys: GOOGLE_API_KEY is required and should be kept secure
- VNC password: Default is
vncpassword
- change in docker-compose.yml for production
The agent integrates Gemini's safety decision system with MCP's elicitation mechanism to enable human approval for risky actions:
How it works:
- Gemini detection: When Gemini Computer Use model marks an action with
safety_decision.action = "require_confirmation"
, the agent pauses execution - MCP elicitation: The MCP server uses FastMCP's
ctx.elicit()
to request user approval - User decision: MCP client (Claude Desktop, IDE) prompts user to approve/decline/cancel
- Execution continues: Only proceeds if user explicitly approves
HITL is enabled by default in MCP mode for safety. The system works at two levels:
- Gemini-level safety: Model detects risky operations (e.g., system commands, destructive actions)
- MCP protocol-level: Clients should implement additional approval UI per MCP specification
Example risky actions that trigger HITL:
- System-wide keyboard shortcuts (Ctrl+Alt+Delete)
- Actions marked as sensitive by Gemini's safety system
- Operations requiring explicit confirmation per model's judgment
For CLI usage, HITL uses LangGraph interrupts instead of MCP elicitation. Disable with --no-hitl
flag if needed.
VncUseAgent(
vnc_server="localhost::5901", # VNC server address
vnc_password=None, # VNC password (optional, see note below)
screen_size=(1440, 900), # Default size (auto-detected)
excluded_actions=None, # List of actions to exclude
step_limit=40, # Max steps (guardrail)
seconds_timeout=300, # Wall-clock timeout
hitl_mode=True, # Enable safety gates
api_key=None, # Google API key (or use env var)
)
Note on credentials: For production use, configure credentials using the credential management system:
vnc-use-credentials set localhost --server localhost::5901 --password your_password
Then the agent can look up credentials automatically. Direct password parameters are supported for backward compatibility and simple use cases.
vnc-use run \
--vnc localhost::5901 \
--password vncpassword \
--task "Your task here" \
--step-limit 50 \
--timeout 600 \
--no-hitl \
--excluded-actions drag_and_drop \
--verbose
Each run creates a runs/<run_id>/
directory containing:
EXECUTION_REPORT.md
- Comprehensive markdown report with embedded screenshots, model observations, and execution timelinestep_000_initial.png
- Initial screenshotstep_NNN_after.png
- Screenshots after each actionaction_history.txt
- Text log of all actions and resultsmetadata.json
- Run metadata (task, duration, status, final state)
The EXECUTION_REPORT.md
file provides a human-readable execution trace showing:
- Task description and run statistics
- Initial screenshot
- For each step:
- Model's observations and reasoning
- Proposed actions
- Executed action with parameters
- Success/error status
- Screenshot showing the result
- Summary with total steps, success status, and error messages (if any)
This makes it easy to understand what the agent saw, what it tried to do, and why it made certain decisions.
# VNC backend tests (with Docker VNC)
docker-compose up -d
uv run python tests/test_vnc_backend.py
# Gemini wrapper tests (no API needed)
uv run python tests/test_gemini_wrapper.py
# With real API
GOOGLE_API_KEY=key uv run python tests/test_gemini_wrapper.py --with-api
# Requires GOOGLE_API_KEY and running VNC
docker-compose up -d
# Simple desktop task
GOOGLE_API_KEY=key uv run python tests/test_e2e.py --test simple
# Browser task - Visit Mayflower blog and extract headlines
GOOGLE_API_KEY=key uv run python tests/test_browser_search.py
# Shell command task - Use terminal to check disk space
GOOGLE_API_KEY=key uv run python tests/test_shell_commands.py --test df
# Memory usage check
GOOGLE_API_KEY=key uv run python tests/test_shell_commands.py --test free
# MCP HTTP test - Get heise.de news via MCP server
python tests/test_mcp_http_heise.py
# Install dev dependencies
uv sync --extra dev
# Run all tests
uv run pytest
# Run specific test file
uv run pytest tests/test_vnc_backend.py
vnc-use/
├── src/vnc_use/
│ ├── __init__.py # Public API
│ ├── agent.py # LangGraph agent
│ ├── types.py # Type definitions
│ ├── safety.py # HITL handling
│ ├── logging_utils.py # Run artifacts
│ ├── cli.py # CLI entrypoint
│ ├── mcp_server.py # MCP server implementation
│ ├── mcp_cli.py # MCP server CLI entrypoint
│ ├── backends/
│ │ └── vnc.py # VNC controller
│ └── planners/
│ └── gemini.py # Gemini wrapper
├── tests/
│ ├── test_vnc_backend.py # VNC tests
│ ├── test_gemini_wrapper.py # Gemini tests
│ ├── test_e2e.py # E2E tests
│ ├── test_browser_search.py # Browser automation tests (Mayflower blog)
│ ├── test_shell_commands.py # Shell command integration tests (df, free)
│ ├── test_mcp_server.py # MCP server tests
│ └── test_mcp_http_heise.py # MCP HTTP integration test (heise.de news)
├── docker-compose.yml # Test VNC desktop
├── pyproject.toml # Package configuration
├── LICENSE # MIT license
└── README.md # This file
- Python 3.10+
- Google API key (Gemini API)
- VNC server (or use docker-compose)
- Browser-optimized - The Computer Use model is tuned for browser workflows; native desktop apps may need custom functions
- Sequential execution - Function calls are executed one-by-one (parallelism possible future enhancement)
- HITL not interactive - Current implementation logs but doesn't block on user input (TODO)
- No OCR fallback - Purely vision-based grounding (by design)
- Model:
gemini-2.5-computer-use-preview-10-2025
- Tool: Computer Use with
ENVIRONMENT_BROWSER
- Coordinate system: Normalized 0-999 grid
- Response format: FunctionCalls with PNG screenshots
For detailed testing instructions, see TESTING.md
Contributions are welcome! Please feel free to submit a Pull Request.
MIT License - see LICENSE file for details.