This directory contains the CoAct-1 implementation — a replica (not the exact code) of a closed-source agent architecture for Computer-Use that enables computer control through both vision (the eye) and code (the programmer). At its core, CoAct-1 is built on top of the CUA (Computer Use Agent) framework, which provides the foundational abstraction layers for agents, computer interaction, and core utilities. The implementation leverages and modifies the original agent, computer, and core directories from the CUA repository, adapting them to align with the CoAct hierarchical architecture.
CoAct-1 implements a hierarchical multi-agent system inspired by the paper "CoAct: A Multi-Agent System for Cooperative Computer Control". The system orchestrates three specialized agents to execute computer automation tasks through coordinated action:
- Orchestrator: Strategic task decomposition and delegation
- Programmer: Shell, Python commands commands execution
- GUI Operator: Vision-based graphical user interface interactions
- Model:
gemini/gemini-2.5-flash
- Role: Decomposes user tasks into minimal executable subtasks
- Strategy: Prefer Programmer agent for efficiency, use GUI Operator only for visual interactions
- Delegation Logic: Break tasks into 5-10 second executable units
- Model:
gemini/gemini-2.5-flash
- Role: Execute code and system-level operations
- Tools:
run_command()
: Execute shell commands with output capturerun_command_in_background()
: Launch GUI applications asynchronously- File system operations (
list_dir
,read_file
,write_file
) - Virtual environment commands (
venv_cmd
)
- Model:
huggingface-local/OpenGVLab/InternVL3_5-4B+gemini/gemini-2.5-flash
- Role: Vision-based GUI manipulation and visual element interaction
- Capabilities: Mouse/keyboard simulation, screenshot analysis, OCR text detection, element interaction
- OCR Features: Automatic text element detection, click-by-text functionality, confidence scoring
- Efficiency Principle: Minimize vision model calls, prefer keyboard shortcuts over mouse clicks, leverage OCR for precise text interactions
- Python: 3.12+
- Docker: Running Docker Desktop (Windows/macOS) or Docker Engine (Linux)
- Conda/Miniconda: For environment management
- Google API Key: For Gemini models (
GOOGLE_API_KEY
environment variable)
- RAM: 8GB minimum, 16GB recommended
- GPU: Optional but recommended for local vision models (CUDA support)
- Storage: ~5GB for Docker images and models
# Create and activate conda environment
conda create -n coact1 python==3.12 -y
conda activate coact1
# Navigate to coact_implementation directory
cd coact_implementation
# Install Python dependencies
pip install -r requirements.txt
# Set your Google API key
export GOOGLE_API_KEY="your-api-key-here"
# Verify the key is set
echo $GOOGLE_API_KEY
# Basic usage
python coact_1.py -m "Open Firefox and navigate to github.com"
# Example tasks
python coact_1.py -m "Take a screenshot and save it as test.png"
python coact_1.py -m "Create a text file with 'Hello World' content"
python coact_1.py -m "Open terminal and run 'ls -la'"
CoAct-1 includes a modern web interface for real-time visualization of agent execution, featuring live screenshots, OCR text detection, grounding model predictions, and function call logs.
# Navigate to the web application directory
cd agent-viz-canvas
# Install dependencies
npm install
# Start the development server
npm run dev
The web interface will be available at http://localhost:5173
(or similar, check the terminal output).
In a separate terminal, start CoAct-1 to see live progress:
# Navigate back to the main directory
cd ..
# Run CoAct-1 with a task
python main.py -m "get me my roboflow api key"
- Live Screenshots: Real-time display of the computer screen as agents interact
- OCR Text Detection: Automatic text element detection with bounding boxes
- Grounding Model Panel: Shows when vision models predict click coordinates
- Function Call Log: Live tracking of all agent actions and tool calls
- Agent State Indicators: Visual status of Orchestrator, Programmer, and GUI Operator
- Task Progress: Hierarchical view of task decomposition and completion
The system uses WebSocket connections on port 8765 for real-time data streaming between the Python backend and web frontend.
python coact_1.py -m "TASK_DESCRIPTION"
Parameters:
-m, --message
: The task description to execute (required)
# File operations
python coact_1.py -m "Create a directory called 'test' and add a file with some content"
# Application management
python coact_1.py -m "Open Firefox browser and search for 'artificial intelligence'"
# System operations
python coact_1.py -m "Check disk usage and list running processes"
# GUI interactions
python coact_1.py -m "Open a text editor and type 'Hello from CoAct-1'"
- Initialization: System starts Docker-based Linux VM (
trycua/cua-ubuntu:latest
) - Screenshot Capture: Initial screenshot taken for context
- Task Decomposition: Orchestrator analyzes task and decomposes into subtasks
- Agent Delegation: Tasks delegated to Programmer or GUI Operator
- Execution: Specialist agents execute delegated subtasks
- Progress Evaluation: Orchestrator reviews results and continues or completes
- Cleanup: VM resources cleaned up
The system uses three models by default:
orchestrator_model = "gemini/gemini-2.5-flash"
programmer_model = "gemini/gemini-2.5-flash"
gui_operator_model = "huggingface-local/OpenGVLab/InternVL3_5-4B+gemini/gemini-2.5-flash"
For Orchestrator/Programmer:
anthropic/claude-3-5-sonnet-20241022
openai/gpt-4o
For GUI Operator:
omniparser+gemini/gemini-2.5-flash
(uses OmniParser for element detection)huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B
(UI-TARS model)
The system runs on a Docker-based Linux VM with these defaults:
computer = Computer(
os_type="linux",
provider_type=VMProviderType.DOCKER,
name="cua-coact1-demo",
image="trycua/cua-ubuntu:latest",
)
The system uses a sophisticated computer abstraction built on CUA (Computer Use Agent) framework:
- Docker VM: Isolated Ubuntu Linux environment
- WebSocket Communication: Real-time interaction with VM
- GUI Proxy: Restricted interface for GUI Operator (no shell access)
Agents communicate through multimodal messages containing:
- Text instructions and task descriptions
- Base64-encoded screenshot images
- OCR-detected text elements with bounding boxes and confidence scores
- Function call delegations and results
- Progress summaries and status updates
- Token Efficiency: Filtered conversation history to reduce context length
- Vision Optimization: Minimal screenshot usage, text-based progress summaries
- Execution Strategy: Background command execution for GUI applications
- Delegation Logic: Programmer-first approach for reliability
-
Docker Connection Failed
# Ensure Docker is running docker --version docker ps
-
API Key Not Set
echo $GOOGLE_API_KEY # Should show your key (masked for security)
-
CUDA/GPU Issues
- For CPU-only: Use
torch
without CUDA - Check CUDA installation:
nvidia-smi
- For CPU-only: Use
-
Model Loading Errors
- Ensure sufficient RAM (16GB+ recommended)
- Check internet connection for model downloads
-
Port Conflicts
- Default WebSocket ports may conflict
- Check for running Docker containers
Enable verbose logging:
import logging
logging.basicConfig(level=logging.INFO)
- GPU Acceleration: Install CUDA-enabled PyTorch for vision models
- Memory Management: Close other applications during execution
- Network: Ensure stable internet for API calls
coact_implementation/
├── coact_1.py # Main CoAct-1 implementation
├── requirements.txt # Python dependencies
├── COACT1_TECHNICAL_README.md # Technical documentation
├── agent/ # Agent framework
├── computer/ # Computer interface abstraction
├── core/ # Core utilities
└── benchmarks/ # Benchmarking tools
class CustomTools:
async def custom_operation(self, param: str) -> str:
# Implement your tool
pass
def _create_custom_agent(self) -> ComputerAgent:
instructions = "Your custom agent instructions..."
tools = [self.custom_tools.custom_operation]
return ComputerAgent(
model="your-model",
tools=tools,
instructions=instructions
)
Run the system with simple tasks first:
# Test basic functionality
python coact_1.py -m "List files in current directory"
# Test GUI operations
python coact_1.py -m "Open terminal application"
- Isolated Execution: All operations run within Docker containers
- No Host Access: VM cannot modify host system files
- Controlled APIs: Limited computer interface exposure
- Agent Isolation: Clean separation between agent capabilities
- Task Completion Time: 30 seconds to 5 minutes depending on complexity
- Token Usage: ~10K-50K tokens per complex task
- Memory Usage: 4-8GB RAM during execution
- Success Rate: 85-95% for well-defined tasks
- GUI Precision: Vision-based element detection may fail on complex UIs
- Browser Compatibility: Optimized for Firefox, may need adaptation for other browsers
- Network Dependency: Requires internet for cloud models
- Resource Intensive: High memory/CPU usage during execution
- Fork the repository
- Create a feature branch:
git checkout -b feature-name
- Make changes and test thoroughly
- Submit a pull request with detailed description
This implementation is for research and educational purposes. See the main project LICENSE for details.
This implementation is inspired by the CoAct architecture and built upon the excellent CUA framework. Special thanks to the trycua team for providing the foundational computer automation infrastructure.