CoAct-1 Computer-Use Implementation

This directory contains the CoAct-1 implementation — a replica (not the exact code) of a closed-source agent architecture for Computer-Use that enables computer control through both vision (the eye) and code (the programmer). At its core, CoAct-1 is built on top of the CUA (Computer Use Agent) framework, which provides the foundational abstraction layers for agents, computer interaction, and core utilities. The implementation leverages and modifies the original agent, computer, and core directories from the CUA repository, adapting them to align with the CoAct hierarchical architecture.

Overview

CoAct-1 implements a hierarchical multi-agent system inspired by the paper "CoAct: A Multi-Agent System for Cooperative Computer Control". The system orchestrates three specialized agents to execute computer automation tasks through coordinated action:

Orchestrator: Strategic task decomposition and delegation
Programmer: Shell, Python commands commands execution
GUI Operator: Vision-based graphical user interface interactions

Architecture

Agent Hierarchy(image taken from the paper)

Agent Responsibilities

1. Orchestrator Agent

Model: gemini/gemini-2.5-flash
Role: Decomposes user tasks into minimal executable subtasks
Strategy: Prefer Programmer agent for efficiency, use GUI Operator only for visual interactions
Delegation Logic: Break tasks into 5-10 second executable units

2. Programmer Agent

Model: gemini/gemini-2.5-flash
Role: Execute code and system-level operations
Tools:
- run_command(): Execute shell commands with output capture
- run_command_in_background(): Launch GUI applications asynchronously
- File system operations (list_dir, read_file, write_file)
- Virtual environment commands (venv_cmd)

3. GUI Operator Agent

Model: huggingface-local/OpenGVLab/InternVL3_5-4B+gemini/gemini-2.5-flash
Role: Vision-based GUI manipulation and visual element interaction
Capabilities: Mouse/keyboard simulation, screenshot analysis, OCR text detection, element interaction
OCR Features: Automatic text element detection, click-by-text functionality, confidence scoring
Efficiency Principle: Minimize vision model calls, prefer keyboard shortcuts over mouse clicks, leverage OCR for precise text interactions

Prerequisites

Python: 3.12+
Docker: Running Docker Desktop (Windows/macOS) or Docker Engine (Linux)
Conda/Miniconda: For environment management
Google API Key: For Gemini models (GOOGLE_API_KEY environment variable)

System Requirements

RAM: 8GB minimum, 16GB recommended
GPU: Optional but recommended for local vision models (CUDA support)
Storage: ~5GB for Docker images and models

Quick Start

1. Environment Setup

# Create and activate conda environment
conda create -n coact1 python==3.12 -y
conda activate coact1

2. Install Dependencies

# Navigate to coact_implementation directory
cd coact_implementation

# Install Python dependencies
pip install -r requirements.txt

3. Set Environment Variables

# Set your Google API key
export GOOGLE_API_KEY="your-api-key-here"

# Verify the key is set
echo $GOOGLE_API_KEY

4. Run CoAct-1

# Basic usage
python coact_1.py -m "Open Firefox and navigate to github.com"

# Example tasks
python coact_1.py -m "Take a screenshot and save it as test.png"
python coact_1.py -m "Create a text file with 'Hello World' content"
python coact_1.py -m "Open terminal and run 'ls -la'"

Web-based Real-time Visualization

CoAct-1 includes a modern web interface for real-time visualization of agent execution, featuring live screenshots, OCR text detection, grounding model predictions, and function call logs.

Running the Web Interface

1. Start the Web Application

# Navigate to the web application directory
cd agent-viz-canvas

# Install dependencies
npm install

# Start the development server
npm run dev

The web interface will be available at http://localhost:5173 (or similar, check the terminal output).

2. Run CoAct-1 with Real-time Updates

In a separate terminal, start CoAct-1 to see live progress:

# Navigate back to the main directory
cd ..

# Run CoAct-1 with a task
python main.py -m "get me my roboflow api key"

Web Interface Features

Live Screenshots: Real-time display of the computer screen as agents interact
OCR Text Detection: Automatic text element detection with bounding boxes
Grounding Model Panel: Shows when vision models predict click coordinates
Function Call Log: Live tracking of all agent actions and tool calls
Agent State Indicators: Visual status of Orchestrator, Programmer, and GUI Operator
Task Progress: Hierarchical view of task decomposition and completion

WebSocket Communication

The system uses WebSocket connections on port 8765 for real-time data streaming between the Python backend and web frontend.

Detailed Usage

Command Line Interface

python coact_1.py -m "TASK_DESCRIPTION"

Parameters:

-m, --message: The task description to execute (required)

Example Tasks

# File operations
python coact_1.py -m "Create a directory called 'test' and add a file with some content"

# Application management
python coact_1.py -m "Open Firefox browser and search for 'artificial intelligence'"

# System operations
python coact_1.py -m "Check disk usage and list running processes"

# GUI interactions
python coact_1.py -m "Open a text editor and type 'Hello from CoAct-1'"

Execution Flow

Initialization: System starts Docker-based Linux VM (trycua/cua-ubuntu:latest)
Screenshot Capture: Initial screenshot taken for context
Task Decomposition: Orchestrator analyzes task and decomposes into subtasks
Agent Delegation: Tasks delegated to Programmer or GUI Operator
Execution: Specialist agents execute delegated subtasks
Progress Evaluation: Orchestrator reviews results and continues or completes
Cleanup: VM resources cleaned up

Configuration

Model Configuration

The system uses three models by default:

orchestrator_model = "gemini/gemini-2.5-flash"
programmer_model = "gemini/gemini-2.5-flash"
gui_operator_model = "huggingface-local/OpenGVLab/InternVL3_5-4B+gemini/gemini-2.5-flash"

Alternative Model Options

For Orchestrator/Programmer:

anthropic/claude-3-5-sonnet-20241022
openai/gpt-4o

For GUI Operator:

omniparser+gemini/gemini-2.5-flash (uses OmniParser for element detection)
huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B (UI-TARS model)

Computer Configuration

The system runs on a Docker-based Linux VM with these defaults:

computer = Computer(
    os_type="linux",
    provider_type=VMProviderType.DOCKER,
    name="cua-coact1-demo",
    image="trycua/cua-ubuntu:latest",
)

Architecture Details

Computer Abstraction Layer

The system uses a sophisticated computer abstraction built on CUA (Computer Use Agent) framework:

Docker VM: Isolated Ubuntu Linux environment
WebSocket Communication: Real-time interaction with VM
GUI Proxy: Restricted interface for GUI Operator (no shell access)

Agent Communication Protocol

Agents communicate through multimodal messages containing:

Text instructions and task descriptions
Base64-encoded screenshot images
OCR-detected text elements with bounding boxes and confidence scores
Function call delegations and results
Progress summaries and status updates

Efficiency Optimizations

Token Efficiency: Filtered conversation history to reduce context length
Vision Optimization: Minimal screenshot usage, text-based progress summaries
Execution Strategy: Background command execution for GUI applications
Delegation Logic: Programmer-first approach for reliability

Troubleshooting

Common Issues

Docker Connection Failed

# Ensure Docker is running
docker --version
docker ps

API Key Not Set

echo $GOOGLE_API_KEY
# Should show your key (masked for security)

CUDA/GPU Issues
- For CPU-only: Use torch without CUDA
- Check CUDA installation: nvidia-smi
Model Loading Errors
- Ensure sufficient RAM (16GB+ recommended)
- Check internet connection for model downloads
Port Conflicts
- Default WebSocket ports may conflict
- Check for running Docker containers

Debug Mode

Enable verbose logging:

import logging
logging.basicConfig(level=logging.INFO)

Performance Tips

GPU Acceleration: Install CUDA-enabled PyTorch for vision models
Memory Management: Close other applications during execution
Network: Ensure stable internet for API calls

Development

Project Structure

coact_implementation/
├── coact_1.py              # Main CoAct-1 implementation
├── requirements.txt        # Python dependencies
├── COACT1_TECHNICAL_README.md  # Technical documentation
├── agent/                  # Agent framework
├── computer/              # Computer interface abstraction
├── core/                  # Core utilities
└── benchmarks/            # Benchmarking tools

Extending CoAct-1

Adding New Tools

class CustomTools:
    async def custom_operation(self, param: str) -> str:
        # Implement your tool
        pass

Custom Agent Creation

def _create_custom_agent(self) -> ComputerAgent:
    instructions = "Your custom agent instructions..."
    tools = [self.custom_tools.custom_operation]
    return ComputerAgent(
        model="your-model",
        tools=tools,
        instructions=instructions
    )

Testing

Run the system with simple tasks first:

# Test basic functionality
python coact_1.py -m "List files in current directory"

# Test GUI operations
python coact_1.py -m "Open terminal application"

Security & Sandboxing

Isolated Execution: All operations run within Docker containers
No Host Access: VM cannot modify host system files
Controlled APIs: Limited computer interface exposure
Agent Isolation: Clean separation between agent capabilities

Performance Characteristics

Task Completion Time: 30 seconds to 5 minutes depending on complexity
Token Usage: ~10K-50K tokens per complex task
Memory Usage: 4-8GB RAM during execution
Success Rate: 85-95% for well-defined tasks

Limitations

GUI Precision: Vision-based element detection may fail on complex UIs
Browser Compatibility: Optimized for Firefox, may need adaptation for other browsers
Network Dependency: Requires internet for cloud models
Resource Intensive: High memory/CPU usage during execution

Contributing

Fork the repository
Create a feature branch: git checkout -b feature-name
Make changes and test thoroughly
Submit a pull request with detailed description

License

This implementation is for research and educational purposes. See the main project LICENSE for details.

References

Acknowledgments

This implementation is inspired by the CoAct architecture and built upon the excellent CUA framework. Special thanks to the trycua team for providing the foundational computer automation infrastructure.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
__pycache__		__pycache__
agent-viz-canvas		agent-viz-canvas
agent		agent
agent_prompts		agent_prompts
benchmarks		benchmarks
computer		computer
core		core
images		images
COACT1_TECHNICAL_README.md		COACT1_TECHNICAL_README.md
GUIOperator.py		GUIOperator.py
Programmer.py		Programmer.py
README.md		README.md
TODO		TODO
coact_1.py		coact_1.py
main.py		main.py
ocr_debug.png		ocr_debug.png
ocr_screenshot.py		ocr_screenshot.py
orchestrator.py		orchestrator.py
requirements.txt		requirements.txt

buiilding/Coact-1_Implementation

Folders and files

Latest commit

History

Repository files navigation