An autonomous computer control agent powered by Vision-Language Models
| Date | Version | Changes |
|---|---|---|
| Feb 2026 | v2.0 | Upgraded to Holo2-4B - improved accuracy and reasoning |
| June 2025 | v1.0 | Initial release with Holo1.5-3B |
Computer Use Agent is an autonomous system that can control your computer to complete tasks. Give it a high-level goal like "Search for flights to New York" and watch it navigate, click, type, and interact with your desktop - all powered by a local Vision-Language Model.
The agent uses a tri-role architecture:
- Navigator - Analyzes screenshots and decides the next action
- Localizer - Finds exact coordinates of UI elements
- Validator - Confirms actions were successful (optional)
| Feature | Description |
|---|---|
| Autonomous Mode | Set a goal and let the agent work independently |
| Manual Testing | Test localization, navigation, and validation separately |
| Speed Presets | Quality, Balanced, Fast, Fastest - trade accuracy for speed |
| Real-time Streaming | Watch the model think and reason live |
| Stop Control | Interrupt the agent at any time |
| Thinking Mode | Enable/disable chain-of-thought reasoning |
| Library | Purpose |
|---|---|
| PyTorch | Deep learning framework |
| Transformers | Model loading and inference |
| Gradio | Web interface |
| PyAutoGUI | Mouse and keyboard control |
| mss | Fast screen capture |
| Pydantic | Data validation |
| bitsandbytes | 8-bit quantization |
Hardware
- NVIDIA GPU with 8GB+ VRAM (tested on RTX 4070 Laptop)
- 16GB+ RAM recommended
Software
- Python 3.10+
- CUDA 11.8+
- Windows 10/11
-
Clone the repository
git clone https://github.com/yourusername/Computer-Use-Agent.git cd Computer-Use-Agent -
Install dependencies
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118 pip install transformers gradio pyautogui mss pydantic pillow bitsandbytes
-
Download the model
Download Holo2-4B from HuggingFace and place it in your preferred directory. Update the
MODEL_PATHincore/model.py:MODEL_PATH = r"C:\AI\Holo2-4B" # change this to your path
-
Run the agent
python agent.py
-
Open the UI
Navigate to
http://localhost:7860in your browser.
- Enter a task description (e.g., "Open Notepad and type Hello World")
- Select speed preset (Balanced recommended)
- Click Start Agent
- A new browser tab opens automatically - the agent will work there
| Preset | Resolution | Use Case |
|---|---|---|
| Quality | 1280px | Best accuracy, slower |
| Balanced | 896px | Good balance |
| Fast | 768px | Faster, still accurate |
| Fastest | 512px | Maximum speed, may miss small elements |
Use the other tabs to test individual components:
- Localization - Upload screenshot, describe element, get coordinates
- Navigation - Upload screenshot, describe task, get next action
- Validator - Verify if an action was successful
Computer-Use-Agent/
├── agent.py # entry point
├── core/
│ ├── __init__.py
│ ├── model.py # model loading & inference
│ ├── actions.py # action classes & execution
│ └── prompts.py # prompt templates
└── ui/
├── __init__.py
└── gradio_app.py # web interface
This project uses Holo2-4B from Hcompany, a Vision-Language Model fine-tuned for GUI understanding and computer control tasks.
| Model | Parameters | Link |
|---|---|---|
| Holo2-4B | 4B | huggingface.co/Hcompany/Holo2-4B |
| Holo1.5-3B | 3B | huggingface.co/Hcompany/Holo1.5-3B |
- Hcompany for the Holo models
Semester Project - Agentic AI
Spring 2025

