Skip to content

Codeeaner/Computer-Use-Agent

Repository files navigation

See-Think-Act AI Agent for Windows 11

An autonomous AI Agent that can See, Think, and Act to perform any task on Windows 11 using Ollama with Qwen3-VL vision-language model.

๐ŸŽฏ Features

  • ๐Ÿ‘๏ธ See: Captures screenshots of your desktop using efficient Windows APIs
  • ๐Ÿง  Think: Analyzes screenshots with Qwen3-VL model via Ollama to understand the UI and plan actions
  • ๐ŸŽฏ Act: Executes mouse clicks, keyboard input, and system actions autonomously
  • ๐Ÿ”„ Loop: Continues until the task is complete or max iterations reached
  • ๐Ÿ“ธ Screenshot History: Saves all screenshots for debugging and analysis
  • ๐Ÿ›ก๏ธ Safe Operation: Includes failsafe mechanisms to prevent runaway automation

๐Ÿš€ Quick Start

Prerequisites

  1. Windows 11 (or Windows 10)
  2. Python 3.9+
  3. Ollama installed and running
  4. Qwen3-VL model pulled in Ollama

Installation

  1. Install Ollama (if not already installed):

    # Download from https://ollama.ai and install
    # Or use winget:
    winget install Ollama.Ollama
  2. Pull the Qwen3-VL model:

    ollama pull qwen3-vl:235b-cloud
  3. Clone this repository:

    git clone <repository-url>
    cd Computer-Use-Agent
  4. Install Python dependencies:

    pip install -r requirements.txt

Running the Agent

Option 1: Command Line

python see_think_act_agent.py

You can also run custom tasks programmatically:

from see_think_act_agent import SeeThinkActAgent

# Initialize agent
agent = SeeThinkActAgent(
    model="qwen3-vl:235b-cloud",
    max_iterations=30,
    save_screenshots=True
)

# Run a task
result = agent.run("Open Notepad and type 'Hello World'")
print(result)

Option 2: Jupyter Notebook

Open and run see_think_act_demo.ipynb for interactive examples:

jupyter notebook see_think_act_demo.ipynb

๐Ÿ“ Project Structure

STC/
โ”œโ”€โ”€ see_think_act_agent.py       # Main agent implementation
โ”œโ”€โ”€ see_think_act_demo.ipynb     # Demo notebook with examples
โ”œโ”€โ”€ requirements.txt              # Python dependencies
โ”œโ”€โ”€ README.md                     # This file
โ”œโ”€โ”€ utils/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ screenshot_capture.py    # Screen capture utility
โ”‚   โ”œโ”€โ”€ ollama_client.py         # Ollama API wrapper
โ”‚   โ”œโ”€โ”€ action_executor.py       # Action execution (mouse/keyboard)
โ”‚   โ””โ”€โ”€ agent_function_call.py   # Function calling definitions
โ”œโ”€โ”€ screenshots/                  # Default screenshot directory
โ””โ”€โ”€ agent_screenshots/           # Agent execution screenshots

๐ŸŽฎ How It Works

The agent operates in a continuous See-Think-Act loop:

1. SEE ๐Ÿ‘๏ธ

  • Captures a screenshot of the current desktop state
  • Uses mss library for efficient, low-latency screen capture
  • Encodes image for transmission to the model

2. THINK ๐Ÿง 

  • Sends screenshot to Qwen3-VL model via Ollama
  • Model analyzes the visual state and understands the UI
  • Uses function calling to structure the next action
  • Considers task progress and decides next step

3. ACT ๐ŸŽฏ

  • Executes the action decided by the model
  • Uses pyautogui for precise mouse and keyboard control
  • Actions include:
    • Mouse clicks (left, right, double, middle)
    • Mouse movement and dragging
    • Keyboard typing and key presses
    • Scrolling
    • Waiting for UI updates
    • Task termination

4. REPEAT ๐Ÿ”„

  • Captures new screenshot to see the result
  • Loop continues until task completion or max iterations
  • Agent adapts based on what it sees

๐Ÿ› ๏ธ Configuration

Agent Parameters

agent = SeeThinkActAgent(
    model="qwen3-vl:235b-cloud",    # Ollama model name
    max_iterations=30,               # Maximum action loops
    save_screenshots=True,           # Save screenshots for debugging
    screenshot_dir="screenshots",    # Screenshot save directory
    log_level="INFO"                 # Logging verbosity
)

Action Executor Settings

Edit utils/action_executor.py to adjust:

  • Mouse movement speed
  • Typing speed
  • Wait times between actions
  • Failsafe settings

๐Ÿ“ Example Tasks

Simple Tasks

# Open an application
agent.run("Open Notepad")

# Type text
agent.run("Open Notepad and type 'Hello World'")

# Use calculator
agent.run("Open Calculator and calculate 123 + 456")

Complex Tasks

# Web browsing
agent.run("Open Microsoft Edge and search for 'Ollama AI'")

# File management
agent.run("Open File Explorer and create a new folder named 'AI_Projects'")

# Multi-step tasks
agent.run("Open Notepad, type a grocery list, and save it as groceries.txt on Desktop")

๐Ÿ”’ Safety Features

  • Failsafe: Move mouse to top-left corner to abort (pyautogui feature)
  • Max Iterations: Prevents infinite loops
  • Keyboard Interrupt: Press Ctrl+C to stop
  • Screenshot History: Review what the agent did
  • Logging: Full activity logs for debugging

๐Ÿ› Troubleshooting

Model Not Found

# Pull the model
ollama pull qwen3-vl:235b-cloud

# Verify it's available
ollama list

Ollama Connection Error

# Check if Ollama is running
ollama serve

# Or restart Ollama service

Screen Capture Issues

  • Ensure Python has screen capture permissions
  • Try running with administrator privileges
  • Check antivirus/security software settings

Mouse Control Problems

  • Verify screen resolution in agent initialization
  • Adjust coordinate scaling if needed
  • Check for display scaling settings in Windows

Actions Too Fast/Slow

  • Adjust pyautogui.PAUSE in action_executor.py
  • Modify wait times after actions
  • Increase iteration delays

๐ŸŽฏ Best Practices

  1. Start Simple: Begin with simple tasks and build up complexity
  2. Clear Desktop: Minimize visual clutter for better recognition
  3. Specific Instructions: Give clear, specific task descriptions
  4. Monitor Progress: Watch the agent work to understand its decisions
  5. Review Screenshots: Check saved screenshots to debug issues
  6. Reasonable Scope: Keep tasks focused and achievable

๐Ÿงช Testing

Test individual components:

# Test screenshot capture
python utils/screenshot_capture.py

# Test Ollama client
python utils/ollama_client.py

# Test action executor
python utils/action_executor.py

๐Ÿ“Š Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    See-Think-Act Agent                      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚
        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
        โ”‚                   โ”‚                   โ”‚
        โ–ผ                   โ–ผ                   โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  SEE Module  โ”‚    โ”‚ THINK Module โ”‚    โ”‚  ACT Module  โ”‚
โ”‚              โ”‚    โ”‚              โ”‚    โ”‚              โ”‚
โ”‚  Screenshot  โ”‚โ”€โ”€โ”€โ–ถโ”‚    Ollama    โ”‚โ”€โ”€โ”€โ–ถโ”‚  PyAutoGUI   โ”‚
โ”‚   Capture    โ”‚    โ”‚   Qwen3-VL   โ”‚    โ”‚   Control    โ”‚
โ”‚   (mss)      โ”‚    โ”‚  Vision LLM  โ”‚    โ”‚              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
        โ”‚                   โ”‚                   โ”‚
        โ”‚                   โ”‚                   โ”‚
        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚
                      โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”
                      โ”‚  Windows  โ”‚
                      โ”‚    OS     โ”‚
                      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿค Contributing

Contributions are welcome! Areas for improvement:

  • Additional action types
  • Better error handling
  • Multi-monitor support
  • Task planning and optimization
  • Integration with more models

๐Ÿ™ Acknowledgments

  • Qwen Team: For the amazing Qwen3-VL model
  • Ollama: For making local LLM inference easy
  • PyAutoGUI: For GUI automation capabilities
  • MSS: For efficient screen capture

๐Ÿ“ž Support

For issues and questions:

  • Open an issue on GitHub
  • Check existing issues for solutions
  • Review the troubleshooting section

โš ๏ธ Warning: This agent has the ability to control your computer. Use responsibly and monitor its actions, especially when testing new tasks.