An autonomous AI Agent that can See, Think, and Act to perform any task on Windows 11 using Ollama with Qwen3-VL vision-language model.
- ๐๏ธ See: Captures screenshots of your desktop using efficient Windows APIs
- ๐ง Think: Analyzes screenshots with Qwen3-VL model via Ollama to understand the UI and plan actions
- ๐ฏ Act: Executes mouse clicks, keyboard input, and system actions autonomously
- ๐ Loop: Continues until the task is complete or max iterations reached
- ๐ธ Screenshot History: Saves all screenshots for debugging and analysis
- ๐ก๏ธ Safe Operation: Includes failsafe mechanisms to prevent runaway automation
- Windows 11 (or Windows 10)
- Python 3.9+
- Ollama installed and running
- Qwen3-VL model pulled in Ollama
-
Install Ollama (if not already installed):
# Download from https://ollama.ai and install # Or use winget: winget install Ollama.Ollama
-
Pull the Qwen3-VL model:
ollama pull qwen3-vl:235b-cloud
-
Clone this repository:
git clone <repository-url> cd Computer-Use-Agent
-
Install Python dependencies:
pip install -r requirements.txt
python see_think_act_agent.pyYou can also run custom tasks programmatically:
from see_think_act_agent import SeeThinkActAgent
# Initialize agent
agent = SeeThinkActAgent(
model="qwen3-vl:235b-cloud",
max_iterations=30,
save_screenshots=True
)
# Run a task
result = agent.run("Open Notepad and type 'Hello World'")
print(result)Open and run see_think_act_demo.ipynb for interactive examples:
jupyter notebook see_think_act_demo.ipynbSTC/
โโโ see_think_act_agent.py # Main agent implementation
โโโ see_think_act_demo.ipynb # Demo notebook with examples
โโโ requirements.txt # Python dependencies
โโโ README.md # This file
โโโ utils/
โ โโโ __init__.py
โ โโโ screenshot_capture.py # Screen capture utility
โ โโโ ollama_client.py # Ollama API wrapper
โ โโโ action_executor.py # Action execution (mouse/keyboard)
โ โโโ agent_function_call.py # Function calling definitions
โโโ screenshots/ # Default screenshot directory
โโโ agent_screenshots/ # Agent execution screenshots
The agent operates in a continuous See-Think-Act loop:
- Captures a screenshot of the current desktop state
- Uses
msslibrary for efficient, low-latency screen capture - Encodes image for transmission to the model
- Sends screenshot to Qwen3-VL model via Ollama
- Model analyzes the visual state and understands the UI
- Uses function calling to structure the next action
- Considers task progress and decides next step
- Executes the action decided by the model
- Uses
pyautoguifor precise mouse and keyboard control - Actions include:
- Mouse clicks (left, right, double, middle)
- Mouse movement and dragging
- Keyboard typing and key presses
- Scrolling
- Waiting for UI updates
- Task termination
- Captures new screenshot to see the result
- Loop continues until task completion or max iterations
- Agent adapts based on what it sees
agent = SeeThinkActAgent(
model="qwen3-vl:235b-cloud", # Ollama model name
max_iterations=30, # Maximum action loops
save_screenshots=True, # Save screenshots for debugging
screenshot_dir="screenshots", # Screenshot save directory
log_level="INFO" # Logging verbosity
)Edit utils/action_executor.py to adjust:
- Mouse movement speed
- Typing speed
- Wait times between actions
- Failsafe settings
# Open an application
agent.run("Open Notepad")
# Type text
agent.run("Open Notepad and type 'Hello World'")
# Use calculator
agent.run("Open Calculator and calculate 123 + 456")# Web browsing
agent.run("Open Microsoft Edge and search for 'Ollama AI'")
# File management
agent.run("Open File Explorer and create a new folder named 'AI_Projects'")
# Multi-step tasks
agent.run("Open Notepad, type a grocery list, and save it as groceries.txt on Desktop")- Failsafe: Move mouse to top-left corner to abort (pyautogui feature)
- Max Iterations: Prevents infinite loops
- Keyboard Interrupt: Press Ctrl+C to stop
- Screenshot History: Review what the agent did
- Logging: Full activity logs for debugging
# Pull the model
ollama pull qwen3-vl:235b-cloud
# Verify it's available
ollama list# Check if Ollama is running
ollama serve
# Or restart Ollama service- Ensure Python has screen capture permissions
- Try running with administrator privileges
- Check antivirus/security software settings
- Verify screen resolution in agent initialization
- Adjust coordinate scaling if needed
- Check for display scaling settings in Windows
- Adjust
pyautogui.PAUSEinaction_executor.py - Modify wait times after actions
- Increase iteration delays
- Start Simple: Begin with simple tasks and build up complexity
- Clear Desktop: Minimize visual clutter for better recognition
- Specific Instructions: Give clear, specific task descriptions
- Monitor Progress: Watch the agent work to understand its decisions
- Review Screenshots: Check saved screenshots to debug issues
- Reasonable Scope: Keep tasks focused and achievable
Test individual components:
# Test screenshot capture
python utils/screenshot_capture.py
# Test Ollama client
python utils/ollama_client.py
# Test action executor
python utils/action_executor.pyโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ See-Think-Act Agent โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโ
โ โ โ
โผ โผ โผ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ SEE Module โ โ THINK Module โ โ ACT Module โ
โ โ โ โ โ โ
โ Screenshot โโโโโถโ Ollama โโโโโถโ PyAutoGUI โ
โ Capture โ โ Qwen3-VL โ โ Control โ
โ (mss) โ โ Vision LLM โ โ โ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ โ โ
โ โ โ
โโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโผโโโโโโ
โ Windows โ
โ OS โ
โโโโโโโโโโโโโ
Contributions are welcome! Areas for improvement:
- Additional action types
- Better error handling
- Multi-monitor support
- Task planning and optimization
- Integration with more models
- Qwen Team: For the amazing Qwen3-VL model
- Ollama: For making local LLM inference easy
- PyAutoGUI: For GUI automation capabilities
- MSS: For efficient screen capture
For issues and questions:
- Open an issue on GitHub
- Check existing issues for solutions
- Review the troubleshooting section