NextDesk is an intelligent desktop automation application powered by LLMs via OpenRouter (using advanced models like Google's Gemini 3.0) that uses the ReAct (Reasoning + Acting) framework to understand and execute complex computer tasks through natural language commands.
โ ๏ธ UNDER DEVELOPMENT This project is currently in development and not ready for production use. The vision-based element detection tool (detectElementPosition) is experimental. We recommend using keyboard shortcuts (pressKeys) and thegetShortcutstool for more reliable automation.
This Flutter desktop application combines AI reasoning with keyboard automation and input control to automate desktop tasks. Simply describe what you want to do in natural language (e.g., "open Chrome and search for Flutter documentation"), and the AI agent will break it down into executable steps, reason about each action, and perform the automation.
| Feature | Status | Notes |
|---|---|---|
| ReAct Framework | โ Working | Core reasoning loop is functional |
| Keyboard Automation | โ Working | Reliable keyboard shortcuts via pressKeys |
| AI Shortcuts Tool | โ Working | getShortcuts dynamically fetches shortcuts |
| Mouse Control | โ Working | Basic mouse movement and clicks |
| Screenshot Capture | โ Working | Screen capture functionality |
| Vision Detection | Unreliable, not recommended for use | |
| User Interaction | โ Working | Agent can ask user questions via dialog |
| Task Persistence | โ Working | Isar database for task history |
Current Focus: Improving vision detection accuracy and reliability using newer vision models.
| Platform | Status | Notes |
|---|---|---|
| macOS | โ Supported | Fully tested and working |
| Windows | Requires testing of bixat_key_mouse plugin for proper keyboard and mouse control | |
| Linux | Requires testing of bixat_key_mouse plugin for proper keyboard and mouse control |
Note: While the
bixat_key_mouseplugin claims to support Windows and Linux, we need to thoroughly test keyboard and mouse control functionality on these platforms before officially supporting them in this application.
Main interface showing task history and quick actions
The AI's reasoning process displayed in real-time with numbered thought steps
Execution log showing all function calls and their parameters
nextdesk/
โโโ lib/
โ โโโ main.dart # Application entry point
โ โโโ config/
โ โ โโโ app_theme.dart # Centralized theme & design system
โ โ โโโ app_config.dart # API keys and configuration
โ โโโ models/
โ โ โโโ task.dart # Task data model (Isar)
โ โ โโโ detection_result.dart # UI element detection results
โ โ โโโ react_agent_state.dart # ReAct agent state
โ โโโ services/
โ โ โโโ openrouter_service.dart # OpenRouter AI integration
โ โ โโโ vision_service.dart # AI-powered UI element detection
โ โ โโโ automation_service.dart # All automation functions
โ โ โโโ shortcuts_service.dart # AI-powered keyboard shortcuts
โ โโโ providers/
โ โ โโโ app_state.dart # Main state management (Provider)
โ โโโ screens/
โ โ โโโ main_screen.dart # Main UI with responsive layout
โ โโโ widgets/
โ โ โโโ task_card.dart # Reusable task card widget
โ โ โโโ user_prompt_dialog.dart # User interaction dialog
โ โโโ main.g.dart # Generated Isar database code
โโโ macos/ # macOS platform-specific code
โโโ windows/ # Windows platform-specific code
โโโ linux/ # Linux platform-specific code
โโโ pubspec.yaml # Dependencies and project configuration
โโโ README.md # This file
The application follows separation of concerns with a clean modular architecture:
Task: Isar database model for storing automation tasks with thoughts and stepsDetectionResult: Model for UI element detection results with coordinatesReActAgentState: State management for the ReAct reasoning cycle
OpenRouterService: Initializes and configures AI models via OpenRouter API with function calling supportVisionService: AI-powered UI element detection using OpenRouter Vision APIAutomationService: Wrapper for all automation capabilities (mouse, keyboard, screen)
AppState: Main state management using Provider pattern- Manages task execution state
- Handles ReAct agent lifecycle
- Stores execution logs and thought history
- Manages database operations
MainScreen: Primary interface with responsive layout- Adaptive design (800px breakpoint)
- Side-by-side panels on large screens
- Drawer navigation on small screens
- Tabbed interface for thoughts and actions
TaskCard: Reusable task card with animations and metrics
AppTheme: Centralized design system- Material Design 3 theme
- Color palette (Purple/Blue/Green)
- 8px spacing system
- Typography using Google Fonts Inter
- Shadow and border radius constants
The application implements the ReAct (Reasoning + Acting) pattern, which combines reasoning and action in an iterative loop:
1. THOUGHT โ 2. ACTION โ 3. OBSERVATION โ (repeat)
The AI agent analyzes the current state and decides what to do next:
- Understands the user's goal
- Considers what has been done so far
- Plans the next logical step
The agent executes one of the available automation functions:
captureScreenshot(): Takes a screenshot to see the current statedetectElementPosition(description): Finds UI elements using AI visionmoveMouse(x, y): Moves cursor to coordinatesclickMouse(button, action): Performs mouse clickstypeText(text): Types text via keyboardpressKeys(keys): Presses keyboard shortcutswait(seconds): Waits for a specified durationgetShortcuts(query): Dynamically fetches app shortcuts
The agent receives feedback from the action:
- Success/failure status
- Element coordinates (for detection)
- Screenshot data
- Error messages
This cycle repeats until the task is complete or max iterations (20) is reached.
The application uses OpenRouter to access powerful LLMs (like Google Gemini 3.0 Flash/Pro) with function calling capabilities.
The service handles:
- Chat session management
- System prompts for ReAct behavior
- Tool/Function definition and execution signatures
- Response parsing and JSON handling
The VisionService leverages the OpenRouter Vision API for UI element detection. It sends screenshots to a vision-capable model (e.g., Gemini 3.0 Flash) to identify pixel coordinates of described elements.
How it works:
- Takes a screenshot of the current screen
- Sends the image + element description to the OpenRouter API
- AI analyzes the image and returns pixel coordinates via JSON
- Returns a
DetectionResultwith x, y coordinates and confidence score
Example:
final result = await VisionService.detectElementPosition(
imageBytes,
"blue Submit button",
config,
);
// Returns: {x: 450, y: 320, confidence: 0.95}Uses the bixat_key_mouse package (custom Rust-based FFI) for:
- Mouse Control: Move cursor, click, double-click, right-click
- Keyboard Control: Type text, press keys, keyboard shortcuts
- Screen Capture: Take screenshots via
screen_capturer
The AppState class manages:
- Current task execution state
- Execution logs and thought logs
- Screenshot data
- Task history from database
- ReAct agent state (iteration count, current thought, observations)
Tasks are stored locally using Isar (NoSQL database):
@collection
class Task {
Id id = Isar.autoIncrement;
String prompt = '';
List<String> thoughts = []; // AI reasoning steps
List<String> steps = []; // Executed actions
bool completed = false;
DateTime createdAt = DateTime.now();
}- http (^1.2.0): For making API requests to OpenRouter
- bixat_key_mouse: Custom Rust-based FFI package for mouse/keyboard control
- screen_capturer (^0.2.1): Cross-platform screen capture functionality
- provider (^6.1.1): State management using ChangeNotifier pattern
- isar (^3.1.0+1): Fast, local NoSQL database for task persistence
- isar_flutter_libs (^3.1.0+1): Isar platform-specific bindings
- flutter_animate (^4.5.0): Declarative animations and transitions
- google_fonts (^6.1.0): Inter font family for typography
- Material Design 3: Modern design system with gradient themes
- path_provider (^2.1.1): Access to file system paths
- uuid (^4.2.1): Generate unique identifiers for tasks
- image (^4.5.4): Image processing and manipulation
- build_runner (^2.4.6): Code generation for Isar
- isar_generator (^3.1.0+1): Generates Isar database code
- Flutter SDK (>=3.0.0)
- OpenRouter API key from OpenRouter
- macOS desktop environment (Windows and Linux not yet supported - see Platform Support)
-
Clone the repository
git clone <repository-url> cd nextdesk
-
Install dependencies
flutter pub get
-
Set up the bixat_key_mouse package
Ensure the
bixat_key_mouse_dartpackage is available at:../bixat_key_mouse_dart -
Configure API key
You can configure the API key directly in the app settings, or set it via environment variable.
Copy the example config file:
cp lib/config/app_config.dart.example lib/config/app_config.dart
Then open
lib/config/app_config.dartand replace the API key (optional if using Settings UI):static const String openRouterApiKey = 'YOUR_OPENROUTER_API_KEY_HERE';
-
Generate Isar database code
flutter pub run build_runner build
-
Run the application
flutter run -d macos # or windows/linux
Input: "Open Chrome and search for Flutter documentation"
ReAct Process:
THOUGHT: I need to open Chrome first
ACTION: pressKeys(['cmd', 'space'])
OBSERVATION: Spotlight opened
THOUGHT: Now type Chrome to search for it
ACTION: typeText('Chrome')
OBSERVATION: Text typed successfully
THOUGHT: Press Enter to open Chrome
ACTION: pressKeys(['enter'])
OBSERVATION: Chrome is opening
THOUGHT: Wait for Chrome to load
ACTION: wait(2)
OBSERVATION: Waited 2 seconds
THOUGHT: Now I need to click on the address bar
ACTION: captureScreenshot()
OBSERVATION: Screenshot captured
THOUGHT: Detect the address bar
ACTION: detectElementPosition('address bar at the top')
OBSERVATION: Found at x:500, y:100
THOUGHT: Click on the address bar
ACTION: moveMouse(500, 100)
ACTION: clickMouse('left', 'click')
OBSERVATION: Clicked successfully
THOUGHT: Type the search query
ACTION: typeText('Flutter documentation')
ACTION: pressKeys(['enter'])
OBSERVATION: Task complete
- โ Natural language task understanding
- โ ReAct reasoning framework (Thought โ Action โ Observation)
- โ AI-powered UI element detection using computer vision (OpenRouter)
- โ Mouse and keyboard automation
- โ Screenshot capture and analysis
- โ Task history and persistence (Isar database)
- โ Multi-step task execution with iteration control
- โ Real-time execution logs and thought visualization
- โ Responsive desktop interface
- Multi-monitor support
- Task templates and macros
- Voice command input
- Task scheduling and automation
- Error recovery and retry logic
- Plugin system for custom actions
- Export task history to JSON/CSV
The detectElementPosition function uses AI vision to locate UI elements. While modern models like Gemini 3.0 are powerful, detection may still be imprecise in some contexts:
- Accuracy: Detection may be off by several pixels depending on the model's interpretation.
- Performance: Vision API calls can have latency.
- Complex UIs: Very dense UIs can still challenge current vision models.
โ RECOMMENDED APPROACH:
- Use keyboard shortcuts (
pressKeys) whenever possible - much more reliable. - Use
getShortcutstool to dynamically fetch keyboard shortcuts for applications. - Use vision detection as a fallback when no keyboard shortcut is available.
-
"Failed to detect element"
- Ensure the element description is very clear and specific.
- Try taking a screenshot first to verify visibility.
- Use keyboard shortcuts instead of mouse clicks when possible.
-
"API key error"
- Verify your OpenRouter API key is valid.
- Update the API key in the app Settings or
lib/config/app_config.dart.
-
Mouse/keyboard not working
- Grant accessibility permissions to the app (System Preferences โ Security & Privacy).
- Check that
bixat_key_mousepackage is properly installed. - Restart the application after granting permissions.
Contributions are welcome! Please feel free to submit a Pull Request.
Built with โค๏ธ using Flutter and OpenRouter