Screen understanding using Apple's Accessibility APIsโsoon powered by local LLMs
๐ Demo โข โก Quick Start โข ๐ง How It Works โข ๐ Privacy
Coming soon: GIF showing the overlay in action, UI tree extraction, and activity timeline
What you'll see:
- Floating minimal overlay that tracks your work in real-time
- Accessibility API extracting UI elements from any app (buttons, text fields, menus)
- Activity timeline showing your workflow patterns
- All running locally at 1 FPSโno cloud, no tracking
- Soon: Local LLM understanding full screen context
Most productivity tools are either:
- ๐ซ Invasive (uploading your screen to the cloud)
- ๐ซ Limited (only track app names, not context)
- ๐ซ Closed-source (you have no idea what they're doing with your data)
Visual Agent is different:
- โ 100% local processing using Apple's native Accessibility APIs
- โ Works with ANY applicationโextracts actual UI structure, not just pixels
- โ Fully open sourceโaudit every line of code
- โ Built by developers, for developers
- โ Coming soon: Local LLM integration for semantic understanding
Ever wondered:
- "How much time did I actually spend focused today?"
- "What was I working on 2 hours ago?"
- "Which apps are killing my productivity?"
Traditional time trackers only see app names. Visual Agent sees structureโthe actual UI elements, window layouts, and soon, semantic meaning via local LLMs.
# Clone and build
git clone https://github.com/yourusername/macos-visual-agent.git
cd macos-visual-agent
open VisualAgent.xcodeproj
# Grant permissions when prompted (Screen Recording + Accessibility)
# Launch from Xcode or build for ReleaseThat's it. The floating overlay appears in your top-right corner.
Screen Capture (1 FPS)
โ
Accessibility APIs โ Extract UI tree (buttons, inputs, text, menus)
โ
Vision Framework โ Capture visible text with coordinates
โ
Window Manager โ Track active apps & window layouts
โ
[COMING SOON] Local LLM โ Semantic understanding of screen context
โ
In-Memory Processing โ Real-time insights (persistence coming soon)
This isn't some Electron wrapper running a Chrome browser. It's pure native Swift using Apple's most powerful APIs:
| Framework | What It Does |
|---|---|
| Accessibility APIs | Extracts complete UI treeโevery button, text field, menu with exact coordinates |
| ScreenCaptureKit | Captures display at 1 FPS (macOS 12.3+) for visual context |
| Vision Framework | Detects text regions and extracts content with word-level bounding boxes |
| SwiftUI | Native, buttery-smooth 120Hz interface |
| [Coming] Local LLM | Ollama/MLX integration for semantic screen understanding |
| [Coming] Vector DB | Persistent storage with semantic search capabilities |
Performance: Uses ~50MB RAM, <2% CPU on Apple Silicon.
Unlike screen scraping or pixel-based hacks, Visual Agent uses Apple's Accessibility APIsโthe same system VoiceOver uses:
- Extract complete UI hierarchies from any app
- Get precise element types (button, checkbox, text field, etc.)
- Know exact coordinates and states
- Works even with custom UI frameworks
Why this matters: Way more accurate than traditional methods. Works with native apps, Electron apps, web appsโeverything.
The architecture is designed for local AI integration:
Screen Context โ UI Tree + Visual Data โ Local LLM โ Semantic Understanding
Imagine:
- "Show me all code-related activity from yesterday"
- "What documentation was I reading when I wrote this function?"
- "Auto-tag my work sessions by project context"
Privacy-first: Your screen data never leaves your Mac. Run Ollama or MLX models locally.
Want to build:
- A personal search engine of everything you've seen?
- RAG system with your entire work context?
- AI copilot that sees your full development environment?
- Context-aware automation ("when Figma opens, show design system docs")?
You can. The architecture is modularโjust plug into ContextStreamManager.
- โ No keystroke logging (removed in security audit)
- โ No network requests (check the codeโzero external API calls)
- โ No cloud uploads (everything stays on your Mac)
- โ No telemetry or tracking (not even anonymous analytics)
- โ No persistent storage yet (currently in-memory only)
- โ UI element metadata โ button labels, text fields, window titles
- โ Screen captures โ processed for visual context, then discarded
- โ Text regions โ what's visible and where
- โ App usage patterns โ which apps you're using when
Current state: All data is in-memory and lost when you quit the app.
Coming soon: Optional local persistence with vector embeddings for semantic search.
# Verify zero network activity
sudo lsof -i -P | grep VisualAgent # Should return nothing
# Audit the code yourself
grep -r "URLSession\|fetch\|http" VisualAgent/ # Zero API calls
# Check for any data files
find ~/Library -name "*visualagent*" -o -name "*VisualAgent*" # Currently noneWhen LLM support arrives: All inference runs locally via Ollama or MLX. Your data never touches the internet.
For Developers:
- Track context switches: Xcode โ docs โ StackOverflow โ Slack
- See actual productivity patterns beyond "Chrome was open for 4 hours"
- Build AI dev tools that understand your full environment
- Create personal knowledge base from your screen history
For Researchers:
- Study UI/UX patterns in real applications
- Log screen interactions for user studies (with consent)
- Analyze accessibility compliance across apps
- Build datasets of human-computer interaction
For Hackers & Tinkerers:
- Personal "time machine" search of everything you've seen
- Auto-journal your workday based on actual screen context
- Context-aware automation and workflows
- Train local AI models on your work patterns
VisualAgent/
โโโ ๐ธ ScreenCaptureManager.swift โ ScreenCaptureKit wrapper (1 FPS)
โโโ ๐ฏ AccessibilityAnalyzer.swift โ UI tree extraction via AX APIs
โโโ ๐ง VisionTextExtractor.swift โ Text detection with coordinates
โโโ ๐ ContextStreamManager.swift โ Pipeline coordinator
โโโ ๐จ ContentView.swift โ SwiftUI overlay interface
โโโ [Coming] LLMContextProcessor.swift โ Local LLM integration
Current Stack:
- Accessibility APIs for UI structure
- Vision framework for text + coordinates
- Native screen capture at 1 FPS
- In-memory state management
Coming Soon:
- Ollama integration for local LLaMA/Mistral models
- MLX support for Apple Silicon optimized inference
- Vector database for persistent semantic search
- RAG pipeline for screen memory
Want to contribute?
- Help integrate Ollama/MLX for local LLM support
- Design the persistence layer (vector DB, embeddings)
- Build export plugins (Obsidian, Notion, CSV)
- Create visualization tools for screen timelines
- Multi-monitor support
No webpack. No npm. No bullshit. Just Swift + Xcode.
Phase 1: Native Foundation (โ Done)
- Screen capture at 1 FPS via ScreenCaptureKit
- Accessibility API integration for UI tree extraction
- Vision framework for text detection
- Floating overlay interface
- In-memory activity tracking
Phase 2: Persistence & Intelligence (๐ง Next)
- Local persistence layer (vector DB or similar)
- Smart activity categorization (coding vs. browsing vs. meetings)
- Export formats (JSON, CSV, Markdown)
- Data retention policies (auto-cleanup after N days)
- Encryption for stored data
Phase 3: Local AI Superpowers (๐ก Planned)
- Ollama integration โ Run LLaMA 3.3, Mistral locally
- MLX support โ Apple Silicon optimized inference
- Vector embeddings โ Semantic understanding of screen context
- RAG pipeline โ "Search everything I've seen this week about React hooks"
- Semantic tagging โ Auto-categorize sessions by project/topic
- Context-aware insights โ "You're most productive in morning sessions when..."
Phase 4: Ecosystem (๐ฎ Future)
- Plugin system for custom analyzers
- Multi-display support
- Team features (opt-in collaborative insights)
- API for third-party integrations
We want your ideasโespecially around local LLM integration.
Working on Ollama/MLX/llama.cpp? Want to help make this the best local-first productivity tool?
- Open an issue firstโlet's discuss your idea
- Fork & build your changes
- Submit a PR with clear description
Priority areas:
- Local LLM integration (Ollama, MLX)
- Persistence layer design (vector DB, embeddings)
- RAG implementation for screen memory
- Privacy-preserving analytics
- Export & visualization tools
No bureaucracy. No corporate approval. Just ship.
MIT Licenseโbuild products, fork it, sell it, integrate it, we don't care.
Just keep it local. Keep it open. Keep it honest.
Built with โค๏ธ by developers who believe your screen data belongs on YOUR machine.
Not in some cloud. Not feeding someone's AI training pipeline. Just yours.
Report Bug โข Request Feature โข Discussions