Skip to content

peakmojo/macos-visual-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

24 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐ŸŽฏ Visual Agent

Your Mac, but it actually understands what's on screen

Screen understanding using Apple's Accessibility APIsโ€”soon powered by local LLMs

๐Ÿš€ Demo โ€ข โšก Quick Start โ€ข ๐Ÿง  How It Works โ€ข ๐Ÿ” Privacy

macOS Swift ScreenCaptureKit Accessibility License


๐ŸŽฌ Demo

Coming soon: GIF showing the overlay in action, UI tree extraction, and activity timeline

What you'll see:

  • Floating minimal overlay that tracks your work in real-time
  • Accessibility API extracting UI elements from any app (buttons, text fields, menus)
  • Activity timeline showing your workflow patterns
  • All running locally at 1 FPSโ€”no cloud, no tracking
  • Soon: Local LLM understanding full screen context

๐Ÿ’ก Why This Exists

Most productivity tools are either:

  • ๐Ÿšซ Invasive (uploading your screen to the cloud)
  • ๐Ÿšซ Limited (only track app names, not context)
  • ๐Ÿšซ Closed-source (you have no idea what they're doing with your data)

Visual Agent is different:

  • โœ… 100% local processing using Apple's native Accessibility APIs
  • โœ… Works with ANY applicationโ€”extracts actual UI structure, not just pixels
  • โœ… Fully open sourceโ€”audit every line of code
  • โœ… Built by developers, for developers
  • โœ… Coming soon: Local LLM integration for semantic understanding

The Problem It Solves

Ever wondered:

  • "How much time did I actually spend focused today?"
  • "What was I working on 2 hours ago?"
  • "Which apps are killing my productivity?"

Traditional time trackers only see app names. Visual Agent sees structureโ€”the actual UI elements, window layouts, and soon, semantic meaning via local LLMs.


โšก Quick Start

# Clone and build
git clone https://github.com/yourusername/macos-visual-agent.git
cd macos-visual-agent
open VisualAgent.xcodeproj

# Grant permissions when prompted (Screen Recording + Accessibility)
# Launch from Xcode or build for Release

That's it. The floating overlay appears in your top-right corner.


๐Ÿง  How It Works

The Intelligence Pipeline

Screen Capture (1 FPS)
    โ†“
Accessibility APIs โ†’ Extract UI tree (buttons, inputs, text, menus)
    โ†“
Vision Framework โ†’ Capture visible text with coordinates
    โ†“
Window Manager โ†’ Track active apps & window layouts
    โ†“
[COMING SOON] Local LLM โ†’ Semantic understanding of screen context
    โ†“
In-Memory Processing โ†’ Real-time insights (persistence coming soon)

Under the Hood

This isn't some Electron wrapper running a Chrome browser. It's pure native Swift using Apple's most powerful APIs:

Framework What It Does
Accessibility APIs Extracts complete UI treeโ€”every button, text field, menu with exact coordinates
ScreenCaptureKit Captures display at 1 FPS (macOS 12.3+) for visual context
Vision Framework Detects text regions and extracts content with word-level bounding boxes
SwiftUI Native, buttery-smooth 120Hz interface
[Coming] Local LLM Ollama/MLX integration for semantic screen understanding
[Coming] Vector DB Persistent storage with semantic search capabilities

Performance: Uses ~50MB RAM, <2% CPU on Apple Silicon.


๐ŸŽจ What Makes This Special

1. Accessibility-First Architecture

Unlike screen scraping or pixel-based hacks, Visual Agent uses Apple's Accessibility APIsโ€”the same system VoiceOver uses:

  • Extract complete UI hierarchies from any app
  • Get precise element types (button, checkbox, text field, etc.)
  • Know exact coordinates and states
  • Works even with custom UI frameworks

Why this matters: Way more accurate than traditional methods. Works with native apps, Electron apps, web appsโ€”everything.

2. Local LLM Ready

The architecture is designed for local AI integration:

Screen Context โ†’ UI Tree + Visual Data โ†’ Local LLM โ†’ Semantic Understanding

Imagine:

  • "Show me all code-related activity from yesterday"
  • "What documentation was I reading when I wrote this function?"
  • "Auto-tag my work sessions by project context"

Privacy-first: Your screen data never leaves your Mac. Run Ollama or MLX models locally.

3. Ridiculously Extensible

Want to build:

  • A personal search engine of everything you've seen?
  • RAG system with your entire work context?
  • AI copilot that sees your full development environment?
  • Context-aware automation ("when Figma opens, show design system docs")?

You can. The architecture is modularโ€”just plug into ContextStreamManager.


๐Ÿ” Privacy

What This App Does NOT Do

  • โŒ No keystroke logging (removed in security audit)
  • โŒ No network requests (check the codeโ€”zero external API calls)
  • โŒ No cloud uploads (everything stays on your Mac)
  • โŒ No telemetry or tracking (not even anonymous analytics)
  • โŒ No persistent storage yet (currently in-memory only)

What It DOES Collect (100% Locally, In-Memory)

  • โœ… UI element metadata โ†’ button labels, text fields, window titles
  • โœ… Screen captures โ†’ processed for visual context, then discarded
  • โœ… Text regions โ†’ what's visible and where
  • โœ… App usage patterns โ†’ which apps you're using when

Current state: All data is in-memory and lost when you quit the app.

Coming soon: Optional local persistence with vector embeddings for semantic search.

For the Paranoid (We Love You)

# Verify zero network activity
sudo lsof -i -P | grep VisualAgent  # Should return nothing

# Audit the code yourself
grep -r "URLSession\|fetch\|http" VisualAgent/  # Zero API calls

# Check for any data files
find ~/Library -name "*visualagent*" -o -name "*VisualAgent*"  # Currently none

When LLM support arrives: All inference runs locally via Ollama or MLX. Your data never touches the internet.


๐Ÿš€ Use Cases

For Developers:

  • Track context switches: Xcode โ†’ docs โ†’ StackOverflow โ†’ Slack
  • See actual productivity patterns beyond "Chrome was open for 4 hours"
  • Build AI dev tools that understand your full environment
  • Create personal knowledge base from your screen history

For Researchers:

  • Study UI/UX patterns in real applications
  • Log screen interactions for user studies (with consent)
  • Analyze accessibility compliance across apps
  • Build datasets of human-computer interaction

For Hackers & Tinkerers:

  • Personal "time machine" search of everything you've seen
  • Auto-journal your workday based on actual screen context
  • Context-aware automation and workflows
  • Train local AI models on your work patterns

๐Ÿ› ๏ธ Architecture for Contributors

VisualAgent/
โ”œโ”€โ”€ ๐Ÿ“ธ ScreenCaptureManager.swift      โ†’ ScreenCaptureKit wrapper (1 FPS)
โ”œโ”€โ”€ ๐ŸŽฏ AccessibilityAnalyzer.swift     โ†’ UI tree extraction via AX APIs
โ”œโ”€โ”€ ๐Ÿง  VisionTextExtractor.swift       โ†’ Text detection with coordinates
โ”œโ”€โ”€ ๐Ÿ”„ ContextStreamManager.swift      โ†’ Pipeline coordinator
โ”œโ”€โ”€ ๐ŸŽจ ContentView.swift               โ†’ SwiftUI overlay interface
โ””โ”€โ”€ [Coming] LLMContextProcessor.swift โ†’ Local LLM integration

Current Stack:

  • Accessibility APIs for UI structure
  • Vision framework for text + coordinates
  • Native screen capture at 1 FPS
  • In-memory state management

Coming Soon:

  • Ollama integration for local LLaMA/Mistral models
  • MLX support for Apple Silicon optimized inference
  • Vector database for persistent semantic search
  • RAG pipeline for screen memory

Want to contribute?

  • Help integrate Ollama/MLX for local LLM support
  • Design the persistence layer (vector DB, embeddings)
  • Build export plugins (Obsidian, Notion, CSV)
  • Create visualization tools for screen timelines
  • Multi-monitor support

No webpack. No npm. No bullshit. Just Swift + Xcode.


๐ŸŽฏ Roadmap

Phase 1: Native Foundation (โœ… Done)

  • Screen capture at 1 FPS via ScreenCaptureKit
  • Accessibility API integration for UI tree extraction
  • Vision framework for text detection
  • Floating overlay interface
  • In-memory activity tracking

Phase 2: Persistence & Intelligence (๐Ÿšง Next)

  • Local persistence layer (vector DB or similar)
  • Smart activity categorization (coding vs. browsing vs. meetings)
  • Export formats (JSON, CSV, Markdown)
  • Data retention policies (auto-cleanup after N days)
  • Encryption for stored data

Phase 3: Local AI Superpowers (๐Ÿ’ก Planned)

  • Ollama integration โ†’ Run LLaMA 3.3, Mistral locally
  • MLX support โ†’ Apple Silicon optimized inference
  • Vector embeddings โ†’ Semantic understanding of screen context
  • RAG pipeline โ†’ "Search everything I've seen this week about React hooks"
  • Semantic tagging โ†’ Auto-categorize sessions by project/topic
  • Context-aware insights โ†’ "You're most productive in morning sessions when..."

Phase 4: Ecosystem (๐Ÿ”ฎ Future)

  • Plugin system for custom analyzers
  • Multi-display support
  • Team features (opt-in collaborative insights)
  • API for third-party integrations

๐Ÿค Contributing

We want your ideasโ€”especially around local LLM integration.

Working on Ollama/MLX/llama.cpp? Want to help make this the best local-first productivity tool?

  1. Open an issue firstโ€”let's discuss your idea
  2. Fork & build your changes
  3. Submit a PR with clear description

Priority areas:

  • Local LLM integration (Ollama, MLX)
  • Persistence layer design (vector DB, embeddings)
  • RAG implementation for screen memory
  • Privacy-preserving analytics
  • Export & visualization tools

No bureaucracy. No corporate approval. Just ship.


๐Ÿ“œ License

MIT Licenseโ€”build products, fork it, sell it, integrate it, we don't care.

Just keep it local. Keep it open. Keep it honest.


โญ If local-first AI tools matter to you, star this.

๐Ÿด If you want to build something with screen context, fork it.

๐Ÿš€ If you're working on local LLMs, let's collaborate.


Built with โค๏ธ by developers who believe your screen data belongs on YOUR machine.

Not in some cloud. Not feeding someone's AI training pipeline. Just yours.

Report Bug โ€ข Request Feature โ€ข Discussions