Cora.Demo.Updated.1.1.1.mp4
This project is a Chrome-based browser agent that observes web pages, reasons over the DOM + screenshots using OpenAI models, and executes actions such as clicking, typing, and scrolling.
It combines:
- A FastAPI backend (LLM controller / planner)
- A Chrome MV3 extension (UI observation + action execution)
- Screenshot-based multimodal reasoning
- Overlay indexing for deterministic element selection
This project consists of two main components:
Files
manifest.jsonbackground.jscontent.jsllmClient.jsprompts.jsonboarding.htmlonboarding.jsonboarding.css
Responsibilities
- Injects content scripts into pages
- Detects clickable elements
- Draws numbered overlays
- Captures screenshots
- Sends structured observations to backend
- Executes validated actions returned by the model
File
server.py
Responsibilities
- Receives page context + screenshot
- Sends multimodal prompt to OpenAI
- Returns structured action JSON
Supported Endpoints
/agent-step— baseline loop/execute-step/status/summarize/intent/profile-answer/tts
-
User starts agent (goal provided)
-
background.jsrequests page observation -
content.js:- Distills interactive elements
- Shows overlays
- Returns structured
elements[]+pageContext
-
Background captures screenshot via:
chrome.tabs.captureVisibleTab()
-
llmClient.jssends:- goal
- screenshot (base64)
- elements
- action history
- metadata
-
Backend sends multimodal request to OpenAI
-
Model returns structured action:
click_indextype_textscrollfinish
-
Action is validated against allowlist
-
content.jsexecutes action -
Loop continues until finish or max steps
cora/
│
├── manifest.json
├── background.js
├── content.js
├── llmClient.js
├── prompts.js
│
├── onboarding.html
├── onboarding.js
├── onboarding.css
│
└── server.py
Create virtual environment:
python -m venv venvActivate:
macOS / Linux
source venv/bin/activateWindows
venv\Scripts\activateInstall packages:
pip install fastapi uvicorn openai websockets python-dotenvThis project requires your OpenAI key as an environment variable.
Mac / Linux
export OPENAI_API_KEY="your-key-here"Windows PowerShell
$env:OPENAI_API_KEY="your-key-here"uvicorn server:app --reload --port 8000Backend runs at:
http://localhost:8000
- Open Chrome
- Navigate to:
chrome://extensions
- Enable Developer Mode
- Click Load unpacked
- Select the project folder (where
manifest.jsonlives)
The extension is now active.
- Ensure backend is running on
localhost:8000 - Open any webpage
- Start agent via:
- Extension UI
- Onboarding interface
- Custom trigger in your code
Agent will:
- Index elements
- Show overlays
- Begin iterative reasoning loop
This project uses powerful browser permissions.
<all_urls>activeTabtabsscriptingwebNavigationstorage
The system captures the visible viewport when making LLM decisions.
Captured data includes:
- Visible screenshot (PNG)
- URL
- Page title
- Indexed elements (text + metadata)
- Action history
Data flow:
Browser → Localhost Backend → OpenAI API
- Loaded from
OPENAI_API_KEY - Never committed to repository
- Never stored in Chrome extension
- Observation caching
- Delta-based DOM diffing
- Plan caching
- Persistent memory layer
- Element ranking (top-K distilled elements)
- Local embedding store