A modern org-mode database with semantic search capabilities. Built with a hybrid architecture: Emacs handles org parsing and UI, while a Python FastAPI backend provides storage, embeddings, and vector search.
- Semantic Search
- Find org content by meaning using embeddings
- Full-Text Search
- Fast keyword search with SQLite FTS5 and snippets
- Image Search
- Find images by text description using CLIP embeddings
- Headline Search
- Browse and jump to org headlines across all files
- Scoped Search
- Limit searches to specific directories, projects (Projectile), or keyword/tags
- Agenda
- View TODOs, deadlines, and scheduled items from all indexed files
- Non-blocking Indexing
- Queue-based indexing with configurable delays keeps Emacs responsive
- Linked File Indexing
- (Experimental/disabled) Can index linked documents (PDF, DOCX, PPTX, etc.) - needs optimization for large collections
- gptel Integration
- Expose search tools to LLMs for AI-powered org file exploration
- Transient UI
- Easy-to-use menu system for all commands (
H-vorM-x org-db) - Web Interface
- Browse statistics and documentation via browser
┌─────────────────┐ ┌──────────────────────┐ │ Emacs (UI) │◄───────►│ FastAPI Server │ │ - org parsing │ HTTP │ - SQLite storage │ │ - search UI │ │ - embeddings (ML) │ │ - navigation │ │ - vector search │ │ - transient │ │ - CLIP (images) │ └─────────────────┘ └──────────────────────┘
- Emacs 28.1+
- Python 3.10+
- uv (Python package manager)
cd python
uv syncThis installs all dependencies including:
- FastAPI & Uvicorn (web server)
- sentence-transformers (text embeddings)
- CLIP (image embeddings)
- SQLite (database)
Add to your Emacs config:
(add-to-list 'load-path "/path/to/org-db-v3/elisp")
(require 'org-db-v3)
;; Bind the menu to a convenient key
(global-set-key (kbd "H-v") 'org-db-menu)
;; Auto-start server when Emacs starts (optional)
(setq org-db-v3-auto-start-server t)
;; Optional: Enable gptel integration for AI-powered search
(when (featurep 'gptel)
(require 'org-db-v3-gptel-tools)
(org-db-v3-gptel-register-tools))Required Emacs packages:
plz(async HTTP client)transient(menu system)
Optional packages:
gptel(for AI-powered search integration)
Install via package manager or manually.
From the python directory:
cd python
uv run uvicorn org_db_server.main:app --reload --port 8765 > /tmp/org-db-server.log 2>&1 &Or from Emacs:
M-x org-db-v3-start-serverM-x org-db
;; or press H-v if you bound itChoose from the menu:
d- Index entire directory (recursive, non-blocking)
u- Update current file
U- Update all open org buffers
r- Reindex all files in database
Or manually:
M-x org-db-v3-index-directoryFiles are processed one at a time with configurable delays to keep Emacs responsive.
M-x org-db-v3-semantic-search
;; or press 'v' in the menuEnter a query like “machine learning projects” and find content by semantic similarity.
M-x org-db-v3-fulltext-search
;; or press 'k' in the menuM-x org-db-v3-search-at-point
;; or press 'p' in the menuUses selected region or sentence at point as the search query.
M-x org-db-v3-image-search
;; or press 'i' in the menuFind images by text description using CLIP embeddings.
M-x org-db-v3-headline-search
;; or press 'h' in the menu
;; With prefix arg, choose sort order interactively
C-u M-x org-db-v3-headline-searchBrowse all headlines with completing-read interface.
Sort Order Options:
last_updated- Most recently updated files first (default)
filename- Alphabetical by filename
indexed_at- Most recently indexed files first
You can customize the default sort order:
(setq org-db-v3-headline-sort-order "last_updated") ; default
;; or "filename" or "indexed_at"All searches (semantic, fulltext, headline, image) can be limited to specific subsets of your indexed files. The scope setting applies to the next search only and then resets to “All files”.
From the main menu (H-v), use the scope options:
-a- Search all files (default)
-d- Limit to a specific directory
-p- Limit to a Projectile project (prompts to select from known projects)
-t- Limit to files with a specific keyword/tag
Example workflow:
- Press
H-vto open menu - Press
-pto select a project scope - Choose “my-notes” from the project list
- Press
vfor semantic search - Enter your query - results will only come from “my-notes” project
- Next search will automatically reset to “All files”
The current scope is shown in the menu header: org-db v3 [Scope: Project: my-notes]
Note: Scope options use dash prefixes (-a, -d, -p, -t) to distinguish them from action keys.
M-x org-db-v3-agenda
;; or press 'a' in the menuView TODO items with deadlines and scheduled dates from indexed files.
By default, shows items due in the next **2 weeks**. Use prefix argument (C-u M-x org-db-v3-agenda) to specify a custom date range:
+2w- 2 weeks (default)+1m- 1 month+3w- 3 weeks2025-12-31- specific date
org-db v3 has the capability to index linked documents (PDF, DOCX, PPTX, etc.) found in your org files, but this feature needs significant work to handle large-scale use cases efficiently.
With large document collections (5000+ linked files), linked file indexing can cause:
- Database bloat: 29.5 GB database from paragraph-level chunking
- Slow searches: 30+ second timeouts or server crashes
- 360,000+ chunk embeddings: From aggressive per-paragraph chunking
The current implementation chunks each paragraph separately, creating too many embeddings for large PDF collections.
The feature is disabled by default in python/org_db_server/config.py:
enable_linked_files: bool = False # Disabled to prevent database bloatBefore enabling for large collections, the following improvements are needed:
- File-level embeddings: Aggregate chunks into single embeddings (98% size reduction)
- Smarter chunking: Use larger fixed-size chunks instead of paragraph chunking
- Selective indexing: Only index important/recent documents
- Better limits: Enforce per-file chunk limits and file size limits
See python/LINKED_FILES_OPTIMIZATION.md and python/EMBEDDING_AGGREGATION_STRATEGIES.md for detailed analysis and solutions.
If you have a small number of linked files (<100), you can enable it:
export ORG_DB_ENABLE_LINKED_FILES=true
export ORG_DB_MAX_LINKED_FILE_SIZE_MB=20 # Skip large files
export ORG_DB_MAX_LINKED_FILE_CHUNKS=50 # Limit chunks per fileOr edit python/org_db_server/config.py and restart the server.
- PDF files (
.pdf) - Microsoft Word (
.docx) - PowerPoint (
.pptx) - Legacy Office formats (
.doc,.xls,.xlsx,.ppt) via docling subprocess - HTML/Web (
.html,.htm) - Images with OCR (
.png,.jpg,.jpeg, etc.) - And more (see
SUPPORTED_FORMATS.md)
M-x org-db-v3-open-linked-file
;; or press 'F' in the menuShows all indexed linked files with file type, source location, chunks, and conversion status.
Press RET to jump to the org file link location, or C-u RET to open the linked file directly.
;; Linked file indexing (DISABLED by default - see above)
(setq org-db-v3-enable-linked-files nil) ; default: disabled
;; If you enable it for small collections:
(setq org-db-v3-max-linked-file-size 20971520) ; 20MB limitIf you use gptel for LLM integration in Emacs, org-db v3 provides tools that allow AI assistants to search your org files.
;; Load the gptel tools module
(require 'org-db-v3-gptel-tools)
;; Register the tools with gptel
(org-db-v3-gptel-register-tools)The integration provides two search tools for LLMs:
Search using AI/semantic similarity. Best for:
- Conceptual queries (“projects related to machine learning”)
- Finding related content without exact keywords
- Questions (“what have I written about travel?”)
Search using exact keyword matching (SQLite FTS5). Best for:
- Finding specific terms or names
- Exact phrases
- Fast keyword lookup
In any org file with gptel enabled, you can configure the tools in the file header:
#+PROPERTY: GPTEL_TOOLS org_semantic_search org_fulltext_search
Or set buffer-locally:
M-x gptel-set-tools RET org_semantic_search org_fulltext_search RETThen ask the LLM to search your org files:
- “Search my org files for travel plans”
- “What have I written about machine learning?”
- “Find all mentions of project deadlines”
The LLM will automatically use the appropriate tool and present results in a readable format.
Both tools accept optional parameters:
limit- Number of results (default 5, max 20)
filename_pattern- SQL LIKE pattern to filter files (e.g.,
%2024%for files in 2024,%project%for project-related files)
Example query: “Search my 2024 journal entries for mentions of conferences”
The LLM will automatically add filename_pattern like %journal/2024% if needed.
To remove the tools from gptel:
(org-db-v3-gptel-unregister-tools)Press H-v or M-x org-db to open the menu:
-a- Search all files (reset scope)
-d- Limit to directory
-p- Limit to Projectile project
-t- Limit to tag/keyword
q- Quit menu
v- Semantic search (search by meaning)
k- Full-text search (keyword search with FTS5)
h- Headline search (browse headlines)
i- Image search (find images by description)
p- Search at point (use text at cursor)
f- Open file from database (browse indexed files)
F- Open linked file (browse indexed linked documents like PDFs, DOCX, etc.)
u- Update current file
U- Update all open org buffers
d- Index directory (recursive, non-blocking)
r- Reindex database (all files)
a- Show agenda (TODOs, deadlines, scheduled)
S- Server status
R- Restart server
L- View server logs
W- Open web interface
q- Quit menu
In the *org-db search* buffer:
RET- Jump to result location
n- Next result
p- Previous result
s- New search
q- Quit window
;; Server connection
(setq org-db-v3-server-host "127.0.0.1")
(setq org-db-v3-server-port 8765)
;; Auto-start server
(setq org-db-v3-auto-start-server t)
;; Indexing speed (seconds between files)
;; Lower = faster indexing but less responsive Emacs and higher server load
;; Higher = slower indexing but more responsive Emacs and prevents server overload
(setq org-db-v3-index-delay 0.5) ; default: 500ms (allows linked file processing)
;; Search defaults
(setq org-db-v3-search-default-limit 10)
;; Headline search sort order
(setq org-db-v3-headline-sort-order "last_updated") ; default
;; Options: "last_updated", "filename", "indexed_at"
;; Linked file indexing (DISABLED by default due to performance issues)
(setq org-db-v3-enable-linked-files nil) ; disabled by default
;; (setq org-db-v3-max-linked-file-size 20971520) ; 20MB limit if enabled
;; gptel integration (optional)
(require 'org-db-v3-gptel-tools)
(org-db-v3-gptel-register-tools)
(setq org-db-v3-gptel-search-limit 5) ; results returned to LLMOpen the web interface to view statistics and API documentation:
M-x org-db-v3-open-web-interface
;; or press 'W' in the menu
;; or visit http://127.0.0.1:8765 in your browserThe homepage shows:
- Database location and size
- File counts, headlines, embeddings
- Recent files indexed
- Complete API documentation
- Getting started guide
The FastAPI server provides REST endpoints. Visit http://127.0.0.1:8765/docs for interactive documentation.
GET /health
POST /api/file
Content-Type: application/json
{
"filename": "/path/to/file.org",
"md5": "abc123...",
"file_size": 1024,
"content": "full file content...",
"headlines": [...],
"links": [...],
"keywords": [...],
"src_blocks": [...],
"images": [...]
}
POST /api/search/semantic
Content-Type: application/json
{
"query": "your search query",
"limit": 10,
"model": "all-MiniLM-L6-v2" // optional
}
Response includes similarity scores, filenames, line numbers, and context.
POST /api/search/fulltext
Content-Type: application/json
{
"query": "keyword search",
"limit": 10
}
Returns snippets and character positions for jumping to results.
POST /api/search/images
Content-Type: application/json
{
"query": "a photo of a cat",
"limit": 10
}
POST /api/search/headlines
Content-Type: application/json
{
"query": "project",
"limit": 10,
"sort_by": "last_updated" // optional: "filename", "last_updated", "indexed_at"
}
Sorts by most recently updated files first by default.
POST /api/agenda
Content-Type: application/json
{
"days_ahead": 7
}
Returns TODO items with deadlines and scheduled dates.
GET /api/files # List all indexed files DELETE /api/file?filename=/path/to/file.org # Remove file from database
GET /api/stats/ # Database statistics GET /api/stats/files # List of files with timestamps
Database location: ~/org-db/org-db-v3.db
SQLite database with 22 tables including:
files- Indexed org files with MD5 hashes
headlines- Org headings with metadata (TODO, tags, priority, scheduled, deadline)
chunks- Text chunks for semantic search with line numbers
embeddings- Vector embeddings stored as BLOB (float32)
fts_content- Full-text search index (SQLite FTS5)
links- All links with type, path, and description
properties- Org properties (key-value pairs)
keywords- File-level keywords
tags- Tags with relationships to headlines
src_blocks- Source code blocks with language and contents
images- Image paths with positions
image_embeddings- CLIP embeddings for images
Python tests:
cd python
uv run pytest tests/ -vEmacs tests:
emacs -batch -l tests/org-db-v3-search-test.el -f ert-run-tests-batch-and-exitAfter making changes to elisp files:
M-x load-file RET reload.el RET
M-x org-db-v3-reloadOr restart Emacs to load fresh code.
org-db-v3/ ├── python/ │ ├── org_db_server/ │ │ ├── api/ # FastAPI routes │ │ │ ├── indexing.py # File indexing endpoints │ │ │ ├── search.py # Search endpoints │ │ │ ├── stats.py # Statistics endpoints │ │ │ ├── agenda.py # Agenda endpoint │ │ │ └── linked_files.py # Linked file endpoints │ │ ├── models/ # Pydantic schemas & DB models │ │ ├── services/ # Business logic │ │ │ ├── database.py # SQLite operations │ │ │ ├── embeddings.py # Text embeddings │ │ │ ├── clip_service.py # Image embeddings │ │ │ ├── chunking.py # Text chunking │ │ │ ├── docling_service.py # Document conversion │ │ │ └── docling_worker.py # Subprocess worker │ │ └── templates/ # HTML templates │ └── tests/ # Python tests ├── elisp/ │ ├── org-db-v3.el # Main package │ ├── org-db-v3-parse.el # Org parsing │ ├── org-db-v3-client.el # HTTP client & indexing queue │ ├── org-db-v3-server.el # Server management │ ├── org-db-v3-search.el # Search UI │ ├── org-db-v3-agenda.el # Agenda functionality │ ├── org-db-v3-ui.el # Transient menu │ └── org-db-v3-gptel-tools.el # gptel integration ├── tests/ # Emacs tests └── reload.el # Development helper
- Emacs parses org file using
org-element-parse-buffer - Extracts headlines, links, keywords, src blocks, images, and full content
- Adds to queue for non-blocking processing
- Idle timer processes one file at a time
- Sends JSON to FastAPI server via async HTTP
- Server stores structured data in SQLite
- Generates embeddings:
- Text chunks using sentence-transformers
- Images using CLIP
- Stores vectors as float32 bytes in database
- Populates FTS5 table for full-text search
Directory indexing uses a queue-based approach:
- Files are collected upfront
- Regular timer processes one file at configurable intervals (default: 50ms)
- Local/directory variables are suppressed for safety
- Already-open buffers are preserved
- Progress shown in echo area
- Cancellable with
M-x org-db-v3-cancel-indexing - Speed tunable via
org-db-v3-index-delay(lower = faster, higher = more responsive)
- User enters query in minibuffer
- Server generates query embedding
- Calculates cosine similarity with all stored embeddings
- Returns top N results ranked by similarity
- Emacs displays in org-mode buffer with navigation
- Query sent to FTS5 index
- SQLite returns matching content with snippets
- Character positions enable precise jumping
- Results displayed with context
- Query text converted to CLIP embedding
- Similarity calculated with stored image embeddings
- Matching images returned with thumbnails
- Click to view full image
- Model:
all-MiniLM-L6-v2(384 dimensions) - Fast inference (~50 sentences/second on CPU)
- Good accuracy for semantic search
- Runs locally (no API calls)
- Downloaded automatically on first use (~90MB)
- Model:
clip-ViT-B-32 - Text-to-image and image-to-image similarity
- Downloaded automatically on first use (~600MB)
- Indexing: ~100 headlines/second
- Search: <100ms for 1000 documents (local)
- Embedding generation: ~50 sentences/second (CPU)
- Database: SQLite handles 100k+ chunks efficiently
- Non-blocking: Emacs remains responsive during indexing
Check if port 8765 is already in use:
lsof -i :8765The server includes automatic protection against multiple starts:
- Detects if port is already in use
- Identifies zombie/stuck processes
- Offers to automatically clean them up (when started manually)
- Auto-cleans zombies when auto-starting
- Fast health checks (1 second timeout) prevent hanging
If you see “Address already in use” errors:
- Try
M-x org-db-v3-start-server- it will prompt to kill zombies - Or manually:
M-x org-db-v3-kill-zombie-processes - Or restart Emacs (auto-start will clean up zombies)
Change port in both Python and Emacs config if needed.
Old code is cached in Emacs. Reload the elisp files:
M-x load-file RET reload.el RET
M-x org-db-v3-reloadOr restart Emacs.
- Verify files are indexed: Check server logs at
/tmp/org-db-server.log - Ensure embeddings generated: Look for
embedding_servicein logs - Check database size: Visit web interface (
Win menu) - Try reindexing: Press
rin menu
If indexing appears to pause between files:
- Ensure you have the latest code (uses regular timers, not idle timers)
- Check/adjust indexing speed:
(setq org-db-v3-index-delay 0.05)(or lower for faster) - Reload code:
M-x load-file RET reload.el RET
The queue-based system should prevent this, but if it happens:
- Cancel current operation:
M-x org-db-v3-cancel-indexing - Check queue status:
M-x describe-variable RET org-db-v3-index-queue - Reload code (see above)
- Restart Emacs if needed
The code suppresses local and directory variables, but if you still see prompts:
- Check that you’ve reloaded the latest code
- Set
(setq enable-local-variables nil)temporarily - Report as a bug
Ensure uv environment is set up:
cd python
uv sync
uv run python -c "import org_db_server"If you see “void-variable org-db-v3-server-url” errors:
- Make sure you’ve loaded the gptel-tools module:
(load-file "/path/to/org-db-v3/elisp/org-db-v3-gptel-tools.el") (org-db-v3-gptel-register-tools)
- Check that the main org-db-v3 package is loaded (it defines
org-db-v3-server-url) - Verify the tools are registered:
M-x describe-variable RET gptel-tools RET ;; Should show org_semantic_search and org_fulltext_search - Make sure your
GPTEL_TOOLSproperty uses the correct names:org_semantic_search(notorg-db-semantic-search)org_fulltext_search(notorg-db-fulltext)
- [X] Semantic search with embeddings
- [X] Full-text search (FTS5) with snippets
- [X] Image search with CLIP
- [X] Headline search
- [X] Scoped search (directory, project, tag filters)
- [X] Agenda with customizable date ranges (default 2 weeks)
- [X] Transient menu UI with infix arguments
- [X] Web interface with statistics
- [X] Non-blocking directory indexing
- [X] File browser (open files from database)
- [X] Linked file indexing (PDF, DOCX, PPTX, etc.) - currently disabled due to performance issues with large collections (5000+ files)
- [X] Linked file browser (browse and open indexed documents)
- [ ] File-level embedding aggregation (needed for linked files at scale)
- [X] gptel integration (expose search tools to LLMs)
- [X] Search at point
- [X] Auto-indexing on save
- [X] Full file paths in search results
- [X] Auto-enable for already-open buffers
- [X] Server startup protection (prevents multiple servers, auto-cleans zombies)
- [X] Fast health checks with short timeouts
- [X] Configurable indexing speed
- [X] Memory-optimized document conversion (95% reduction vs docling)
- [ ] File-level embedding aggregation (reduce 360K embeddings to 6K)
- [ ] Smart chunking for linked files (fixed-size instead of paragraph)
- [ ] Selective linked file indexing (by type, size, date)
- [ ] Hybrid search (file-level + chunk-level)
- [ ] Reranking strategy
- [ ] Custom chunk sizes/strategies
- [ ] More search filters (date ranges, file patterns)
- [ ] Incremental indexing optimizations
See LICENSE file.
Contributions welcome! Please:
- Write tests for new features
- Follow existing code style
- Update documentation
- Use descriptive commit messages
Built with:
- FastAPI (Python web framework)
- sentence-transformers (text embeddings)
- CLIP (image embeddings)
- plz.el (async HTTP for Emacs)
- uv (Python package manager)
- transient (Emacs menu system)
Inspired by org-roam, org-ql, and semantic search research.