Skip to content

jkitchin/org-db-v3

Repository files navigation

org-db v3

Overview

A modern org-mode database with semantic search capabilities. Built with a hybrid architecture: Emacs handles org parsing and UI, while a Python FastAPI backend provides storage, embeddings, and vector search.

Features

Semantic Search
Find org content by meaning using embeddings
Full-Text Search
Fast keyword search with SQLite FTS5 and snippets
Image Search
Find images by text description using CLIP embeddings
Headline Search
Browse and jump to org headlines across all files
Scoped Search
Limit searches to specific directories, projects (Projectile), or keyword/tags
Agenda
View TODOs, deadlines, and scheduled items from all indexed files
Non-blocking Indexing
Queue-based indexing with configurable delays keeps Emacs responsive
Linked File Indexing
(Experimental/disabled) Can index linked documents (PDF, DOCX, PPTX, etc.) - needs optimization for large collections
gptel Integration
Expose search tools to LLMs for AI-powered org file exploration
Transient UI
Easy-to-use menu system for all commands (H-v or M-x org-db)
Web Interface
Browse statistics and documentation via browser

Architecture

┌─────────────────┐         ┌──────────────────────┐
│  Emacs (UI)     │◄───────►│  FastAPI Server      │
│  - org parsing  │  HTTP   │  - SQLite storage    │
│  - search UI    │         │  - embeddings (ML)   │
│  - navigation   │         │  - vector search     │
│  - transient    │         │  - CLIP (images)     │
└─────────────────┘         └──────────────────────┘

Installation

Prerequisites

  • Emacs 28.1+
  • Python 3.10+
  • uv (Python package manager)

Python Backend

cd python
uv sync

This installs all dependencies including:

  • FastAPI & Uvicorn (web server)
  • sentence-transformers (text embeddings)
  • CLIP (image embeddings)
  • SQLite (database)

Emacs Package

Add to your Emacs config:

(add-to-list 'load-path "/path/to/org-db-v3/elisp")
(require 'org-db-v3)

;; Bind the menu to a convenient key
(global-set-key (kbd "H-v") 'org-db-menu)

;; Auto-start server when Emacs starts (optional)
(setq org-db-v3-auto-start-server t)

;; Optional: Enable gptel integration for AI-powered search
(when (featurep 'gptel)
  (require 'org-db-v3-gptel-tools)
  (org-db-v3-gptel-register-tools))

Required Emacs packages:

  • plz (async HTTP client)
  • transient (menu system)

Optional packages:

  • gptel (for AI-powered search integration)

Install via package manager or manually.

Quick Start

1. Start the Server

From the python directory:

cd python
uv run uvicorn org_db_server.main:app --reload --port 8765 > /tmp/org-db-server.log 2>&1 &

Or from Emacs:

M-x org-db-v3-start-server

2. Open the Menu

M-x org-db
;; or press H-v if you bound it

3. Index Your Org Files

Choose from the menu:

d
Index entire directory (recursive, non-blocking)
u
Update current file
U
Update all open org buffers
r
Reindex all files in database

Or manually:

M-x org-db-v3-index-directory

Files are processed one at a time with configurable delays to keep Emacs responsive.

4. Search

Semantic Search (by meaning)

M-x org-db-v3-semantic-search
;; or press 'v' in the menu

Enter a query like “machine learning projects” and find content by semantic similarity.

Full-Text Search (by keywords)

M-x org-db-v3-fulltext-search
;; or press 'k' in the menu

Search at Point

M-x org-db-v3-search-at-point
;; or press 'p' in the menu

Uses selected region or sentence at point as the search query.

Image Search

M-x org-db-v3-image-search
;; or press 'i' in the menu

Find images by text description using CLIP embeddings.

Headline Search

M-x org-db-v3-headline-search
;; or press 'h' in the menu

;; With prefix arg, choose sort order interactively
C-u M-x org-db-v3-headline-search

Browse all headlines with completing-read interface.

Sort Order Options:

last_updated
Most recently updated files first (default)
filename
Alphabetical by filename
indexed_at
Most recently indexed files first

You can customize the default sort order:

(setq org-db-v3-headline-sort-order "last_updated")  ; default
;; or "filename" or "indexed_at"

Using Search Scope

All searches (semantic, fulltext, headline, image) can be limited to specific subsets of your indexed files. The scope setting applies to the next search only and then resets to “All files”.

From the main menu (H-v), use the scope options:

-a
Search all files (default)
-d
Limit to a specific directory
-p
Limit to a Projectile project (prompts to select from known projects)
-t
Limit to files with a specific keyword/tag

Example workflow:

  1. Press H-v to open menu
  2. Press -p to select a project scope
  3. Choose “my-notes” from the project list
  4. Press v for semantic search
  5. Enter your query - results will only come from “my-notes” project
  6. Next search will automatically reset to “All files”

The current scope is shown in the menu header: org-db v3 [Scope: Project: my-notes]

Note: Scope options use dash prefixes (-a, -d, -p, -t) to distinguish them from action keys.

5. Agenda

M-x org-db-v3-agenda
;; or press 'a' in the menu

View TODO items with deadlines and scheduled dates from indexed files.

By default, shows items due in the next **2 weeks**. Use prefix argument (C-u M-x org-db-v3-agenda) to specify a custom date range:

  • +2w - 2 weeks (default)
  • +1m - 1 month
  • +3w - 3 weeks
  • 2025-12-31 - specific date

6. Linked File Indexing

⚠️ IMPORTANT: Linked file indexing is currently DISABLED by default due to performance issues.

org-db v3 has the capability to index linked documents (PDF, DOCX, PPTX, etc.) found in your org files, but this feature needs significant work to handle large-scale use cases efficiently.

Known Issues

With large document collections (5000+ linked files), linked file indexing can cause:

  • Database bloat: 29.5 GB database from paragraph-level chunking
  • Slow searches: 30+ second timeouts or server crashes
  • 360,000+ chunk embeddings: From aggressive per-paragraph chunking

The current implementation chunks each paragraph separately, creating too many embeddings for large PDF collections.

Current Status

The feature is disabled by default in python/org_db_server/config.py:

enable_linked_files: bool = False  # Disabled to prevent database bloat

Future Work Needed

Before enabling for large collections, the following improvements are needed:

  1. File-level embeddings: Aggregate chunks into single embeddings (98% size reduction)
  2. Smarter chunking: Use larger fixed-size chunks instead of paragraph chunking
  3. Selective indexing: Only index important/recent documents
  4. Better limits: Enforce per-file chunk limits and file size limits

See python/LINKED_FILES_OPTIMIZATION.md and python/EMBEDDING_AGGREGATION_STRATEGIES.md for detailed analysis and solutions.

If You Want to Try It

If you have a small number of linked files (<100), you can enable it:

export ORG_DB_ENABLE_LINKED_FILES=true
export ORG_DB_MAX_LINKED_FILE_SIZE_MB=20      # Skip large files
export ORG_DB_MAX_LINKED_FILE_CHUNKS=50       # Limit chunks per file

Or edit python/org_db_server/config.py and restart the server.

Supported Formats (when enabled)

  • PDF files (.pdf)
  • Microsoft Word (.docx)
  • PowerPoint (.pptx)
  • Legacy Office formats (.doc, .xls, .xlsx, .ppt) via docling subprocess
  • HTML/Web (.html, .htm)
  • Images with OCR (.png, .jpg, .jpeg, etc.)
  • And more (see SUPPORTED_FORMATS.md)

Browsing Linked Files

M-x org-db-v3-open-linked-file
;; or press 'F' in the menu

Shows all indexed linked files with file type, source location, chunks, and conversion status. Press RET to jump to the org file link location, or C-u RET to open the linked file directly.

Configuration

;; Linked file indexing (DISABLED by default - see above)
(setq org-db-v3-enable-linked-files nil)  ; default: disabled

;; If you enable it for small collections:
(setq org-db-v3-max-linked-file-size 20971520)  ; 20MB limit

7. gptel Integration (AI-Powered Search)

If you use gptel for LLM integration in Emacs, org-db v3 provides tools that allow AI assistants to search your org files.

Setup

;; Load the gptel tools module
(require 'org-db-v3-gptel-tools)

;; Register the tools with gptel
(org-db-v3-gptel-register-tools)

Available Tools

The integration provides two search tools for LLMs:

org_semantic_search

Search using AI/semantic similarity. Best for:

  • Conceptual queries (“projects related to machine learning”)
  • Finding related content without exact keywords
  • Questions (“what have I written about travel?”)

org_fulltext_search

Search using exact keyword matching (SQLite FTS5). Best for:

  • Finding specific terms or names
  • Exact phrases
  • Fast keyword lookup

Usage in gptel

In any org file with gptel enabled, you can configure the tools in the file header:

#+PROPERTY: GPTEL_TOOLS org_semantic_search org_fulltext_search

Or set buffer-locally:

M-x gptel-set-tools RET org_semantic_search org_fulltext_search RET

Then ask the LLM to search your org files:

  • “Search my org files for travel plans”
  • “What have I written about machine learning?”
  • “Find all mentions of project deadlines”

The LLM will automatically use the appropriate tool and present results in a readable format.

Tool Parameters

Both tools accept optional parameters:

limit
Number of results (default 5, max 20)
filename_pattern
SQL LIKE pattern to filter files (e.g., %2024% for files in 2024, %project% for project-related files)

Example query: “Search my 2024 journal entries for mentions of conferences”

The LLM will automatically add filename_pattern like %journal/2024% if needed.

Unregistering Tools

To remove the tools from gptel:

(org-db-v3-gptel-unregister-tools)

Usage

Transient Menu

Press H-v or M-x org-db to open the menu:

Scope Options

-a
Search all files (reset scope)
-d
Limit to directory
-p
Limit to Projectile project
-t
Limit to tag/keyword

Actions

q
Quit menu

Search Commands

v
Semantic search (search by meaning)
k
Full-text search (keyword search with FTS5)
h
Headline search (browse headlines)
i
Image search (find images by description)
p
Search at point (use text at cursor)

File Management

f
Open file from database (browse indexed files)
F
Open linked file (browse indexed linked documents like PDFs, DOCX, etc.)

Indexing

u
Update current file
U
Update all open org buffers
d
Index directory (recursive, non-blocking)
r
Reindex database (all files)

Agenda

a
Show agenda (TODOs, deadlines, scheduled)

Server Management

S
Server status
R
Restart server
L
View server logs
W
Open web interface

Other

q
Quit menu

Search Results Navigation

In the *org-db search* buffer:

RET
Jump to result location
n
Next result
p
Previous result
s
New search
q
Quit window

Configuration

;; Server connection
(setq org-db-v3-server-host "127.0.0.1")
(setq org-db-v3-server-port 8765)

;; Auto-start server
(setq org-db-v3-auto-start-server t)

;; Indexing speed (seconds between files)
;; Lower = faster indexing but less responsive Emacs and higher server load
;; Higher = slower indexing but more responsive Emacs and prevents server overload
(setq org-db-v3-index-delay 0.5)  ; default: 500ms (allows linked file processing)

;; Search defaults
(setq org-db-v3-search-default-limit 10)

;; Headline search sort order
(setq org-db-v3-headline-sort-order "last_updated")  ; default
;; Options: "last_updated", "filename", "indexed_at"

;; Linked file indexing (DISABLED by default due to performance issues)
(setq org-db-v3-enable-linked-files nil)  ; disabled by default
;; (setq org-db-v3-max-linked-file-size 20971520)  ; 20MB limit if enabled

;; gptel integration (optional)
(require 'org-db-v3-gptel-tools)
(org-db-v3-gptel-register-tools)
(setq org-db-v3-gptel-search-limit 5)  ; results returned to LLM

Web Interface

Open the web interface to view statistics and API documentation:

M-x org-db-v3-open-web-interface
;; or press 'W' in the menu
;; or visit http://127.0.0.1:8765 in your browser

The homepage shows:

  • Database location and size
  • File counts, headlines, embeddings
  • Recent files indexed
  • Complete API documentation
  • Getting started guide

API Endpoints

The FastAPI server provides REST endpoints. Visit http://127.0.0.1:8765/docs for interactive documentation.

Core Endpoints

Health Check

GET /health

Index File

POST /api/file
Content-Type: application/json

{
  "filename": "/path/to/file.org",
  "md5": "abc123...",
  "file_size": 1024,
  "content": "full file content...",
  "headlines": [...],
  "links": [...],
  "keywords": [...],
  "src_blocks": [...],
  "images": [...]
}

Semantic Search

POST /api/search/semantic
Content-Type: application/json

{
  "query": "your search query",
  "limit": 10,
  "model": "all-MiniLM-L6-v2"  // optional
}

Response includes similarity scores, filenames, line numbers, and context.

Full-Text Search

POST /api/search/fulltext
Content-Type: application/json

{
  "query": "keyword search",
  "limit": 10
}

Returns snippets and character positions for jumping to results.

Image Search

POST /api/search/images
Content-Type: application/json

{
  "query": "a photo of a cat",
  "limit": 10
}

Headline Search

POST /api/search/headlines
Content-Type: application/json

{
  "query": "project",
  "limit": 10,
  "sort_by": "last_updated"  // optional: "filename", "last_updated", "indexed_at"
}

Sorts by most recently updated files first by default.

Agenda

POST /api/agenda
Content-Type: application/json

{
  "days_ahead": 7
}

Returns TODO items with deadlines and scheduled dates.

File Management

GET /api/files           # List all indexed files
DELETE /api/file?filename=/path/to/file.org  # Remove file from database

Statistics

GET /api/stats/          # Database statistics
GET /api/stats/files     # List of files with timestamps

Database Schema

Database location: ~/org-db/org-db-v3.db

SQLite database with 22 tables including:

files
Indexed org files with MD5 hashes
headlines
Org headings with metadata (TODO, tags, priority, scheduled, deadline)
chunks
Text chunks for semantic search with line numbers
embeddings
Vector embeddings stored as BLOB (float32)
fts_content
Full-text search index (SQLite FTS5)
links
All links with type, path, and description
properties
Org properties (key-value pairs)
keywords
File-level keywords
tags
Tags with relationships to headlines
src_blocks
Source code blocks with language and contents
images
Image paths with positions
image_embeddings
CLIP embeddings for images

Development

Running Tests

Python tests:

cd python
uv run pytest tests/ -v

Emacs tests:

emacs -batch -l tests/org-db-v3-search-test.el -f ert-run-tests-batch-and-exit

Reloading Code During Development

After making changes to elisp files:

M-x load-file RET reload.el RET
M-x org-db-v3-reload

Or restart Emacs to load fresh code.

Project Structure

org-db-v3/
├── python/
│   ├── org_db_server/
│   │   ├── api/              # FastAPI routes
│   │   │   ├── indexing.py   # File indexing endpoints
│   │   │   ├── search.py     # Search endpoints
│   │   │   ├── stats.py      # Statistics endpoints
│   │   │   ├── agenda.py     # Agenda endpoint
│   │   │   └── linked_files.py # Linked file endpoints
│   │   ├── models/           # Pydantic schemas & DB models
│   │   ├── services/         # Business logic
│   │   │   ├── database.py   # SQLite operations
│   │   │   ├── embeddings.py # Text embeddings
│   │   │   ├── clip_service.py # Image embeddings
│   │   │   ├── chunking.py   # Text chunking
│   │   │   ├── docling_service.py # Document conversion
│   │   │   └── docling_worker.py  # Subprocess worker
│   │   └── templates/        # HTML templates
│   └── tests/                # Python tests
├── elisp/
│   ├── org-db-v3.el          # Main package
│   ├── org-db-v3-parse.el    # Org parsing
│   ├── org-db-v3-client.el   # HTTP client & indexing queue
│   ├── org-db-v3-server.el   # Server management
│   ├── org-db-v3-search.el   # Search UI
│   ├── org-db-v3-agenda.el   # Agenda functionality
│   ├── org-db-v3-ui.el       # Transient menu
│   └── org-db-v3-gptel-tools.el  # gptel integration
├── tests/                    # Emacs tests
└── reload.el                 # Development helper

How It Works

Indexing Pipeline

  1. Emacs parses org file using org-element-parse-buffer
  2. Extracts headlines, links, keywords, src blocks, images, and full content
  3. Adds to queue for non-blocking processing
  4. Idle timer processes one file at a time
  5. Sends JSON to FastAPI server via async HTTP
  6. Server stores structured data in SQLite
  7. Generates embeddings:
    • Text chunks using sentence-transformers
    • Images using CLIP
  8. Stores vectors as float32 bytes in database
  9. Populates FTS5 table for full-text search

Non-blocking Indexing

Directory indexing uses a queue-based approach:

  • Files are collected upfront
  • Regular timer processes one file at configurable intervals (default: 50ms)
  • Local/directory variables are suppressed for safety
  • Already-open buffers are preserved
  • Progress shown in echo area
  • Cancellable with M-x org-db-v3-cancel-indexing
  • Speed tunable via org-db-v3-index-delay (lower = faster, higher = more responsive)

Search Pipeline

Semantic Search

  1. User enters query in minibuffer
  2. Server generates query embedding
  3. Calculates cosine similarity with all stored embeddings
  4. Returns top N results ranked by similarity
  5. Emacs displays in org-mode buffer with navigation

Full-Text Search

  1. Query sent to FTS5 index
  2. SQLite returns matching content with snippets
  3. Character positions enable precise jumping
  4. Results displayed with context

Image Search

  1. Query text converted to CLIP embedding
  2. Similarity calculated with stored image embeddings
  3. Matching images returned with thumbnails
  4. Click to view full image

Embedding Models

Text (Semantic Search)

  • Model: all-MiniLM-L6-v2 (384 dimensions)
  • Fast inference (~50 sentences/second on CPU)
  • Good accuracy for semantic search
  • Runs locally (no API calls)
  • Downloaded automatically on first use (~90MB)

Images (CLIP)

  • Model: clip-ViT-B-32
  • Text-to-image and image-to-image similarity
  • Downloaded automatically on first use (~600MB)

Performance

  • Indexing: ~100 headlines/second
  • Search: <100ms for 1000 documents (local)
  • Embedding generation: ~50 sentences/second (CPU)
  • Database: SQLite handles 100k+ chunks efficiently
  • Non-blocking: Emacs remains responsive during indexing

Troubleshooting

Server won’t start

Check if port 8765 is already in use:

lsof -i :8765

The server includes automatic protection against multiple starts:

  • Detects if port is already in use
  • Identifies zombie/stuck processes
  • Offers to automatically clean them up (when started manually)
  • Auto-cleans zombies when auto-starting
  • Fast health checks (1 second timeout) prevent hanging

If you see “Address already in use” errors:

  1. Try M-x org-db-v3-start-server - it will prompt to kill zombies
  2. Or manually: M-x org-db-v3-kill-zombie-processes
  3. Or restart Emacs (auto-start will clean up zombies)

Change port in both Python and Emacs config if needed.

404 errors when indexing

Old code is cached in Emacs. Reload the elisp files:

M-x load-file RET reload.el RET
M-x org-db-v3-reload

Or restart Emacs.

No search results

  1. Verify files are indexed: Check server logs at /tmp/org-db-server.log
  2. Ensure embeddings generated: Look for embedding_service in logs
  3. Check database size: Visit web interface (W in menu)
  4. Try reindexing: Press r in menu

Indexing is slow or requires keypresses

If indexing appears to pause between files:

  1. Ensure you have the latest code (uses regular timers, not idle timers)
  2. Check/adjust indexing speed: (setq org-db-v3-index-delay 0.05) (or lower for faster)
  3. Reload code: M-x load-file RET reload.el RET

Indexing hangs or blocks Emacs

The queue-based system should prevent this, but if it happens:

  1. Cancel current operation: M-x org-db-v3-cancel-indexing
  2. Check queue status: M-x describe-variable RET org-db-v3-index-queue
  3. Reload code (see above)
  4. Restart Emacs if needed

Local variable prompts during indexing

The code suppresses local and directory variables, but if you still see prompts:

  1. Check that you’ve reloaded the latest code
  2. Set (setq enable-local-variables nil) temporarily
  3. Report as a bug

Import errors

Ensure uv environment is set up:

cd python
uv sync
uv run python -c "import org_db_server"

gptel tools not working

If you see “void-variable org-db-v3-server-url” errors:

  1. Make sure you’ve loaded the gptel-tools module:
    (load-file "/path/to/org-db-v3/elisp/org-db-v3-gptel-tools.el")
    (org-db-v3-gptel-register-tools)
        
  2. Check that the main org-db-v3 package is loaded (it defines org-db-v3-server-url)
  3. Verify the tools are registered:
    M-x describe-variable RET gptel-tools RET
    ;; Should show org_semantic_search and org_fulltext_search
        
  4. Make sure your GPTEL_TOOLS property uses the correct names:
    • org_semantic_search (not org-db-semantic-search)
    • org_fulltext_search (not org-db-fulltext)

Completed Features

  • [X] Semantic search with embeddings
  • [X] Full-text search (FTS5) with snippets
  • [X] Image search with CLIP
  • [X] Headline search
  • [X] Scoped search (directory, project, tag filters)
  • [X] Agenda with customizable date ranges (default 2 weeks)
  • [X] Transient menu UI with infix arguments
  • [X] Web interface with statistics
  • [X] Non-blocking directory indexing
  • [X] File browser (open files from database)
  • [X] Linked file indexing (PDF, DOCX, PPTX, etc.) - currently disabled due to performance issues with large collections (5000+ files)
  • [X] Linked file browser (browse and open indexed documents)
  • [ ] File-level embedding aggregation (needed for linked files at scale)
  • [X] gptel integration (expose search tools to LLMs)
  • [X] Search at point
  • [X] Auto-indexing on save
  • [X] Full file paths in search results
  • [X] Auto-enable for already-open buffers
  • [X] Server startup protection (prevents multiple servers, auto-cleans zombies)
  • [X] Fast health checks with short timeouts
  • [X] Configurable indexing speed
  • [X] Memory-optimized document conversion (95% reduction vs docling)

Future Enhancements

High Priority (Linked Files)

  • [ ] File-level embedding aggregation (reduce 360K embeddings to 6K)
  • [ ] Smart chunking for linked files (fixed-size instead of paragraph)
  • [ ] Selective linked file indexing (by type, size, date)
  • [ ] Hybrid search (file-level + chunk-level)

Other Enhancements

  • [ ] Reranking strategy
  • [ ] Custom chunk sizes/strategies
  • [ ] More search filters (date ranges, file patterns)
  • [ ] Incremental indexing optimizations

License

See LICENSE file.

Contributing

Contributions welcome! Please:

  1. Write tests for new features
  2. Follow existing code style
  3. Update documentation
  4. Use descriptive commit messages

Credits

Built with:

Inspired by org-roam, org-ql, and semantic search research.

About

A semantic and full-text database for indexing org-files in Emacs

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published