org-db v3

Overview

A modern org-mode database with semantic search capabilities. Built with a hybrid architecture: Emacs handles org parsing and UI, while a Python FastAPI backend provides storage, embeddings, and vector search.

Features

Semantic Search: Find org content by meaning using embeddings
Full-Text Search: Fast keyword search with SQLite FTS5 and snippets
Image Search: Find images by text description using CLIP embeddings
Headline Search: Browse and jump to org headlines across all files
Scoped Search: Limit searches to specific directories, projects (Projectile), or keyword/tags
Agenda: View TODOs, deadlines, and scheduled items from all indexed files
Non-blocking Indexing: Queue-based indexing with configurable delays keeps Emacs responsive
Linked File Indexing: (Experimental/disabled) Can index linked documents (PDF, DOCX, PPTX, etc.) - needs optimization for large collections
gptel Integration: Expose search tools to LLMs for AI-powered org file exploration
Transient UI: Easy-to-use menu system for all commands (H-v or M-x org-db)
Web Interface: Browse statistics and documentation via browser

Architecture

┌─────────────────┐         ┌──────────────────────┐
│  Emacs (UI)     │◄───────►│  FastAPI Server      │
│  - org parsing  │  HTTP   │  - SQLite storage    │
│  - search UI    │         │  - embeddings (ML)   │
│  - navigation   │         │  - vector search     │
│  - transient    │         │  - CLIP (images)     │
└─────────────────┘         └──────────────────────┘

Installation

Prerequisites

Emacs 28.1+
Python 3.10+
uv (Python package manager)

Python Backend

cd python
uv sync

This installs all dependencies including:

FastAPI & Uvicorn (web server)
sentence-transformers (text embeddings)
CLIP (image embeddings)
SQLite (database)

Emacs Package

Add to your Emacs config:

(add-to-list 'load-path "/path/to/org-db-v3/elisp")
(require 'org-db-v3)

;; Bind the menu to a convenient key
(global-set-key (kbd "H-v") 'org-db-menu)

;; Auto-start server when Emacs starts (optional)
(setq org-db-v3-auto-start-server t)

;; Optional: Enable gptel integration for AI-powered search
(when (featurep 'gptel)
  (require 'org-db-v3-gptel-tools)
  (org-db-v3-gptel-register-tools))

Required Emacs packages:

plz (async HTTP client)
transient (menu system)

Optional packages:

gptel (for AI-powered search integration)

Install via package manager or manually.

Quick Start

1. Start the Server

From the python directory:

cd python
uv run uvicorn org_db_server.main:app --reload --port 8765 > /tmp/org-db-server.log 2>&1 &

Or from Emacs:

M-x org-db-v3-start-server

2. Open the Menu

M-x org-db
;; or press H-v if you bound it

3. Index Your Org Files

Choose from the menu:

d: Index entire directory (recursive, non-blocking)
u: Update current file
U: Update all open org buffers
r: Reindex all files in database

Or manually:

M-x org-db-v3-index-directory

Files are processed one at a time with configurable delays to keep Emacs responsive.

4. Search

Semantic Search (by meaning)

M-x org-db-v3-semantic-search
;; or press 'v' in the menu

Enter a query like “machine learning projects” and find content by semantic similarity.

Full-Text Search (by keywords)

M-x org-db-v3-fulltext-search
;; or press 'k' in the menu

Search at Point

M-x org-db-v3-search-at-point
;; or press 'p' in the menu

Uses selected region or sentence at point as the search query.

Image Search

M-x org-db-v3-image-search
;; or press 'i' in the menu

Find images by text description using CLIP embeddings.

Headline Search

M-x org-db-v3-headline-search
;; or press 'h' in the menu

;; With prefix arg, choose sort order interactively
C-u M-x org-db-v3-headline-search

Browse all headlines with completing-read interface.

Sort Order Options:

last_updated: Most recently updated files first (default)
filename: Alphabetical by filename
indexed_at: Most recently indexed files first

You can customize the default sort order:

(setq org-db-v3-headline-sort-order "last_updated")  ; default
;; or "filename" or "indexed_at"

Using Search Scope

All searches (semantic, fulltext, headline, image) can be limited to specific subsets of your indexed files. The scope setting applies to the next search only and then resets to “All files”.

From the main menu (H-v), use the scope options:

-a: Search all files (default)
-d: Limit to a specific directory
-p: Limit to a Projectile project (prompts to select from known projects)
-t: Limit to files with a specific keyword/tag

Example workflow:

Press H-v to open menu
Press -p to select a project scope
Choose “my-notes” from the project list
Press v for semantic search
Enter your query - results will only come from “my-notes” project
Next search will automatically reset to “All files”

The current scope is shown in the menu header: org-db v3 [Scope: Project: my-notes]

Note: Scope options use dash prefixes (-a, -d, -p, -t) to distinguish them from action keys.

5. Agenda

M-x org-db-v3-agenda
;; or press 'a' in the menu

View TODO items with deadlines and scheduled dates from indexed files.

By default, shows items due in the next **2 weeks**. Use prefix argument (C-u M-x org-db-v3-agenda) to specify a custom date range:

+2w - 2 weeks (default)
+1m - 1 month
+3w - 3 weeks
2025-12-31 - specific date

6. Linked File Indexing

⚠️ IMPORTANT: Linked file indexing is currently DISABLED by default due to performance issues.

org-db v3 has the capability to index linked documents (PDF, DOCX, PPTX, etc.) found in your org files, but this feature needs significant work to handle large-scale use cases efficiently.

Known Issues

With large document collections (5000+ linked files), linked file indexing can cause:

Database bloat: 29.5 GB database from paragraph-level chunking
Slow searches: 30+ second timeouts or server crashes
360,000+ chunk embeddings: From aggressive per-paragraph chunking

The current implementation chunks each paragraph separately, creating too many embeddings for large PDF collections.

Current Status

The feature is disabled by default in python/org_db_server/config.py:

enable_linked_files: bool = False  # Disabled to prevent database bloat

Future Work Needed

Before enabling for large collections, the following improvements are needed:

File-level embeddings: Aggregate chunks into single embeddings (98% size reduction)
Smarter chunking: Use larger fixed-size chunks instead of paragraph chunking
Selective indexing: Only index important/recent documents
Better limits: Enforce per-file chunk limits and file size limits

See python/LINKED_FILES_OPTIMIZATION.md and python/EMBEDDING_AGGREGATION_STRATEGIES.md for detailed analysis and solutions.

If You Want to Try It

If you have a small number of linked files (<100), you can enable it:

export ORG_DB_ENABLE_LINKED_FILES=true
export ORG_DB_MAX_LINKED_FILE_SIZE_MB=20      # Skip large files
export ORG_DB_MAX_LINKED_FILE_CHUNKS=50       # Limit chunks per file

Or edit python/org_db_server/config.py and restart the server.

Supported Formats (when enabled)

PDF files (.pdf)
Microsoft Word (.docx)
PowerPoint (.pptx)
Legacy Office formats (.doc, .xls, .xlsx, .ppt) via docling subprocess
HTML/Web (.html, .htm)
Images with OCR (.png, .jpg, .jpeg, etc.)
And more (see SUPPORTED_FORMATS.md)

Browsing Linked Files

M-x org-db-v3-open-linked-file
;; or press 'F' in the menu

Shows all indexed linked files with file type, source location, chunks, and conversion status. Press RET to jump to the org file link location, or C-u RET to open the linked file directly.

Configuration

;; Linked file indexing (DISABLED by default - see above)
(setq org-db-v3-enable-linked-files nil)  ; default: disabled

;; If you enable it for small collections:
(setq org-db-v3-max-linked-file-size 20971520)  ; 20MB limit

7. gptel Integration (AI-Powered Search)

If you use gptel for LLM integration in Emacs, org-db v3 provides tools that allow AI assistants to search your org files.

Setup

;; Load the gptel tools module
(require 'org-db-v3-gptel-tools)

;; Register the tools with gptel
(org-db-v3-gptel-register-tools)

Available Tools

The integration provides two search tools for LLMs:

`org_semantic_search`

Search using AI/semantic similarity. Best for:

Conceptual queries (“projects related to machine learning”)
Finding related content without exact keywords
Questions (“what have I written about travel?”)

`org_fulltext_search`

Search using exact keyword matching (SQLite FTS5). Best for:

Finding specific terms or names
Exact phrases
Fast keyword lookup

Usage in gptel

In any org file with gptel enabled, you can configure the tools in the file header:

#+PROPERTY: GPTEL_TOOLS org_semantic_search org_fulltext_search

Or set buffer-locally:

M-x gptel-set-tools RET org_semantic_search org_fulltext_search RET

Then ask the LLM to search your org files:

“Search my org files for travel plans”
“What have I written about machine learning?”
“Find all mentions of project deadlines”

The LLM will automatically use the appropriate tool and present results in a readable format.

Tool Parameters

Both tools accept optional parameters:

limit: Number of results (default 5, max 20)
filename_pattern: SQL LIKE pattern to filter files (e.g., %2024% for files in 2024, %project% for project-related files)

Example query: “Search my 2024 journal entries for mentions of conferences”

The LLM will automatically add filename_pattern like %journal/2024% if needed.

Unregistering Tools

To remove the tools from gptel:

(org-db-v3-gptel-unregister-tools)

Usage

Transient Menu

Press H-v or M-x org-db to open the menu:

Scope Options

-a: Search all files (reset scope)
-d: Limit to directory
-p: Limit to Projectile project
-t: Limit to tag/keyword

Actions

q: Quit menu

Search Commands

v: Semantic search (search by meaning)
k: Full-text search (keyword search with FTS5)
h: Headline search (browse headlines)
i: Image search (find images by description)
p: Search at point (use text at cursor)

File Management

f: Open file from database (browse indexed files)
F: Open linked file (browse indexed linked documents like PDFs, DOCX, etc.)

Indexing

u: Update current file
U: Update all open org buffers
d: Index directory (recursive, non-blocking)
r: Reindex database (all files)

Agenda

a: Show agenda (TODOs, deadlines, scheduled)

Server Management

S: Server status
R: Restart server
L: View server logs
W: Open web interface

Other

q: Quit menu

Search Results Navigation

In the *org-db search* buffer:

RET: Jump to result location
n: Next result
p: Previous result
s: New search
q: Quit window

Configuration

;; Server connection
(setq org-db-v3-server-host "127.0.0.1")
(setq org-db-v3-server-port 8765)

;; Auto-start server
(setq org-db-v3-auto-start-server t)

;; Indexing speed (seconds between files)
;; Lower = faster indexing but less responsive Emacs and higher server load
;; Higher = slower indexing but more responsive Emacs and prevents server overload
(setq org-db-v3-index-delay 0.5)  ; default: 500ms (allows linked file processing)

;; Search defaults
(setq org-db-v3-search-default-limit 10)

;; Headline search sort order
(setq org-db-v3-headline-sort-order "last_updated")  ; default
;; Options: "last_updated", "filename", "indexed_at"

;; Linked file indexing (DISABLED by default due to performance issues)
(setq org-db-v3-enable-linked-files nil)  ; disabled by default
;; (setq org-db-v3-max-linked-file-size 20971520)  ; 20MB limit if enabled

;; gptel integration (optional)
(require 'org-db-v3-gptel-tools)
(org-db-v3-gptel-register-tools)
(setq org-db-v3-gptel-search-limit 5)  ; results returned to LLM

Web Interface

Open the web interface to view statistics and API documentation:

M-x org-db-v3-open-web-interface
;; or press 'W' in the menu
;; or visit http://127.0.0.1:8765 in your browser

The homepage shows:

Database location and size
File counts, headlines, embeddings
Recent files indexed
Complete API documentation
Getting started guide

API Endpoints

The FastAPI server provides REST endpoints. Visit http://127.0.0.1:8765/docs for interactive documentation.

Core Endpoints

Health Check

GET /health

Index File

POST /api/file
Content-Type: application/json

{
  "filename": "/path/to/file.org",
  "md5": "abc123...",
  "file_size": 1024,
  "content": "full file content...",
  "headlines": [...],
  "links": [...],
  "keywords": [...],
  "src_blocks": [...],
  "images": [...]
}

Semantic Search

POST /api/search/semantic
Content-Type: application/json

{
  "query": "your search query",
  "limit": 10,
  "model": "all-MiniLM-L6-v2"  // optional
}

Response includes similarity scores, filenames, line numbers, and context.

Full-Text Search

POST /api/search/fulltext
Content-Type: application/json

{
  "query": "keyword search",
  "limit": 10
}

Returns snippets and character positions for jumping to results.

Image Search

POST /api/search/images
Content-Type: application/json

{
  "query": "a photo of a cat",
  "limit": 10
}

Headline Search

POST /api/search/headlines
Content-Type: application/json

{
  "query": "project",
  "limit": 10,
  "sort_by": "last_updated"  // optional: "filename", "last_updated", "indexed_at"
}

Sorts by most recently updated files first by default.

Agenda

POST /api/agenda
Content-Type: application/json

{
  "days_ahead": 7
}

Returns TODO items with deadlines and scheduled dates.

File Management

GET /api/files           # List all indexed files
DELETE /api/file?filename=/path/to/file.org  # Remove file from database

Statistics

GET /api/stats/          # Database statistics
GET /api/stats/files     # List of files with timestamps

Database Schema

Database location: ~/org-db/org-db-v3.db

SQLite database with 22 tables including:

files: Indexed org files with MD5 hashes
headlines: Org headings with metadata (TODO, tags, priority, scheduled, deadline)
chunks: Text chunks for semantic search with line numbers
embeddings: Vector embeddings stored as BLOB (float32)
fts_content: Full-text search index (SQLite FTS5)
links: All links with type, path, and description
properties: Org properties (key-value pairs)
keywords: File-level keywords
tags: Tags with relationships to headlines
src_blocks: Source code blocks with language and contents
images: Image paths with positions
image_embeddings: CLIP embeddings for images

Development

Running Tests

Python tests:

cd python
uv run pytest tests/ -v

Emacs tests:

emacs -batch -l tests/org-db-v3-search-test.el -f ert-run-tests-batch-and-exit

Reloading Code During Development

After making changes to elisp files:

M-x load-file RET reload.el RET
M-x org-db-v3-reload

Or restart Emacs to load fresh code.

Project Structure

org-db-v3/
├── python/
│   ├── org_db_server/
│   │   ├── api/              # FastAPI routes
│   │   │   ├── indexing.py   # File indexing endpoints
│   │   │   ├── search.py     # Search endpoints
│   │   │   ├── stats.py      # Statistics endpoints
│   │   │   ├── agenda.py     # Agenda endpoint
│   │   │   └── linked_files.py # Linked file endpoints
│   │   ├── models/           # Pydantic schemas & DB models
│   │   ├── services/         # Business logic
│   │   │   ├── database.py   # SQLite operations
│   │   │   ├── embeddings.py # Text embeddings
│   │   │   ├── clip_service.py # Image embeddings
│   │   │   ├── chunking.py   # Text chunking
│   │   │   ├── docling_service.py # Document conversion
│   │   │   └── docling_worker.py  # Subprocess worker
│   │   └── templates/        # HTML templates
│   └── tests/                # Python tests
├── elisp/
│   ├── org-db-v3.el          # Main package
│   ├── org-db-v3-parse.el    # Org parsing
│   ├── org-db-v3-client.el   # HTTP client & indexing queue
│   ├── org-db-v3-server.el   # Server management
│   ├── org-db-v3-search.el   # Search UI
│   ├── org-db-v3-agenda.el   # Agenda functionality
│   ├── org-db-v3-ui.el       # Transient menu
│   └── org-db-v3-gptel-tools.el  # gptel integration
├── tests/                    # Emacs tests
└── reload.el                 # Development helper

How It Works

Indexing Pipeline

Emacs parses org file using org-element-parse-buffer
Extracts headlines, links, keywords, src blocks, images, and full content
Adds to queue for non-blocking processing
Idle timer processes one file at a time
Sends JSON to FastAPI server via async HTTP
Server stores structured data in SQLite
Generates embeddings:
- Text chunks using sentence-transformers
- Images using CLIP
Stores vectors as float32 bytes in database
Populates FTS5 table for full-text search

Non-blocking Indexing

Directory indexing uses a queue-based approach:

Files are collected upfront
Regular timer processes one file at configurable intervals (default: 50ms)
Local/directory variables are suppressed for safety
Already-open buffers are preserved
Progress shown in echo area
Cancellable with M-x org-db-v3-cancel-indexing
Speed tunable via org-db-v3-index-delay (lower = faster, higher = more responsive)

Search Pipeline

Semantic Search

User enters query in minibuffer
Server generates query embedding
Calculates cosine similarity with all stored embeddings
Returns top N results ranked by similarity
Emacs displays in org-mode buffer with navigation

Full-Text Search

Query sent to FTS5 index
SQLite returns matching content with snippets
Character positions enable precise jumping
Results displayed with context

Image Search

Query text converted to CLIP embedding
Similarity calculated with stored image embeddings
Matching images returned with thumbnails
Click to view full image

Embedding Models

Text (Semantic Search)

Model: all-MiniLM-L6-v2 (384 dimensions)
Fast inference (~50 sentences/second on CPU)
Good accuracy for semantic search
Runs locally (no API calls)
Downloaded automatically on first use (~90MB)

Images (CLIP)

Model: clip-ViT-B-32
Text-to-image and image-to-image similarity
Downloaded automatically on first use (~600MB)

Performance

Indexing: ~100 headlines/second
Search: <100ms for 1000 documents (local)
Embedding generation: ~50 sentences/second (CPU)
Database: SQLite handles 100k+ chunks efficiently
Non-blocking: Emacs remains responsive during indexing

Troubleshooting

Server won’t start

Check if port 8765 is already in use:

lsof -i :8765

The server includes automatic protection against multiple starts:

Detects if port is already in use
Identifies zombie/stuck processes
Offers to automatically clean them up (when started manually)
Auto-cleans zombies when auto-starting
Fast health checks (1 second timeout) prevent hanging

If you see “Address already in use” errors:

Try M-x org-db-v3-start-server - it will prompt to kill zombies
Or manually: M-x org-db-v3-kill-zombie-processes
Or restart Emacs (auto-start will clean up zombies)

Change port in both Python and Emacs config if needed.

404 errors when indexing

Old code is cached in Emacs. Reload the elisp files:

M-x load-file RET reload.el RET
M-x org-db-v3-reload

Or restart Emacs.

No search results

Verify files are indexed: Check server logs at /tmp/org-db-server.log
Ensure embeddings generated: Look for embedding_service in logs
Check database size: Visit web interface (W in menu)
Try reindexing: Press r in menu

Indexing is slow or requires keypresses

If indexing appears to pause between files:

Ensure you have the latest code (uses regular timers, not idle timers)
Check/adjust indexing speed: (setq org-db-v3-index-delay 0.05) (or lower for faster)
Reload code: M-x load-file RET reload.el RET

Indexing hangs or blocks Emacs

The queue-based system should prevent this, but if it happens:

Cancel current operation: M-x org-db-v3-cancel-indexing
Check queue status: M-x describe-variable RET org-db-v3-index-queue
Reload code (see above)
Restart Emacs if needed

Local variable prompts during indexing

The code suppresses local and directory variables, but if you still see prompts:

Check that you’ve reloaded the latest code
Set (setq enable-local-variables nil) temporarily
Report as a bug

Import errors

Ensure uv environment is set up:

cd python
uv sync
uv run python -c "import org_db_server"

gptel tools not working

If you see “void-variable org-db-v3-server-url” errors:

Make sure you’ve loaded the gptel-tools module:

(load-file "/path/to/org-db-v3/elisp/org-db-v3-gptel-tools.el")
(org-db-v3-gptel-register-tools)

Check that the main org-db-v3 package is loaded (it defines org-db-v3-server-url)

Verify the tools are registered:

M-x describe-variable RET gptel-tools RET
;; Should show org_semantic_search and org_fulltext_search

Make sure your GPTEL_TOOLS property uses the correct names:
- org_semantic_search (not org-db-semantic-search)
- org_fulltext_search (not org-db-fulltext)

Completed Features

[X] Semantic search with embeddings
[X] Full-text search (FTS5) with snippets
[X] Image search with CLIP
[X] Headline search
[X] Scoped search (directory, project, tag filters)
[X] Agenda with customizable date ranges (default 2 weeks)
[X] Transient menu UI with infix arguments
[X] Web interface with statistics
[X] Non-blocking directory indexing
[X] File browser (open files from database)
[X] Linked file indexing (PDF, DOCX, PPTX, etc.) - currently disabled due to performance issues with large collections (5000+ files)
[X] Linked file browser (browse and open indexed documents)
[ ] File-level embedding aggregation (needed for linked files at scale)
[X] gptel integration (expose search tools to LLMs)
[X] Search at point
[X] Auto-indexing on save
[X] Full file paths in search results
[X] Auto-enable for already-open buffers
[X] Server startup protection (prevents multiple servers, auto-cleans zombies)
[X] Fast health checks with short timeouts
[X] Configurable indexing speed
[X] Memory-optimized document conversion (95% reduction vs docling)

Future Enhancements

High Priority (Linked Files)

[ ] File-level embedding aggregation (reduce 360K embeddings to 6K)
[ ] Smart chunking for linked files (fixed-size instead of paragraph)
[ ] Selective linked file indexing (by type, size, date)
[ ] Hybrid search (file-level + chunk-level)

Other Enhancements

[ ] Reranking strategy
[ ] Custom chunk sizes/strategies
[ ] More search filters (date ranges, file patterns)
[ ] Incremental indexing optimizations

License

See LICENSE file.

Contributing

Contributions welcome! Please:

Write tests for new features
Follow existing code style
Update documentation
Use descriptive commit messages

Credits

Built with:

FastAPI (Python web framework)
sentence-transformers (text embeddings)
CLIP (image embeddings)
plz.el (async HTTP for Emacs)
uv (Python package manager)
transient (Emacs menu system)

Inspired by org-roam, org-ql, and semantic search research.

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
docs		docs
elisp		elisp
examples		examples
python		python
screenshots		screenshots
scripts		scripts
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
PROJECT_REPORT.md		PROJECT_REPORT.md
readme.org		readme.org

License

jkitchin/org-db-v3

Folders and files

Latest commit

History

Repository files navigation

org-db v3

Overview

Features

Architecture

Installation

Prerequisites

Python Backend

Emacs Package

Quick Start

1. Start the Server

2. Open the Menu

3. Index Your Org Files

4. Search

Semantic Search (by meaning)

Full-Text Search (by keywords)

Search at Point

Image Search

Headline Search

Using Search Scope

5. Agenda

6. Linked File Indexing

Known Issues

Current Status

Future Work Needed

If You Want to Try It

Supported Formats (when enabled)

Browsing Linked Files

Configuration

7. gptel Integration (AI-Powered Search)

Setup

Available Tools

org_semantic_search

org_fulltext_search

Usage in gptel

Tool Parameters

Unregistering Tools

Usage

Transient Menu

Scope Options

Actions

Search Commands

File Management

Indexing

Agenda

Server Management

Other

Search Results Navigation

Configuration

Web Interface

API Endpoints

Core Endpoints

Health Check

Index File

Semantic Search

Full-Text Search

Image Search

Headline Search

Agenda

File Management

Statistics

Database Schema

Development

Running Tests

Reloading Code During Development

Project Structure

How It Works

Indexing Pipeline

Non-blocking Indexing

Search Pipeline

Semantic Search

Full-Text Search

Image Search

Embedding Models

Text (Semantic Search)

`org_semantic_search`

`org_fulltext_search`

Packages