Skip to content

Python application to index code locally supporting ollama and voyageai to power semantic searching a large codebase

Notifications You must be signed in to change notification settings

jsbattig/code-indexer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Code Indexer (cidx)

AI-powered semantic code search for your codebase. Find code by meaning, not just keywords.

Version 7.1.0

New in 7.1.0: Full-text search (FTS) support with Tantivy backend - see Full-Text Search section below.

Two Operating Modes

CLI Mode (cidx command)

Traditional command-line interface for individual developers:

  • Direct CLI commands - cidx init, cidx index, cidx query
  • Local project indexing - Index your current project directory
  • Real-time progress - Individual file status progression (starting → chunking → vectorizing → finalizing → complete)
  • Integrated with Claude CLI - Seamless AI code analysis with semantic search context

Multi-User Server Mode

FastAPI web service for team environments:

  • REST API - Web service with authentication and user management
  • JWT Authentication - Token-based security
  • Golden Repositories - Centralized code repository management for teams
  • Background Processing - Async job system for heavy operations
  • Multi-project support - Team workspace management

Core Capabilities

Semantic Search Engine

  • Vector embeddings - Find code by meaning using fixed-size chunking
  • Multiple providers - Local (Ollama) or cloud (VoyageAI) embeddings
  • Smart indexing - Parallel file processing, incremental updates, git-aware
  • Advanced filtering - Language, file paths, extensions, similarity scores
  • Multi-language - Python, JavaScript, TypeScript, Java, C#, Go, Kotlin, and more

Parallel Processing Architecture

  • Slot-based file processing - Real-time state visibility with natural slot reuse
  • Dual thread pools - Frontend file processing (threadcount+2) feeds backend vectorization (threadcount)
  • Real-time progress - Individual file status progression (starting → chunking → vectorizing → finalizing → complete)
  • Clean cancellation - Post-write cancellation preserves file atomicity

Vector Search Architecture (v7.0)

HNSW Graph-Based Indexing

Code Indexer v7.0 introduces HNSW (Hierarchical Navigable Small World) graph-based indexing for blazing-fast semantic search with O(log N) complexity.

Performance:

  • 300x speedup: ~20ms queries (vs 6+ seconds with binary index)
  • Scalability: Tested with 37K vectors, sub-30ms response times
  • Memory efficient: 154 MB index for 37K vectors (4.2 KB per vector)

Algorithm Complexity:

Query Time Complexity: O(log N + K)
  - HNSW graph search: O(log N) average case
  - Candidate loading: O(K) where K = limit * 2, K << N
  - Practical: ~20ms for 37K vectors

HNSW Configuration:

  • M=16: Connections per node (graph connectivity)
  • ef_construction=200: Build-time accuracy parameter
  • ef_query=50: Query-time accuracy parameter
  • Space=cosine: Cosine similarity distance metric

Filesystem Vector Store

Container-free vector storage using filesystem + HNSW indexing:

Storage Structure:

.code-indexer/index/<collection>/
├── hnsw_index.bin              # HNSW graph (O(log N) search)
├── id_index.bin                # Binary mmap ID→path mapping
├── collection_meta.json        # Metadata + staleness tracking
└── vectors/                    # Quantized path structure
    └── <level1>/<level2>/<level3>/<level4>/
        └── vector_<uuid>.json  # Individual vector + payload

Key Features:

  • Path-as-Vector Quantization: 64-dim projection → 4-level directory depth
  • Git-Aware Storage:
    • Clean files: Store only git blob hash (space efficient)
    • Dirty/non-git: Store full chunk_text
  • Hash-Based Staleness: SHA256 for precise change detection
  • 3-Tier Content Retrieval: Current file → git blob → error

Binary ID Index:

  • Format: Packed binary [num_entries:uint32][id_len:uint16, id:utf8, path_len:uint16, path:utf8]...
  • Performance: <20ms cached loads via memory mapping (mmap)
  • Thread-safe: RLock for concurrent access

Parallel Query Execution

2-Thread Architecture for 15-30% Latency Reduction:

Query Pipeline:
┌─────────────────────────────────────────┐
│  Thread 1: Index Loading (I/O bound)   │
│  - Load HNSW graph (~5-15ms)           │
│  - Load ID index via mmap (<20ms)      │
└─────────────────────────────────────────┘
           ⬇ Parallel Execution ⬇
┌─────────────────────────────────────────┐
│ Thread 2: Embedding (CPU/Network bound)│
│  - Generate query embedding (5s API)   │
└─────────────────────────────────────────┘
           ⬇ Join ⬇
┌─────────────────────────────────────────┐
│  HNSW Graph Search + Filtering         │
│  - Navigate graph: O(log N)            │
│  - Load K candidates: O(K)             │
│  - Apply filters and score             │
│  - Return top-K results                │
└─────────────────────────────────────────┘

Typical Savings: 175-265ms per query Threading Overhead: 7-16% (transparently reported)

Search Strategy Evolution

Version 6.x: Binary Index (O(N) Linear Scan)

# Load ALL vectors
for vector_id in all_vectors:  # O(N)
    vector = load_vector(vector_id)
    similarity = cosine(query, vector)
    results.append((vector_id, similarity))

results.sort()  # O(N log N)
return results[:limit]

# Performance: 6+ seconds for 7K vectors

Version 7.0: HNSW Index (O(log N) Graph Search)

# Load HNSW graph index
hnsw = load_hnsw_index()  # O(1)

# Navigate graph to find approximate nearest neighbors
candidates = hnsw.search(query, k=limit*2)  # O(log N)

# Load ONLY candidate vectors
for candidate_id in candidates:  # O(K) where K << N
    vector = load_vector(candidate_id)
    similarity = exact_cosine(query, vector)
    if filter_match(vector.payload):
        results.append((candidate_id, similarity))

results.sort()  # O(K log K)
return results[:limit]

# Performance: ~20ms for 37K vectors (300x faster)

Performance Decision Analysis

Why HNSW?

  1. vs FAISS: Simpler integration, no external C++ dependencies, optimal for small-medium datasets (<100K vectors)
  2. vs Annoy: Better accuracy-speed tradeoff, superior graph connectivity
  3. vs Product Quantization: Maintains full precision, no accuracy loss
  4. vs Brute Force: 300x speedup justifies ~150MB index overhead

Quantization Strategy:

  • 64-dim projection: Optimal balance (tested 32, 64, 128, 256 dimensions)
  • 4-level depth: Enables 64^4 = 16.8M unique paths (sufficient for large codebases)
  • 2-bit quantization: Further reduces from 64 to 4 levels per dimension

Storage Trade-offs:

  • JSON vs Binary: JSON chosen for git-trackability and debuggability (3-5x size acceptable)
  • Individual files: Enable incremental updates and git change tracking
  • Binary exceptions: ID index and HNSW use binary for performance-critical components

How the Indexing Algorithm Works

The code-indexer uses a sophisticated dual-phase parallel processing architecture with git-aware metadata extraction and dynamic VoyageAI batch optimization.

For detailed technical documentation, see docs/algorithms.md which covers:

  • Complete algorithm flow and complexity analysis
  • VoyageAI batch processing optimization
  • Slot-based progress tracking architecture
  • Git-aware branch isolation and deduplication
  • Performance characteristics and implementation details

Installation

pipx (Recommended)

# Install the package
pipx install git+https://github.com/jsbattig/code-indexer.git@v7.1.0

# Setup global registry (standalone command - requires sudo)
cidx setup-global-registry

# If cidx command is not found, add pipx bin directory to PATH:
# export PATH="$HOME/.local/bin:$PATH"

pip with virtual environment

python3 -m venv code-indexer-env
source code-indexer-env/bin/activate
pip install git+https://github.com/jsbattig/code-indexer.git@v7.0.1

# Setup global registry (standalone command - requires sudo)
cidx setup-global-registry

Requirements

  • Python 3.9+
  • Container Engine: Docker or Podman (for containerized services)
  • Memory: 4GB+ RAM recommended
  • Disk Space: ~500MB for base containers + vector data storage

Docker vs Podman Support

Code Indexer supports both Docker and Podman container engines:

  • Auto-detection: Automatically detects and uses available container engine
  • Podman preferred: Uses Podman by default when available (better rootless support)
  • Force Docker: Use --force-docker flag to force Docker usage
  • Rootless containers: Fully supports rootless container execution

Global Port Registry: The setup-global-registry command configures system-wide port coordination at /var/lib/code-indexer/port-registry, preventing conflicts when running multiple code-indexer projects simultaneously. This is required for proper multi-project support.

Vector Storage Backends

Code Indexer supports two vector storage backends for different deployment scenarios:

Filesystem Backend (Default)

Container-free vector storage using the local filesystem - ideal for environments where containers are restricted or unavailable.

Features:

  • No containers required - Stores vector data directly in .code-indexer/index/ directory
  • Zero setup overhead - Works immediately without Docker/Podman
  • Lightweight - Minimal resource footprint, no container orchestration
  • Portable - Vector data travels with your repository

Usage:

# Default behavior (no flag needed)
cidx init

# Or explicitly specify filesystem backend
cidx init --vector-store filesystem

Directory Structure:

your-project/
└── .code-indexer/
    ├── config.json
    └── index/           # Vector data stored here
        └── (vector storage files)

Qdrant Backend

Container-based vector storage using Docker/Podman + Qdrant vector database - provides advanced vector database features and scalability.

Features:

  • High performance - Optimized vector similarity search with HNSW indexing
  • Advanced filtering - Complex queries and metadata filtering
  • Horizontal scaling - Suitable for large codebases and production deployments
  • Container isolation - Clean separation of concerns with containerized services

Usage:

# Initialize with Qdrant backend
cidx init --vector-store qdrant

# Requires: Docker or Podman installed
# Automatically allocates unique ports per project
# Requires global port registry: cidx setup-global-registry

When to use each backend:

  • Filesystem: Development environments, CI/CD pipelines, container-restricted systems, quick prototyping
  • Qdrant: Production deployments, large teams, advanced vector search requirements, high-performance needs

Switching Between Backends

You can switch between filesystem and Qdrant backends at any time. Warning: Switching backends requires destroying existing vector data and re-indexing your codebase. Your source code is never affected - only the vector index data is removed.

Complete Switching Workflow:

# Method 1: Manual step-by-step (recommended for first-time users)
cidx stop                                  # Stop any running services
cidx uninstall --confirm                   # Remove current backend data
cidx init --vector-store <new-backend>     # Initialize new backend (filesystem or qdrant)
cidx start                                 # Start services for new backend
cidx index                                 # Re-index codebase with new backend

# Method 2: Force reinitialize (quick switch, skips uninstall)
cidx init --vector-store <new-backend> --force
cidx start
cidx index

Examples:

# Switch from Qdrant to filesystem (container-free)
cidx stop
cidx uninstall --confirm
cidx init --vector-store filesystem
cidx start
cidx index

# Switch from filesystem to Qdrant (containerized)
cidx stop  # No-op for filesystem, safe to run
cidx uninstall --confirm
cidx init --vector-store qdrant
cidx start
cidx index

# Quick switch using --force flag
cidx init --vector-store qdrant --force
cidx start
cidx index

What gets preserved:

  • ✅ All source code files
  • ✅ Git repository and history
  • ✅ Project structure and dependencies
  • ✅ Configuration settings (file exclusions, size limits, etc.)

What gets removed:

  • ❌ Vector embeddings and index data
  • ❌ Cached search results
  • ❌ Backend-specific containers (when switching from Qdrant)
  • ❌ Backend-specific storage directories

Safety considerations:

  • The --confirm flag skips the confirmation prompt for uninstall
  • The --force flag overwrites existing configuration when using init
  • Always ensure you have a backup if you've customized your .code-indexer/config.json
  • Re-indexing time depends on codebase size (typically seconds to minutes)

Quick Start

# 1. Setup global registry (once per system)
cidx setup-global-registry

# 2. Navigate to your project
cd /path/to/your/project

# 3. Start services and index code
cidx start     # Auto-creates config if needed
cidx index     # Smart incremental indexing

# 4. Search semantically
cidx query "authentication logic"

# 5. Search with filtering
cidx query "user" --language python --min-score 0.7
cidx query "save" --path-filter "*/models/*" --limit 20

# 6. Teach AI assistants about semantic search (optional)
cidx teach-ai --claude --project

Alternative: Custom Configuration

# Optional: Initialize with custom settings first
cidx init --embedding-provider voyage-ai --max-file-size 2000000
cidx start
cidx index

Full-Text Search (FTS)

CIDX now supports blazing-fast, index-backed full-text search alongside semantic search. FTS is perfect for finding exact text matches, specific identifiers, or debugging typos in your codebase.

Why Use FTS?

  • Sub-5ms query latency on large codebases (vs ~20ms for semantic search)
  • Exact text matching for finding specific function names, classes, or identifiers
  • Fuzzy matching with configurable edit distance for typo tolerance
  • Case sensitivity control for precise matching
  • Real-time updates in watch mode

Building the FTS Index

# Build both semantic and FTS indexes (recommended)
cidx index --fts

# Enable FTS in watch mode for real-time updates
cidx watch --fts

Note: FTS requires Tantivy v0.25.0. Install with: pip install tantivy==0.25.0

Search Modes

CIDX supports three search modes:

1. Semantic Search (Default)

AI-powered conceptual search using vector embeddings:

cidx query "authentication logic"
cidx query "database connection setup"
cidx query "error handling patterns"

2. Full-Text Search (--fts)

Fast, exact text matching with optional fuzzy tolerance:

# Exact text search
cidx query "authenticate_user" --fts

# Case-sensitive search (find exact case)
cidx query "ParseError" --fts --case-sensitive

# Fuzzy matching (typo tolerant)
cidx query "authenticte" --fts --fuzzy  # Finds "authenticate"
cidx query "conection" --fts --edit-distance 2  # Finds "connection"

3. Hybrid Search (--fts --semantic)

Run both searches in parallel for comprehensive results:

cidx query "parse" --fts --semantic
cidx query "login" --fts --semantic --limit 5

FTS-Specific Options

# Case sensitivity
--case-sensitive      # Enable case-sensitive matching
--case-insensitive    # Force case-insensitive (default)

# Fuzzy matching
--fuzzy               # Enable fuzzy matching (edit distance 1)
--edit-distance N     # Set fuzzy tolerance (0-3, default: 0)
                      # 0=exact, 1=1 typo, 2=2 typos, 3=3 typos

# Context control
--snippet-lines N     # Lines of context around matches (0-50, default: 5)
                      # 0=list files only, no content snippets

FTS Examples

# Find specific function names
cidx query "authenticate_user" --fts

# Case-sensitive class search
cidx query "UserAuthentication" --fts --case-sensitive

# Fuzzy search for typos
cidx query "respnse" --fts --fuzzy                    # Finds "response"
cidx query "athenticate" --fts --edit-distance 2      # Finds "authenticate"

# Minimal output (list files only)
cidx query "TODO" --fts --snippet-lines 0

# Extended context
cidx query "error" --fts --snippet-lines 10

# Filter by language and path
cidx query "test" --fts --language python --path-filter "*/tests/*"

# Hybrid search (both semantic and exact matching)
cidx query "login" --fts --semantic

FTS Performance

FTS queries are extremely fast:

  • Sub-5ms latency for most searches
  • Parallel execution in hybrid mode (both searches run simultaneously)
  • Real-time index updates in watch mode
  • Storage: .code-indexer/tantivy_index/

When to Use Each Mode

Use Case Best Mode Example
Finding specific functions/classes FTS with --case-sensitive cidx query "UserAuth" --fts --case-sensitive
Exploring concepts/patterns Semantic (default) cidx query "authentication strategies"
Debugging typos in code FTS with --fuzzy cidx query "respnse" --fts --fuzzy
Comprehensive search Hybrid cidx query "parse" --fts --semantic
Finding TODO comments FTS cidx query "TODO" --fts
Understanding architecture Semantic cidx query "how does caching work"

Regex Pattern Matching

Use --regex flag for token-based pattern matching (faster than grep on indexed repos):

# Find function definitions
cidx query "def" --fts --regex

# Find identifiers starting with "auth"
cidx query "auth.*" --fts --regex --language python

# Find test functions
cidx query "test_.*" --fts --regex --exclude-path "*/vendor/*"

# Find TODO comments (use lowercase for case-insensitive matching)
cidx query "todo" --fts --regex

# Case-sensitive regex (use uppercase for exact case match)
cidx query "ERROR" --fts --regex --case-sensitive

Important Limitation - Token-Based Matching: Tantivy regex operates on individual TOKENS, not full text:

  • Works: r"def", r"login.*", r"todo", r"test_.*"
  • Doesn't work: r"def\s+\w+" (whitespace removed during tokenization)

Case Sensitivity: By default, regex searches are case-insensitive (patterns matched against lowercased content). Use --case-sensitive flag for exact case matching.

For multi-word patterns spanning whitespace, use exact search instead of regex.

Performance: Regex on indexed repos is 10-50x faster than grep for large codebases.

Multi-User Server

The CIDX server provides a FastAPI-based multi-user semantic code search service with JWT authentication and role-based access control.

Server Quick Start

# 1. Install and setup (same as CLI)
pipx install git+https://github.com/jsbattig/code-indexer.git@v7.0.1
cidx setup-global-registry

# 2. Install and configure the server
cidx install-server

# 3. Start the server
cidx server start

# 4. Access the API documentation
# Visit: http://localhost:8090/docs

Server Features

  • JWT Authentication: Token-based authentication system
  • User Roles: Admin, power_user, and normal_user with different permissions
  • Golden Repositories: Centralized repository management
  • Repository Activation: Copy-on-Write cloning for user workspaces
  • Advanced Search: Semantic search with file extension filtering
  • Background Jobs: Async processing for indexing and heavy operations
  • Health Monitoring: System status and performance endpoints

API Endpoints Overview

  • Authentication: POST /auth/login - JWT token authentication
  • User Management: POST /users/create - Admin-only user creation
  • Golden Repos: GET|POST|DELETE /repositories/golden/* - Repository management
  • Activation: POST|DELETE /repositories/activate/* - User repository activation
  • Search: POST /query/* - Semantic search with filtering
  • Jobs: GET /jobs/* - Background job status monitoring
  • Health: GET /health - System health and metrics

Authentication & Roles

# Login and get JWT token
curl -X POST http://localhost:8090/auth/login \
  -H "Content-Type: application/json" \
  -d '{"username": "admin", "password": "admin123"}'

# Use token in subsequent requests
curl -X GET http://localhost:8090/repositories/golden \
  -H "Authorization: Bearer YOUR_JWT_TOKEN"

Default Users (for testing):

  • Admin: admin / admin123 - Full system access
  • Power User: power_user / power123 - Repository and search access
  • Normal User: normal_user / normal123 - Search-only access

Example Usage

# 1. Login as admin
TOKEN=$(curl -s -X POST http://localhost:8090/auth/login \
  -H "Content-Type: application/json" \
  -d '{"username": "admin", "password": "admin123"}' | jq -r '.access_token')

# 2. Add a golden repository
curl -X POST http://localhost:8090/repositories/golden \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"repository_path": "/path/to/your/repo", "alias": "my-repo"}'

# 3. Activate repository for user
curl -X POST http://localhost:8090/repositories/activate \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"golden_alias": "my-repo", "user_alias": "my-workspace"}'

# 4. Semantic search
curl -X POST http://localhost:8090/query/repositories \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"query_text": "authentication logic", "file_extensions": [".py", ".js"]}'

Complete CLI Reference

Setup Commands

# Global system setup
cidx setup-global-registry              # Setup global port registry (requires sudo)
cidx setup-global-registry --test-access --quiet  # Test registry access

# Project initialization
cidx init                               # Initialize with default settings
cidx init --embedding-provider voyage-ai  # Use VoyageAI instead of Ollama
cidx init --max-file-size 2000000       # Set 2MB file size limit
cidx init --setup-global-registry       # Init + setup registry (legacy)
cidx init --create-override-file        # Create .code-indexer-override.yaml

Service Management

# Service lifecycle
cidx start                      # Start services (smart detection)
cidx start --force-docker       # Force Docker instead of Podman
cidx start --force-recreate     # Force recreate containers
cidx start --quiet              # Silent mode
cidx start -m all-minilm-l6-v2  # Different Ollama model

cidx status                     # Check service status
cidx status --force-docker      # Check Docker status specifically

cidx stop                       # Stop services (preserve data)
cidx stop --force-docker        # Stop Docker services specifically

Indexing Commands

# Standard indexing
cidx index                      # Smart incremental indexing
cidx index --clear              # Force full reindex
cidx index --reconcile          # Reconcile disk vs database
cidx index --detect-deletions   # Handle deleted files
cidx index --batch-size 25      # Custom batch size
cidx index --files-count-to-process 100  # Limit file count
cidx index --threads 8          # Custom thread count (configure in config.json)

# Real-time monitoring
cidx watch                      # Git-aware file watching
cidx watch --debounce 5.0       # Custom debounce delay
cidx watch --initial-sync       # Full sync before watching

Search Commands

# Basic search
cidx query "search terms"      # Semantic search
cidx query "auth" --limit 20   # More results
cidx query "function" --quiet  # Only results, no headers

# Advanced filtering
cidx query "user" --language python  # Filter by language
cidx query "save" --path-filter "*/models/*" # Filter by path pattern
cidx query "function" --min-score 0.7  # Higher confidence matches
cidx query "database" --limit 15     # More results
cidx query "test" --min-score 0.8     # High-confidence matches

# Short alias
cidx query "search terms"              # Same as cidx query

Language Filtering

CIDX supports intelligent language filtering with comprehensive file extension mapping:

# Friendly language names (recommended)
cidx query "authentication" --language python     # Matches .py, .pyw, .pyi files
cidx query "components" --language javascript     # Matches .js, .jsx files  
cidx query "models" --language typescript         # Matches .ts, .tsx files
cidx query "handlers" --language cpp              # Matches .cpp, .cc, .cxx, .c++ files

# Direct extension usage (also supported)
cidx query "function" --language py               # Matches only .py files
cidx query "component" --language jsx             # Matches only .jsx files

Supported Languages: python, javascript, typescript, java, csharp, c, cpp, go, rust, php, ruby, swift, kotlin, scala, dart, html, css, vue, markdown, xml, yaml, json, sql, shell, bash, dockerfile, and more.

Customizing Language Mappings

You can customize language mappings by editing .code-indexer/language-mappings.yaml:

# Add custom languages or modify existing mappings
python: [py, pyw, pyi]          # Multiple extensions
mylang: [ml, mli]               # Your custom language  
javascript: [js, jsx]           # Modify existing mappings

Changes take effect on the next query execution. The file is automatically created during cidx init or on first use.

Exclusion Filters

CIDX provides powerful exclusion filters to remove unwanted files from your search results. Exclusions always take precedence over inclusions, giving you precise control over your search scope.

Excluding Files by Language

Filter out files of specific programming languages using --exclude-language:

# Exclude JavaScript files from results
cidx query "database implementation" --exclude-language javascript

# Exclude multiple languages
cidx query "api handlers" --exclude-language javascript --exclude-language typescript --exclude-language css

# Combine with language inclusion (Python only, no JS)
cidx query "web server" --language python --exclude-language javascript

Excluding Files by Path Pattern

Use --exclude-path with glob patterns to filter out files in specific directories or with certain names:

# Exclude all test files
cidx query "production code" --exclude-path "*/tests/*" --exclude-path "*_test.py"

# Exclude dependency and cache directories
cidx query "application logic" \
  --exclude-path "*/node_modules/*" \
  --exclude-path "*/vendor/*" \
  --exclude-path "*/__pycache__/*"

# Exclude by file extension
cidx query "source code" --exclude-path "*.min.js" --exclude-path "*.pyc"

# Complex path patterns
cidx query "configuration" --exclude-path "*/build/*" --exclude-path "*/.*"  # Hidden files

Combining Multiple Filter Types

Create sophisticated queries by combining inclusion and exclusion filters:

# Python files in src/, excluding tests and cache
cidx query "database models" \
  --language python \
  --path-filter "*/src/*" \
  --exclude-path "*/tests/*" \
  --exclude-path "*/__pycache__/*"

# High-relevance results, no test files or vendored code
cidx query "authentication logic" \
  --min-score 0.8 \
  --exclude-path "*/tests/*" \
  --exclude-path "*/vendor/*" \
  --exclude-language javascript

# API code only, multiple exclusions
cidx query "REST endpoints" \
  --path-filter "*/api/*" \
  --exclude-path "*/tests/*" \
  --exclude-path "*/mocks/*" \
  --exclude-language javascript \
  --exclude-language css

Common Exclusion Patterns

Testing Files
--exclude-path "*/tests/*"        # Test directories
--exclude-path "*/test/*"         # Alternative test dirs
--exclude-path "*_test.py"        # Python test files
--exclude-path "*_test.go"        # Go test files
--exclude-path "*.test.js"        # JavaScript test files
--exclude-path "*/fixtures/*"     # Test fixtures
--exclude-path "*/mocks/*"        # Mock files
Dependencies and Vendor Code
--exclude-path "*/node_modules/*"    # Node.js dependencies
--exclude-path "*/vendor/*"          # Vendor libraries
--exclude-path "*/.venv/*"           # Python virtual environments
--exclude-path "*/site-packages/*"   # Python packages
--exclude-path "*/bower_components/*" # Bower dependencies
Build Artifacts and Cache
--exclude-path "*/build/*"        # Build output
--exclude-path "*/dist/*"         # Distribution files
--exclude-path "*/target/*"       # Maven/Cargo output
--exclude-path "*/__pycache__/*"  # Python cache
--exclude-path "*.pyc"            # Python compiled files
--exclude-path "*.pyo"            # Python optimized files
--exclude-path "*.class"          # Java compiled files
--exclude-path "*.o"              # Object files
--exclude-path "*.so"             # Shared libraries
Generated and Minified Files
--exclude-path "*.min.js"         # Minified JavaScript
--exclude-path "*.min.css"        # Minified CSS
--exclude-path "*_pb2.py"         # Protocol buffer generated
--exclude-path "*.generated.*"    # Generated files
--exclude-path "*/migrations/*"   # Database migrations

Filter Conflicts and Warnings

CIDX automatically detects contradictory filters and provides helpful feedback:

# Language conflict (same language included AND excluded)
cidx query "database" --language python --exclude-language python
# Output: 🚫 Language 'python' is both included and excluded.

# Path conflict (same path included AND excluded)
cidx query "config" --path-filter "*/src/*" --exclude-path "*/src/*"
# Output: 🚫 Path pattern '*/src/*' is both included and excluded.

# Over-exclusion warning (many exclusions without inclusions)
cidx query "code" --exclude-language python --exclude-language javascript \
  --exclude-language typescript --exclude-language java --exclude-language go
# Output: ⚠️  Excluding 5 languages without any inclusion filters may result in unexpected results.

Performance Notes

  • Each exclusion filter adds minimal overhead (typically <2ms)
  • Filters are applied during the search phase, not during indexing
  • Use specific patterns when possible for better performance
  • Complex glob patterns may have slightly higher overhead
  • The order of filters does not affect performance

AI Platform Instructions

The teach-ai command generates instruction files that teach AI assistants how to use cidx for semantic code search. Instructions are loaded from template files, allowing non-technical users to update content without code changes.

# Install Claude instructions in project root
cidx teach-ai --claude --project    # Creates ./CLAUDE.md

# Install Claude instructions globally
cidx teach-ai --claude --global     # Creates ~/.claude/CLAUDE.md

# Preview instruction content
cidx teach-ai --claude --show-only  # Show without writing

# Supported AI platforms
cidx teach-ai --claude              # Claude Code
cidx teach-ai --codex               # OpenAI Codex
cidx teach-ai --gemini              # Google Gemini
cidx teach-ai --opencode            # OpenCode
cidx teach-ai --q                   # Q
cidx teach-ai --junie               # Junie

# Combine platform and scope flags
cidx teach-ai --gemini --global     # Gemini global install
cidx teach-ai --codex --project     # Codex project install

Template Location: prompts/ai_instructions/{platform}.md

Platform File Locations:

Platform Project File Global File
Claude CLAUDE.md ~/.claude/CLAUDE.md
Codex CODEX.md ~/.codex/instructions.md
Gemini .gemini/styleguide.md N/A (project-only)
OpenCode AGENTS.md ~/.config/opencode/AGENTS.md
Q .amazonq/rules/cidx.md ~/.aws/amazonq/Q.md
Junie .junie/guidelines.md N/A (project-only)

Safety Features: Smart update preserves existing content and updates only CIDX sections.

Data Management Commands

# Quick cleanup (recommended)
cidx clean-data                 # Clear current project data
cidx clean-data --all-projects  # Clear all projects data
cidx clean-data --force-docker  # Use Docker for cleanup

# Complete removal
cidx uninstall                  # Remove current project completely
cidx uninstall --force-docker   # Use Docker for removal
cidx uninstall --wipe-all       # DANGEROUS: Complete system wipe

# Migration and maintenance
cidx clean-legacy               # Migrate from legacy containers
cidx optimize                   # Optimize vector database
cidx force-flush                # Force flush to disk (deprecated)
cidx force-flush --collection mycoll  # Flush specific collection

Configuration Commands

# Configuration repair
cidx fix-config                 # Fix corrupted configuration
cidx fix-config --dry-run       # Preview fixes
cidx fix-config --verbose       # Detailed fix information
cidx fix-config --force         # Apply without confirmation

# Claude integration setup
cidx set-claude-prompt          # Set CIDX instructions in project CLAUDE.md
cidx set-claude-prompt --user-prompt  # Set in global ~/.claude/CLAUDE.md

Global Options

# Available on most commands
--force-docker          # Force Docker instead of Podman
--verbose, -v           # Verbose output
--config, -c PATH       # Custom config file path
--path, -p PATH         # Custom project directory

# Special global options
--use-cidx-prompt       # Generate AI integration prompt
--format FORMAT         # Output format (text, markdown, compact, comprehensive)
--output FILE           # Save output to file
--compact               # Generate compact prompt

Command Aliases

  • cidxcode-indexer (shorter alias for all commands)
  • Use cidx for faster typing: cidx start, cidx query "search", etc.

Configuration

Embedding Providers

Ollama (Default - Local)

cidx init --embedding-provider ollama

VoyageAI (Cloud)

export VOYAGE_API_KEY="your-key"
cidx init --embedding-provider voyage-ai

Real-time Progress Display

During indexing, the progress display shows:

  • File processing progress with counts and percentages
  • Performance metrics: files/s, KB/s, active threads
  • Individual file status with processing stages

Example progress: 15/100 files (15%) | 8.3 files/s | 156.7 KB/s | 12 threads

Individual file status display:

├─ main.py (15.2 KB) starting
├─ utils.py (8.3 KB) chunking...  
├─ config.py (4.1 KB) vectorizing...
├─ helpers.py (3.2 KB) finalizing...
├─ models.py (12.5 KB) complete ✓

Configuration File

Configuration is stored in .code-indexer/config.json:

  • file_extensions: File types to index
  • exclude_dirs: Directories to skip
  • embedding_provider: ollama or voyage-ai (determines chunk size automatically)
  • max_file_size: Maximum file size in bytes (default: 1MB)
  • chunk_size: Legacy setting (ignored, chunker uses model-aware sizing)
  • chunk_overlap: Legacy setting (ignored, chunker uses 15% of chunk size)

Model-Aware Chunking Strategy

Code Indexer uses a model-aware fixed-size chunking approach optimized for different embedding models:

How it works:

  • Model-optimized chunk sizes: Automatically selects optimal chunk size based on embedding model capabilities
  • Consistent overlap: 15% overlap between adjacent chunks (across all models)
  • Simple arithmetic: Next chunk starts at (chunk_size - overlap_size) from current start position
  • Token optimization: Uses larger chunk sizes for models with higher token capacity

Model-Specific Chunk Sizes:

  • voyage-code-3: 4,096 characters (leverages 32K token capacity)
  • voyage-code-2: 4,096 characters (leverages 16K token capacity)
  • voyage-large-2: 4,096 characters (leverages large context capacity)
  • nomic-embed-text: 2,048 characters (512 tokens - Ollama limitation)
  • Unknown models: 1,000 characters (conservative fallback)

Example chunking (voyage-code-3):

Chunk 1: characters 0-4095     (4096 chars)
Chunk 2: characters 3482-7577  (4096 chars, overlaps 614 chars with Chunk 1) 
Chunk 3: characters 6964-11059 (4096 chars, overlaps 614 chars with Chunk 2)

Benefits:

  • Model optimization: Uses larger chunks for high-capacity models (VoyageAI: 4096, Ollama: 2048)
  • Better context: More complete code sections per chunk
  • Efficiency: Fewer total chunks reduce storage requirements
  • Model utilization: Takes advantage of each model's token capacity

Supported Languages

Code Indexer provides fixed-size chunking with intelligent text processing for all supported languages:

Language File Extensions Text Processing
Python .py Fixed-size chunks with consistent overlap
JavaScript .js, .jsx Fixed-size chunks with consistent overlap
TypeScript .ts, .tsx Fixed-size chunks with consistent overlap
Java .java Fixed-size chunks with consistent overlap
C# .cs Fixed-size chunks with consistent overlap
Go .go Fixed-size chunks with consistent overlap
Kotlin .kt, .kts Fixed-size chunks with consistent overlap
Groovy .groovy, .gradle, .gvy, .gy Fixed-size chunks with consistent overlap
Pascal/Delphi .pas, .pp, .dpr, .dpk, .inc Fixed-size chunks with consistent overlap
SQL .sql Fixed-size chunks with consistent overlap
C .c, .h Fixed-size chunks with consistent overlap
C++ .cpp, .cc, .cxx, .hpp, .hxx Fixed-size chunks with consistent overlap
Rust .rs Fixed-size chunks with consistent overlap
Swift .swift Fixed-size chunks with consistent overlap
Ruby .rb, .rake, .gemspec Fixed-size chunks with consistent overlap
Lua .lua Fixed-size chunks with consistent overlap
HTML .html, .htm Fixed-size chunks with consistent overlap
CSS .css, .scss, .sass Fixed-size chunks with consistent overlap
YAML .yaml, .yml Fixed-size chunks with consistent overlap
XML .xml, .xsd, .xsl, .xslt Fixed-size chunks with consistent overlap

Model-Aware Chunking Benefits:

  • Optimized chunk sizes: Each model gets its optimal chunk size (voyage-code-3: 4096, nomic-embed-text: 2048)
  • Consistent overlap: 15% overlap between adjacent chunks (across all models)
  • Fast processing: No complex parsing overhead, pure arithmetic operations
  • Complete search results: Full code sections without truncation
  • Model efficiency: Leverages each embedding model's capabilities

Parallel File Processing

Code Indexer uses slot-based parallel file processing for efficient throughput:

Architecture:

  • Dual thread pool design - Frontend file processing (threadcount+2 workers) feeds backend vectorization (threadcount workers)
  • File-level parallelism - Multiple files processed concurrently with dedicated slot allocation
  • Slot-based allocation - Fixed-size display array (threadcount+2 slots) with natural worker slot reuse
  • Real-time progress - Individual file status visible during processing (starting → chunking → vectorizing → finalizing → complete)
  • Thread allocation - Base threadcount configured via voyage_ai.parallel_requests in config.json

Thread Configuration:

  • VoyageAI default: 8 vectorization threads → 10 file processing workers (8+2)
  • Ollama default: 1 vectorization thread → 3 file processing workers (1+2)
  • Frontend thread pool: threadcount+2 workers handle file reading, chunking, and coordination (provides preemptive capacity)
  • Backend thread pool: threadcount workers handle vector embedding calculations
  • Pipeline design: Frontend stays ahead of backend, ensuring continuous vector thread utilization
  • Custom configuration: Adjust parallel_requests in config.json for optimal performance

Containerized Services

Code Indexer uses the following containerized services:

  • Qdrant: Vector database for storing code embeddings
  • Ollama (optional): Local language model service for embeddings when not using VoyageAI
  • Data Cleaner: Containerized service for cleaning root-owned files during data removal operations

Development

Developer Onboarding

Code Indexer uses a comprehensive test infrastructure organized into three main categories for optimal maintainability and execution speed:

Test Directory Structure

tests/
├── unit/              # Fast unit tests organized by functionality
│   ├── parsers/       # Language parser tests
│   ├── chunking/      # Text chunking logic tests
│   ├── config/        # Configuration management tests
│   ├── services/      # Service layer unit tests
│   ├── cli/           # CLI unit tests
│   ├── git/           # Git operations tests
│   ├── infrastructure/ # Core infrastructure tests
│   └── bugfixes/      # Bug fix regression tests
├── integration/       # Integration tests with real services
│   ├── performance/   # Performance and throughput tests
│   ├── docker/        # Docker integration tests
│   ├── multiproject/  # Multi-project workflow tests
│   ├── services/      # Service integration tests
│   └── cli/           # CLI integration tests
├── e2e/              # End-to-end workflow tests
│   ├── git_workflows/ # Git workflow e2e tests
│   ├── payload_indexes/ # Payload indexing e2e tests
│   ├── providers/     # Provider switching tests
│   ├── semantic_search/ # Search capability tests
│   ├── claude_integration/ # Claude integration tests
│   ├── display/       # UI and display tests
│   └── infrastructure/ # Infrastructure e2e tests
├── shared/           # Shared test utilities and fixtures
└── fixtures/         # Test data and fixtures

Test Safety and Container Categories

Tests are categorized by container requirements using pytest markers:

  • @pytest.mark.shared_safe - Can use either Docker or Podman containers, data-only operations
  • @pytest.mark.docker_only - Requires Docker-specific features or configurations
  • @pytest.mark.podman_only - Requires Podman-specific features or rootless containers
  • @pytest.mark.destructive - Manipulates containers directly, requires isolation

Dual-Container Architecture

Code Indexer supports both Docker and Podman with intelligent container management:

  • Production containers: code-indexer-ollama, code-indexer-qdrant (ports 11434, 6333)
  • Test containers: code-indexer-test-*-docker/podman (ports 50000+)
  • Automatic isolation: Test containers use different names, ports, and networks
  • Safety measures: Tests never interfere with production container setups

Test Running Instructions

Quick Development Tests (Fast)

# Run fast unit tests and integration tests
./ci-github.sh

# Run specific test categories
pytest tests/unit/ -v                    # Unit tests only
pytest tests/integration/ -v             # Integration tests only
pytest tests/e2e/ -v                     # E2E tests only

Comprehensive Testing (Slow)

# Run all tests including slow e2e tests
./full-automation.sh

# Run tests by container category
pytest -m shared_safe -v                 # Shared-safe tests only
pytest -m docker_only -v                 # Docker-specific tests only
pytest -m "not destructive" -v           # Non-destructive tests only

Development Setup

git clone https://github.com/jsbattig/code-indexer.git
cd code-indexer
pip install -e ".[dev]"

# Run quick tests to verify setup
./ci-github.sh

# Optional: Run comprehensive tests (requires Docker/Podman)
./full-automation.sh

Test Infrastructure Features

  • Service persistence: Tests keep services running between executions for speed
  • Prerequisite validation: Tests ensure required services are available at startup
  • Clean data isolation: Uses clean-data for test isolation, not service shutdown
  • Automatic categorization: Tests are automatically categorized by container requirements
  • Shared fixtures: Common test utilities available via tests.shared.test_infrastructure

Linting and Code Quality

# Run linting (includes ruff, black, mypy)
./lint.sh

# Check for import issues and syntax errors
python -m py_compile src/code_indexer/**/*.py

License

MIT License

Contributing

Issues and pull requests welcome!

About

Python application to index code locally supporting ollama and voyageai to power semantic searching a large codebase

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published