Skip to content

reddtoric/files-semantic-search

Repository files navigation

πŸ” Files Semantic Search

An AI-powered file search tool that finds files by meaning, not just keywords. Uses vector embeddings to understand context and find relevant files even when they use different terminology.

(Most, if not all, code and texts are generated by ChatGPT and Claude.)

Inspiration

The idea for this tool came to mind because I have ad hoc scripts or forumlas, random notes and files, etc. that I do not want to organize and maintain. So, what if I put it all in a folder and I have an AI tool check if I have something that fits and returns a short list of files that possibly contains what I'm looking for so I don't have to manually look for it.

Example 1: A coworker asks for a list of users that subscribed to the pro tier the previous year for whatever reason. I write that sql query for this time and maybe they need that report in the future, but where do I put it or name it so I remember in 1 month or 6 months down the road. In comes this tool, using AI to search your files (taking file paths and filenames into consideration as well--in case it is named appropriately) if you've created this query before. Now you can just put all these random files under a single directory (files in sub directories as well) and let the tool search it for you.

Example 2: There's a formula for something that's in a PDF. But I don't remember if it was in a PDF, a DOC, or a TXT file or even if I saved it. I tell the AI tool what I'm looking for and it'll find the file that has the formula.

✨ Key Features

  • 🧠 Semantic Understanding: Finds files by meaning, not just exact word matches
  • ⚑ Fast Performance: Optimized scanning with GPU acceleration support
  • πŸ’Ύ Smart Caching: Skips unchanged files for faster subsequent searches
  • 🎯 Flexible Models: Choose from fast, balanced, or high-quality AI models
  • πŸ“Š Progress Tracking: Real-time progress with detailed feedback
  • πŸ› οΈ Enhanced Context: Includes file paths and names in search context

βš™οΈπŸ“Œ Configuration

You can customize several default configurations in config.py.

  • Root search directory
  • AI model
  • Top k results
  • Minimum similarity score threshold
  • Number of threads
  • GPU batch size

πŸš¨πŸ“’ Important ‼️

πŸš¨πŸ“’ Start by opening config.py and set DEFAULT_ROOT_DIR to your go-to folder. That way, you won’t have to pass in the directory every time or place its files in the same search directory. πŸ“’πŸš¨

πŸš€ Quick Start

Basic Usage

# Specified directory search
python file_search.py "find the sql query that retrieves users who are subscribed to free tier" -r "C:/Users/reddtoric/catch-all-drawer"

# Simple search
python file_search.py "sql query retrieving error messages between dates ordered by system type"

# With caching for faster repeated searches
python file_search.py "package json for 7tv" --cache

# Fast model for large codebases
python file_search.py "find the sql query that retrieves users who are subscribed to pro tier" --model fast

# High quality search with more results
python file_search.py "formula for calculating profit margins for ABC project" --model best -k 15

Advanced Usage

# Specific file types only
python file_search.py "React components" --extensions .js .jsx .ts .tsx

# Debug mode to see all scores
python file_search.py "utility functions" --debug --min-score 0.0

# Performance tuning
python file_search.py "test files" --threads 16 --gpu-batch-size 512 --perf-report

πŸ“ File Structure

semantic-file-search/
β”œβ”€β”€ file_search.py     # 🎯 Main entry point and CLI
β”œβ”€β”€ config.py          # βš™οΈ Configuration management
β”œβ”€β”€ file_processor.py  # πŸ“‚ File scanning and text extraction
β”œβ”€β”€ vector_db.py       # πŸ€– AI model and vector database
β”œβ”€β”€ utils.py           # πŸ› οΈ Utilities (spinner, progress, caching)
└── README.md          # πŸ“– This documentation

πŸ”§ Installation

Prerequisites

pip install chromadb sentence-transformers unstructured python-magic-bin torch

GPU Acceleration (Optional but Recommended)

# For NVIDIA GPUs with CUDA
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

πŸŽ›οΈ Command Line Options

Essential Options

Option Description Example
query What to search for "machine learning code"
-k, --top-k Max results (1-100) -k 5
-s, --min-score Similarity threshold (0.0-1.0) -s 0.7
-r, --root-dir Directory to search -r "C:/Projects"
-m, --model AI model to use --model fast

Performance Options

Option Description Example
--cache Enable smart caching --cache
--threads N Worker threads --threads 16
--gpu-batch-size N GPU batch size --gpu-batch-size 512
--no-gpu Disable GPU --no-gpu

File Filtering

Option Description Example
--extensions File types to include --extensions .py .js .md
--max-size MB Max file size --max-size 500
--max-files N Limit total files --max-files 1000

Development & Debug

Option Description Example
--debug Show detailed info --debug
--verbose Technical logging --verbose
--perf-report Performance breakdown --perf-report

πŸ€– AI Model Selection

Quick Reference

Model Speed Quality Best For
fast ⚑⚑⚑ ⭐⭐ Large codebases, quick results
A (default) ⚑⚑ ⭐⭐⭐ Balanced speed and quality
B ⚑ ⭐⭐⭐⭐ Better accuracy, still fast
best ⚑ ⭐⭐⭐⭐⭐ Research, critical searches
code ⚑ ⭐⭐⭐⭐ Programming files
multi ⚑⚑ ⭐⭐⭐ Multiple languages

** There are more available model shortcuts listed in config.py.

# You can still specify other accepted models using the full name
python file_search.py "API endpoints" --model sentence-transformers/all-MiniLM-L6-v2

python file_search.py "API endpoints" --model "sentence-transformers/all-MiniLM-L6-v2"

Examples

# Speed priority - fastest processing
python file_search.py "API endpoints" --model fast

# Balanced (recommended) - good speed and quality  
python file_search.py "user authentication" --model A

# Quality priority - best accuracy
python file_search.py "algorithm research" --model best

# Code-specific - understands programming concepts
python file_search.py "database queries" --model code

πŸ“Š Understanding Results

Similarity Scores

  • 0.8-1.0: Very similar files (exact matches, same topic)
  • 0.6-0.8: Quite similar files (related functionality)
  • 0.4-0.6: Somewhat related files (same domain)
  • 0.2-0.4: Loosely related files (may contain relevant info)
  • 0.0-0.2: Minimal similarity (likely not relevant)

** Note: I have noticed 0.0-0.2 scores does not mean it is likely not relevant because there are negative scores. This is why I set the default minimum score to 0.0.

Example Output

🎯 Found 5 relevant files for 'authentication code':
============================================================
 1. [0.847] src/auth/login.py
 2. [0.723] utils/user_validation.js  
 3. [0.681] components/SignIn.tsx
 4. [0.634] middleware/auth_middleware.py
 5. [0.592] tests/test_authentication.py
============================================================

⚑ Performance Tips

First Run vs Subsequent Runs

  • First run: Builds search index (slower, ~2-10 minutes)
  • Later runs: Uses cached index (fast, ~5-30 seconds)
  • With --cache: Skips unchanged files (much faster)

Speed Optimizations

# Fastest possible search
python file_search.py "query" --model fast --cache --priority-exts

# For large projects (limit files for testing)
python file_search.py "query" --max-files 1000 --model fast

# Multi-core optimization
python file_search.py "query" --threads 16 --gpu-batch-size 512

Memory Management

  • Large projects: Use --model fast or --model tiny
  • Memory issues: Reduce --gpu-batch-size to 128 or 64
  • Very large: Use --max-files to limit processing
  • Automatic garbage collection every 1000 files

πŸ” Search Strategies

Finding Code

# Programming concepts
python file_search.py "error handling patterns" --model code
python file_search.py "database connection logic" --extensions .py .js .ts
python file_search.py "REST API endpoints" --model code

# Specific technologies
python file_search.py "React component state management"
python file_search.py "SQL query optimization" 
python file_search.py "async await patterns"

Finding Documentation

# Documentation and guides
python file_search.py "installation instructions" --extensions .md .txt .rst
python file_search.py "API documentation" --model best
python file_search.py "troubleshooting guide"

Finding Configuration

# Config and setup files
python file_search.py "database configuration" --extensions .json .yaml .cfg
python file_search.py "environment variables" --extensions .env .config
python file_search.py "build settings" --extensions .json .yaml .toml

πŸ› Troubleshooting

Common Issues

Slow Performance

# Try faster model
python file_search.py "query" --model fast

# Limit files for testing
python file_search.py "query" --max-files 1000

# Check if GPU is working
python file_search.py "query" --debug  # Look for GPU detection info

No Results Found

# Lower the threshold to see all results
python file_search.py "query" --min-score 0.0 --debug

# Try different model
python file_search.py "query" --model best

# Check what files are being processed
python file_search.py "query" --debug --verbose

Memory Issues

# Use smaller model
python file_search.py "query" --model tiny

# Reduce GPU batch size
python file_search.py "query" --gpu-batch-size 64

# Force CPU-only mode
python file_search.py "query" --no-gpu

GPU Not Working

# Check CUDA installation
python -c "import torch; print(torch.cuda.is_available())"

# Force CPU mode
python file_search.py "query" --no-gpu

# Install GPU-enabled PyTorch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Debug Mode

Use --debug to see detailed information:

  • File processing statistics
  • Search score breakdown
  • Hardware detection results
  • Performance bottlenecks
python file_search.py "query" --debug --perf-report

πŸ”„ What's New

Performance Improvements

  • ⚑ Faster Startup: Deferred imports, show config immediately
  • πŸš€ Optimized Scanning: Smart directory traversal, skip build folders
  • πŸ“Š Live Progress: Real-time progress for both scanning and indexing
  • πŸ’Ύ Better Caching: Smarter file change detection
  • 🎯 Priority Processing: Common file types processed first

Enhanced Context

  • πŸ“ Path Awareness: File paths included in search context
  • 🏷️ Better Naming: Filenames and directory names boost relevance
  • πŸ” Improved Matching: Finds files even with different terminology

Modular Architecture

  • πŸ“‚ Split Files: Organized into logical modules
  • πŸ› οΈ Better Maintenance: Easier to extend and modify
  • πŸ“– Clear Documentation: Comprehensive help and examples

About

Search files under a directory for something semantically

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published