🔍 Files Semantic Search

An AI-powered file search tool that finds files by meaning, not just keywords. Uses vector embeddings to understand context and find relevant files even when they use different terminology.

(Most, if not all, code and texts are generated by ChatGPT and Claude.)

Inspiration

The idea for this tool came to mind because I have ad hoc scripts or forumlas, random notes and files, etc. that I do not want to organize and maintain. So, what if I put it all in a folder and I have an AI tool check if I have something that fits and returns a short list of files that possibly contains what I'm looking for so I don't have to manually look for it.

Example 1: A coworker asks for a list of users that subscribed to the pro tier the previous year for whatever reason. I write that sql query for this time and maybe they need that report in the future, but where do I put it or name it so I remember in 1 month or 6 months down the road. In comes this tool, using AI to search your files (taking file paths and filenames into consideration as well--in case it is named appropriately) if you've created this query before. Now you can just put all these random files under a single directory (files in sub directories as well) and let the tool search it for you.

Example 2: There's a formula for something that's in a PDF. But I don't remember if it was in a PDF, a DOC, or a TXT file or even if I saved it. I tell the AI tool what I'm looking for and it'll find the file that has the formula.

✨ Key Features

🧠 Semantic Understanding: Finds files by meaning, not just exact word matches
⚡ Fast Performance: Optimized scanning with GPU acceleration support
💾 Smart Caching: Skips unchanged files for faster subsequent searches
🎯 Flexible Models: Choose from fast, balanced, or high-quality AI models
📊 Progress Tracking: Real-time progress with detailed feedback
🛠️ Enhanced Context: Includes file paths and names in search context

⚙️📌 Configuration

You can customize several default configurations in config.py.

Root search directory
AI model
Top k results
Minimum similarity score threshold
Number of threads
GPU batch size

🚨📢 Important ‼️

🚨📢 Start by opening config.py and set DEFAULT_ROOT_DIR to your go-to folder. That way, you won’t have to pass in the directory every time or place its files in the same search directory. 📢🚨

🚀 Quick Start

Basic Usage

# Specified directory search
python file_search.py "find the sql query that retrieves users who are subscribed to free tier" -r "C:/Users/reddtoric/catch-all-drawer"

# Simple search
python file_search.py "sql query retrieving error messages between dates ordered by system type"

# With caching for faster repeated searches
python file_search.py "package json for 7tv" --cache

# Fast model for large codebases
python file_search.py "find the sql query that retrieves users who are subscribed to pro tier" --model fast

# High quality search with more results
python file_search.py "formula for calculating profit margins for ABC project" --model best -k 15

Advanced Usage

# Specific file types only
python file_search.py "React components" --extensions .js .jsx .ts .tsx

# Debug mode to see all scores
python file_search.py "utility functions" --debug --min-score 0.0

# Performance tuning
python file_search.py "test files" --threads 16 --gpu-batch-size 512 --perf-report

📁 File Structure

semantic-file-search/
├── file_search.py     # 🎯 Main entry point and CLI
├── config.py          # ⚙️ Configuration management
├── file_processor.py  # 📂 File scanning and text extraction
├── vector_db.py       # 🤖 AI model and vector database
├── utils.py           # 🛠️ Utilities (spinner, progress, caching)
└── README.md          # 📖 This documentation

🔧 Installation

Prerequisites

pip install chromadb sentence-transformers unstructured python-magic-bin torch

GPU Acceleration (Optional but Recommended)

# For NVIDIA GPUs with CUDA
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

🎛️ Command Line Options

Essential Options

Option	Description	Example
`query`	What to search for	`"machine learning code"`
`-k, --top-k`	Max results (1-100)	`-k 5`
`-s, --min-score`	Similarity threshold (0.0-1.0)	`-s 0.7`
`-r, --root-dir`	Directory to search	`-r "C:/Projects"`
`-m, --model`	AI model to use	`--model fast`

Performance Options

Option	Description	Example
`--cache`	Enable smart caching	`--cache`
`--threads N`	Worker threads	`--threads 16`
`--gpu-batch-size N`	GPU batch size	`--gpu-batch-size 512`
`--no-gpu`	Disable GPU	`--no-gpu`

File Filtering

Option	Description	Example
`--extensions`	File types to include	`--extensions .py .js .md`
`--max-size MB`	Max file size	`--max-size 500`
`--max-files N`	Limit total files	`--max-files 1000`

Development & Debug

Option	Description	Example
`--debug`	Show detailed info	`--debug`
`--verbose`	Technical logging	`--verbose`
`--perf-report`	Performance breakdown	`--perf-report`

🤖 AI Model Selection

Quick Reference

Model	Speed	Quality	Best For
`fast`	⚡⚡⚡	⭐⭐	Large codebases, quick results
`A` (default)	⚡⚡	⭐⭐⭐	Balanced speed and quality
`B`	⚡	⭐⭐⭐⭐	Better accuracy, still fast
`best`	⚡	⭐⭐⭐⭐⭐	Research, critical searches
`code`	⚡	⭐⭐⭐⭐	Programming files
`multi`	⚡⚡	⭐⭐⭐	Multiple languages

** There are more available model shortcuts listed in config.py.

# You can still specify other accepted models using the full name
python file_search.py "API endpoints" --model sentence-transformers/all-MiniLM-L6-v2

python file_search.py "API endpoints" --model "sentence-transformers/all-MiniLM-L6-v2"

Examples

# Speed priority - fastest processing
python file_search.py "API endpoints" --model fast

# Balanced (recommended) - good speed and quality  
python file_search.py "user authentication" --model A

# Quality priority - best accuracy
python file_search.py "algorithm research" --model best

# Code-specific - understands programming concepts
python file_search.py "database queries" --model code

📊 Understanding Results

Similarity Scores

0.8-1.0: Very similar files (exact matches, same topic)
0.6-0.8: Quite similar files (related functionality)
0.4-0.6: Somewhat related files (same domain)
0.2-0.4: Loosely related files (may contain relevant info)
0.0-0.2: Minimal similarity (likely not relevant)

** Note: I have noticed 0.0-0.2 scores does not mean it is likely not relevant because there are negative scores. This is why I set the default minimum score to 0.0.

Example Output

🎯 Found 5 relevant files for 'authentication code':
============================================================
 1. [0.847] src/auth/login.py
 2. [0.723] utils/user_validation.js  
 3. [0.681] components/SignIn.tsx
 4. [0.634] middleware/auth_middleware.py
 5. [0.592] tests/test_authentication.py
============================================================

⚡ Performance Tips

First Run vs Subsequent Runs

First run: Builds search index (slower, ~2-10 minutes)
Later runs: Uses cached index (fast, ~5-30 seconds)
With --cache: Skips unchanged files (much faster)

Speed Optimizations

# Fastest possible search
python file_search.py "query" --model fast --cache --priority-exts

# For large projects (limit files for testing)
python file_search.py "query" --max-files 1000 --model fast

# Multi-core optimization
python file_search.py "query" --threads 16 --gpu-batch-size 512

Memory Management

Large projects: Use --model fast or --model tiny
Memory issues: Reduce --gpu-batch-size to 128 or 64
Very large: Use --max-files to limit processing
Automatic garbage collection every 1000 files

🔍 Search Strategies

Finding Code

# Programming concepts
python file_search.py "error handling patterns" --model code
python file_search.py "database connection logic" --extensions .py .js .ts
python file_search.py "REST API endpoints" --model code

# Specific technologies
python file_search.py "React component state management"
python file_search.py "SQL query optimization" 
python file_search.py "async await patterns"

Finding Documentation

# Documentation and guides
python file_search.py "installation instructions" --extensions .md .txt .rst
python file_search.py "API documentation" --model best
python file_search.py "troubleshooting guide"

Finding Configuration

# Config and setup files
python file_search.py "database configuration" --extensions .json .yaml .cfg
python file_search.py "environment variables" --extensions .env .config
python file_search.py "build settings" --extensions .json .yaml .toml

🐛 Troubleshooting

Common Issues

Slow Performance

# Try faster model
python file_search.py "query" --model fast

# Limit files for testing
python file_search.py "query" --max-files 1000

# Check if GPU is working
python file_search.py "query" --debug  # Look for GPU detection info

No Results Found

# Lower the threshold to see all results
python file_search.py "query" --min-score 0.0 --debug

# Try different model
python file_search.py "query" --model best

# Check what files are being processed
python file_search.py "query" --debug --verbose

Memory Issues

# Use smaller model
python file_search.py "query" --model tiny

# Reduce GPU batch size
python file_search.py "query" --gpu-batch-size 64

# Force CPU-only mode
python file_search.py "query" --no-gpu

GPU Not Working

# Check CUDA installation
python -c "import torch; print(torch.cuda.is_available())"

# Force CPU mode
python file_search.py "query" --no-gpu

# Install GPU-enabled PyTorch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Debug Mode

Use --debug to see detailed information:

File processing statistics
Search score breakdown
Hardware detection results
Performance bottlenecks

python file_search.py "query" --debug --perf-report

🔄 What's New

Performance Improvements

⚡ Faster Startup: Deferred imports, show config immediately
🚀 Optimized Scanning: Smart directory traversal, skip build folders
📊 Live Progress: Real-time progress for both scanning and indexing
💾 Better Caching: Smarter file change detection
🎯 Priority Processing: Common file types processed first

Enhanced Context

📁 Path Awareness: File paths included in search context
🏷️ Better Naming: Filenames and directory names boost relevance
🔍 Improved Matching: Finds files even with different terminology

Modular Architecture

📂 Split Files: Organized into logical modules
🛠️ Better Maintenance: Easier to extend and modify
📖 Clear Documentation: Comprehensive help and examples

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
examples.py		examples.py
file_processor.py		file_processor.py
file_search.py		file_search.py
project_summary.md		project_summary.md
requirements.txt		requirements.txt
search.bat		search.bat
setup.py		setup.py
utils.py		utils.py
vector_db.py		vector_db.py

License

reddtoric/files-semantic-search

Folders and files

Latest commit

History

Repository files navigation