An AI-powered file search tool that finds files by meaning, not just keywords. Uses vector embeddings to understand context and find relevant files even when they use different terminology.
(Most, if not all, code and texts are generated by ChatGPT and Claude.)
The idea for this tool came to mind because I have ad hoc scripts or forumlas, random notes and files, etc. that I do not want to organize and maintain. So, what if I put it all in a folder and I have an AI tool check if I have something that fits and returns a short list of files that possibly contains what I'm looking for so I don't have to manually look for it.
Example 1: A coworker asks for a list of users that subscribed to the pro tier the previous year for whatever reason. I write that sql query for this time and maybe they need that report in the future, but where do I put it or name it so I remember in 1 month or 6 months down the road. In comes this tool, using AI to search your files (taking file paths and filenames into consideration as well--in case it is named appropriately) if you've created this query before. Now you can just put all these random files under a single directory (files in sub directories as well) and let the tool search it for you.
Example 2: There's a formula for something that's in a PDF. But I don't remember if it was in a PDF, a DOC, or a TXT file or even if I saved it. I tell the AI tool what I'm looking for and it'll find the file that has the formula.
- π§ Semantic Understanding: Finds files by meaning, not just exact word matches
- β‘ Fast Performance: Optimized scanning with GPU acceleration support
- πΎ Smart Caching: Skips unchanged files for faster subsequent searches
- π― Flexible Models: Choose from fast, balanced, or high-quality AI models
- π Progress Tracking: Real-time progress with detailed feedback
- π οΈ Enhanced Context: Includes file paths and names in search context
You can customize several default configurations in config.py.
- Root search directory
- AI model
- Top
kresults - Minimum similarity score threshold
- Number of threads
- GPU batch size
π¨π’ Start by opening config.py and set DEFAULT_ROOT_DIR to your go-to folder. That way, you wonβt have to pass in the directory every time or place its files in the same search directory. π’π¨
# Specified directory search
python file_search.py "find the sql query that retrieves users who are subscribed to free tier" -r "C:/Users/reddtoric/catch-all-drawer"
# Simple search
python file_search.py "sql query retrieving error messages between dates ordered by system type"
# With caching for faster repeated searches
python file_search.py "package json for 7tv" --cache
# Fast model for large codebases
python file_search.py "find the sql query that retrieves users who are subscribed to pro tier" --model fast
# High quality search with more results
python file_search.py "formula for calculating profit margins for ABC project" --model best -k 15# Specific file types only
python file_search.py "React components" --extensions .js .jsx .ts .tsx
# Debug mode to see all scores
python file_search.py "utility functions" --debug --min-score 0.0
# Performance tuning
python file_search.py "test files" --threads 16 --gpu-batch-size 512 --perf-reportsemantic-file-search/
βββ file_search.py # π― Main entry point and CLI
βββ config.py # βοΈ Configuration management
βββ file_processor.py # π File scanning and text extraction
βββ vector_db.py # π€ AI model and vector database
βββ utils.py # π οΈ Utilities (spinner, progress, caching)
βββ README.md # π This documentationpip install chromadb sentence-transformers unstructured python-magic-bin torch# For NVIDIA GPUs with CUDA
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118| Option | Description | Example |
|---|---|---|
query |
What to search for | "machine learning code" |
-k, --top-k |
Max results (1-100) | -k 5 |
-s, --min-score |
Similarity threshold (0.0-1.0) | -s 0.7 |
-r, --root-dir |
Directory to search | -r "C:/Projects" |
-m, --model |
AI model to use | --model fast |
| Option | Description | Example |
|---|---|---|
--cache |
Enable smart caching | --cache |
--threads N |
Worker threads | --threads 16 |
--gpu-batch-size N |
GPU batch size | --gpu-batch-size 512 |
--no-gpu |
Disable GPU | --no-gpu |
| Option | Description | Example |
|---|---|---|
--extensions |
File types to include | --extensions .py .js .md |
--max-size MB |
Max file size | --max-size 500 |
--max-files N |
Limit total files | --max-files 1000 |
| Option | Description | Example |
|---|---|---|
--debug |
Show detailed info | --debug |
--verbose |
Technical logging | --verbose |
--perf-report |
Performance breakdown | --perf-report |
| Model | Speed | Quality | Best For |
|---|---|---|---|
fast |
β‘β‘β‘ | ββ | Large codebases, quick results |
A (default) |
β‘β‘ | βββ | Balanced speed and quality |
B |
β‘ | ββββ | Better accuracy, still fast |
best |
β‘ | βββββ | Research, critical searches |
code |
β‘ | ββββ | Programming files |
multi |
β‘β‘ | βββ | Multiple languages |
** There are more available model shortcuts listed in config.py.
# You can still specify other accepted models using the full name
python file_search.py "API endpoints" --model sentence-transformers/all-MiniLM-L6-v2
python file_search.py "API endpoints" --model "sentence-transformers/all-MiniLM-L6-v2"# Speed priority - fastest processing
python file_search.py "API endpoints" --model fast
# Balanced (recommended) - good speed and quality
python file_search.py "user authentication" --model A
# Quality priority - best accuracy
python file_search.py "algorithm research" --model best
# Code-specific - understands programming concepts
python file_search.py "database queries" --model code- 0.8-1.0: Very similar files (exact matches, same topic)
- 0.6-0.8: Quite similar files (related functionality)
- 0.4-0.6: Somewhat related files (same domain)
- 0.2-0.4: Loosely related files (may contain relevant info)
- 0.0-0.2: Minimal similarity (likely not relevant)
** Note: I have noticed 0.0-0.2 scores does not mean it is likely not relevant because there are negative scores. This is why I set the default minimum score to 0.0.
π― Found 5 relevant files for 'authentication code':
============================================================
1. [0.847] src/auth/login.py
2. [0.723] utils/user_validation.js
3. [0.681] components/SignIn.tsx
4. [0.634] middleware/auth_middleware.py
5. [0.592] tests/test_authentication.py
============================================================- First run: Builds search index (slower, ~2-10 minutes)
- Later runs: Uses cached index (fast, ~5-30 seconds)
- With --cache: Skips unchanged files (much faster)
# Fastest possible search
python file_search.py "query" --model fast --cache --priority-exts
# For large projects (limit files for testing)
python file_search.py "query" --max-files 1000 --model fast
# Multi-core optimization
python file_search.py "query" --threads 16 --gpu-batch-size 512- Large projects: Use
--model fastor--model tiny - Memory issues: Reduce
--gpu-batch-sizeto 128 or 64 - Very large: Use
--max-filesto limit processing - Automatic garbage collection every 1000 files
# Programming concepts
python file_search.py "error handling patterns" --model code
python file_search.py "database connection logic" --extensions .py .js .ts
python file_search.py "REST API endpoints" --model code
# Specific technologies
python file_search.py "React component state management"
python file_search.py "SQL query optimization"
python file_search.py "async await patterns"# Documentation and guides
python file_search.py "installation instructions" --extensions .md .txt .rst
python file_search.py "API documentation" --model best
python file_search.py "troubleshooting guide"# Config and setup files
python file_search.py "database configuration" --extensions .json .yaml .cfg
python file_search.py "environment variables" --extensions .env .config
python file_search.py "build settings" --extensions .json .yaml .toml# Try faster model
python file_search.py "query" --model fast
# Limit files for testing
python file_search.py "query" --max-files 1000
# Check if GPU is working
python file_search.py "query" --debug # Look for GPU detection info# Lower the threshold to see all results
python file_search.py "query" --min-score 0.0 --debug
# Try different model
python file_search.py "query" --model best
# Check what files are being processed
python file_search.py "query" --debug --verbose# Use smaller model
python file_search.py "query" --model tiny
# Reduce GPU batch size
python file_search.py "query" --gpu-batch-size 64
# Force CPU-only mode
python file_search.py "query" --no-gpu# Check CUDA installation
python -c "import torch; print(torch.cuda.is_available())"
# Force CPU mode
python file_search.py "query" --no-gpu
# Install GPU-enabled PyTorch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118Use --debug to see detailed information:
- File processing statistics
- Search score breakdown
- Hardware detection results
- Performance bottlenecks
python file_search.py "query" --debug --perf-report- β‘ Faster Startup: Deferred imports, show config immediately
- π Optimized Scanning: Smart directory traversal, skip build folders
- π Live Progress: Real-time progress for both scanning and indexing
- πΎ Better Caching: Smarter file change detection
- π― Priority Processing: Common file types processed first
- π Path Awareness: File paths included in search context
- π·οΈ Better Naming: Filenames and directory names boost relevance
- π Improved Matching: Finds files even with different terminology
- π Split Files: Organized into logical modules
- π οΈ Better Maintenance: Easier to extend and modify
- π Clear Documentation: Comprehensive help and examples