CyberSamatha is a Retrieval-Augmented Generation (RAG) system designed for cybersecurity professionals. It leverages local documentation and real-time data from multiple cybersecurity sources to provide accurate, context-aware answers to your security questions.
This system combines your local cybersecurity documentation with regularly updated data from leading security repositories to create a comprehensive knowledge base. Using Google's Gemini AI, it provides intelligent responses to your queries with proper source citations.
- Multi-format Document Support: Handles PDF, DOCX, PPTX, JSON, YAML, TXT, and MD files
- Automated Data Updates: Regularly syncs with top cybersecurity repositories
- Semantic Search: Uses vector embeddings for intelligent document retrieval
- Source Citation: Provides references to original documents in all responses
- Interactive CLI: Natural conversation interface for querying your knowledge base
- Change Detection: Only reindexes modified documents to save processing time
The system automatically maintains updated data from:
- Awesome Cybersecurity Handbooks: Comprehensive security guides and references
- Exploit Database: Latest exploits and proof-of-concepts
- GitHub Advisory Database: Official vulnerability advisories
- NVD CVE Database: Common Vulnerabilities and Exposures data
- Python 3.8+
- Google Gemini API key
- Git (for data updates)
- Clone the repository:
git clone https://github.com/RicheByte/cyberSamantha
cd cyberSamantha- Install dependencies:
pip install -r requirements.txt- Create environment configuration:
# Linux/Mac
echo "GEMINI_API_KEY=your_api_key_here" > .env
# Windows PowerShell
"GEMINI_API_KEY=your_api_key_here" | Out-File -Encoding utf8 .env
# Windows CMD
echo GEMINI_API_KEY=your_api_key_here > .env- Make scripts executable (Linux/Mac only):
chmod +x ask.sh- Check your setup (cross-platform):
python setup_check.py- Update all data sources:
python update_data.py --update- Index the documents:
python cybersamatha.py --indexStart a conversation with your cybersecurity knowledge base:
python cybersamatha.pyAsk a specific question:
python cybersamatha.py --question "What are the latest Apache vulnerabilities?"Use --quiet flag for faster execution:
# Cross-platform
python cybersamatha.py --question "What is XSS?" --quiet
# With ASCII banner
python cybersamatha.py --question "What is XSS?" --quiet --banner
# Quick launchers
./ask.sh "What is XSS?" # Linux/Mac
.\ask.ps1 "What is XSS?" # Windows PowerShell
ask.bat "What is XSS?" # Windows CMDUpdate the vector database with all documents:
python cybersamatha.py --index --forceRemove large Git pack files to save disk space:
python cleanup_storage.py --status # Show current sizes
python cleanup_storage.py --all # Full cleanup (~5.9 GB freed)
python cleanup_storage.py --temp # Remove temp packs only
python cleanup_storage.py --all --keep-handbooks # Keep handbooks updatableUpdate all cybersecurity data repositories:
python update_data.py --update # Update all sources
python update_data.py --update --cleanup # Update + optimize storage
python update_data.py --status # Check current statusView current data status:
python update_data.py --statusAdd to crontab for daily updates (Linux/Mac):
0 2 * * * cd /path/to/cybersamantha && python update_data.py --update --cleanupOr use Task Scheduler on Windows:
# Create a scheduled task
$action = New-ScheduledTaskAction -Execute "python" -Argument "update_data.py --update --cleanup" -WorkingDirectory "C:\path\to\cybersamantha"
$trigger = New-ScheduledTaskTrigger -Daily -At 2am
Register-ScheduledTask -TaskName "CyberSamantha Update" -Action $action -Trigger $triggercybersamatha/
├── cybersamatha.py # Main RAG system
├── update_data.py # Data updater
├── data/ # Local documents and fetched data
│ ├── handbooks/ # Cybersecurity handbooks
│ ├── exploits/ # Exploit database
│ ├── advisories/ # Security advisories
│ └── nvdcve/ # CVE database
├── chroma_db/ # Vector database (auto-generated)
├── .env # Environment variables
└── README.md
Create a .env file with:
GEMINI_API_KEY=your_gemini_api_key
Customize config.yaml to control:
- Data sources: Enable/disable repos (handbooks, exploits, advisories, nvdcve)
- Storage: Auto-cleanup Git history to save disk space
- RAG settings: Embedding model, Gemini model, chunk sizes
Example to keep only handbooks (saves ~5.9 GB):
data_sources:
handbooks:
enabled: true
exploits:
enabled: false # Disable large repos
advisories:
enabled: false
nvdcve:
enabled: falseAdd your own documents to the data/ directory in any supported format. The system will automatically index them.
- Text: TXT, MD
- Documents: PDF, DOCX, PPTX
- Data: JSON, YAML, YML
- AI Provider: Google Gemini
- Vector Database: ChromaDB
- Embeddings: Google Generative AI Embeddings
- Vulnerability Research: Quick access to CVE details and exploits
- Security Advisory Lookup: Find relevant security advisories
- Procedure Reference: Access security handbooks and guides
- Incident Response: Rapid information retrieval during security incidents
- Security Training: Learn from comprehensive cybersecurity documentation
-
Missing API Key
- Ensure GEMINI_API_KEY is set in .env file
-
Document Indexing Failures
- Check file permissions in data directory
- Verify supported file formats
-
Update Failures
- Ensure git is installed and accessible
- Check network connectivity to repositories
- Use
--quietflag for faster queries (skips verbose output) - Keep only handbooks enabled (6.89 MB) for minimal storage
- The system only reindexes changed files by default
- Vector database persists between sessions
- First query is slower (model loading), subsequent queries are fast
- Use quick launcher scripts for one-off queries
- Stay in interactive mode for multiple questions
For detailed performance optimization, see PERFORMANCE.md
Contributions are welcome! Please feel free to submit pull requests or open issues for bugs and feature requests.
- Google Gemini AI for language model capabilities
- ChromaDB for vector storage
- All data source maintainers for their valuable cybersecurity content
For issues and questions:
- Open an issue on GitHub
- Check existing documentation
- Review troubleshooting section
CyberSamatha - Your intelligent cybersecurity knowledge companion


