A Python-based RAG (Retrieval-Augmented Generation) system for document processing and vector search.
- PDF and text document loading with LangChain
- Document chunking and embedding generation
- Vector storage using ChromaDB and FAISS
- Support for multiple document formats (PDF, TXT)
- langchain & langchain-community
- chromadb
- faiss-cpu
- sentence-transformers
- pymupdf & pypdf
pip install -r requirements.txtDocuments are stored in data/ directory:
data/pdf/- PDF filesdata/text_files/- Text filesdata/vector_store/- ChromaDB vector storage
Example notebooks demonstrating document loading, processing, and RAG pipeline:
- document.ipynb - Basic document loading with text and PDF files
- pdf_loader.ipynb - Complete RAG pipeline with PDF processing, chunking, embeddings, and vector storage