This project implements a Retrieval-Augmented Generation (RAG) pipeline for querying unstructured PDF documents.
It combines embeddings, vector search, and a large language model to return context-aware answers in real time.
Note
: Right now data is limited, in future data gets updated or you can
- Document Ingestion (
core/data_loader.py
): Load and chunk PDF documents. - Embeddings (
core/embedding_manager.py
): Generate 384-dim sentence embeddings withall-MiniLM-L6-v2
. - Vector Store (
core/vector_store.py
): Store and search embeddings using ChromaDB (HNSW indexing). - Retriever (
core/retriever.py
): Fetch relevant context for queries. - Pipeline (
pipelines/rag_pipeline.py
): Combine retriever + LLM (Google’sgemma2-9b-it
) for RAG responses. - Streamlit UI (
main.py
): Simple and interactive interface for querying documents. - Configurable (
config.py
): Centralized settings for model, database, and pipeline options. - Experiments (
notebooks/rag_pipeline.ipynb
).
This project uses uv for Python package management.
Make sure you have uv
installed first:
pip install uv
Clone the repo and install dependencies:
git clone https://github.com/<your-username>/<repo-name>.git
cd <repo-name>
uv sync
Build the databse (this is a onetime setup):
- Upload PDFs to the
data/pdf_files path
- Then run this command
python main.py --build
API Setup:
- Get your API key to the gemma2-9b-it model from here groq-api-keys.
- Create a
.env
file in your project root path and assign your API key toGROQ_API_KEY
.
Start the Streamlit app in local:
streamlit run app.py
or click on this link to view the app DocQuery
Type your query, and get context-aware answers.
.
├── core/ # Core components
│ ├── data_loader.py # PDF loading + chunking
│ ├── embedding_manager.py # Embedding generation
│ ├── retriever.py # Context retrieval
│ └── vector_store.py # ChromaDB integration
│
├── data/ # Input and storage
│ ├── pdf_files/ # Source documents
│ └── vector_store/ # Persisted ChromaDB index
│
├── notebooks/
│ └── rag_pipeline.ipynb # Experiments & benchmarks
│
├── pipelines/
│ └── rag_pipeline.py # Full RAG pipeline logic
│
├── config.py # Global configs
├── main.py # Streamlit entry point
├── pyproject.toml # uv dependencies
├── requirements.txt # pip fallback
├── uv.lock # uv lock file
├── .gitignore
└── README.md
- Benchmark the retrieval strategies
- https://www.youtube.com/watch?v=fZM3oX4xEyg&list=PLZoTAELRMXVM8Pf4U67L4UuDRgV4TNX9D
- https://www.singlestore.com/blog/a-guide-to-retrieval-augmented-generation-rag/
- https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
- https://python.langchain.com/docs/introduction/
- https://console.groq.com/docs/