Skip to content

A basic RAG pipeline which uses gemma 2 model to answer the user query with the external knowledge stored in a vector database.

License

Notifications You must be signed in to change notification settings

neehanthreddym/doc-query-rag

Repository files navigation

DocQuery: Retrieval-Augmented Generation Pipeline

Python Streamlit HuggingFace ChromaDB Groq uv

This project implements a Retrieval-Augmented Generation (RAG) pipeline for querying unstructured PDF documents.
It combines embeddings, vector search, and a large language model to return context-aware answers in real time. Note: Right now data is limited, in future data gets updated or you can

📊 Workflow

RAG Workflow

🚀 Features

  • Document Ingestion (core/data_loader.py): Load and chunk PDF documents.
  • Embeddings (core/embedding_manager.py): Generate 384-dim sentence embeddings with all-MiniLM-L6-v2.
  • Vector Store (core/vector_store.py): Store and search embeddings using ChromaDB (HNSW indexing).
  • Retriever (core/retriever.py): Fetch relevant context for queries.
  • Pipeline (pipelines/rag_pipeline.py): Combine retriever + LLM (Google’s gemma2-9b-it) for RAG responses.
  • Streamlit UI (main.py): Simple and interactive interface for querying documents.
  • Configurable (config.py): Centralized settings for model, database, and pipeline options.
  • Experiments (notebooks/rag_pipeline.ipynb).

⚙️ Setup

This project uses uv for Python package management.
Make sure you have uv installed first:

pip install uv

Clone the repo and install dependencies:

git clone https://github.com/<your-username>/<repo-name>.git
cd <repo-name>
uv sync

▶️ Usage

Build the databse (this is a onetime setup):

  • Upload PDFs to the data/pdf_files path
  • Then run this command
python main.py --build

API Setup:

  • Get your API key to the gemma2-9b-it model from here groq-api-keys.
  • Create a .env file in your project root path and assign your API key to GROQ_API_KEY.

Start the Streamlit app in local:

streamlit run app.py

or click on this link to view the app DocQuery

Type your query, and get context-aware answers.

📂 Project Structure

.
├── core/                    # Core components
│   ├── data_loader.py       # PDF loading + chunking
│   ├── embedding_manager.py # Embedding generation
│   ├── retriever.py         # Context retrieval
│   └── vector_store.py      # ChromaDB integration
│
├── data/                    # Input and storage
│   ├── pdf_files/           # Source documents
│   └── vector_store/        # Persisted ChromaDB index
│
├── notebooks/
│   └── rag_pipeline.ipynb   # Experiments & benchmarks
│
├── pipelines/
│   └── rag_pipeline.py      # Full RAG pipeline logic
│
├── config.py                # Global configs
├── main.py                  # Streamlit entry point
├── pyproject.toml           # uv dependencies
├── requirements.txt         # pip fallback
├── uv.lock                  # uv lock file
├── .gitignore
└── README.md

Future work

  • Benchmark the retrieval strategies

Reference