This project was developed as an assignment for Skyclad Ventures to make structured datasets (CSV, Excel, JSON, and PDFs with tables) conversational and easy to query. This project implements a Retrieval-Augmented Generation (RAG) pipeline tailored for tabular data. It preserves row-level precision while enabling natural-language questions over your tables.
- A FastAPI backend that ingests datasets, extracts structure, chunks data into multiple "views", and stores vectors in ChromaDB.
- A Streamlit frontend for quick uploads and conversational querying.
- A LangExtract-based structure extractor (configured to use LANGEXTRACT_API_KEY) that creates LLM-friendly metadata and schema descriptions.
- A Python-first PDF table extraction path (pdfplumber + PyMuPDF + optional Camelot) so the system doesn't require Java by default.
- UTF-8 safe output and robust error handling so extraction won't fail the whole workflow on Windows encoding issues.
graph TB
U[User] --> Q[Query Classification]
Q --> T{Query Type}
T -->|new data| L[Load & Validate]
T -->|existing data| R[Retrieve Context]
L --> E[Structure Extraction - LangExtract]
E --> C[Chunking - schema, row, stats, relationships]
C --> V[Embedding & ChromaDB]
V --> R
R --> H[Hybrid Search - vector + metadata]
H --> G[LLM Generation - Gemini]
G --> O[Formatted Answer]
-
Create and activate the conda/env or venv you prefer. Example with conda:
conda activate langgraph
-
Install dependencies (if you haven't already):
pip install -r requirements.txt
-
Create a
.envfile with your keys. At minimum I use:GOOGLE_API_KEY=your-gemini-api-key LANGEXTRACT_API_KEY=your-langextract-key
-
Start the API and UI (from the repo root, in PowerShell):
uvicorn src.api.routes:app --host 0.0.0.0 --port 8000 --reload
streamlit run src/ui/streamlit_app.py
Notes:
- If you want Tabula/Camelot features for PDF tables you may need Java (for Tabula) or Ghostscript (for Camelot). I intentionally default to a pure-Python pipeline so Java is optional.
- On Windows, I force UTF-8 when saving extraction outputs to avoid encoding failures; the system also creates a safe JSON fallback if needed.
- Upload CSV/Excel/JSON or PDF files via the Streamlit UI or the
/uploadAPI endpoint. - Let the ingestion complete (PDF and LangExtract steps can take longer than simple CSV ingestion). I increased UI timeouts to accommodate long-running PDF/table extraction.
- Query the dataset with natural language. The system classifies your query and selects the best chunking view (schema, row-level, statistics, or relationships) to retrieve relevant context.
- Environment variables (in
.env):- GOOGLE_API_KEY — Gemini API key
- LANGEXTRACT_API_KEY — LangExtract key (also set programmatically by the extractor)
- CHROMA_PERSIST_DIRECTORY — where Chroma stores vectors
- PDF extraction: I try pdfplumber first, then fall back to PyMuPDF for text-block heuristics, then to a text-pattern table parser. Camelot/Tabula are optional.
- LangExtract integration: the extractor ensures
LANGEXTRACT_API_KEYis visible to the LangExtract library and passes the key explicitly when calling the extract method. - Encoding: all HTML/visualization files are written with UTF-8; if that fails on a filesystem, a sanitized JSON fallback is written and the ingestion continues.