A RAG system for structured data

This project was developed as an assignment for Skyclad Ventures to make structured datasets (CSV, Excel, JSON, and PDFs with tables) conversational and easy to query. This project implements a Retrieval-Augmented Generation (RAG) pipeline tailored for tabular data. It preserves row-level precision while enabling natural-language questions over your tables.

What I built

A FastAPI backend that ingests datasets, extracts structure, chunks data into multiple "views", and stores vectors in ChromaDB.
A Streamlit frontend for quick uploads and conversational querying.
A LangExtract-based structure extractor (configured to use LANGEXTRACT_API_KEY) that creates LLM-friendly metadata and schema descriptions.
A Python-first PDF table extraction path (pdfplumber + PyMuPDF + optional Camelot) so the system doesn't require Java by default.
UTF-8 safe output and robust error handling so extraction won't fail the whole workflow on Windows encoding issues.

System overview

graph TB
  U[User] --> Q[Query Classification]
  Q --> T{Query Type}
  T -->|new data| L[Load & Validate]
  T -->|existing data| R[Retrieve Context]
  L --> E[Structure Extraction - LangExtract]
  E --> C[Chunking - schema, row, stats, relationships]
  C --> V[Embedding & ChromaDB]
  V --> R
  R --> H[Hybrid Search - vector + metadata]
  H --> G[LLM Generation - Gemini]
  G --> O[Formatted Answer]

Quick start (Windows PowerShell)

Create and activate the conda/env or venv you prefer. Example with conda:

conda activate langgraph
Install dependencies (if you haven't already):

pip install -r requirements.txt
Create a .env file with your keys. At minimum I use:

GOOGLE_API_KEY=your-gemini-api-key LANGEXTRACT_API_KEY=your-langextract-key
Start the API and UI (from the repo root, in PowerShell):

Start the backend

uvicorn src.api.routes:app --host 0.0.0.0 --port 8000 --reload

In a new terminal, start the UI

streamlit run src/ui/streamlit_app.py

Notes:

If you want Tabula/Camelot features for PDF tables you may need Java (for Tabula) or Ghostscript (for Camelot). I intentionally default to a pure-Python pipeline so Java is optional.
On Windows, I force UTF-8 when saving extraction outputs to avoid encoding failures; the system also creates a safe JSON fallback if needed.

How I recommend using it

Upload CSV/Excel/JSON or PDF files via the Streamlit UI or the /upload API endpoint.
Let the ingestion complete (PDF and LangExtract steps can take longer than simple CSV ingestion). I increased UI timeouts to accommodate long-running PDF/table extraction.
Query the dataset with natural language. The system classifies your query and selects the best chunking view (schema, row-level, statistics, or relationships) to retrieve relevant context.

Configuration highlights

Environment variables (in .env):
- GOOGLE_API_KEY — Gemini API key
- LANGEXTRACT_API_KEY — LangExtract key (also set programmatically by the extractor)
- CHROMA_PERSIST_DIRECTORY — where Chroma stores vectors

Implementation notes

PDF extraction: I try pdfplumber first, then fall back to PyMuPDF for text-block heuristics, then to a text-pattern table parser. Camelot/Tabula are optional.
LangExtract integration: the extractor ensures LANGEXTRACT_API_KEY is visible to the LangExtract library and passes the key explicitly when calling the extract method.
Encoding: all HTML/visualization files are written with UTF-8; if that fails on a filesystem, a sanitized JSON fallback is written and the ingestion continues.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
config		config
data/samples		data/samples
docs		docs
examples		examples
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A RAG system for structured data

What I built

System overview

Quick start (Windows PowerShell)

Start the backend

In a new terminal, start the UI

How I recommend using it

Configuration highlights

Implementation notes

About

Uh oh!

Releases

Packages

Languages

Kaos599/RAG-Structured-Data-Chat

Folders and files

Latest commit

History

Repository files navigation

A RAG system for structured data

What I built

System overview

Quick start (Windows PowerShell)

Start the backend

In a new terminal, start the UI

How I recommend using it

Configuration highlights

Implementation notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages