Skip to content

The RAG Structured Data System is an advanced solution using langextract, designed specifically for conversational interfaces over structured datasets. Unlike traditional RAG systems that work with unstructured text, this system excels at handling tabular data from Excel, CSV, and JSON datasets while preserving granularity and precision

Notifications You must be signed in to change notification settings

Kaos599/RAG-Structured-Data-Chat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A RAG system for structured data

This project was developed as an assignment for Skyclad Ventures to make structured datasets (CSV, Excel, JSON, and PDFs with tables) conversational and easy to query. This project implements a Retrieval-Augmented Generation (RAG) pipeline tailored for tabular data. It preserves row-level precision while enabling natural-language questions over your tables.

What I built

  • A FastAPI backend that ingests datasets, extracts structure, chunks data into multiple "views", and stores vectors in ChromaDB.
  • A Streamlit frontend for quick uploads and conversational querying.
  • A LangExtract-based structure extractor (configured to use LANGEXTRACT_API_KEY) that creates LLM-friendly metadata and schema descriptions.
  • A Python-first PDF table extraction path (pdfplumber + PyMuPDF + optional Camelot) so the system doesn't require Java by default.
  • UTF-8 safe output and robust error handling so extraction won't fail the whole workflow on Windows encoding issues.

System overview

graph TB
  U[User] --> Q[Query Classification]
  Q --> T{Query Type}
  T -->|new data| L[Load & Validate]
  T -->|existing data| R[Retrieve Context]
  L --> E[Structure Extraction - LangExtract]
  E --> C[Chunking - schema, row, stats, relationships]
  C --> V[Embedding & ChromaDB]
  V --> R
  R --> H[Hybrid Search - vector + metadata]
  H --> G[LLM Generation - Gemini]
  G --> O[Formatted Answer]
Loading

Quick start (Windows PowerShell)

  1. Create and activate the conda/env or venv you prefer. Example with conda:

    conda activate langgraph

  2. Install dependencies (if you haven't already):

    pip install -r requirements.txt

  3. Create a .env file with your keys. At minimum I use:

    GOOGLE_API_KEY=your-gemini-api-key LANGEXTRACT_API_KEY=your-langextract-key

  4. Start the API and UI (from the repo root, in PowerShell):

    Start the backend

    uvicorn src.api.routes:app --host 0.0.0.0 --port 8000 --reload

    In a new terminal, start the UI

    streamlit run src/ui/streamlit_app.py

Notes:

  • If you want Tabula/Camelot features for PDF tables you may need Java (for Tabula) or Ghostscript (for Camelot). I intentionally default to a pure-Python pipeline so Java is optional.
  • On Windows, I force UTF-8 when saving extraction outputs to avoid encoding failures; the system also creates a safe JSON fallback if needed.

How I recommend using it

  • Upload CSV/Excel/JSON or PDF files via the Streamlit UI or the /upload API endpoint.
  • Let the ingestion complete (PDF and LangExtract steps can take longer than simple CSV ingestion). I increased UI timeouts to accommodate long-running PDF/table extraction.
  • Query the dataset with natural language. The system classifies your query and selects the best chunking view (schema, row-level, statistics, or relationships) to retrieve relevant context.

Configuration highlights

  • Environment variables (in .env):
    • GOOGLE_API_KEY — Gemini API key
    • LANGEXTRACT_API_KEY — LangExtract key (also set programmatically by the extractor)
    • CHROMA_PERSIST_DIRECTORY — where Chroma stores vectors

Implementation notes

  • PDF extraction: I try pdfplumber first, then fall back to PyMuPDF for text-block heuristics, then to a text-pattern table parser. Camelot/Tabula are optional.
  • LangExtract integration: the extractor ensures LANGEXTRACT_API_KEY is visible to the LangExtract library and passes the key explicitly when calling the extract method.
  • Encoding: all HTML/visualization files are written with UTF-8; if that fails on a filesystem, a sanitized JSON fallback is written and the ingestion continues.

About

The RAG Structured Data System is an advanced solution using langextract, designed specifically for conversational interfaces over structured datasets. Unlike traditional RAG systems that work with unstructured text, this system excels at handling tabular data from Excel, CSV, and JSON datasets while preserving granularity and precision

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages