Agentic dataset discovery, profiling, and visualization. From query to analysis‑ready data in minutes.
Nexora is an AI‑native “data universe explorer” that turns a plain‑English query into curated, profiled, and visualizable datasets. It runs an agentic workflow to search Kaggle, download relevant columnar files (CSV/XLSX/JSON), profile them with pandas, and generate safe matplotlib plots via LLM codegen - all surfaced through a clean FastAPI backend and a React + Three.js frontend.
Nexora.mp4
- Agentic pipeline (LangGraph + LangChain) orchestrating: search → download → profile → describe → plot
- Targeted discovery via Tavily + Kaggle API (columnar-first, dedup, file size caps)
- Profiling with pandas: row/column counts, dtype map, missingness; task fit inference
- Sandboxed plotting: GPT‑4o/4o‑mini → matplotlib, headless (Agg) in a restricted Python REPL, returns base64 PNG
- Durable storage: SQLite (WAL) with idempotent upserts and sensible indexes
- Production-friendly FastAPI with CORS for local dev
- Frontend: React/Vite + Three.js interactive results, instant previews, one‑click exports
- Backend: FastAPI, LangGraph, LangChain, pandas, matplotlib, SQLite
- Tooling/Integrations: Tavily, Kaggle API, python‑dotenv
- Frontend: React, Vite, Three.js, React Router
- Python 3.12+
- Node.js 18+
- Kaggle credentials configured (~/.kaggle/kaggle.json)
- API keys in environment (.env):
- OPEN_API_KEY (OpenAI)
- TAVILY_API_KEY
- Backend setup
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
uvicorn backend.main:app --host 127.0.0.1 --port 8000- Frontend setup (in
frontend/)
cd frontend
npm install
npm run devVisit the app at http://127.0.0.1:5173.
- Search: The agent queries Tavily for Kaggle dataset links and ranks for columnar relevance.
- Download: Datasets are fetched via Kaggle API with safe filenames and size limits.
- Profile: pandas computes rows/cols, dtypes, and missingness with resilient CSV/Excel/JSON parsing.
- Describe: A concise dataset description and task fit (classification/regression/etc.).
- Plot: GPT‑4o‑mini generates matplotlib code executed headless in a restricted Python REPL; images are returned as base64.
- Persist + Serve: Metadata stored in SQLite (WAL); FastAPI exposes endpoints for the frontend.
Frontend (React/Vite + Three.js)
│
▼
FastAPI (backend/main.py)
├─ Agent pipeline (backend/agent.py)
│ ├─ Tavily search → Kaggle download → pandas profile → LLM describe
│ └─ LangGraph state machine orchestration
├─ Plot agent (backend/plot_agent.py)
│ └─ GPT‑4o‑mini → matplotlib in restricted PythonREPL (Agg)
└─ DB layer (backend/db.py, SQLite WAL)
Key endpoints:
- POST
/run-agent— Run the pipeline for a query - GET
/datasets— List profiled datasets - GET
/dataset?source_url=— Get one dataset + files - GET
/file-preview— Sample rows for preview - GET
/download-file— Download a specific file - GET
/download-dataset-zip— Zip all available files - POST
/plot/suggestions— Heuristic plot prompts - POST
/plot/generate— LLM‑generated matplotlib plot
Create a .env at the repo root:
OPEN_API_KEY=sk-...
TAVILY_API_KEY=tvly-...Kaggle setup: ensure ~/.kaggle/kaggle.json exists and is readable.
PRs and issues are welcome. Please open an issue to discuss significant changes.
MIT © 2025 Riyanshi Bohra — see LICENSE