Skip to content

Technical-Mavle/SagarManthan-RAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SagarManthan-RAG - Quick Start Guide

SagarManthan-RAG is a Retrieval-Augmented Generation (RAG) system for marine scientific data analysis. It enables scientists to ask natural language questions about marine biodiversity, CTD, AWS, and ADCP data, and receive intelligent, data-driven answers.

Overview

SagarManthan-RAG uses:

  • Supabase Storage: Stores marine data files (Parquet files for occurrences, .HDR/.asc for CTD, .txt for AWS/ADCP)
  • ChromaDB: Vector database for semantic search using embeddings
  • Google Gemini: For generating embeddings and answering questions using LLM
  • FastAPI: REST API backend with comprehensive query support
  • Python parsers (parsers.py): Specialized parsers for CTD, AWS, and ADCP data formats

Prerequisites Checklist

  • Python 3.8+ installed
  • Supabase account with storage bucket processed-data containing marine data files (Parquet, CTD, AWS, ADCP)
  • Google Gemini API key (get from Google AI Studio)
  • Virtual environment (recommended)

Step-by-Step Setup

1. Navigate to Project Directory

cd RAG/SagarManthan-RAG

2. Create and Activate Virtual Environment

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

3. Install Dependencies

pip install -r requirements.txt

4. Configure Environment Variables

Create a .env file in the SagarManthan-RAG/ directory:

cp .env.example .env  # If .env.example exists, or create manually

Edit .env and add your credentials:

# Supabase Configuration
SUPABASE_URL=https://your-project.supabase.co
SUPABASE_KEY=your-service-role-key-here
STORAGE_BUCKET=processed-data

# Google Gemini API
GEMINI_API_KEY=your-gemini-api-key-here

# ChromaDB Configuration (optional)
CHROMA_PERSIST_DIR=chroma_db_marine
CHROMA_COLLECTION=occurrence_embeddings

# LLM Configuration (optional)
LLM_MODEL=gemini-2.5-flash
EMBED_MODEL=models/text-embedding-004

Note: All scripts automatically load from .env file. You don't need to export environment variables manually.

5. Import Data from Supabase

Import marine data from Supabase storage bucket:

python import_occurrences_supabase.py --output occurrences.json

This script:

  • Downloads marine data files from Supabase storage bucket processed-data
  • Auto-detects data types from file extensions and content:
    • .parquet files → Occurrence data
    • .HDR + .asc pairs → CTD data (automatically grouped and processed together)
    • .txt files with AWS patterns ($GPS, $MET, $SST) → AWS data
    • .txt files with ADCP patterns (Ens, Eas, Nor, Mag) → ADCP data
  • Processes and parses each data type using specialized parsers from parsers.py
  • Converts to unified JSON format for embedding generation
  • Handles column name variations automatically (e.g., scientificName, scientific_name, species)
  • Saves all marine data records with dataType field for proper categorization

Options:

  • --file-pattern: Filter files by pattern (e.g., 'occurrence', 'ctd')
  • --limit: Limit number of records to process (0 = all)
  • --output, -o: Output JSON file path (default: occurrences.json)
  • --files: Manually specify file names to download (e.g., --files 'occurrence.parquet' 'occurrence (1).parquet')
  • --supabase-url: Supabase project URL (defaults to SUPABASE_URL from .env)
  • --supabase-key: Supabase service role key (defaults to SUPABASE_KEY from .env)
  • --bucket: Storage bucket name (defaults to STORAGE_BUCKET from .env or processed-data)

Note: The script automatically groups CTD files (.HDR header + .asc data pairs) and processes them together.

6. Generate Embeddings

Create vector embeddings for all marine data records:

python marine_embeddings_supabase.py --input occurrences.json

What this does:

  • Reads marine data from JSON file
  • Generates text representations for each record (type-specific):
    • Occurrence: Species, location, depth, water body, sampling method
    • CTD: Temperature, salinity, oxygen, depth profiles
    • AWS: Sea surface temperature, wind speed, air temperature
    • ADCP: Current speeds, velocity profiles, depth bins
  • Creates embeddings using Google Gemini embedding model
  • Stores in ChromaDB with stable, content-based IDs (prevents duplicates)
  • Skips existing records by default (safe to re-run)

Options:

  • --input, -i: Input JSON file (default: occurrences.json)
  • --limit: Process only N records (0 = all, useful for testing)
  • --skip-existing: Skip records already in ChromaDB (default: True)
  • --force: Force regeneration of all embeddings (disables skip-existing)
  • --batch-size: Batch size for processing (default: 100)
  • --chroma-persist-dir: ChromaDB persistence directory (defaults to CHROMA_PERSIST_DIR from .env or chroma_db_marine)
  • --chroma-collection: ChromaDB collection name (defaults to CHROMA_COLLECTION from .env or occurrence_embeddings)

For testing with a smaller dataset:

python marine_embeddings_supabase.py --input occurrences.json --limit 100

To clear and regenerate all embeddings:

python clear_embeddings.py  # Removes ChromaDB directory
python marine_embeddings_supabase.py --input occurrences.json --force

7. Start the RAG API Server

python marine_rag_supabase_main.py

Or using uvicorn directly:

uvicorn marine_rag_supabase_main:app --host 0.0.0.0 --port 8000 --reload

You should see:

INFO:     Uvicorn running on http://0.0.0.0:8000
📊 ChromaDB collection 'occurrence_embeddings' has X documents
✅ ChromaDB is working - test query returned 1 result(s)
Loading occurrences from Supabase storage...

8. Test the API

Open a new terminal and test:

# Health check
curl http://localhost:8000/health

# Ask a question about occurrences
curl "http://localhost:8000/query?question=What%20species%20are%20found%20in%20the%20Indian%20Ocean?"

# Query with filters
curl "http://localhost:8000/query?question=unique%20species%20in%20Arabian%20Sea&water_body=Arabian%20Sea&min_depth=300&max_depth=600"

# Query CTD data
curl "http://localhost:8000/query?question=temperature%20profiles%20in%20Indian%20Ocean&data_types=CTD"

9. Use in SAGAR Dashboard

  1. Make sure the RAG API is running (step 7)
  2. Start the SAGAR frontend:
    cd ../../SAGAR
    npm start
  3. Navigate to the Globe View
  4. Use the "AI-Powered Analysis" textarea or bottom search bar to ask questions
  5. Select data types (OCCURRENCE, CTD, AWS, ADCP) or leave empty for auto-detection

Supported Data Types

SagarManthan-RAG supports multiple marine data types:

  • OCCURRENCE: Species occurrence records (biodiversity data)
  • CTD: Conductivity-Temperature-Depth profiles
  • AWS: Automated Weather Station data
  • ADCP: Acoustic Doppler Current Profiler data

The system auto-detects data types from queries or you can explicitly specify them.

Example Questions to Try

Occurrence Queries

  • "What unique species are found in the Indian Ocean between 350-600 meters?"
    • Auto-extracts: water_body=Indian Ocean, min_depth=350, max_depth=600, deduplicates by species
  • "Which species live at depths between 300 and 500 meters in the Arabian Sea?"
    • Auto-extracts: water_body=Arabian Sea, depth range 300-500m
  • "What sampling methods were used in the Andaman Sea?"
    • Auto-extracts: water_body=Andaman Sea, focuses on sampling protocols
  • "Show me occurrences of deep sea species in the Bay of Bengal"
    • Auto-extracts: water_body=Bay of Bengal, searches for deep sea species
  • "occurrence of unique marine species within the Indian Ocean, specifically focusing on the 350 to 600-meter depth range"
    • Complex query with multiple auto-extracted parameters

CTD Queries

  • "What are the temperature profiles in the Indian Ocean?"
    • Auto-detects: data_type=CTD, water_body=Indian Ocean
  • "Show me salinity data from the Arabian Sea"
    • Auto-detects: data_type=CTD, water_body=Arabian Sea, parameter=salinity
  • "What is the oxygen concentration at different depths?"
    • Auto-detects: data_type=CTD, parameter=oxygen

AWS Queries

  • "What are the sea surface temperatures in the Indian Ocean?"
    • Auto-detects: data_type=AWS, water_body=Indian Ocean, parameter=SST
  • "Show me wind speed data from recent measurements"
    • Auto-detects: data_type=AWS, parameter=wind speed
  • "What are the weather conditions in the Arabian Sea?"
    • Auto-detects: data_type=AWS, water_body=Arabian Sea

ADCP Queries

  • "What are the current speeds in the Indian Ocean?"
    • Auto-detects: data_type=ADCP, water_body=Indian Ocean
  • "Show me ocean current velocity profiles"
    • Auto-detects: data_type=ADCP, focuses on velocity profiles
  • "What are the flow patterns in the Bay of Bengal?"
    • Auto-detects: data_type=ADCP, water_body=Bay of Bengal

Multi-Data Type Queries

  • "Show me all marine data from the Indian Ocean"
    • Searches across all data types (OCCURRENCE, CTD, AWS, ADCP)
  • "What environmental conditions and species are found in the Arabian Sea?"
    • Combines occurrence and environmental data (CTD/AWS)

Key Features

Query Processing

  • Natural Language Queries: Ask questions in plain English - no complex syntax needed
  • Automatic Parameter Extraction: Intelligently extracts depth ranges, water bodies, and data types from natural language
    • Depth ranges: "350 to 600-meter depth range" → automatically extracts min_depth=350, max_depth=600
    • Water bodies: "Indian Ocean" → automatically filters to Indian Ocean records
    • Data types: "CTD temperature profiles" → automatically detects CTD data type
  • Multi-Data Type Support: Query across Occurrence, CTD, AWS, and ADCP data simultaneously or individually
  • Auto-Detection: If data types not specified, system intelligently detects from query keywords

Data Processing

  • Unique Species Deduplication: When querying for "unique" or "distinct" species, automatically deduplicates results by scientific name
    • Keeps best match (highest similarity score) for each unique species
    • Returns both total count (before deduplication) and unique count (after deduplication)
  • Intelligent Depth Filtering: Handles depth ranges correctly
    • For occurrence data: Checks if record's depth range overlaps with query range
    • For CTD data: Filters by single depth values
    • Supports both min/max depth ranges and single depth values
  • Case-Insensitive Filtering: Water body and species name matching is case-insensitive
    • "Arabian Sea", "arabian sea", "ARABIAN SEA" all match correctly

Research-Focused Design

  • All Results by Default: Returns ALL matching records (no artificial limits) for comprehensive research analysis
  • Complete Dataset: Research dashboard receives full dataset for proper visualization and analysis
  • Transparent Results: Shows both total occurrences and unique species counts
  • Source Tracking: Each result includes similarity scores and source information

AI-Powered Analysis

  • Comprehensive Answers: Uses Gemini LLM to generate detailed, scientific answers based on retrieved data
  • Dashboard Summaries: Generates structured summaries with:
    • Executive summary
    • Key findings (bullet points)
    • Species analysis
    • Geographic distribution
    • Depth analysis
    • Temporal patterns
    • Research insights
  • Type-Specific Prompts: Customizes LLM prompts based on data type (CTD, AWS, ADCP, Occurrence) for more accurate answers
  • Graceful Error Handling: If rate limits are hit, still returns data with fallback messages

Performance & Reliability

  • Deduplication: Prevents duplicate embeddings using content-based stable IDs
  • Caching: In-memory cache for fast data access
  • Persistent Storage: ChromaDB persists embeddings across restarts
  • Rate Limit Handling: Gracefully handles API rate limits with informative messages

API Endpoints

GET /health

Health check - shows number of cached records and ChromaDB status.

GET /query

Answer scientific questions using RAG.

Parameters:

  • question (required): The scientific question
  • water_body (optional): Filter by water body
  • scientific_name (optional): Filter by scientific name
  • min_depth (optional): Minimum depth in meters
  • max_depth (optional): Maximum depth in meters
  • data_types (optional): Comma-separated data types (OCCURRENCE,CTD,AWS,ADCP,ALL)
  • top_k (optional): Limit results (None = return all matching)
  • similarity_threshold (optional): Similarity threshold for filtering

Response includes:

  • query: The original question
  • answer: Generated answer from LLM (comprehensive, scientific response)
  • relevant_occurrences: ALL matching records (not limited - complete dataset for research)
  • sources_count: Total records before deduplication (if unique species requested)
  • dashboard_summary: Structured summary with:
    • executive_summary: 2-3 sentence overview
    • key_findings: Array of 5 key findings
    • species_analysis: Detailed species analysis
    • geographic_distribution: Geographic patterns
    • depth_analysis: Depth-related insights
    • temporal_patterns: Time-based patterns
    • research_insights: Research methods and contributions
  • took_ms: Query processing time in milliseconds

Note: Each occurrence in relevant_occurrences includes:

  • All original data fields
  • similarity_score: Relevance score (lower = more relevant)
  • dataType: Type of record (OCCURRENCE, CTD, AWS, ADCP)
  • content: Text representation used for embedding

GET /search

Search marine data with filters (similar to /query but returns raw results without LLM-generated answer).

Parameters: Same as /query endpoint

Response:

  • query: Search query text
  • filters: Applied filters
  • candidates: Total matching records
  • results: All matching records
  • took_ms: Processing time

POST /reload

Reload data from Supabase storage (clears cache and reloads from bucket).

curl -X POST "http://localhost:8000/reload"

Troubleshooting

"GEMINI_API_KEY not set"

  • Create a .env file with GEMINI_API_KEY=your-key
  • Verify the API key is valid and has sufficient quota
  • Check rate limits (free tier: 10 requests/minute)

"Supabase credentials not set"

  • Create .env file with SUPABASE_URL and SUPABASE_KEY
  • Use service role key (not anon key) for storage access
  • Verify bucket name is correct (default: processed-data)

"No occurrence data available" or "No marine data available"

  • Run import_occurrences_supabase.py first to create the JSON file
  • Ensure Supabase credentials are correct and bucket is accessible
  • Check that marine data files exist in the bucket (Parquet files, CTD .HDR/.asc pairs, AWS/ADCP .txt files)
  • Verify file names match expected patterns (e.g., CTD files need matching .HDR and .asc files)

"ChromaDB collection is empty"

  • Run python marine_embeddings_supabase.py to generate embeddings
  • Check that occurrences.json exists and contains data
  • Verify ChromaDB directory has write permissions

"Rate limit exceeded" (429 errors)

  • Free tier Gemini API: 10 requests/minute per model
  • What happens: System gracefully handles rate limits:
    • Still returns ALL matching data records
    • Provides fallback answer message if LLM generation fails
    • Skips dashboard summary generation if rate limited
    • Logs informative messages about rate limit status
  • Solutions:
    • Wait 30-60 seconds between queries
    • Consider upgrading API tier for higher limits
    • The system will automatically retry with exponential backoff
    • Data is still returned even if LLM calls fail

"No results found"

  • Check that data was imported: curl http://localhost:8000/health
  • Verify embeddings were generated
  • Try a more general query without filters
  • Check ChromaDB has documents: Look for "ChromaDB collection has X documents" in startup logs

CORS errors in browser

  • The API already has CORS enabled for all origins
  • If issues persist, check the browser console for specific errors

File Structure

SagarManthan-RAG/
├── marine_rag_supabase_main.py      # Main API server
├── import_occurrences_supabase.py   # Import from Supabase → JSON
├── marine_embeddings_supabase.py    # Generate embeddings from JSON
├── parsers.py                       # Data type parsers (CTD, AWS, ADCP)
├── clear_embeddings.py              # Utility to clear ChromaDB
├── occurrences.json                 # Imported data (created)
├── occurrences_cache.json           # API cache (created)
├── chroma_db_marine/                # ChromaDB persistence directory
├── requirements.txt                 # Python dependencies
├── .env                             # Environment variables (create this)
├── SUPABASE_SETUP.md                # Detailed Supabase setup guide
├── MARINE_RAG_README.md             # Full documentation
├── Rag_process.txt                  # RAG system explanation
└── EXTENSION_PLAN.md                # Multi-data type extension plan

How It Works

Query Processing Flow

  1. Query Received: User asks a natural language question
  2. Parameter Extraction: System automatically extracts:
    • Depth ranges (e.g., "350-600 meters" → min_depth=350, max_depth=600)
    • Water bodies (e.g., "Indian Ocean" → water_body filter)
    • Data types (e.g., "CTD temperature" → data_type=CTD)
    • Unique species request (e.g., "unique species" → deduplication flag)
  3. Data Type Detection: If not specified, detects from query keywords
  4. Vector Search: Searches ChromaDB for semantically similar records
  5. Scalar Filtering: Applies filters (water body, depth, species, data type)
  6. Deduplication: If "unique" requested, deduplicates by scientific name
  7. LLM Generation: Generates answer and dashboard summary from results
  8. Response: Returns ALL matching records + generated insights

Data Flow

Supabase Storage (Parquet, CTD .HDR/.asc, AWS .txt, ADCP .txt files)
    ↓
import_occurrences_supabase.py
    ↓
occurrences.json (Unified JSON format with dataType field)
    ↓
marine_embeddings_supabase.py
    ↓
ChromaDB (Vector embeddings with stable IDs)
    ↓
marine_rag_supabase_main.py (API Server)
    ↓
User Query → Hybrid Search → Results + LLM Answer
    ↓
SAGAR Dashboard (Visualization & Analysis)

Next Steps

  • Read the full documentation in MARINE_RAG_README.md
  • Check SUPABASE_SETUP.md for detailed Supabase configuration
  • Review Rag_process.txt to understand the RAG architecture in detail
  • Explore EXTENSION_PLAN.md to see how multi-data type support was implemented
  • Customize filters and queries for your research needs
  • Add more data sources as needed
  • Integrate with other SAGAR modules

Support

For issues or questions, refer to:

  • MARINE_RAG_README.md for detailed documentation
  • SUPABASE_SETUP.md for Supabase-specific setup
  • Rag_process.txt for RAG system architecture
  • The main SAGAR project documentation
  • Check API logs for error messages

License

Same as the main SAGAR project.


SagarManthan-RAG - Empowering marine research with intelligent data analysis by SAGAR.

About

RAG Pipeline for SAGAR

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors