SagarManthan-RAG - Quick Start Guide

SagarManthan-RAG is a Retrieval-Augmented Generation (RAG) system for marine scientific data analysis. It enables scientists to ask natural language questions about marine biodiversity, CTD, AWS, and ADCP data, and receive intelligent, data-driven answers.

Overview

SagarManthan-RAG uses:

Supabase Storage: Stores marine data files (Parquet files for occurrences, .HDR/.asc for CTD, .txt for AWS/ADCP)
ChromaDB: Vector database for semantic search using embeddings
Google Gemini: For generating embeddings and answering questions using LLM
FastAPI: REST API backend with comprehensive query support
Python parsers (parsers.py): Specialized parsers for CTD, AWS, and ADCP data formats

Prerequisites Checklist

Python 3.8+ installed
Supabase account with storage bucket processed-data containing marine data files (Parquet, CTD, AWS, ADCP)
Google Gemini API key (get from Google AI Studio)
Virtual environment (recommended)

Step-by-Step Setup

1. Navigate to Project Directory

cd RAG/SagarManthan-RAG

2. Create and Activate Virtual Environment

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

3. Install Dependencies

pip install -r requirements.txt

4. Configure Environment Variables

Create a .env file in the SagarManthan-RAG/ directory:

cp .env.example .env  # If .env.example exists, or create manually

Edit .env and add your credentials:

# Supabase Configuration
SUPABASE_URL=https://your-project.supabase.co
SUPABASE_KEY=your-service-role-key-here
STORAGE_BUCKET=processed-data

# Google Gemini API
GEMINI_API_KEY=your-gemini-api-key-here

# ChromaDB Configuration (optional)
CHROMA_PERSIST_DIR=chroma_db_marine
CHROMA_COLLECTION=occurrence_embeddings

# LLM Configuration (optional)
LLM_MODEL=gemini-2.5-flash
EMBED_MODEL=models/text-embedding-004

Note: All scripts automatically load from .env file. You don't need to export environment variables manually.

5. Import Data from Supabase

Import marine data from Supabase storage bucket:

python import_occurrences_supabase.py --output occurrences.json

This script:

Downloads marine data files from Supabase storage bucket processed-data
Auto-detects data types from file extensions and content:
- .parquet files → Occurrence data
- .HDR + .asc pairs → CTD data (automatically grouped and processed together)
- .txt files with AWS patterns ($GPS, $MET, $SST) → AWS data
- .txt files with ADCP patterns (Ens, Eas, Nor, Mag) → ADCP data
Processes and parses each data type using specialized parsers from parsers.py
Converts to unified JSON format for embedding generation
Handles column name variations automatically (e.g., scientificName, scientific_name, species)
Saves all marine data records with dataType field for proper categorization

Options:

--file-pattern: Filter files by pattern (e.g., 'occurrence', 'ctd')
--limit: Limit number of records to process (0 = all)
--output, -o: Output JSON file path (default: occurrences.json)
--files: Manually specify file names to download (e.g., --files 'occurrence.parquet' 'occurrence (1).parquet')
--supabase-url: Supabase project URL (defaults to SUPABASE_URL from .env)
--supabase-key: Supabase service role key (defaults to SUPABASE_KEY from .env)
--bucket: Storage bucket name (defaults to STORAGE_BUCKET from .env or processed-data)

Note: The script automatically groups CTD files (.HDR header + .asc data pairs) and processes them together.

6. Generate Embeddings

Create vector embeddings for all marine data records:

python marine_embeddings_supabase.py --input occurrences.json

What this does:

Reads marine data from JSON file
Generates text representations for each record (type-specific):
- Occurrence: Species, location, depth, water body, sampling method
- CTD: Temperature, salinity, oxygen, depth profiles
- AWS: Sea surface temperature, wind speed, air temperature
- ADCP: Current speeds, velocity profiles, depth bins
Creates embeddings using Google Gemini embedding model
Stores in ChromaDB with stable, content-based IDs (prevents duplicates)
Skips existing records by default (safe to re-run)

Options:

--input, -i: Input JSON file (default: occurrences.json)
--limit: Process only N records (0 = all, useful for testing)
--skip-existing: Skip records already in ChromaDB (default: True)
--force: Force regeneration of all embeddings (disables skip-existing)
--batch-size: Batch size for processing (default: 100)
--chroma-persist-dir: ChromaDB persistence directory (defaults to CHROMA_PERSIST_DIR from .env or chroma_db_marine)
--chroma-collection: ChromaDB collection name (defaults to CHROMA_COLLECTION from .env or occurrence_embeddings)

For testing with a smaller dataset:

python marine_embeddings_supabase.py --input occurrences.json --limit 100

To clear and regenerate all embeddings:

python clear_embeddings.py  # Removes ChromaDB directory
python marine_embeddings_supabase.py --input occurrences.json --force

7. Start the RAG API Server

python marine_rag_supabase_main.py

Or using uvicorn directly:

uvicorn marine_rag_supabase_main:app --host 0.0.0.0 --port 8000 --reload

You should see:

INFO:     Uvicorn running on http://0.0.0.0:8000
📊 ChromaDB collection 'occurrence_embeddings' has X documents
✅ ChromaDB is working - test query returned 1 result(s)
Loading occurrences from Supabase storage...

8. Test the API

Open a new terminal and test:

# Health check
curl http://localhost:8000/health

# Ask a question about occurrences
curl "http://localhost:8000/query?question=What%20species%20are%20found%20in%20the%20Indian%20Ocean?"

# Query with filters
curl "http://localhost:8000/query?question=unique%20species%20in%20Arabian%20Sea&water_body=Arabian%20Sea&min_depth=300&max_depth=600"

# Query CTD data
curl "http://localhost:8000/query?question=temperature%20profiles%20in%20Indian%20Ocean&data_types=CTD"

9. Use in SAGAR Dashboard

Make sure the RAG API is running (step 7)
Start the SAGAR frontend:
```
cd ../../SAGAR
npm start
```
Navigate to the Globe View
Use the "AI-Powered Analysis" textarea or bottom search bar to ask questions
Select data types (OCCURRENCE, CTD, AWS, ADCP) or leave empty for auto-detection

Supported Data Types

SagarManthan-RAG supports multiple marine data types:

OCCURRENCE: Species occurrence records (biodiversity data)
CTD: Conductivity-Temperature-Depth profiles
AWS: Automated Weather Station data
ADCP: Acoustic Doppler Current Profiler data

The system auto-detects data types from queries or you can explicitly specify them.

Example Questions to Try

Occurrence Queries

"What unique species are found in the Indian Ocean between 350-600 meters?"
- Auto-extracts: water_body=Indian Ocean, min_depth=350, max_depth=600, deduplicates by species
"Which species live at depths between 300 and 500 meters in the Arabian Sea?"
- Auto-extracts: water_body=Arabian Sea, depth range 300-500m
"What sampling methods were used in the Andaman Sea?"
- Auto-extracts: water_body=Andaman Sea, focuses on sampling protocols
"Show me occurrences of deep sea species in the Bay of Bengal"
- Auto-extracts: water_body=Bay of Bengal, searches for deep sea species
"occurrence of unique marine species within the Indian Ocean, specifically focusing on the 350 to 600-meter depth range"
- Complex query with multiple auto-extracted parameters

CTD Queries

"What are the temperature profiles in the Indian Ocean?"
- Auto-detects: data_type=CTD, water_body=Indian Ocean
"Show me salinity data from the Arabian Sea"
- Auto-detects: data_type=CTD, water_body=Arabian Sea, parameter=salinity
"What is the oxygen concentration at different depths?"
- Auto-detects: data_type=CTD, parameter=oxygen

AWS Queries

"What are the sea surface temperatures in the Indian Ocean?"
- Auto-detects: data_type=AWS, water_body=Indian Ocean, parameter=SST
"Show me wind speed data from recent measurements"
- Auto-detects: data_type=AWS, parameter=wind speed
"What are the weather conditions in the Arabian Sea?"
- Auto-detects: data_type=AWS, water_body=Arabian Sea

ADCP Queries

"What are the current speeds in the Indian Ocean?"
- Auto-detects: data_type=ADCP, water_body=Indian Ocean
"Show me ocean current velocity profiles"
- Auto-detects: data_type=ADCP, focuses on velocity profiles
"What are the flow patterns in the Bay of Bengal?"
- Auto-detects: data_type=ADCP, water_body=Bay of Bengal

Multi-Data Type Queries

"Show me all marine data from the Indian Ocean"
- Searches across all data types (OCCURRENCE, CTD, AWS, ADCP)
"What environmental conditions and species are found in the Arabian Sea?"
- Combines occurrence and environmental data (CTD/AWS)

Key Features

Query Processing

Natural Language Queries: Ask questions in plain English - no complex syntax needed
Automatic Parameter Extraction: Intelligently extracts depth ranges, water bodies, and data types from natural language
- Depth ranges: "350 to 600-meter depth range" → automatically extracts min_depth=350, max_depth=600
- Water bodies: "Indian Ocean" → automatically filters to Indian Ocean records
- Data types: "CTD temperature profiles" → automatically detects CTD data type
Multi-Data Type Support: Query across Occurrence, CTD, AWS, and ADCP data simultaneously or individually
Auto-Detection: If data types not specified, system intelligently detects from query keywords

Data Processing

Unique Species Deduplication: When querying for "unique" or "distinct" species, automatically deduplicates results by scientific name
- Keeps best match (highest similarity score) for each unique species
- Returns both total count (before deduplication) and unique count (after deduplication)
Intelligent Depth Filtering: Handles depth ranges correctly
- For occurrence data: Checks if record's depth range overlaps with query range
- For CTD data: Filters by single depth values
- Supports both min/max depth ranges and single depth values
Case-Insensitive Filtering: Water body and species name matching is case-insensitive
- "Arabian Sea", "arabian sea", "ARABIAN SEA" all match correctly

Research-Focused Design

All Results by Default: Returns ALL matching records (no artificial limits) for comprehensive research analysis
Complete Dataset: Research dashboard receives full dataset for proper visualization and analysis
Transparent Results: Shows both total occurrences and unique species counts
Source Tracking: Each result includes similarity scores and source information

AI-Powered Analysis

Comprehensive Answers: Uses Gemini LLM to generate detailed, scientific answers based on retrieved data
Dashboard Summaries: Generates structured summaries with:
- Executive summary
- Key findings (bullet points)
- Species analysis
- Geographic distribution
- Depth analysis
- Temporal patterns
- Research insights
Type-Specific Prompts: Customizes LLM prompts based on data type (CTD, AWS, ADCP, Occurrence) for more accurate answers
Graceful Error Handling: If rate limits are hit, still returns data with fallback messages

Performance & Reliability

Deduplication: Prevents duplicate embeddings using content-based stable IDs
Caching: In-memory cache for fast data access
Persistent Storage: ChromaDB persists embeddings across restarts
Rate Limit Handling: Gracefully handles API rate limits with informative messages

API Endpoints

GET /health

Health check - shows number of cached records and ChromaDB status.

GET /query

Answer scientific questions using RAG.

Parameters:

question (required): The scientific question
water_body (optional): Filter by water body
scientific_name (optional): Filter by scientific name
min_depth (optional): Minimum depth in meters
max_depth (optional): Maximum depth in meters
data_types (optional): Comma-separated data types (OCCURRENCE,CTD,AWS,ADCP,ALL)
top_k (optional): Limit results (None = return all matching)
similarity_threshold (optional): Similarity threshold for filtering

Response includes:

query: The original question
answer: Generated answer from LLM (comprehensive, scientific response)
relevant_occurrences: ALL matching records (not limited - complete dataset for research)
sources_count: Total records before deduplication (if unique species requested)
dashboard_summary: Structured summary with:
- executive_summary: 2-3 sentence overview
- key_findings: Array of 5 key findings
- species_analysis: Detailed species analysis
- geographic_distribution: Geographic patterns
- depth_analysis: Depth-related insights
- temporal_patterns: Time-based patterns
- research_insights: Research methods and contributions
took_ms: Query processing time in milliseconds

Note: Each occurrence in relevant_occurrences includes:

All original data fields
similarity_score: Relevance score (lower = more relevant)
dataType: Type of record (OCCURRENCE, CTD, AWS, ADCP)
content: Text representation used for embedding

GET /search

Search marine data with filters (similar to /query but returns raw results without LLM-generated answer).

Parameters: Same as /query endpoint

Response:

query: Search query text
filters: Applied filters
candidates: Total matching records
results: All matching records
took_ms: Processing time

POST /reload

Reload data from Supabase storage (clears cache and reloads from bucket).

curl -X POST "http://localhost:8000/reload"

Troubleshooting

"GEMINI_API_KEY not set"

Create a .env file with GEMINI_API_KEY=your-key
Verify the API key is valid and has sufficient quota
Check rate limits (free tier: 10 requests/minute)

"Supabase credentials not set"

Create .env file with SUPABASE_URL and SUPABASE_KEY
Use service role key (not anon key) for storage access
Verify bucket name is correct (default: processed-data)

"No occurrence data available" or "No marine data available"

Run import_occurrences_supabase.py first to create the JSON file
Ensure Supabase credentials are correct and bucket is accessible
Check that marine data files exist in the bucket (Parquet files, CTD .HDR/.asc pairs, AWS/ADCP .txt files)
Verify file names match expected patterns (e.g., CTD files need matching .HDR and .asc files)

"ChromaDB collection is empty"

Run python marine_embeddings_supabase.py to generate embeddings
Check that occurrences.json exists and contains data
Verify ChromaDB directory has write permissions

"Rate limit exceeded" (429 errors)

Free tier Gemini API: 10 requests/minute per model
What happens: System gracefully handles rate limits:
- Still returns ALL matching data records
- Provides fallback answer message if LLM generation fails
- Skips dashboard summary generation if rate limited
- Logs informative messages about rate limit status
Solutions:
- Wait 30-60 seconds between queries
- Consider upgrading API tier for higher limits
- The system will automatically retry with exponential backoff
- Data is still returned even if LLM calls fail

"No results found"

Check that data was imported: curl http://localhost:8000/health
Verify embeddings were generated
Try a more general query without filters
Check ChromaDB has documents: Look for "ChromaDB collection has X documents" in startup logs

CORS errors in browser

The API already has CORS enabled for all origins
If issues persist, check the browser console for specific errors

File Structure

SagarManthan-RAG/
├── marine_rag_supabase_main.py      # Main API server
├── import_occurrences_supabase.py   # Import from Supabase → JSON
├── marine_embeddings_supabase.py    # Generate embeddings from JSON
├── parsers.py                       # Data type parsers (CTD, AWS, ADCP)
├── clear_embeddings.py              # Utility to clear ChromaDB
├── occurrences.json                 # Imported data (created)
├── occurrences_cache.json           # API cache (created)
├── chroma_db_marine/                # ChromaDB persistence directory
├── requirements.txt                 # Python dependencies
├── .env                             # Environment variables (create this)
├── SUPABASE_SETUP.md                # Detailed Supabase setup guide
├── MARINE_RAG_README.md             # Full documentation
├── Rag_process.txt                  # RAG system explanation
└── EXTENSION_PLAN.md                # Multi-data type extension plan

How It Works

Query Processing Flow

Query Received: User asks a natural language question
Parameter Extraction: System automatically extracts:
- Depth ranges (e.g., "350-600 meters" → min_depth=350, max_depth=600)
- Water bodies (e.g., "Indian Ocean" → water_body filter)
- Data types (e.g., "CTD temperature" → data_type=CTD)
- Unique species request (e.g., "unique species" → deduplication flag)
Data Type Detection: If not specified, detects from query keywords
Vector Search: Searches ChromaDB for semantically similar records
Scalar Filtering: Applies filters (water body, depth, species, data type)
Deduplication: If "unique" requested, deduplicates by scientific name
LLM Generation: Generates answer and dashboard summary from results
Response: Returns ALL matching records + generated insights

Data Flow

Supabase Storage (Parquet, CTD .HDR/.asc, AWS .txt, ADCP .txt files)
    ↓
import_occurrences_supabase.py
    ↓
occurrences.json (Unified JSON format with dataType field)
    ↓
marine_embeddings_supabase.py
    ↓
ChromaDB (Vector embeddings with stable IDs)
    ↓
marine_rag_supabase_main.py (API Server)
    ↓
User Query → Hybrid Search → Results + LLM Answer
    ↓
SAGAR Dashboard (Visualization & Analysis)

Next Steps

Read the full documentation in MARINE_RAG_README.md
Check SUPABASE_SETUP.md for detailed Supabase configuration
Review Rag_process.txt to understand the RAG architecture in detail
Explore EXTENSION_PLAN.md to see how multi-data type support was implemented
Customize filters and queries for your research needs
Add more data sources as needed
Integrate with other SAGAR modules

Support

For issues or questions, refer to:

MARINE_RAG_README.md for detailed documentation
SUPABASE_SETUP.md for Supabase-specific setup
Rag_process.txt for RAG system architecture
The main SAGAR project documentation
Check API logs for error messages

License

Same as the main SAGAR project.

SagarManthan-RAG - Empowering marine research with intelligent data analysis by SAGAR.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github/workflows		.github/workflows
chroma_db_marine/ce360b22-fef1-42f7-b9ef-146080491001		chroma_db_marine/ce360b22-fef1-42f7-b9ef-146080491001
.env.example		.env.example
.gitignore		.gitignore
EXTENSION_PLAN.md		EXTENSION_PLAN.md
MARINE_RAG_README.md		MARINE_RAG_README.md
Rag_process.txt		Rag_process.txt
SUPABASE_SETUP.md		SUPABASE_SETUP.md
TEST_QUERIES.md		TEST_QUERIES.md
clear_embeddings.py		clear_embeddings.py
import_occurrences_supabase.py		import_occurrences_supabase.py
marine_embeddings_supabase.py		marine_embeddings_supabase.py
marine_rag_supabase_main.py		marine_rag_supabase_main.py
occurrences.json		occurrences.json
occurrences_cache.json		occurrences_cache.json
parsers.py		parsers.py
readme.md		readme.md
requirements.txt		requirements.txt
test_queries.sh		test_queries.sh

Technical-Mavle/SagarManthan-RAG

Folders and files

Latest commit

History

Repository files navigation

SagarManthan-RAG - Quick Start Guide

Overview

Prerequisites Checklist

Step-by-Step Setup

1. Navigate to Project Directory

2. Create and Activate Virtual Environment

3. Install Dependencies

4. Configure Environment Variables

5. Import Data from Supabase

6. Generate Embeddings

7. Start the RAG API Server

8. Test the API

9. Use in SAGAR Dashboard

Supported Data Types

Example Questions to Try

Occurrence Queries

CTD Queries

AWS Queries

ADCP Queries

Multi-Data Type Queries

Key Features

Query Processing

Data Processing

Research-Focused Design

AI-Powered Analysis

Performance & Reliability

API Endpoints

GET /health

GET /query

GET /search

POST /reload

Troubleshooting

"GEMINI_API_KEY not set"

"Supabase credentials not set"

"No occurrence data available" or "No marine data available"

"ChromaDB collection is empty"

"Rate limit exceeded" (429 errors)

"No results found"

CORS errors in browser

File Structure

How It Works

Query Processing Flow

Data Flow

Next Steps

Support

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages