SagarManthan-RAG is a Retrieval-Augmented Generation (RAG) system for marine scientific data analysis. It enables scientists to ask natural language questions about marine biodiversity, CTD, AWS, and ADCP data, and receive intelligent, data-driven answers.
SagarManthan-RAG uses:
- Supabase Storage: Stores marine data files (Parquet files for occurrences, .HDR/.asc for CTD, .txt for AWS/ADCP)
- ChromaDB: Vector database for semantic search using embeddings
- Google Gemini: For generating embeddings and answering questions using LLM
- FastAPI: REST API backend with comprehensive query support
- Python parsers (
parsers.py): Specialized parsers for CTD, AWS, and ADCP data formats
- Python 3.8+ installed
- Supabase account with storage bucket
processed-datacontaining marine data files (Parquet, CTD, AWS, ADCP) - Google Gemini API key (get from Google AI Studio)
- Virtual environment (recommended)
cd RAG/SagarManthan-RAGpython3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activatepip install -r requirements.txtCreate a .env file in the SagarManthan-RAG/ directory:
cp .env.example .env # If .env.example exists, or create manuallyEdit .env and add your credentials:
# Supabase Configuration
SUPABASE_URL=https://your-project.supabase.co
SUPABASE_KEY=your-service-role-key-here
STORAGE_BUCKET=processed-data
# Google Gemini API
GEMINI_API_KEY=your-gemini-api-key-here
# ChromaDB Configuration (optional)
CHROMA_PERSIST_DIR=chroma_db_marine
CHROMA_COLLECTION=occurrence_embeddings
# LLM Configuration (optional)
LLM_MODEL=gemini-2.5-flash
EMBED_MODEL=models/text-embedding-004Note: All scripts automatically load from .env file. You don't need to export environment variables manually.
Import marine data from Supabase storage bucket:
python import_occurrences_supabase.py --output occurrences.jsonThis script:
- Downloads marine data files from Supabase storage bucket
processed-data - Auto-detects data types from file extensions and content:
.parquetfiles → Occurrence data.HDR+.ascpairs → CTD data (automatically grouped and processed together).txtfiles with AWS patterns ($GPS,$MET,$SST) → AWS data.txtfiles with ADCP patterns (Ens,Eas,Nor,Mag) → ADCP data
- Processes and parses each data type using specialized parsers from
parsers.py - Converts to unified JSON format for embedding generation
- Handles column name variations automatically (e.g.,
scientificName,scientific_name,species) - Saves all marine data records with
dataTypefield for proper categorization
Options:
--file-pattern: Filter files by pattern (e.g., 'occurrence', 'ctd')--limit: Limit number of records to process (0 = all)--output,-o: Output JSON file path (default:occurrences.json)--files: Manually specify file names to download (e.g.,--files 'occurrence.parquet' 'occurrence (1).parquet')--supabase-url: Supabase project URL (defaults toSUPABASE_URLfrom.env)--supabase-key: Supabase service role key (defaults toSUPABASE_KEYfrom.env)--bucket: Storage bucket name (defaults toSTORAGE_BUCKETfrom.envorprocessed-data)
Note: The script automatically groups CTD files (.HDR header + .asc data pairs) and processes them together.
Create vector embeddings for all marine data records:
python marine_embeddings_supabase.py --input occurrences.jsonWhat this does:
- Reads marine data from JSON file
- Generates text representations for each record (type-specific):
- Occurrence: Species, location, depth, water body, sampling method
- CTD: Temperature, salinity, oxygen, depth profiles
- AWS: Sea surface temperature, wind speed, air temperature
- ADCP: Current speeds, velocity profiles, depth bins
- Creates embeddings using Google Gemini embedding model
- Stores in ChromaDB with stable, content-based IDs (prevents duplicates)
- Skips existing records by default (safe to re-run)
Options:
--input,-i: Input JSON file (default:occurrences.json)--limit: Process only N records (0 = all, useful for testing)--skip-existing: Skip records already in ChromaDB (default: True)--force: Force regeneration of all embeddings (disables skip-existing)--batch-size: Batch size for processing (default: 100)--chroma-persist-dir: ChromaDB persistence directory (defaults toCHROMA_PERSIST_DIRfrom.envorchroma_db_marine)--chroma-collection: ChromaDB collection name (defaults toCHROMA_COLLECTIONfrom.envoroccurrence_embeddings)
For testing with a smaller dataset:
python marine_embeddings_supabase.py --input occurrences.json --limit 100To clear and regenerate all embeddings:
python clear_embeddings.py # Removes ChromaDB directory
python marine_embeddings_supabase.py --input occurrences.json --forcepython marine_rag_supabase_main.pyOr using uvicorn directly:
uvicorn marine_rag_supabase_main:app --host 0.0.0.0 --port 8000 --reloadYou should see:
INFO: Uvicorn running on http://0.0.0.0:8000
📊 ChromaDB collection 'occurrence_embeddings' has X documents
✅ ChromaDB is working - test query returned 1 result(s)
Loading occurrences from Supabase storage...
Open a new terminal and test:
# Health check
curl http://localhost:8000/health
# Ask a question about occurrences
curl "http://localhost:8000/query?question=What%20species%20are%20found%20in%20the%20Indian%20Ocean?"
# Query with filters
curl "http://localhost:8000/query?question=unique%20species%20in%20Arabian%20Sea&water_body=Arabian%20Sea&min_depth=300&max_depth=600"
# Query CTD data
curl "http://localhost:8000/query?question=temperature%20profiles%20in%20Indian%20Ocean&data_types=CTD"- Make sure the RAG API is running (step 7)
- Start the SAGAR frontend:
cd ../../SAGAR npm start - Navigate to the Globe View
- Use the "AI-Powered Analysis" textarea or bottom search bar to ask questions
- Select data types (OCCURRENCE, CTD, AWS, ADCP) or leave empty for auto-detection
SagarManthan-RAG supports multiple marine data types:
- OCCURRENCE: Species occurrence records (biodiversity data)
- CTD: Conductivity-Temperature-Depth profiles
- AWS: Automated Weather Station data
- ADCP: Acoustic Doppler Current Profiler data
The system auto-detects data types from queries or you can explicitly specify them.
- "What unique species are found in the Indian Ocean between 350-600 meters?"
- Auto-extracts: water_body=Indian Ocean, min_depth=350, max_depth=600, deduplicates by species
- "Which species live at depths between 300 and 500 meters in the Arabian Sea?"
- Auto-extracts: water_body=Arabian Sea, depth range 300-500m
- "What sampling methods were used in the Andaman Sea?"
- Auto-extracts: water_body=Andaman Sea, focuses on sampling protocols
- "Show me occurrences of deep sea species in the Bay of Bengal"
- Auto-extracts: water_body=Bay of Bengal, searches for deep sea species
- "occurrence of unique marine species within the Indian Ocean, specifically focusing on the 350 to 600-meter depth range"
- Complex query with multiple auto-extracted parameters
- "What are the temperature profiles in the Indian Ocean?"
- Auto-detects: data_type=CTD, water_body=Indian Ocean
- "Show me salinity data from the Arabian Sea"
- Auto-detects: data_type=CTD, water_body=Arabian Sea, parameter=salinity
- "What is the oxygen concentration at different depths?"
- Auto-detects: data_type=CTD, parameter=oxygen
- "What are the sea surface temperatures in the Indian Ocean?"
- Auto-detects: data_type=AWS, water_body=Indian Ocean, parameter=SST
- "Show me wind speed data from recent measurements"
- Auto-detects: data_type=AWS, parameter=wind speed
- "What are the weather conditions in the Arabian Sea?"
- Auto-detects: data_type=AWS, water_body=Arabian Sea
- "What are the current speeds in the Indian Ocean?"
- Auto-detects: data_type=ADCP, water_body=Indian Ocean
- "Show me ocean current velocity profiles"
- Auto-detects: data_type=ADCP, focuses on velocity profiles
- "What are the flow patterns in the Bay of Bengal?"
- Auto-detects: data_type=ADCP, water_body=Bay of Bengal
- "Show me all marine data from the Indian Ocean"
- Searches across all data types (OCCURRENCE, CTD, AWS, ADCP)
- "What environmental conditions and species are found in the Arabian Sea?"
- Combines occurrence and environmental data (CTD/AWS)
- Natural Language Queries: Ask questions in plain English - no complex syntax needed
- Automatic Parameter Extraction: Intelligently extracts depth ranges, water bodies, and data types from natural language
- Depth ranges: "350 to 600-meter depth range" → automatically extracts min_depth=350, max_depth=600
- Water bodies: "Indian Ocean" → automatically filters to Indian Ocean records
- Data types: "CTD temperature profiles" → automatically detects CTD data type
- Multi-Data Type Support: Query across Occurrence, CTD, AWS, and ADCP data simultaneously or individually
- Auto-Detection: If data types not specified, system intelligently detects from query keywords
- Unique Species Deduplication: When querying for "unique" or "distinct" species, automatically deduplicates results by scientific name
- Keeps best match (highest similarity score) for each unique species
- Returns both total count (before deduplication) and unique count (after deduplication)
- Intelligent Depth Filtering: Handles depth ranges correctly
- For occurrence data: Checks if record's depth range overlaps with query range
- For CTD data: Filters by single depth values
- Supports both min/max depth ranges and single depth values
- Case-Insensitive Filtering: Water body and species name matching is case-insensitive
- "Arabian Sea", "arabian sea", "ARABIAN SEA" all match correctly
- All Results by Default: Returns ALL matching records (no artificial limits) for comprehensive research analysis
- Complete Dataset: Research dashboard receives full dataset for proper visualization and analysis
- Transparent Results: Shows both total occurrences and unique species counts
- Source Tracking: Each result includes similarity scores and source information
- Comprehensive Answers: Uses Gemini LLM to generate detailed, scientific answers based on retrieved data
- Dashboard Summaries: Generates structured summaries with:
- Executive summary
- Key findings (bullet points)
- Species analysis
- Geographic distribution
- Depth analysis
- Temporal patterns
- Research insights
- Type-Specific Prompts: Customizes LLM prompts based on data type (CTD, AWS, ADCP, Occurrence) for more accurate answers
- Graceful Error Handling: If rate limits are hit, still returns data with fallback messages
- Deduplication: Prevents duplicate embeddings using content-based stable IDs
- Caching: In-memory cache for fast data access
- Persistent Storage: ChromaDB persists embeddings across restarts
- Rate Limit Handling: Gracefully handles API rate limits with informative messages
Health check - shows number of cached records and ChromaDB status.
Answer scientific questions using RAG.
Parameters:
question(required): The scientific questionwater_body(optional): Filter by water bodyscientific_name(optional): Filter by scientific namemin_depth(optional): Minimum depth in metersmax_depth(optional): Maximum depth in metersdata_types(optional): Comma-separated data types (OCCURRENCE,CTD,AWS,ADCP,ALL)top_k(optional): Limit results (None = return all matching)similarity_threshold(optional): Similarity threshold for filtering
Response includes:
query: The original questionanswer: Generated answer from LLM (comprehensive, scientific response)relevant_occurrences: ALL matching records (not limited - complete dataset for research)sources_count: Total records before deduplication (if unique species requested)dashboard_summary: Structured summary with:executive_summary: 2-3 sentence overviewkey_findings: Array of 5 key findingsspecies_analysis: Detailed species analysisgeographic_distribution: Geographic patternsdepth_analysis: Depth-related insightstemporal_patterns: Time-based patternsresearch_insights: Research methods and contributions
took_ms: Query processing time in milliseconds
Note: Each occurrence in relevant_occurrences includes:
- All original data fields
similarity_score: Relevance score (lower = more relevant)dataType: Type of record (OCCURRENCE, CTD, AWS, ADCP)content: Text representation used for embedding
Search marine data with filters (similar to /query but returns raw results without LLM-generated answer).
Parameters: Same as /query endpoint
Response:
query: Search query textfilters: Applied filterscandidates: Total matching recordsresults: All matching recordstook_ms: Processing time
Reload data from Supabase storage (clears cache and reloads from bucket).
curl -X POST "http://localhost:8000/reload"- Create a
.envfile withGEMINI_API_KEY=your-key - Verify the API key is valid and has sufficient quota
- Check rate limits (free tier: 10 requests/minute)
- Create
.envfile withSUPABASE_URLandSUPABASE_KEY - Use service role key (not anon key) for storage access
- Verify bucket name is correct (default:
processed-data)
- Run
import_occurrences_supabase.pyfirst to create the JSON file - Ensure Supabase credentials are correct and bucket is accessible
- Check that marine data files exist in the bucket (Parquet files, CTD .HDR/.asc pairs, AWS/ADCP .txt files)
- Verify file names match expected patterns (e.g., CTD files need matching .HDR and .asc files)
- Run
python marine_embeddings_supabase.pyto generate embeddings - Check that
occurrences.jsonexists and contains data - Verify ChromaDB directory has write permissions
- Free tier Gemini API: 10 requests/minute per model
- What happens: System gracefully handles rate limits:
- Still returns ALL matching data records
- Provides fallback answer message if LLM generation fails
- Skips dashboard summary generation if rate limited
- Logs informative messages about rate limit status
- Solutions:
- Wait 30-60 seconds between queries
- Consider upgrading API tier for higher limits
- The system will automatically retry with exponential backoff
- Data is still returned even if LLM calls fail
- Check that data was imported:
curl http://localhost:8000/health - Verify embeddings were generated
- Try a more general query without filters
- Check ChromaDB has documents: Look for "ChromaDB collection has X documents" in startup logs
- The API already has CORS enabled for all origins
- If issues persist, check the browser console for specific errors
SagarManthan-RAG/
├── marine_rag_supabase_main.py # Main API server
├── import_occurrences_supabase.py # Import from Supabase → JSON
├── marine_embeddings_supabase.py # Generate embeddings from JSON
├── parsers.py # Data type parsers (CTD, AWS, ADCP)
├── clear_embeddings.py # Utility to clear ChromaDB
├── occurrences.json # Imported data (created)
├── occurrences_cache.json # API cache (created)
├── chroma_db_marine/ # ChromaDB persistence directory
├── requirements.txt # Python dependencies
├── .env # Environment variables (create this)
├── SUPABASE_SETUP.md # Detailed Supabase setup guide
├── MARINE_RAG_README.md # Full documentation
├── Rag_process.txt # RAG system explanation
└── EXTENSION_PLAN.md # Multi-data type extension plan
- Query Received: User asks a natural language question
- Parameter Extraction: System automatically extracts:
- Depth ranges (e.g., "350-600 meters" → min_depth=350, max_depth=600)
- Water bodies (e.g., "Indian Ocean" → water_body filter)
- Data types (e.g., "CTD temperature" → data_type=CTD)
- Unique species request (e.g., "unique species" → deduplication flag)
- Data Type Detection: If not specified, detects from query keywords
- Vector Search: Searches ChromaDB for semantically similar records
- Scalar Filtering: Applies filters (water body, depth, species, data type)
- Deduplication: If "unique" requested, deduplicates by scientific name
- LLM Generation: Generates answer and dashboard summary from results
- Response: Returns ALL matching records + generated insights
Supabase Storage (Parquet, CTD .HDR/.asc, AWS .txt, ADCP .txt files)
↓
import_occurrences_supabase.py
↓
occurrences.json (Unified JSON format with dataType field)
↓
marine_embeddings_supabase.py
↓
ChromaDB (Vector embeddings with stable IDs)
↓
marine_rag_supabase_main.py (API Server)
↓
User Query → Hybrid Search → Results + LLM Answer
↓
SAGAR Dashboard (Visualization & Analysis)
- Read the full documentation in
MARINE_RAG_README.md - Check
SUPABASE_SETUP.mdfor detailed Supabase configuration - Review
Rag_process.txtto understand the RAG architecture in detail - Explore
EXTENSION_PLAN.mdto see how multi-data type support was implemented - Customize filters and queries for your research needs
- Add more data sources as needed
- Integrate with other SAGAR modules
For issues or questions, refer to:
MARINE_RAG_README.mdfor detailed documentationSUPABASE_SETUP.mdfor Supabase-specific setupRag_process.txtfor RAG system architecture- The main SAGAR project documentation
- Check API logs for error messages
Same as the main SAGAR project.
SagarManthan-RAG - Empowering marine research with intelligent data analysis by SAGAR.