A modern, interactive web application for uploading and ingesting datasets into the SAGAR data lakehouse. Built with React, Vite, and Supabase, featuring a beautiful glassmorphism UI with an animated 3D globe background, integrated with the proprietary SAGAR-QC quality control system for automated data validation and comprehensive quality reporting.
- Overview
- Features
- Architecture
- Tech Stack
- Project Structure
- Setup Instructions
- Environment Variables
- Components
- Quality Control (SAGAR-QC)
- Workflow
- Supabase Edge Functions
- Development
- Build & Preview
- Deployment
The SAGAR Data Ingestion Portal is a secure, user-friendly interface for the CMLRE (Centre for Marine Living Resources and Ecology) to upload datasets. The application provides:
- Secure Authentication: Simple username/password login with session persistence
- File Upload: Drag-and-drop or browse file selection
- Real-time Processing: Visual feedback during data ingestion
- Automated Quality Control: Proprietary SAGAR-QC module with intelligent test selection
- Comprehensive Quality Reports: Interactive JSON reports and formal PDF downloads
- Automated Pipeline: Automatic triggering of backend ingestion services
- Modern UI: Glassmorphism design with animated 3D globe background
- Login Page: Secure username/password authentication
- Session Persistence: Login state saved in localStorage
- Credential Display: Temporary credentials shown on login page (Username:
admin, Password:admin123) - Logout Functionality: Secure logout with state cleanup
- Error Handling: Clear error messages for invalid credentials
- Multiple File Selection: Upload up to 100 files at once
- File Selection: Browse button with multiple file selection support
- File List Display: Shows all selected files with individual file sizes
- File Management: Remove individual files from the selection before upload
- Multi-format Support: Automatically converts any file type to CSV:
- CSV files (no conversion needed)
- TSV/TXT files (converts tabs to commas)
- Excel files (XLSX, XLS) - converts first sheet to CSV
- JSON files (converts objects/arrays to CSV)
- Other text files (auto-detects delimiter)
- Sequential Processing: Files are processed one by one to ensure quality
- Backend Processing: All cleaning and processing handled by Data Processing Engine API
- Real-time Progress Tracking: Individual status updates for each file being processed
- Upload Status: Real-time status messages showing current file and progress
- Error Handling: Comprehensive error messages per file; one failure doesn't stop other files
- Client-side Processing:
- File Conversion: Converts any file type (Excel, JSON, TSV, TXT, etc.) to CSV for each selected file
- Sends CSV files to backend API sequentially (one at a time)
- Backend Processing (Data Processing Engine) - Per File:
- CSV Cleaning:
- Removes lines above header
- Ensures proper CSV format
- Handles both comma and tab-separated files
- SAGAR-QC Quality Control:
- Data analysis and intelligent test selection
- QC test execution and flag assignment
- Quality report generation
- Parquet Conversion: Converts cleaned CSV (with QC flags) to Parquet using pandas/pyarrow
- Storage Upload: Automatic upload to
processed-databucket - Metadata Storage: Stores metadata and quality report in
metadata_sagartable
- CSV Cleaning:
- Visual Feedback:
- Real-time progress tracking for each file
- Individual status indicators (converting, processing, completed, failed)
- Animated spinner during processing
- Success confirmation with summary
- List of processed files with individual quality report access
- State Management: Smooth transitions between upload, processing, and completion states
- Batch Summary: Shows total successful and failed files after batch completion
- Intelligent Test Selection: AI-powered (Gemini 2.5 Flash) or rule-based test selection based on data characteristics
- Comprehensive QC Tests:
- IOOS-QC/QARTOD Standards: Gross range, spike detection, flat line, rate of change, climatology, temporal consistency
- SAGAR-Specific Tests: Location validation, duplicate detection, missing data analysis
- Data Type Awareness: Automatically detects occurrence data vs. sensor data and applies appropriate tests
- GPS Format Support: Multi-format GPS coordinate parsing (Decimal Degrees, NMEA 0183, DDM, DMS, UTM)
- Row-wise & Column-wise Testing: Intelligent application based on data type (occurrence data uses row-wise checks)
- QC Flagging System: Standard flags (GOOD, SUSPECT, FAIL, MISSING, UNKNOWN) added to data
- Quality Reports:
- Interactive JSON Reports: Detailed metrics, test results, and recommendations
- Formal PDF Reports: Academic-style downloadable reports with charts and detailed analysis
- Test Rationale: Explains why specific tests were selected for each dataset
- Individual Reports: Each processed file has its own quality report
- Report Access: View quality reports for any processed file from the results list
- Interactive Dashboard: Visual charts (Pie and Bar charts) showing flag distribution
- Expandable Test Results: Column-specific details with expandable dropdowns
- Test Metrics: Detailed statistics for each QC test executed
- Download Options:
- Download JSON quality report for each file
- Download formal PDF report (academic-style) for each file
- Visual Indicators: Color-coded flags and quality scores
- Batch Overview: See all processed files with their individual status and report access
- 3D Globe Background: Interactive rotating globe with decorative points
- Glassmorphism Design: Modern frosted glass effect with backdrop blur
- Responsive Layout: Works on desktop and mobile devices
- Smooth Animations: CSS keyframe animations for loading states
- Color Scheme: Dark theme with cyan/green gradient accents
βββββββββββββββββββ
β React App β
β (Frontend) β
β Multiple Files β
β (up to 100) β
ββββββββββ¬βββββββββ
β
β For each file (sequential):
β 1. Convert to CSV (client-side)
β 2. Send cleaned CSV
βΌ
βββββββββββββββββββ
β Data Processingβ
β Engine API β
β (FastAPI) β
β Per File: β
ββββββββββ¬βββββββββ
β
β 3. CSV Cleaning
β 4. SAGAR-QC Processing
β ββ Data Analysis
β ββ Intelligent Test Selection (Gemini AI)
β ββ QC Test Execution
β ββ Flag Assignment
β 5. Convert CSV β Parquet (with flags)
β 6. Generate Quality Report (JSON)
β 7. Upload Parquet
β 8. Store Metadata + QC Report
β 9. Return QC Report to Frontend
βΌ
βββββββββββββββββββ
β Supabase β
β Storage β
β (processed- β
β data bucket) β
ββββββββββ¬βββββββββ
β
β
βΌ
βββββββββββββββββββ
β Supabase DB β
β (metadata_ β
β sagar table) β
ββββββββββ¬βββββββββ
β
β
βΌ
βββββββββββββββββββ
β Quality Report β
β Display & PDF β
β Generation β
β (Per File) β
βββββββββββββββββββ
Note: Files are processed sequentially (one at a time) to ensure quality control and prevent resource conflicts. Each file receives its own quality report and is stored independently.
- React 18.3.1: UI library
- Vite 5.4.0: Build tool and dev server
- react-globe.gl 2.36.0: 3D globe visualization
- Three.js 0.180.0: 3D graphics library
- @supabase/supabase-js 2.45.4: Supabase client library
- recharts 2.15.4: Interactive charts for quality reports
- html2canvas 1.4.1: HTML to canvas conversion for PDF
- jspdf 2.5.2: PDF generation library
- Data Processing Engine: FastAPI service for CSV to Parquet conversion
- SAGAR-QC Module: Proprietary quality control system
- Google Gemini AI 2.5 Flash: Intelligent test selection
- IOOS-QC/QARTOD Tests: Standard oceanographic QC tests
- Custom QC Tests: SAGAR-specific quality checks
- Supabase:
- Storage buckets for processed data
- Database for metadata storage
- Inline Styles: React inline styles for component styling
- CSS Animations: Keyframe animations for loading states
- Glassmorphism: Backdrop blur effects
data-ingestion/
βββ src/
β βββ App.jsx # Main application component
β βββ main.jsx # React entry point
β βββ components/
β β βββ GlobeBackground.jsx # 3D globe background component
β β βββ SimpleFilePicker.jsx # File selection component
β β βββ QualityReport.jsx # Quality report display component
β β βββ ui/
β β βββ file-upload.jsx # Alternative file upload component
β βββ lib/
β βββ utils.js # Utility functions
β βββ fileProcessing.js # File conversion to CSV utilities
β βββ pdfGenerator.js # PDF report generation
βββ DataProcessingEngine/ # Backend processing API
β βββ main.py # FastAPI application
β βββ processing.py # CSV to Parquet conversion + QC integration
β βββ config.py # Supabase & Gemini configuration
β βββ requirements.txt # Python dependencies
β βββ .env.example # Environment variables template
β βββ SAGAR_QC/ # Proprietary QC module
β β βββ __init__.py # Module initialization
β β βββ qc_flags.py # QC flag definitions
β β βββ qc_tests.py # QC test implementations
β β βββ qc_analyzer.py # Data analysis & test selection
β β βββ qc_pipeline.py # QC pipeline orchestration
β βββ README.md # Processing engine documentation
βββ supabase/
β βββ config.toml # Supabase configuration
β βββ functions/
β βββ trigger-ingestion/
β βββ index.ts # Edge Function (legacy)
β βββ deno.json # Deno configuration
βββ index.html # HTML entry point
βββ vite.config.js # Vite configuration
βββ package.json # Dependencies and scripts
βββ env.js.example # Environment variables template
βββ README.md # This file
- Node.js (v16 or higher)
- npm or yarn
- Supabase account and project
- (Optional) Netlify account for deployment
-
Clone the repository
git clone <repository-url> cd data-ingestion
-
Install dependencies
npm install
-
Set up environment variables
- Copy
env.js.exampleto.env(or create.envfile) - Fill in your Supabase credentials and login credentials
- Copy
-
Configure Supabase
- Create a storage bucket named
processed-datain your Supabase project - Create a table named
metadata_sagarfor storing file metadata
- Create a storage bucket named
-
Run development server
npm run dev
-
Open in browser
- Navigate to
http://localhost:5173(or the port shown in terminal)
- Navigate to
Create a .env file in the root directory with the following variables:
VITE_LOGIN_USERNAME=admin
VITE_LOGIN_PASSWORD=admin123
VITE_PROCESSING_API_URL=http://localhost:8000Note:
VITE_PROCESSING_API_URLis the URL of the Data Processing Engine API. Defaults tohttp://localhost:8000for development. For production, set this to your deployed API URL (e.g.,https://dataprocessingengine-sagar.onrender.com).- Supabase credentials are only needed in the backend (DataProcessingEngine/.env), not in the frontend.
- Gemini API key is configured in the backend (DataProcessingEngine/.env) for intelligent QC test selection.
- VITE_LOGIN_USERNAME: Username for portal login
- VITE_LOGIN_PASSWORD: Password for portal login
- VITE_PROCESSING_API_URL: URL of the Data Processing Engine API (defaults to
https://dataprocessingengine-sagar.onrender.com)
SUPABASE_URL=https://your-project.supabase.co
SUPABASE_KEY=your-service-role-key-here
GEMINI_API_KEY=your-gemini-api-key-here # Optional: for AI-powered test selectionNote: For production, use secure environment variable management. Never commit
.envfiles to version control.
Main application component that handles:
- Authentication state management
- File cleaning (client-side)
- API communication with Data Processing Engine
- UI state transitions (login β upload β processing β complete)
- LocalStorage session persistence
Key Features:
- Login/logout functionality
- Client-side CSV cleaning
- Direct API calls to processing backend
- Error handling and status messages
- Responsive glassmorphism UI
3D interactive globe component using react-globe.gl:
- Rotating 3D Earth visualization
- Decorative points at specific coordinates (Bangalore, Delhi, Mumbai)
- Atmosphere effect with white glow
- Auto-rotation enabled
- Responsive to window resize
Coordinates Displayed:
- Bangalore: 12.97Β°N, 77.59Β°E (Green)
- Delhi: 28.61Β°N, 77.20Β°E (Cyan)
- Mumbai: 19.07Β°N, 72.87Β°E (Gold)
Multiple file selection component:
- Browse button for multiple file selection (up to 100 files)
- File list display with individual file names and sizes
- Remove individual files from selection
- File count display
- Glassmorphism styling
- Scrollable file list for large selections
Comprehensive quality report display component:
- Interactive Charts: Pie chart for flag distribution, bar chart for quality metrics
- Test Results Display: Detailed results for each QC test executed
- Expandable Columns: Click to expand column-specific test results
- Quality Metrics: Overall quality score, flag percentages, test statistics
- Download Options:
- Download JSON report
- Download formal PDF report (academic-style)
- Test Rationale: Displays why specific tests were selected
- Visual Indicators: Color-coded flags and status indicators
- User enters username and password
- Credentials validated against environment variables
- On success:
- Login state set to
true - Session saved to localStorage
- User redirected to upload interface
- Login state set to
- On failure:
- Error message displayed
- User remains on login page
- User selects files via browse button (supports up to 100 files, any file type)
- Selected files displayed in picker with file names and sizes
- User can remove individual files before upload
- User clicks "Upload [N] Files" button
- Sequential Processing (one file at a time):
For each file:
a. Client-side Processing:
- File Conversion: File is converted to CSV format (if not already CSV)
- Excel files (XLSX, XLS) β CSV (first sheet)
- JSON files β CSV
- TSV/TXT files β CSV (tabs converted to commas)
- CSV files β No conversion needed b. Backend Processing (Data Processing Engine API):
- CSV Cleaning:
- Removes lines above header
- Ensures proper CSV format
- Handles both comma and tab-separated files
- SAGAR-QC Quality Control:
- Data Analysis: Analyzes CSV structure and data characteristics
- Intelligent Test Selection:
- Uses Gemini AI (if available) to analyze headers and select appropriate tests
- Falls back to rule-based selection if AI unavailable
- Logs which system was used (AI or rule-based)
- QC Test Execution: Runs selected tests:
- For Occurrence Data (species records, biodiversity):
missing_data(row-wise),location(if coordinates present) - For Sensor Data (time-series):
gross_range,spike,flat_line,rate_of_change,temporal_consistency,climatology,missing_data(column-wise),duplicate_detection
- For Occurrence Data (species records, biodiversity):
- Flag Assignment: Adds
flagcolumn to DataFrame with QC flags (GOOD, SUSPECT, FAIL, MISSING, UNKNOWN) - Quality Report Generation: Creates comprehensive JSON report with:
- Summary statistics
- Detailed metrics per test
- Test rationale
- Recommendations
- Parquet Conversion: Converts CSV with QC flags to Parquet format
- Storage Upload: Parquet file uploaded to
processed-databucket - Metadata Storage: Metadata + quality report stored in
metadata_sagartable c. Progress Tracking: - Real-time status updates for each file (converting, processing, completed, failed)
- Individual progress indicators
- Error messages displayed per file if processing fails
- File Conversion: File is converted to CSV format (if not already CSV)
- Quality Report Display:
- List of all processed files displayed after batch completion
- Individual "View Report" button for each file
- Interactive charts showing flag distribution for selected file
- Expandable test results with column-specific details
- Download options for JSON and PDF reports per file
- Completion:
- Success animation with checkmark
- Summary message: "[N] files processed and stored to lakehouse"
- List of processed files with status indicators
- Individual quality reports available for viewing and download
- Return to upload interface
- Upload Errors: Displayed in red with error message per file
- Network Errors: Caught and displayed to user; doesn't stop processing of other files
- Validation Errors: File selection validation before upload (max 100 files)
- QC Errors: QC test failures are logged but don't stop processing
- Individual File Errors: One file failure doesn't affect other files in the batch
- Error Recovery: Failed files are clearly marked with error details in the results list
The Data Processing Engine is a FastAPI service located in the DataProcessingEngine/ folder that handles CSV to Parquet conversion and storage.
-
Navigate to the DataProcessingEngine directory:
cd DataProcessingEngine -
Install dependencies:
pip install -r requirements.txt
-
Configure environment variables:
- Copy
.env.exampleto.env - Fill in your Supabase credentials and optional Gemini API key:
SUPABASE_URL=https://your-project.supabase.co SUPABASE_KEY=your-service-role-key-here GEMINI_API_KEY=your-gemini-api-key-here # Optional: enables AI-powered test selection - Note: If
GEMINI_API_KEYis not provided, the system will use rule-based test selection
- Copy
-
Run the server:
uvicorn main:app --reload --port 8000
POST /process-csv
Processes a cleaned CSV file:
- Runs SAGAR-QC quality control tests
- Adds QC flags to data
- Converts CSV to Parquet format using pandas/pyarrow
- Uploads to
processed-databucket in Supabase Storage - Stores metadata and quality report in
metadata_sagartable
Request:
- Method: POST
- Content-Type: multipart/form-data
- Body: File upload (cleaned CSV file)
Response:
{
"status": "success",
"processed_file": "filename.parquet",
"metadata": {
"columns": [...],
"inferred_types": {...},
"total_rows": 1000,
"quality_control": {
"summary": {...},
"detailed_metrics": {...},
"test_results": {...}
},
"quality_report_json": {
"summary": {...},
"detailed_metrics": {...},
"test_results": {...},
"test_rationale": "...",
"recommendations": [...]
}
}
}For more details, see DataProcessingEngine/README.md.
The SAGAR-QC module is a proprietary quality control system integrated into the Data Processing Engine. It provides comprehensive data validation using both IOOS-QC/QARTOD standards and custom SAGAR-specific tests.
- AI-Powered (Gemini 2.5 Flash): Analyzes CSV headers and data characteristics to intelligently select appropriate QC tests
- Rule-Based Fallback: Uses rule-based logic if Gemini AI is unavailable
- Data Type Detection: Automatically distinguishes between:
- Occurrence Data: Species records, biodiversity data, checklists (row-wise testing)
- Sensor Data: Time-series measurements, real-time sensor data (column-wise testing)
IOOS-QC/QARTOD Standard Tests:
- Gross Range Test: Validates data within acceptable physical/biological ranges
- Spike Test: Detects sudden, unrealistic value changes
- Flat Line Test: Identifies constant values (sensor malfunction)
- Rate of Change Test: Validates rate of change between consecutive values
- Climatology Test: Compares against historical climatological ranges
- Temporal Consistency Test: Validates temporal ordering and gaps
SAGAR-Specific Tests:
- Location Test: Validates GPS coordinates in multiple formats:
- Decimal Degrees (DD)
- NMEA 0183 (DDMM.MMMM, DDMMSS.SSSS)
- Degrees Decimal Minutes (DDM)
- Degrees Minutes Seconds (DMS)
- UTM coordinates
- Missing Data Test:
- Row-wise for occurrence data (checks critical identifier fields)
- Column-wise for sensor data (flags columns with excessive missing values)
- Duplicate Detection: Identifies duplicate records
Standard flags applied to data:
- GOOD (1): Data passes all applicable tests
- UNKNOWN (2): Insufficient information to determine quality
- SUSPECT (3): Data may be questionable but not definitively bad
- FAIL (4): Data fails quality tests
- MISSING (9): Data value is missing
JSON Report Structure:
{
"summary": {
"quality_status": "GOOD|SUSPECT|FAIL",
"flag_summary": {...},
"total_rows": 1000,
"tests_executed": [...]
},
"detailed_metrics": {
"overall_quality_score": 95.5,
"good_percentage": 90.0,
"suspect_percentage": 8.0,
"fail_percentage": 2.0
},
"test_results": {
"test_name": {
"rows_flagged": 50,
"columns_checked": [...],
"column_results": {...}
}
},
"test_rationale": "Explanation of why tests were selected",
"recommendations": [...]
}PDF Report Features:
- Academic-style formal report
- Title page with dataset information
- Executive summary
- Data characteristics analysis
- Quality control methodology
- Test selection rationale
- Detailed test results with charts
- Recommendations for data improvement
The location test supports multiple GPS coordinate formats:
- Decimal Degrees (DD):
12.9716, 77.5946 - NMEA 0183 DDMM.MMMM:
958.217, 7614.599(variable length) - NMEA 0183 DDMMSS.SSSS:
095821.7, 0761459.9 - Degrees Decimal Minutes (DDM):
12Β°58.217', 77Β°14.599' - Degrees Minutes Seconds (DMS):
12Β°58'13", 77Β°14'36" - UTM: Universal Transverse Mercator coordinates
The system automatically detects the format and converts to decimal degrees for validation.
The SAGAR-QC module is automatically executed during the data processing pipeline:
- CSV is cleaned and parsed
- Data structure is analyzed
- Appropriate tests are selected (AI or rule-based)
- Tests are executed and flags assigned
- Quality report is generated
- DataFrame with flags is converted to Parquet
- Report is stored in metadata and returned to frontend
QC test behavior can be configured via test-specific parameters:
- Range bounds for gross range test
- Spike detection thresholds
- Missing data percentage limits
- Climatology reference data
- And more...
See DataProcessingEngine/SAGAR_QC/ for detailed implementation.
# Start development server
npm run dev
# Build for production
npm run build
# Preview production build
npm run preview- Runs on
http://localhost:5173by default - Hot Module Replacement (HMR) enabled
- Fast refresh for React components
- Component-based: React functional components with hooks
- State Management: React useState and useEffect hooks
- Styling: Inline styles with glassmorphism effects
- Error Handling: Try-catch blocks and error state management
npm run buildThis creates an optimized production build in the dist/ directory:
- Minified JavaScript
- Optimized assets
- Tree-shaking for smaller bundle size
npm run previewStarts a local server to preview the production build before deployment.
-
Create Netlify Site
- Connect your Git repository
- Or drag and drop the
distfolder after building
-
Configure Build Settings
- Build command:
npm run build - Publish directory:
dist - Node version: 18.x or higher
- Build command:
-
Set Environment Variables In Netlify dashboard β Site settings β Environment variables:
VITE_SUPABASE_URL=your-supabase-url VITE_SUPABASE_ANON_KEY=your-anon-key VITE_LOGIN_USERNAME=admin VITE_LOGIN_PASSWORD=admin123 -
Deploy
- Push to main branch (auto-deploy)
- Or trigger manual deploy from Netlify dashboard
Vercel:
- Similar to Netlify
- Set environment variables in project settings
- Auto-detects Vite configuration
Supabase Hosting:
- Can host static sites
- Environment variables configured in Supabase dashboard
Traditional Hosting:
- Build locally:
npm run build - Upload
dist/folder contents to web server - Configure environment variables on server
-
Authentication: Currently uses client-side validation. For production:
- Move authentication to server-side
- Use Supabase Auth for proper user management
- Implement JWT tokens
-
Environment Variables:
- Never commit
.envfiles - Use secure secret management in production
- Rotate keys regularly
- Never commit
-
Storage Permissions:
- Configure Supabase Storage bucket policies
- Restrict upload permissions appropriately
- Enable RLS (Row Level Security) if needed
-
API Endpoints:
- Secure backend ingestion API
- Use API keys or authentication tokens
- Implement rate limiting
Upload fails:
- Check that the Data Processing Engine API is running
- Verify
VITE_PROCESSING_API_URLis set correctly in.env(defaults tohttp://localhost:8000for development) - Check browser console for detailed error messages
- Ensure backend has proper Supabase credentials configured
- For multiple files: Check individual file status in the processing list to identify which files failed
Login not working:
- Verify environment variables are set correctly
- Check that
VITE_prefix is used (required for Vite) - Restart dev server after changing
.env
Globe not displaying:
- Check internet connection (globe uses external image URLs)
- Verify
react-globe.glandthreeare installed - Check browser console for WebGL errors
Backend API not responding:
- Verify the Data Processing Engine is running (
uvicorn main:app --reload --port 8000) - Check backend logs for errors
- Ensure backend
.envhas correct Supabase credentials - Verify CORS is enabled in the backend for your frontend URL
Quality Control not working:
- Check backend logs for QC test execution messages
- Verify SAGAR_QC module is properly installed
- Check if Gemini AI is being used (look for "Using Gemini AI" or "Using rule-based" in logs)
- If using Gemini AI, ensure
GEMINI_API_KEYis set in backend.env - Review test selection rationale in quality report to understand which tests were selected
Quality Report not displaying:
- Check browser console for errors
- Verify
quality_report_jsonis present in API response for the specific file - Ensure
rechartsand PDF libraries are installed (npm install) - Check that QualityReport component is properly imported in App.jsx
- For multiple files: Click "View Report" button for the specific file you want to view
- Verify the file was successfully processed (check status indicator)
[Add your license information here]
[Add contributor information here]
For issues, questions, or contributions, please create an issue or contact the development team.
Built with β€οΈ for CMLRE SAGAR Data Lakehouse