FastAPI service that processes CSV files to Parquet format with integrated SAGAR-QC quality control system. Automatically validates data quality, applies QC tests, and generates comprehensive quality reports before storing in Supabase storage.
- AI-Powered Preprocessing: Uses Google Gemini 2.5 Flash API to intelligently analyze and convert any file format to CSV
- Automatically identifies headers, columns, and data structure
- Handles complex delimiters (tabs, spaces, commas, mixed)
- Removes metadata and non-data rows intelligently
- Falls back to rule-based preprocessing if Gemini API is not configured
- SAGAR-QC Quality Control: Proprietary quality control system integrated into the processing pipeline
- Intelligent Test Selection: AI-powered (Gemini) or rule-based test selection based on data characteristics
- Comprehensive QC Tests: IOOS-QC/QARTOD standards + SAGAR-specific tests
- Data Type Awareness: Automatically detects occurrence data vs. sensor data
- Multi-format GPS Support: Validates coordinates in various formats (Decimal Degrees, NMEA 0183, DDM, DMS, UTM)
- QC Flagging: Adds
flagcolumn with standard QC flags (GOOD, SUSPECT, FAIL, MISSING, UNKNOWN) - Quality Reports: Generates comprehensive JSON quality reports with metrics and recommendations
- Parquet Conversion: Converts cleaned CSV (with QC flags) to Parquet format using pandas/pyarrow
- Storage Upload: Uploads Parquet files to Supabase
processed-databucket - Metadata Storage: Stores processing metadata and quality reports in
metadata_sagartable
-
Create and activate virtual environment (recommended):
python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install --upgrade pip pip install -r requirements.txt
-
Configure environment variables:
- Create a
.envfile in theDataProcessingEnginedirectory - Add your Supabase credentials:
SUPABASE_URL=https://your-project.supabase.co SUPABASE_KEY=your-service-role-key-here - Add your Gemini API key (optional but recommended):
GEMINI_API_KEY=your-gemini-api-key-here - Note:
- Use the SERVICE ROLE KEY (not the anon key) for Supabase backend operations
- Get Gemini API key from: https://makersuite.google.com/app/apikey
- If GEMINI_API_KEY is not set:
- System will use rule-based preprocessing (less intelligent)
- SAGAR-QC will use rule-based test selection (still functional, but less intelligent)
- Create a
-
Run the server:
# Activate virtual environment (if using one) source venv/bin/activate # Run the server on localhost:8000 uvicorn main:app --host localhost --port 8000 --reload
The server will start at: http://localhost:8000
- Root endpoint:
http://localhost:8000/- Status check - Health check:
http://localhost:8000/health- Health status - API docs:
http://localhost:8000/docs- Interactive Swagger UI documentation - Process endpoint:
http://localhost:8000/process-csv- Main processing endpoint
- Root endpoint:
Processes a CSV file (can have lines above header):
- Cleans CSV: Removes lines above header, ensures proper format
- SAGAR-QC Processing: Runs quality control tests and adds QC flags
- Converts to Parquet: Converts cleaned CSV (with QC flags) to Parquet format
- Uploads to Storage: Uploads to
processed-databucket in Supabase Storage - Stores Metadata: Stores processing metadata and quality report in
metadata_sagartable
Request:
- Method: POST
- Content-Type: multipart/form-data
- Body: File upload (CSV file - can have metadata lines above header)
Response:
{
"status": "success",
"processed_file": "filename.parquet",
"metadata": {
"columns": [...],
"inferred_types": {...},
"total_rows": 1000,
"quality_control": {
"summary": {
"quality_status": "GOOD|SUSPECT|FAIL",
"flag_summary": {...},
"total_rows": 1000,
"tests_executed": [...]
},
"detailed_metrics": {
"overall_quality_score": 95.5,
"good_percentage": 90.0,
"suspect_percentage": 8.0,
"fail_percentage": 2.0
},
"test_results": {...}
},
"quality_report_json": {
"summary": {...},
"detailed_metrics": {...},
"test_results": {...},
"test_rationale": "Explanation of why tests were selected",
"recommendations": [...]
}
}
}Health check endpoint.
SUPABASE_URL: Your Supabase project URLSUPABASE_KEY: Your Supabase service role key (needed for storage uploads)GEMINI_API_KEY: (Optional) Your Google Gemini API key for AI-powered features- Get it from: https://makersuite.google.com/app/apikey
- Used for:
- AI-powered CSV preprocessing (file structure analysis)
- AI-powered QC test selection (intelligent test selection based on data characteristics)
- If not provided:
- System uses rule-based preprocessing
- SAGAR-QC uses rule-based test selection (still functional)
-
AI-Powered Preprocessing (if Gemini API key is configured):
- Uses Google Gemini 2.5 Flash model to analyze file structure
- Intelligently identifies headers, columns, and data patterns
- Handles complex delimiters and mixed formats automatically
- Removes metadata lines and non-data content
- Converts to clean CSV format with proper quoting
- Falls back to rule-based preprocessing if API call fails
-
Rule-Based Preprocessing (fallback or if no Gemini API key):
- Removes lines above the header row
- Detects header (first line with commas or tabs)
- Ensures all data rows match header column count
- Handles multiple encodings (UTF-8, Latin-1, ISO-8859-1, CP1252)
-
Data Processing:
- Reads cleaned CSV into pandas DataFrame
- Handles type inference (dates, numeric values)
- Removes completely empty columns
- Sanitizes column names for Parquet compatibility
-
SAGAR-QC Quality Control:
- Data Analysis: Analyzes DataFrame structure and data characteristics
- Detects data type (occurrence data vs. sensor data)
- Identifies numeric, temporal, and coordinate columns
- Analyzes data patterns and distributions
- Intelligent Test Selection:
- AI-Powered (if Gemini API key available): Uses Gemini 2.5 Flash to analyze CSV headers and intelligently select appropriate QC tests
- Rule-Based (fallback): Uses rule-based logic to select tests based on detected data characteristics
- Logs which system is used (visible in terminal output)
- QC Test Execution: Runs selected tests:
- For Occurrence Data (species records, biodiversity):
missing_data(row-wise: checks each record for critical identifier fields)location(if coordinates present: validates GPS coordinates in multiple formats)
- For Sensor Data (time-series measurements):
gross_range(validates data within acceptable ranges)spike(detects sudden unrealistic value changes)flat_line(identifies constant values indicating sensor malfunction)rate_of_change(validates rate of change between consecutive values)temporal_consistency(validates temporal ordering and gaps)climatology(compares against historical climatological ranges)missing_data(column-wise: flags columns with excessive missing values)duplicate_detection(identifies duplicate records)
- For Occurrence Data (species records, biodiversity):
- Flag Assignment: Adds
flagcolumn to DataFrame with QC flags:- GOOD (1): Data passes all applicable tests
- UNKNOWN (2): Insufficient information to determine quality
- SUSPECT (3): Data may be questionable but not definitively bad
- FAIL (4): Data fails quality tests
- MISSING (9): Data value is missing
- Quality Report Generation: Creates comprehensive JSON report with:
- Summary statistics (flag distribution, quality status)
- Detailed metrics (quality score, percentages)
- Individual test results (columns checked, rows flagged, sample problematic values)
- Test rationale (explanation of why specific tests were selected)
- Recommendations for data improvement
- Data Analysis: Analyzes DataFrame structure and data characteristics
-
Parquet Conversion:
- Converts DataFrame (with QC flags) to Parquet format using PyArrow
- Preserves data types, structure, and QC flags
-
Storage & Metadata:
- Uploads Parquet file to
processed-databucket - Creates bounding box geometry if coordinates are available
- Stores metadata and quality report in
metadata_sagartable - Returns quality report to frontend for display
- Uploads Parquet file to
# Health check
curl http://localhost:8000/health
# Process a CSV file
curl -X POST http://localhost:8000/process-csv \
-F "file=@path/to/your/file.csv"Visit http://localhost:8000/docs in your browser to access the Swagger UI, where you can:
- View all available endpoints
- Test the API directly from the browser
- See request/response schemas
The SAGAR-QC module (SAGAR_QC/) is a proprietary quality control system integrated into the processing pipeline.
SAGAR_QC/
├── __init__.py # Module initialization and exports
├── qc_flags.py # QC flag definitions (GOOD, SUSPECT, FAIL, etc.)
├── qc_tests.py # QC test implementations
├── qc_analyzer.py # Data analysis and intelligent test selection
└── qc_pipeline.py # QC pipeline orchestration
IOOS-QC/QARTOD Standard Tests:
gross_range_test: Validates data within acceptable physical/biological rangesspike_test: Detects sudden, unrealistic value changesflat_line_test: Identifies constant values (sensor malfunction indicator)rate_of_change_test: Validates rate of change between consecutive valuesclimatology_test: Compares against historical climatological rangestemporal_consistency_test: Validates temporal ordering and gaps
SAGAR-Specific Tests:
location_test: Validates GPS coordinates in multiple formats:- Decimal Degrees (DD):
12.9716, 77.5946 - NMEA 0183 DDMM.MMMM:
958.217, 7614.599(variable length support) - NMEA 0183 DDMMSS.SSSS:
095821.7, 0761459.9 - Degrees Decimal Minutes (DDM):
12°58.217', 77°14.599' - Degrees Minutes Seconds (DMS):
12°58'13", 77°14'36" - UTM: Universal Transverse Mercator coordinates
- Decimal Degrees (DD):
missing_data_test: Column-wise missing data analysis for sensor datamissing_data_test_occurrence: Row-wise missing data check for occurrence dataduplicate_detection_test: Identifies duplicate records
The system intelligently selects appropriate tests based on data characteristics:
-
AI-Powered (Gemini 2.5 Flash):
- Analyzes CSV headers and data structure
- Distinguishes between occurrence data and sensor data
- Selects only relevant tests to avoid false positives
- Provides rationale for test selection
-
Rule-Based (Fallback):
- Analyzes data structure programmatically
- Detects occurrence data by column names (occurrenceID, scientificName, etc.)
- Detects sensor data by temporal columns and numeric patterns
- Applies appropriate tests based on detected characteristics
The system automatically detects and handles different data types:
-
Occurrence Data (species records, biodiversity data):
- Uses row-wise testing (each record checked individually)
- Applies:
missing_data(row-wise),location(if coordinates present) - Skips time-series tests (spike, flat_line, rate_of_change, etc.)
-
Sensor Data (time-series measurements):
- Uses column-wise testing (each column analyzed separately)
- Applies full suite of QC tests including temporal and range-based tests
Standard QC flags applied to data:
- GOOD (1): Data passes all applicable tests
- UNKNOWN (2): Insufficient information to determine quality
- SUSPECT (3): Data may be questionable but not definitively bad
- FAIL (4): Data fails quality tests
- MISSING (9): Data value is missing
Flags are combined using worst-case logic (e.g., if one test flags SUSPECT and another flags FAIL, the final flag is FAIL).
The system generates comprehensive quality reports in JSON format, including:
- Summary statistics (flag distribution, quality status)
- Detailed metrics (quality score, percentages per flag type)
- Individual test results (columns checked, rows flagged, sample problematic values)
- Test rationale (explanation of why specific tests were selected)
- Recommendations for data improvement
Reports are stored in the metadata_sagar table and returned to the frontend for interactive display and PDF generation.
- AI-Powered Features: With Gemini API key, the service uses AI for:
- Intelligent CSV preprocessing (better handling of complex file structures)
- Intelligent QC test selection (reduces false positives by selecting only relevant tests)
- Fallback Processing: Without Gemini API key:
- Uses rule-based preprocessing
- SAGAR-QC uses rule-based test selection (still fully functional)
- QC Processing: SAGAR-QC runs automatically after CSV cleaning and before Parquet conversion
- QC Flags: All processed data includes a
flagcolumn with QC validation results - Quality Reports: Comprehensive reports are generated and stored with metadata
- Parquet files are uploaded to the
processed-databucket - Metadata and quality reports are stored in the
metadata_sagartable - The service handles comma, tab, and space-separated files
- Supports multiple text encodings for international character sets
- Server runs with auto-reload enabled (restarts on code changes)
- Large files (>200k chars) are processed with sampling for efficiency
- Terminal logs indicate whether Gemini AI or rule-based system is used for test selection