SAGAR Data Processing Engine

FastAPI service that processes CSV files to Parquet format with integrated SAGAR-QC quality control system. Automatically validates data quality, applies QC tests, and generates comprehensive quality reports before storing in Supabase storage.

Features

AI-Powered Preprocessing: Uses Google Gemini 2.5 Flash API to intelligently analyze and convert any file format to CSV
- Automatically identifies headers, columns, and data structure
- Handles complex delimiters (tabs, spaces, commas, mixed)
- Removes metadata and non-data rows intelligently
- Falls back to rule-based preprocessing if Gemini API is not configured
SAGAR-QC Quality Control: Proprietary quality control system integrated into the processing pipeline
- Intelligent Test Selection: AI-powered (Gemini) or rule-based test selection based on data characteristics
- Comprehensive QC Tests: IOOS-QC/QARTOD standards + SAGAR-specific tests
- Data Type Awareness: Automatically detects occurrence data vs. sensor data
- Multi-format GPS Support: Validates coordinates in various formats (Decimal Degrees, NMEA 0183, DDM, DMS, UTM)
- QC Flagging: Adds flag column with standard QC flags (GOOD, SUSPECT, FAIL, MISSING, UNKNOWN)
- Quality Reports: Generates comprehensive JSON quality reports with metrics and recommendations
Parquet Conversion: Converts cleaned CSV (with QC flags) to Parquet format using pandas/pyarrow
Storage Upload: Uploads Parquet files to Supabase processed-data bucket
Metadata Storage: Stores processing metadata and quality reports in metadata_sagar table

Setup

Create and activate virtual environment (recommended):

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install --upgrade pip
pip install -r requirements.txt

Configure environment variables:
- Create a .env file in the DataProcessingEngine directory
- Add your Supabase credentials:
```
SUPABASE_URL=https://your-project.supabase.co
SUPABASE_KEY=your-service-role-key-here
```
- Add your Gemini API key (optional but recommended):
```
GEMINI_API_KEY=your-gemini-api-key-here
```
- Note:
  - Use the SERVICE ROLE KEY (not the anon key) for Supabase backend operations
  - Get Gemini API key from: https://makersuite.google.com/app/apikey
  - If GEMINI_API_KEY is not set:
    - System will use rule-based preprocessing (less intelligent)
    - SAGAR-QC will use rule-based test selection (still functional, but less intelligent)
Run the server:
```
# Activate virtual environment (if using one)
source venv/bin/activate

# Run the server on localhost:8000
uvicorn main:app --host localhost --port 8000 --reload
```
The server will start at: http://localhost:8000
- Root endpoint: http://localhost:8000/ - Status check
- Health check: http://localhost:8000/health - Health status
- API docs: http://localhost:8000/docs - Interactive Swagger UI documentation
- Process endpoint: http://localhost:8000/process-csv - Main processing endpoint

API Endpoints

POST `/process-csv`

Processes a CSV file (can have lines above header):

Cleans CSV: Removes lines above header, ensures proper format
SAGAR-QC Processing: Runs quality control tests and adds QC flags
Converts to Parquet: Converts cleaned CSV (with QC flags) to Parquet format
Uploads to Storage: Uploads to processed-data bucket in Supabase Storage
Stores Metadata: Stores processing metadata and quality report in metadata_sagar table

Request:

Method: POST
Content-Type: multipart/form-data
Body: File upload (CSV file - can have metadata lines above header)

Response:

{
  "status": "success",
  "processed_file": "filename.parquet",
  "metadata": {
    "columns": [...],
    "inferred_types": {...},
    "total_rows": 1000,
    "quality_control": {
      "summary": {
        "quality_status": "GOOD|SUSPECT|FAIL",
        "flag_summary": {...},
        "total_rows": 1000,
        "tests_executed": [...]
      },
      "detailed_metrics": {
        "overall_quality_score": 95.5,
        "good_percentage": 90.0,
        "suspect_percentage": 8.0,
        "fail_percentage": 2.0
      },
      "test_results": {...}
    },
    "quality_report_json": {
      "summary": {...},
      "detailed_metrics": {...},
      "test_results": {...},
      "test_rationale": "Explanation of why tests were selected",
      "recommendations": [...]
    }
  }
}

GET `/health`

Health check endpoint.

Environment Variables

SUPABASE_URL: Your Supabase project URL
SUPABASE_KEY: Your Supabase service role key (needed for storage uploads)
GEMINI_API_KEY: (Optional) Your Google Gemini API key for AI-powered features
- Get it from: https://makersuite.google.com/app/apikey
- Used for:
  - AI-powered CSV preprocessing (file structure analysis)
  - AI-powered QC test selection (intelligent test selection based on data characteristics)
- If not provided:
  - System uses rule-based preprocessing
  - SAGAR-QC uses rule-based test selection (still functional)

Processing Flow

AI-Powered Preprocessing (if Gemini API key is configured):
- Uses Google Gemini 2.5 Flash model to analyze file structure
- Intelligently identifies headers, columns, and data patterns
- Handles complex delimiters and mixed formats automatically
- Removes metadata lines and non-data content
- Converts to clean CSV format with proper quoting
- Falls back to rule-based preprocessing if API call fails
Rule-Based Preprocessing (fallback or if no Gemini API key):
- Removes lines above the header row
- Detects header (first line with commas or tabs)
- Ensures all data rows match header column count
- Handles multiple encodings (UTF-8, Latin-1, ISO-8859-1, CP1252)
Data Processing:
- Reads cleaned CSV into pandas DataFrame
- Handles type inference (dates, numeric values)
- Removes completely empty columns
- Sanitizes column names for Parquet compatibility
SAGAR-QC Quality Control:
- Data Analysis: Analyzes DataFrame structure and data characteristics
  - Detects data type (occurrence data vs. sensor data)
  - Identifies numeric, temporal, and coordinate columns
  - Analyzes data patterns and distributions
- Intelligent Test Selection:
  - AI-Powered (if Gemini API key available): Uses Gemini 2.5 Flash to analyze CSV headers and intelligently select appropriate QC tests
  - Rule-Based (fallback): Uses rule-based logic to select tests based on detected data characteristics
  - Logs which system is used (visible in terminal output)
- QC Test Execution: Runs selected tests:
  - For Occurrence Data (species records, biodiversity):
    - missing_data (row-wise: checks each record for critical identifier fields)
    - location (if coordinates present: validates GPS coordinates in multiple formats)
  - For Sensor Data (time-series measurements):
    - gross_range (validates data within acceptable ranges)
    - spike (detects sudden unrealistic value changes)
    - flat_line (identifies constant values indicating sensor malfunction)
    - rate_of_change (validates rate of change between consecutive values)
    - temporal_consistency (validates temporal ordering and gaps)
    - climatology (compares against historical climatological ranges)
    - missing_data (column-wise: flags columns with excessive missing values)
    - duplicate_detection (identifies duplicate records)
- Flag Assignment: Adds flag column to DataFrame with QC flags:
  - GOOD (1): Data passes all applicable tests
  - UNKNOWN (2): Insufficient information to determine quality
  - SUSPECT (3): Data may be questionable but not definitively bad
  - FAIL (4): Data fails quality tests
  - MISSING (9): Data value is missing
- Quality Report Generation: Creates comprehensive JSON report with:
  - Summary statistics (flag distribution, quality status)
  - Detailed metrics (quality score, percentages)
  - Individual test results (columns checked, rows flagged, sample problematic values)
  - Test rationale (explanation of why specific tests were selected)
  - Recommendations for data improvement
Parquet Conversion:
- Converts DataFrame (with QC flags) to Parquet format using PyArrow
- Preserves data types, structure, and QC flags
Storage & Metadata:
- Uploads Parquet file to processed-data bucket
- Creates bounding box geometry if coordinates are available
- Stores metadata and quality report in metadata_sagar table
- Returns quality report to frontend for display

Testing the API

Using curl:

# Health check
curl http://localhost:8000/health

# Process a CSV file
curl -X POST http://localhost:8000/process-csv \
  -F "file=@path/to/your/file.csv"

Using the Interactive API Documentation:

Visit http://localhost:8000/docs in your browser to access the Swagger UI, where you can:

View all available endpoints
Test the API directly from the browser
See request/response schemas

SAGAR-QC Module

The SAGAR-QC module (SAGAR_QC/) is a proprietary quality control system integrated into the processing pipeline.

Module Structure

SAGAR_QC/
├── __init__.py          # Module initialization and exports
├── qc_flags.py          # QC flag definitions (GOOD, SUSPECT, FAIL, etc.)
├── qc_tests.py          # QC test implementations
├── qc_analyzer.py       # Data analysis and intelligent test selection
└── qc_pipeline.py       # QC pipeline orchestration

Available QC Tests

IOOS-QC/QARTOD Standard Tests:

gross_range_test: Validates data within acceptable physical/biological ranges
spike_test: Detects sudden, unrealistic value changes
flat_line_test: Identifies constant values (sensor malfunction indicator)
rate_of_change_test: Validates rate of change between consecutive values
climatology_test: Compares against historical climatological ranges
temporal_consistency_test: Validates temporal ordering and gaps

SAGAR-Specific Tests:

location_test: Validates GPS coordinates in multiple formats:
- Decimal Degrees (DD): 12.9716, 77.5946
- NMEA 0183 DDMM.MMMM: 958.217, 7614.599 (variable length support)
- NMEA 0183 DDMMSS.SSSS: 095821.7, 0761459.9
- Degrees Decimal Minutes (DDM): 12°58.217', 77°14.599'
- Degrees Minutes Seconds (DMS): 12°58'13", 77°14'36"
- UTM: Universal Transverse Mercator coordinates
missing_data_test: Column-wise missing data analysis for sensor data
missing_data_test_occurrence: Row-wise missing data check for occurrence data
duplicate_detection_test: Identifies duplicate records

Intelligent Test Selection

The system intelligently selects appropriate tests based on data characteristics:

AI-Powered (Gemini 2.5 Flash):
- Analyzes CSV headers and data structure
- Distinguishes between occurrence data and sensor data
- Selects only relevant tests to avoid false positives
- Provides rationale for test selection
Rule-Based (Fallback):
- Analyzes data structure programmatically
- Detects occurrence data by column names (occurrenceID, scientificName, etc.)
- Detects sensor data by temporal columns and numeric patterns
- Applies appropriate tests based on detected characteristics

Data Type Awareness

The system automatically detects and handles different data types:

Occurrence Data (species records, biodiversity data):
- Uses row-wise testing (each record checked individually)
- Applies: missing_data (row-wise), location (if coordinates present)
- Skips time-series tests (spike, flat_line, rate_of_change, etc.)
Sensor Data (time-series measurements):
- Uses column-wise testing (each column analyzed separately)
- Applies full suite of QC tests including temporal and range-based tests

QC Flags

Standard QC flags applied to data:

GOOD (1): Data passes all applicable tests
UNKNOWN (2): Insufficient information to determine quality
SUSPECT (3): Data may be questionable but not definitively bad
FAIL (4): Data fails quality tests
MISSING (9): Data value is missing

Flags are combined using worst-case logic (e.g., if one test flags SUSPECT and another flags FAIL, the final flag is FAIL).

Quality Reports

The system generates comprehensive quality reports in JSON format, including:

Summary statistics (flag distribution, quality status)
Detailed metrics (quality score, percentages per flag type)
Individual test results (columns checked, rows flagged, sample problematic values)
Test rationale (explanation of why specific tests were selected)
Recommendations for data improvement

Reports are stored in the metadata_sagar table and returned to the frontend for interactive display and PDF generation.

Notes

AI-Powered Features: With Gemini API key, the service uses AI for:
- Intelligent CSV preprocessing (better handling of complex file structures)
- Intelligent QC test selection (reduces false positives by selecting only relevant tests)
Fallback Processing: Without Gemini API key:
- Uses rule-based preprocessing
- SAGAR-QC uses rule-based test selection (still fully functional)
QC Processing: SAGAR-QC runs automatically after CSV cleaning and before Parquet conversion
QC Flags: All processed data includes a flag column with QC validation results
Quality Reports: Comprehensive reports are generated and stored with metadata
Parquet files are uploaded to the processed-data bucket
Metadata and quality reports are stored in the metadata_sagar table
The service handles comma, tab, and space-separated files
Supports multiple text encodings for international character sets
Server runs with auto-reload enabled (restarts on code changes)
Large files (>200k chars) are processed with sampling for efficiency
Terminal logs indicate whether Gemini AI or rule-based system is used for test selection

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
SAGAR_QC		SAGAR_QC
.DS_Store		.DS_Store
.gitignore		.gitignore
NETCDF_PROCESSING.md		NETCDF_PROCESSING.md
PROCESSING_FLOW.md		PROCESSING_FLOW.md
README.md		README.md
config.py		config.py
create_metadata_table.sql		create_metadata_table.sql
create_netcdf_table.sql		create_netcdf_table.sql
logging_system.py		logging_system.py
main.py		main.py
processing.py		processing.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SAGAR Data Processing Engine

Features

Setup

API Endpoints

POST `/process-csv`

GET `/health`

Environment Variables

Processing Flow

Testing the API

Using curl:

Using the Interactive API Documentation:

SAGAR-QC Module

Module Structure

Available QC Tests

Intelligent Test Selection

Data Type Awareness

QC Flags

Quality Reports

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Technical-Mavle/DataProcessingEngine_SAGAR

Folders and files

Latest commit

History

Repository files navigation

SAGAR Data Processing Engine

Features

Setup

API Endpoints

POST /process-csv

GET /health

Environment Variables

Processing Flow

Testing the API

Using curl:

Using the Interactive API Documentation:

SAGAR-QC Module

Module Structure

Available QC Tests

Intelligent Test Selection

Data Type Awareness

QC Flags

Quality Reports

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

POST `/process-csv`

GET `/health`

Packages