Skip to content

Technical-Mavle/DataProcessingEngine_SAGAR

Repository files navigation

SAGAR Data Processing Engine

FastAPI service that processes CSV files to Parquet format with integrated SAGAR-QC quality control system. Automatically validates data quality, applies QC tests, and generates comprehensive quality reports before storing in Supabase storage.

Features

  • AI-Powered Preprocessing: Uses Google Gemini 2.5 Flash API to intelligently analyze and convert any file format to CSV
    • Automatically identifies headers, columns, and data structure
    • Handles complex delimiters (tabs, spaces, commas, mixed)
    • Removes metadata and non-data rows intelligently
    • Falls back to rule-based preprocessing if Gemini API is not configured
  • SAGAR-QC Quality Control: Proprietary quality control system integrated into the processing pipeline
    • Intelligent Test Selection: AI-powered (Gemini) or rule-based test selection based on data characteristics
    • Comprehensive QC Tests: IOOS-QC/QARTOD standards + SAGAR-specific tests
    • Data Type Awareness: Automatically detects occurrence data vs. sensor data
    • Multi-format GPS Support: Validates coordinates in various formats (Decimal Degrees, NMEA 0183, DDM, DMS, UTM)
    • QC Flagging: Adds flag column with standard QC flags (GOOD, SUSPECT, FAIL, MISSING, UNKNOWN)
    • Quality Reports: Generates comprehensive JSON quality reports with metrics and recommendations
  • Parquet Conversion: Converts cleaned CSV (with QC flags) to Parquet format using pandas/pyarrow
  • Storage Upload: Uploads Parquet files to Supabase processed-data bucket
  • Metadata Storage: Stores processing metadata and quality reports in metadata_sagar table

Setup

  1. Create and activate virtual environment (recommended):

    python3 -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  2. Install dependencies:

    pip install --upgrade pip
    pip install -r requirements.txt
  3. Configure environment variables:

    • Create a .env file in the DataProcessingEngine directory
    • Add your Supabase credentials:
      SUPABASE_URL=https://your-project.supabase.co
      SUPABASE_KEY=your-service-role-key-here
      
    • Add your Gemini API key (optional but recommended):
      GEMINI_API_KEY=your-gemini-api-key-here
      
    • Note:
      • Use the SERVICE ROLE KEY (not the anon key) for Supabase backend operations
      • Get Gemini API key from: https://makersuite.google.com/app/apikey
      • If GEMINI_API_KEY is not set:
        • System will use rule-based preprocessing (less intelligent)
        • SAGAR-QC will use rule-based test selection (still functional, but less intelligent)
  4. Run the server:

    # Activate virtual environment (if using one)
    source venv/bin/activate
    
    # Run the server on localhost:8000
    uvicorn main:app --host localhost --port 8000 --reload

    The server will start at: http://localhost:8000

    • Root endpoint: http://localhost:8000/ - Status check
    • Health check: http://localhost:8000/health - Health status
    • API docs: http://localhost:8000/docs - Interactive Swagger UI documentation
    • Process endpoint: http://localhost:8000/process-csv - Main processing endpoint

API Endpoints

POST /process-csv

Processes a CSV file (can have lines above header):

  1. Cleans CSV: Removes lines above header, ensures proper format
  2. SAGAR-QC Processing: Runs quality control tests and adds QC flags
  3. Converts to Parquet: Converts cleaned CSV (with QC flags) to Parquet format
  4. Uploads to Storage: Uploads to processed-data bucket in Supabase Storage
  5. Stores Metadata: Stores processing metadata and quality report in metadata_sagar table

Request:

  • Method: POST
  • Content-Type: multipart/form-data
  • Body: File upload (CSV file - can have metadata lines above header)

Response:

{
  "status": "success",
  "processed_file": "filename.parquet",
  "metadata": {
    "columns": [...],
    "inferred_types": {...},
    "total_rows": 1000,
    "quality_control": {
      "summary": {
        "quality_status": "GOOD|SUSPECT|FAIL",
        "flag_summary": {...},
        "total_rows": 1000,
        "tests_executed": [...]
      },
      "detailed_metrics": {
        "overall_quality_score": 95.5,
        "good_percentage": 90.0,
        "suspect_percentage": 8.0,
        "fail_percentage": 2.0
      },
      "test_results": {...}
    },
    "quality_report_json": {
      "summary": {...},
      "detailed_metrics": {...},
      "test_results": {...},
      "test_rationale": "Explanation of why tests were selected",
      "recommendations": [...]
    }
  }
}

GET /health

Health check endpoint.

Environment Variables

  • SUPABASE_URL: Your Supabase project URL
  • SUPABASE_KEY: Your Supabase service role key (needed for storage uploads)
  • GEMINI_API_KEY: (Optional) Your Google Gemini API key for AI-powered features
    • Get it from: https://makersuite.google.com/app/apikey
    • Used for:
      • AI-powered CSV preprocessing (file structure analysis)
      • AI-powered QC test selection (intelligent test selection based on data characteristics)
    • If not provided:
      • System uses rule-based preprocessing
      • SAGAR-QC uses rule-based test selection (still functional)

Processing Flow

  1. AI-Powered Preprocessing (if Gemini API key is configured):

    • Uses Google Gemini 2.5 Flash model to analyze file structure
    • Intelligently identifies headers, columns, and data patterns
    • Handles complex delimiters and mixed formats automatically
    • Removes metadata lines and non-data content
    • Converts to clean CSV format with proper quoting
    • Falls back to rule-based preprocessing if API call fails
  2. Rule-Based Preprocessing (fallback or if no Gemini API key):

    • Removes lines above the header row
    • Detects header (first line with commas or tabs)
    • Ensures all data rows match header column count
    • Handles multiple encodings (UTF-8, Latin-1, ISO-8859-1, CP1252)
  3. Data Processing:

    • Reads cleaned CSV into pandas DataFrame
    • Handles type inference (dates, numeric values)
    • Removes completely empty columns
    • Sanitizes column names for Parquet compatibility
  4. SAGAR-QC Quality Control:

    • Data Analysis: Analyzes DataFrame structure and data characteristics
      • Detects data type (occurrence data vs. sensor data)
      • Identifies numeric, temporal, and coordinate columns
      • Analyzes data patterns and distributions
    • Intelligent Test Selection:
      • AI-Powered (if Gemini API key available): Uses Gemini 2.5 Flash to analyze CSV headers and intelligently select appropriate QC tests
      • Rule-Based (fallback): Uses rule-based logic to select tests based on detected data characteristics
      • Logs which system is used (visible in terminal output)
    • QC Test Execution: Runs selected tests:
      • For Occurrence Data (species records, biodiversity):
        • missing_data (row-wise: checks each record for critical identifier fields)
        • location (if coordinates present: validates GPS coordinates in multiple formats)
      • For Sensor Data (time-series measurements):
        • gross_range (validates data within acceptable ranges)
        • spike (detects sudden unrealistic value changes)
        • flat_line (identifies constant values indicating sensor malfunction)
        • rate_of_change (validates rate of change between consecutive values)
        • temporal_consistency (validates temporal ordering and gaps)
        • climatology (compares against historical climatological ranges)
        • missing_data (column-wise: flags columns with excessive missing values)
        • duplicate_detection (identifies duplicate records)
    • Flag Assignment: Adds flag column to DataFrame with QC flags:
      • GOOD (1): Data passes all applicable tests
      • UNKNOWN (2): Insufficient information to determine quality
      • SUSPECT (3): Data may be questionable but not definitively bad
      • FAIL (4): Data fails quality tests
      • MISSING (9): Data value is missing
    • Quality Report Generation: Creates comprehensive JSON report with:
      • Summary statistics (flag distribution, quality status)
      • Detailed metrics (quality score, percentages)
      • Individual test results (columns checked, rows flagged, sample problematic values)
      • Test rationale (explanation of why specific tests were selected)
      • Recommendations for data improvement
  5. Parquet Conversion:

    • Converts DataFrame (with QC flags) to Parquet format using PyArrow
    • Preserves data types, structure, and QC flags
  6. Storage & Metadata:

    • Uploads Parquet file to processed-data bucket
    • Creates bounding box geometry if coordinates are available
    • Stores metadata and quality report in metadata_sagar table
    • Returns quality report to frontend for display

Testing the API

Using curl:

# Health check
curl http://localhost:8000/health

# Process a CSV file
curl -X POST http://localhost:8000/process-csv \
  -F "file=@path/to/your/file.csv"

Using the Interactive API Documentation:

Visit http://localhost:8000/docs in your browser to access the Swagger UI, where you can:

  • View all available endpoints
  • Test the API directly from the browser
  • See request/response schemas

SAGAR-QC Module

The SAGAR-QC module (SAGAR_QC/) is a proprietary quality control system integrated into the processing pipeline.

Module Structure

SAGAR_QC/
├── __init__.py          # Module initialization and exports
├── qc_flags.py          # QC flag definitions (GOOD, SUSPECT, FAIL, etc.)
├── qc_tests.py          # QC test implementations
├── qc_analyzer.py       # Data analysis and intelligent test selection
└── qc_pipeline.py       # QC pipeline orchestration

Available QC Tests

IOOS-QC/QARTOD Standard Tests:

  • gross_range_test: Validates data within acceptable physical/biological ranges
  • spike_test: Detects sudden, unrealistic value changes
  • flat_line_test: Identifies constant values (sensor malfunction indicator)
  • rate_of_change_test: Validates rate of change between consecutive values
  • climatology_test: Compares against historical climatological ranges
  • temporal_consistency_test: Validates temporal ordering and gaps

SAGAR-Specific Tests:

  • location_test: Validates GPS coordinates in multiple formats:
    • Decimal Degrees (DD): 12.9716, 77.5946
    • NMEA 0183 DDMM.MMMM: 958.217, 7614.599 (variable length support)
    • NMEA 0183 DDMMSS.SSSS: 095821.7, 0761459.9
    • Degrees Decimal Minutes (DDM): 12°58.217', 77°14.599'
    • Degrees Minutes Seconds (DMS): 12°58'13", 77°14'36"
    • UTM: Universal Transverse Mercator coordinates
  • missing_data_test: Column-wise missing data analysis for sensor data
  • missing_data_test_occurrence: Row-wise missing data check for occurrence data
  • duplicate_detection_test: Identifies duplicate records

Intelligent Test Selection

The system intelligently selects appropriate tests based on data characteristics:

  • AI-Powered (Gemini 2.5 Flash):

    • Analyzes CSV headers and data structure
    • Distinguishes between occurrence data and sensor data
    • Selects only relevant tests to avoid false positives
    • Provides rationale for test selection
  • Rule-Based (Fallback):

    • Analyzes data structure programmatically
    • Detects occurrence data by column names (occurrenceID, scientificName, etc.)
    • Detects sensor data by temporal columns and numeric patterns
    • Applies appropriate tests based on detected characteristics

Data Type Awareness

The system automatically detects and handles different data types:

  • Occurrence Data (species records, biodiversity data):

    • Uses row-wise testing (each record checked individually)
    • Applies: missing_data (row-wise), location (if coordinates present)
    • Skips time-series tests (spike, flat_line, rate_of_change, etc.)
  • Sensor Data (time-series measurements):

    • Uses column-wise testing (each column analyzed separately)
    • Applies full suite of QC tests including temporal and range-based tests

QC Flags

Standard QC flags applied to data:

  • GOOD (1): Data passes all applicable tests
  • UNKNOWN (2): Insufficient information to determine quality
  • SUSPECT (3): Data may be questionable but not definitively bad
  • FAIL (4): Data fails quality tests
  • MISSING (9): Data value is missing

Flags are combined using worst-case logic (e.g., if one test flags SUSPECT and another flags FAIL, the final flag is FAIL).

Quality Reports

The system generates comprehensive quality reports in JSON format, including:

  • Summary statistics (flag distribution, quality status)
  • Detailed metrics (quality score, percentages per flag type)
  • Individual test results (columns checked, rows flagged, sample problematic values)
  • Test rationale (explanation of why specific tests were selected)
  • Recommendations for data improvement

Reports are stored in the metadata_sagar table and returned to the frontend for interactive display and PDF generation.

Notes

  • AI-Powered Features: With Gemini API key, the service uses AI for:
    • Intelligent CSV preprocessing (better handling of complex file structures)
    • Intelligent QC test selection (reduces false positives by selecting only relevant tests)
  • Fallback Processing: Without Gemini API key:
    • Uses rule-based preprocessing
    • SAGAR-QC uses rule-based test selection (still fully functional)
  • QC Processing: SAGAR-QC runs automatically after CSV cleaning and before Parquet conversion
  • QC Flags: All processed data includes a flag column with QC validation results
  • Quality Reports: Comprehensive reports are generated and stored with metadata
  • Parquet files are uploaded to the processed-data bucket
  • Metadata and quality reports are stored in the metadata_sagar table
  • The service handles comma, tab, and space-separated files
  • Supports multiple text encodings for international character sets
  • Server runs with auto-reload enabled (restarts on code changes)
  • Large files (>200k chars) are processed with sampling for efficiency
  • Terminal logs indicate whether Gemini AI or rule-based system is used for test selection

About

Data processing engine Fast API for SAGAR

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages