Skip to content

Enterprise-grade machine learning observability platform that detects data drift, concept drift, and performance degradation in production models. Features statistical drift detection (KS test, PSI), real-time alerting, Redis caching, and FastAPI backend.

License

Notifications You must be signed in to change notification settings

rodrigoguedes09/model-observability-system

Repository files navigation

Model Observability System

A production-grade machine learning model monitoring and observability platform that detects data drift, concept drift, and performance degradation in real-time.

Architecture

System Design

┌─────────────────┐
│   ML Models     │
│  (Production)   │
└────────┬────────┘
         │ Prediction Logs
         ▼
┌─────────────────────────────────────────────────────────┐
│                    FastAPI Backend                      │
├─────────────────────────────────────────────────────────┤
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐ │
│  │ Ingestion    │  │ Drift        │  │ Alerting     │ │
│  │ Service      │  │ Detection    │  │ Service      │ │
│  └──────────────┘  └──────────────┘  └──────────────┘ │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐ │
│  │ Performance  │  │ Statistics   │  │ Dashboard    │ │
│  │ Monitoring   │  │ Engine       │  │ API          │ │
│  └──────────────┘  └──────────────┘  └──────────────┘ │
└─────────────────────────────────────────────────────────┘
         │                    │
         ▼                    ▼
┌──────────────────┐  ┌──────────────────┐
│   PostgreSQL     │  │      Redis       │
│  (Prediction     │  │   (Statistics    │
│   Logs & Stats)  │  │     Cache)       │
└──────────────────┘  └──────────────────┘

Components

  1. Ingestion Service: Receives and validates prediction logs from production models
  2. Drift Detection Engine: Implements statistical tests for distribution changes
  3. Performance Monitor: Tracks accuracy and performance metrics over time
  4. Alerting Service: Threshold-based alerting with configurable rules
  5. Statistics Engine: Computes and caches metrics using Redis
  6. Dashboard API: Exposes monitoring data for visualization

Drift Detection Techniques

1. Kolmogorov-Smirnov (KS) Test

Method: Non-parametric test comparing cumulative distribution functions (CDFs)

Implementation:

  • Compares empirical CDFs of baseline vs. production data
  • Statistic: Maximum vertical distance between CDFs
  • P-value threshold: typically 0.05

Advantages:

  • Distribution-free (no assumptions about data shape)
  • Sensitive to location and shape differences
  • Well-established statistical test

Limitations:

  • Works only for continuous/numerical features
  • Less sensitive to tail differences
  • Requires sufficient sample size (n > 30 recommended)

Use Case: Primary drift detector for numerical features

2. Population Stability Index (PSI)

Method: Measures distribution shift using binned data

Formula:

PSI = Σ (Actual% - Expected%) × ln(Actual% / Expected%)

Interpretation:

  • PSI < 0.1: No significant change
  • 0.1 ≤ PSI < 0.25: Moderate change, investigate
  • PSI ≥ 0.25: Significant change, action required

Advantages:

  • Industry-standard in credit scoring and finance
  • Works with both numerical and categorical features
  • Intuitive interpretation
  • Less sensitive to sample size

Limitations:

  • Binning strategy affects results
  • Can miss subtle drifts within bins
  • Requires careful bin selection (typically 10-20 bins)

Use Case: Feature stability monitoring, categorical drift detection

3. Chi-Square Test (Categorical Features)

Method: Tests independence between observed and expected frequencies

Implementation:

  • Compares category distributions
  • Tests null hypothesis: distributions are identical

Advantages:

  • Standard test for categorical data
  • Provides p-value for statistical significance

Use Case: Categorical feature drift detection

Alerting System

Alert Types

  1. Data Drift Alert: Triggered when feature distributions shift significantly
  2. Concept Drift Alert: Triggered when model performance degrades
  3. Volume Alert: Triggered by unusual prediction volume patterns

Configurable Thresholds

# config/thresholds.yaml
drift_detection:
  ks_test_threshold: 0.05        # p-value threshold
  psi_threshold: 0.25             # PSI critical value
  chi_square_threshold: 0.05      # p-value threshold
  
performance:
  accuracy_drop_threshold: 0.05   # 5% drop triggers alert
  min_samples_for_eval: 100       # Minimum samples before evaluation
  
volume:
  prediction_volume_std: 3.0      # Standard deviations for anomaly

Storage Strategy

PostgreSQL (Primary Storage)

Schema Design:

-- Prediction logs with partitioning by date
predictions (
  id, model_id, timestamp, features_json, 
  prediction, ground_truth, created_at
) PARTITION BY RANGE (timestamp);

-- Baseline distributions for drift comparison
baselines (
  model_id, feature_name, distribution_stats, 
  version, created_at
);

-- Drift detection results
drift_results (
  model_id, feature_name, test_type, 
  statistic, p_value, timestamp, is_drifted
);

-- Performance metrics over time
performance_metrics (
  model_id, window_start, window_end,
  accuracy, precision, recall, sample_count
);

Optimizations:

  • Time-series partitioning for efficient queries
  • Indexes on model_id, timestamp, and composite keys
  • Retention policies (e.g., raw logs: 90 days, aggregates: 1 year)

Redis (Caching Layer)

Cached Data:

  • Recent statistics (1-hour, 24-hour windows)
  • Baseline distribution summaries
  • Alert states and counts
  • Real-time prediction volumes

TTL Strategy:

  • Real-time metrics: 5 minutes
  • Hourly aggregates: 1 hour
  • Daily aggregates: 24 hours

Performance Considerations

Scalability Strategies

1. Batch Processing

  • Buffer incoming predictions in Redis
  • Process drift detection in scheduled batches (e.g., every 15 minutes)
  • Reduces database load and improves latency

2. Sampling for Drift Detection

  • Use reservoir sampling for large volumes (>10K predictions/hour)
  • Maintains statistical validity while reducing computation
  • Configurable sampling rate based on traffic

3. Async Processing

  • Use Celery/RQ for background tasks:
    • Drift detection computation
    • Performance metric aggregation
    • Alert generation and delivery
  • Non-blocking API responses

4. Horizontal Scaling

  • Stateless API design enables load balancing
  • Redis cluster for distributed caching
  • PostgreSQL read replicas for dashboard queries

Expected Performance

Workload Throughput Latency (p95)
Prediction ingestion 10K req/s < 50ms
Drift detection (batch) 1M samples < 5 min
Dashboard queries 100 req/s < 200ms

Hardware Assumptions: 4 CPU cores, 16GB RAM, SSD storage

Configuration

Environment Variables

# Database
DATABASE_URL=postgresql://user:pass@localhost:5432/ml_observability
REDIS_URL=redis://localhost:6379/0

# API
API_HOST=0.0.0.0
API_PORT=8000
LOG_LEVEL=INFO

# Drift Detection
KS_TEST_THRESHOLD=0.05
PSI_THRESHOLD=0.25
DRIFT_CHECK_INTERVAL=900  # seconds (15 minutes)

# Performance Monitoring
MIN_SAMPLES_FOR_EVAL=100
ACCURACY_DROP_THRESHOLD=0.05

# Alerting
ALERT_WEBHOOK_URL=https://your-webhook.com/alerts
ALERT_EMAIL=alerts@yourcompany.com

Quick Start

Installation

# Clone repository
git clone https://github.com/yourusername/model-observability-system.git
cd model-observability-system

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Set up database
alembic upgrade head

# Start Redis
docker run -d -p 6379:6379 redis:7-alpine

# Run application
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload

Sending Prediction Logs

import requests

# Log a prediction
response = requests.post(
    "http://localhost:8000/api/v1/predictions",
    json={
        "model_id": "fraud_detector_v2",
        "features": {
            "transaction_amount": 150.50,
            "merchant_category": "retail",
            "user_age": 35,
            "account_age_days": 450
        },
        "prediction": 0,  # 0: legitimate, 1: fraud
        "ground_truth": None  # Will be updated later
    }
)

Registering a Baseline

# Upload training data distribution as baseline
response = requests.post(
    "http://localhost:8000/api/v1/baselines",
    json={
        "model_id": "fraud_detector_v2",
        "version": "v2.1.0",
        "features": {
            "transaction_amount": {
                "type": "numerical",
                "mean": 125.30,
                "std": 89.45,
                "min": 0.01,
                "max": 9999.99,
                "samples": [/* sample data */]
            },
            "merchant_category": {
                "type": "categorical",
                "distribution": {
                    "retail": 0.35,
                    "food": 0.25,
                    "online": 0.20,
                    "other": 0.20
                }
            }
        }
    }
)

Checking Drift Status

# Get drift metrics for a model
curl http://localhost:8000/api/v1/drift/fraud_detector_v2

# Response
{
  "model_id": "fraud_detector_v2",
  "timestamp": "2026-01-15T10:30:00Z",
  "features": {
    "transaction_amount": {
      "ks_statistic": 0.08,
      "ks_p_value": 0.15,
      "psi": 0.12,
      "is_drifted": false,
      "drift_severity": "low"
    },
    "merchant_category": {
      "chi_square_statistic": 12.5,
      "chi_square_p_value": 0.006,
      "psi": 0.28,
      "is_drifted": true,
      "drift_severity": "high"
    }
  },
  "alerts": [
    {
      "type": "data_drift",
      "feature": "merchant_category",
      "severity": "high",
      "message": "Significant distribution shift detected"
    }
  ]
}

API Endpoints

Prediction Logging

  • POST /api/v1/predictions - Log a single prediction
  • POST /api/v1/predictions/batch - Log multiple predictions
  • PATCH /api/v1/predictions/{id}/ground-truth - Update ground truth

Baseline Management

  • POST /api/v1/baselines - Register baseline distribution
  • GET /api/v1/baselines/{model_id} - Get current baseline
  • PUT /api/v1/baselines/{model_id} - Update baseline

Drift Monitoring

  • GET /api/v1/drift/{model_id} - Get current drift status
  • GET /api/v1/drift/{model_id}/history - Historical drift trends
  • POST /api/v1/drift/{model_id}/check - Trigger manual drift check

Performance Monitoring

  • GET /api/v1/performance/{model_id} - Current performance metrics
  • GET /api/v1/performance/{model_id}/trends - Performance over time

Alerts

  • GET /api/v1/alerts - List active alerts
  • GET /api/v1/alerts/{model_id} - Alerts for specific model
  • POST /api/v1/alerts/acknowledge/{alert_id} - Acknowledge alert

Health & Metrics

  • GET /health - Health check
  • GET /metrics - Prometheus metrics

Trade-offs & Limitations

Current Limitations

  1. Statistical Power

    • Drift tests require sufficient sample size (typically 100+ samples)
    • Low-traffic models may have delayed drift detection
    • Mitigation: Configurable time windows and sample thresholds
  2. Feature Engineering

    • System monitors input features, not engineered features
    • Drift in engineered features requires separate tracking
    • Mitigation: Log both raw and engineered features if needed
  3. Multivariate Drift

    • Univariate tests may miss interactions between features
    • Correlation drift not directly captured
    • Future Enhancement: Implement multivariate drift tests (e.g., MMD)
  4. Ground Truth Latency

    • Performance monitoring requires ground truth labels
    • Labels may arrive hours/days after prediction
    • Mitigation: Separate real-time drift detection from delayed performance monitoring
  5. High Cardinality Categorical Features

    • PSI and Chi-square less effective with 100+ categories
    • Requires grouping strategies
    • Mitigation: Auto-group rare categories, use entropy-based metrics

Design Trade-offs

Choice Benefit Trade-off
Batch drift detection Lower compute cost, better batching 15-min detection delay
Redis caching Fast dashboard queries Eventual consistency
PostgreSQL ACID guarantees, SQL queries Harder to scale writes
Univariate tests Simple, interpretable Misses feature interactions
Fixed time windows Consistent evaluation May miss sudden spikes

Production Deployment

Docker Deployment

# Build and run with docker-compose
docker-compose up -d

# Scale API workers
docker-compose up -d --scale api=3

Kubernetes (Production)

# Deploy to Kubernetes
kubectl apply -f k8s/

# Components:
# - API deployment (3 replicas)
# - Background worker deployment (2 replicas)
# - PostgreSQL StatefulSet
# - Redis deployment
# - Nginx ingress

Monitoring & Observability

Metrics Exported:

  • Prediction ingestion rate
  • Drift detection computation time
  • Alert generation rate
  • Database query latency
  • Cache hit rate

Logging:

  • Structured JSON logging
  • Request/response logging
  • Error tracking with stack traces
  • Audit logs for baseline changes

Recommended Stack:

  • Prometheus + Grafana for metrics
  • ELK/Loki for log aggregation
  • PagerDuty/Opsgenie for alert routing

Testing

# Run unit tests
pytest tests/unit -v

# Run integration tests
pytest tests/integration -v

# Run with coverage
pytest --cov=app --cov-report=html

# Load testing
locust -f tests/load/locustfile.py

Further Reading

Contributing

See CONTRIBUTING.md for development guidelines.

License

MIT License - see LICENSE file for details.


Note: This is a production-grade framework. Customize thresholds, time windows, and alerting channels based on your specific use case and SLAs.

About

Enterprise-grade machine learning observability platform that detects data drift, concept drift, and performance degradation in production models. Features statistical drift detection (KS test, PSI), real-time alerting, Redis caching, and FastAPI backend.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages