Model Observability System

A production-grade machine learning model monitoring and observability platform that detects data drift, concept drift, and performance degradation in real-time.

Architecture

System Design

┌─────────────────┐
│   ML Models     │
│  (Production)   │
└────────┬────────┘
         │ Prediction Logs
         ▼
┌─────────────────────────────────────────────────────────┐
│                    FastAPI Backend                      │
├─────────────────────────────────────────────────────────┤
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐ │
│  │ Ingestion    │  │ Drift        │  │ Alerting     │ │
│  │ Service      │  │ Detection    │  │ Service      │ │
│  └──────────────┘  └──────────────┘  └──────────────┘ │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐ │
│  │ Performance  │  │ Statistics   │  │ Dashboard    │ │
│  │ Monitoring   │  │ Engine       │  │ API          │ │
│  └──────────────┘  └──────────────┘  └──────────────┘ │
└─────────────────────────────────────────────────────────┘
         │                    │
         ▼                    ▼
┌──────────────────┐  ┌──────────────────┐
│   PostgreSQL     │  │      Redis       │
│  (Prediction     │  │   (Statistics    │
│   Logs & Stats)  │  │     Cache)       │
└──────────────────┘  └──────────────────┘

Components

Ingestion Service: Receives and validates prediction logs from production models
Drift Detection Engine: Implements statistical tests for distribution changes
Performance Monitor: Tracks accuracy and performance metrics over time
Alerting Service: Threshold-based alerting with configurable rules
Statistics Engine: Computes and caches metrics using Redis
Dashboard API: Exposes monitoring data for visualization

Drift Detection Techniques

1. Kolmogorov-Smirnov (KS) Test

Method: Non-parametric test comparing cumulative distribution functions (CDFs)

Implementation:

Compares empirical CDFs of baseline vs. production data
Statistic: Maximum vertical distance between CDFs
P-value threshold: typically 0.05

Advantages:

Distribution-free (no assumptions about data shape)
Sensitive to location and shape differences
Well-established statistical test

Limitations:

Works only for continuous/numerical features
Less sensitive to tail differences
Requires sufficient sample size (n > 30 recommended)

Use Case: Primary drift detector for numerical features

2. Population Stability Index (PSI)

Method: Measures distribution shift using binned data

Formula:

PSI = Σ (Actual% - Expected%) × ln(Actual% / Expected%)

Interpretation:

PSI < 0.1: No significant change
0.1 ≤ PSI < 0.25: Moderate change, investigate
PSI ≥ 0.25: Significant change, action required

Advantages:

Industry-standard in credit scoring and finance
Works with both numerical and categorical features
Intuitive interpretation
Less sensitive to sample size

Limitations:

Binning strategy affects results
Can miss subtle drifts within bins
Requires careful bin selection (typically 10-20 bins)

Use Case: Feature stability monitoring, categorical drift detection

3. Chi-Square Test (Categorical Features)

Method: Tests independence between observed and expected frequencies

Implementation:

Compares category distributions
Tests null hypothesis: distributions are identical

Advantages:

Standard test for categorical data
Provides p-value for statistical significance

Use Case: Categorical feature drift detection

Alerting System

Alert Types

Data Drift Alert: Triggered when feature distributions shift significantly
Concept Drift Alert: Triggered when model performance degrades
Volume Alert: Triggered by unusual prediction volume patterns

Configurable Thresholds

# config/thresholds.yaml
drift_detection:
  ks_test_threshold: 0.05        # p-value threshold
  psi_threshold: 0.25             # PSI critical value
  chi_square_threshold: 0.05      # p-value threshold
  
performance:
  accuracy_drop_threshold: 0.05   # 5% drop triggers alert
  min_samples_for_eval: 100       # Minimum samples before evaluation
  
volume:
  prediction_volume_std: 3.0      # Standard deviations for anomaly

Storage Strategy

PostgreSQL (Primary Storage)

Schema Design:

-- Prediction logs with partitioning by date
predictions (
  id, model_id, timestamp, features_json, 
  prediction, ground_truth, created_at
) PARTITION BY RANGE (timestamp);

-- Baseline distributions for drift comparison
baselines (
  model_id, feature_name, distribution_stats, 
  version, created_at
);

-- Drift detection results
drift_results (
  model_id, feature_name, test_type, 
  statistic, p_value, timestamp, is_drifted
);

-- Performance metrics over time
performance_metrics (
  model_id, window_start, window_end,
  accuracy, precision, recall, sample_count
);

Optimizations:

Time-series partitioning for efficient queries
Indexes on model_id, timestamp, and composite keys
Retention policies (e.g., raw logs: 90 days, aggregates: 1 year)

Redis (Caching Layer)

Cached Data:

Recent statistics (1-hour, 24-hour windows)
Baseline distribution summaries
Alert states and counts
Real-time prediction volumes

TTL Strategy:

Real-time metrics: 5 minutes
Hourly aggregates: 1 hour
Daily aggregates: 24 hours

Performance Considerations

Scalability Strategies

1. Batch Processing

Buffer incoming predictions in Redis
Process drift detection in scheduled batches (e.g., every 15 minutes)
Reduces database load and improves latency

2. Sampling for Drift Detection

Use reservoir sampling for large volumes (>10K predictions/hour)
Maintains statistical validity while reducing computation
Configurable sampling rate based on traffic

3. Async Processing

Use Celery/RQ for background tasks:
- Drift detection computation
- Performance metric aggregation
- Alert generation and delivery
Non-blocking API responses

4. Horizontal Scaling

Stateless API design enables load balancing
Redis cluster for distributed caching
PostgreSQL read replicas for dashboard queries

Expected Performance

Workload	Throughput	Latency (p95)
Prediction ingestion	10K req/s	< 50ms
Drift detection (batch)	1M samples	< 5 min
Dashboard queries	100 req/s	< 200ms

Hardware Assumptions: 4 CPU cores, 16GB RAM, SSD storage

Configuration

Environment Variables

# Database
DATABASE_URL=postgresql://user:pass@localhost:5432/ml_observability
REDIS_URL=redis://localhost:6379/0

# API
API_HOST=0.0.0.0
API_PORT=8000
LOG_LEVEL=INFO

# Drift Detection
KS_TEST_THRESHOLD=0.05
PSI_THRESHOLD=0.25
DRIFT_CHECK_INTERVAL=900  # seconds (15 minutes)

# Performance Monitoring
MIN_SAMPLES_FOR_EVAL=100
ACCURACY_DROP_THRESHOLD=0.05

# Alerting
ALERT_WEBHOOK_URL=https://your-webhook.com/alerts
ALERT_EMAIL=alerts@yourcompany.com

Quick Start

Installation

# Clone repository
git clone https://github.com/yourusername/model-observability-system.git
cd model-observability-system

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Set up database
alembic upgrade head

# Start Redis
docker run -d -p 6379:6379 redis:7-alpine

# Run application
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload

Sending Prediction Logs

import requests

# Log a prediction
response = requests.post(
    "http://localhost:8000/api/v1/predictions",
    json={
        "model_id": "fraud_detector_v2",
        "features": {
            "transaction_amount": 150.50,
            "merchant_category": "retail",
            "user_age": 35,
            "account_age_days": 450
        },
        "prediction": 0,  # 0: legitimate, 1: fraud
        "ground_truth": None  # Will be updated later
    }
)

Registering a Baseline

# Upload training data distribution as baseline
response = requests.post(
    "http://localhost:8000/api/v1/baselines",
    json={
        "model_id": "fraud_detector_v2",
        "version": "v2.1.0",
        "features": {
            "transaction_amount": {
                "type": "numerical",
                "mean": 125.30,
                "std": 89.45,
                "min": 0.01,
                "max": 9999.99,
                "samples": [/* sample data */]
            },
            "merchant_category": {
                "type": "categorical",
                "distribution": {
                    "retail": 0.35,
                    "food": 0.25,
                    "online": 0.20,
                    "other": 0.20
                }
            }
        }
    }
)

Checking Drift Status

# Get drift metrics for a model
curl http://localhost:8000/api/v1/drift/fraud_detector_v2

# Response
{
  "model_id": "fraud_detector_v2",
  "timestamp": "2026-01-15T10:30:00Z",
  "features": {
    "transaction_amount": {
      "ks_statistic": 0.08,
      "ks_p_value": 0.15,
      "psi": 0.12,
      "is_drifted": false,
      "drift_severity": "low"
    },
    "merchant_category": {
      "chi_square_statistic": 12.5,
      "chi_square_p_value": 0.006,
      "psi": 0.28,
      "is_drifted": true,
      "drift_severity": "high"
    }
  },
  "alerts": [
    {
      "type": "data_drift",
      "feature": "merchant_category",
      "severity": "high",
      "message": "Significant distribution shift detected"
    }
  ]
}

API Endpoints

Prediction Logging

POST /api/v1/predictions - Log a single prediction
POST /api/v1/predictions/batch - Log multiple predictions
PATCH /api/v1/predictions/{id}/ground-truth - Update ground truth

Baseline Management

POST /api/v1/baselines - Register baseline distribution
GET /api/v1/baselines/{model_id} - Get current baseline
PUT /api/v1/baselines/{model_id} - Update baseline

Drift Monitoring

GET /api/v1/drift/{model_id} - Get current drift status
GET /api/v1/drift/{model_id}/history - Historical drift trends
POST /api/v1/drift/{model_id}/check - Trigger manual drift check

Performance Monitoring

GET /api/v1/performance/{model_id} - Current performance metrics
GET /api/v1/performance/{model_id}/trends - Performance over time

Alerts

GET /api/v1/alerts - List active alerts
GET /api/v1/alerts/{model_id} - Alerts for specific model
POST /api/v1/alerts/acknowledge/{alert_id} - Acknowledge alert

Health & Metrics

GET /health - Health check
GET /metrics - Prometheus metrics

Trade-offs & Limitations

Current Limitations

Statistical Power
- Drift tests require sufficient sample size (typically 100+ samples)
- Low-traffic models may have delayed drift detection
- Mitigation: Configurable time windows and sample thresholds
Feature Engineering
- System monitors input features, not engineered features
- Drift in engineered features requires separate tracking
- Mitigation: Log both raw and engineered features if needed
Multivariate Drift
- Univariate tests may miss interactions between features
- Correlation drift not directly captured
- Future Enhancement: Implement multivariate drift tests (e.g., MMD)
Ground Truth Latency
- Performance monitoring requires ground truth labels
- Labels may arrive hours/days after prediction
- Mitigation: Separate real-time drift detection from delayed performance monitoring
High Cardinality Categorical Features
- PSI and Chi-square less effective with 100+ categories
- Requires grouping strategies
- Mitigation: Auto-group rare categories, use entropy-based metrics

Design Trade-offs

Choice	Benefit	Trade-off
Batch drift detection	Lower compute cost, better batching	15-min detection delay
Redis caching	Fast dashboard queries	Eventual consistency
PostgreSQL	ACID guarantees, SQL queries	Harder to scale writes
Univariate tests	Simple, interpretable	Misses feature interactions
Fixed time windows	Consistent evaluation	May miss sudden spikes

Production Deployment

Docker Deployment

# Build and run with docker-compose
docker-compose up -d

# Scale API workers
docker-compose up -d --scale api=3

Kubernetes (Production)

# Deploy to Kubernetes
kubectl apply -f k8s/

# Components:
# - API deployment (3 replicas)
# - Background worker deployment (2 replicas)
# - PostgreSQL StatefulSet
# - Redis deployment
# - Nginx ingress

Monitoring & Observability

Metrics Exported:

Prediction ingestion rate
Drift detection computation time
Alert generation rate
Database query latency
Cache hit rate

Logging:

Structured JSON logging
Request/response logging
Error tracking with stack traces
Audit logs for baseline changes

Recommended Stack:

Prometheus + Grafana for metrics
ELK/Loki for log aggregation
PagerDuty/Opsgenie for alert routing

Testing

# Run unit tests
pytest tests/unit -v

# Run integration tests
pytest tests/integration -v

# Run with coverage
pytest --cov=app --cov-report=html

# Load testing
locust -f tests/load/locustfile.py

Contributing

See CONTRIBUTING.md for development guidelines.

License

MIT License - see LICENSE file for details.

Note: This is a production-grade framework. Customize thresholds, time windows, and alerting channels based on your specific use case and SLAs.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
alembic		alembic
app		app
examples		examples
tests		tests
.env.example		.env.example
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
COMPLETION.md		COMPLETION.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
alembic.ini		alembic.ini
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

License

rodrigoguedes09/model-observability-system

Folders and files

Latest commit

History

Repository files navigation

Model Observability System

Architecture

System Design

Components

Drift Detection Techniques

1. Kolmogorov-Smirnov (KS) Test

2. Population Stability Index (PSI)

3. Chi-Square Test (Categorical Features)

Alerting System

Alert Types

Configurable Thresholds

Storage Strategy

PostgreSQL (Primary Storage)

Redis (Caching Layer)

Performance Considerations

Scalability Strategies

1. Batch Processing

2. Sampling for Drift Detection

3. Async Processing

4. Horizontal Scaling

Expected Performance

Configuration

Environment Variables

Quick Start

Installation

Sending Prediction Logs

Registering a Baseline

Checking Drift Status

API Endpoints

Prediction Logging

Baseline Management

Drift Monitoring

Performance Monitoring

Alerts

Health & Metrics

Trade-offs & Limitations

Current Limitations

Design Trade-offs

Production Deployment

Docker Deployment

Kubernetes (Production)

Monitoring & Observability

Testing

Further Reading

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages