Skip to content

trafilea/sagemaker-ml-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SageMaker ML Pipeline Template

Full native AWS SageMaker ML Pipeline template for end-to-end machine learning workflows.

🎯 Overview

This template provides a complete SageMaker-native ML pipeline that leverages all AWS managed services for maximum scalability and minimum operational overhead.

πŸ”₯ What makes this different?

πŸ†š vs Batch ML Pipeline Template:

  • 100% SageMaker native - no custom Docker containers needed
  • Fully managed - AWS handles infrastructure, scaling, monitoring
  • Integrated Model Registry - automatic model versioning and approval workflows
  • Built-in monitoring - SageMaker Model Monitor for data drift detection
  • Cost optimized - automatic scaling and spot instances support

πŸ—οΈ Architecture

πŸ“Š Data Input (S3) 
    ↓
πŸ”„ SageMaker Processing (Feature Engineering)
    ↓  
πŸ€– SageMaker Training (Model Training)
    ↓
πŸ“ˆ SageMaker Processing (Model Evaluation)
    ↓
πŸ“¦ SageMaker Model Registry (Conditional Registration)
    ↓
πŸš€ SageMaker Endpoints (Real-time Inference)

🧩 Components

  • SageMaker Pipeline: Orchestrates the entire ML workflow
  • SageMaker Processing: Data preprocessing and model evaluation
  • SageMaker Training: Distributed model training
  • SageMaker Model Registry: Model versioning and governance
  • SageMaker Endpoints: Real-time model serving
  • SageMaker Model Monitor: Data drift and model quality monitoring

πŸš€ Quick Start

1. Infrastructure Setup

# Deploy infrastructure
make deploy-infra ENV=dev

# Check AWS configuration
make check-aws

2. Upload Sample Data

# Create sample dataset (replace with your data)
mkdir -p data
echo "feature1,feature2,feature3,target" > data/sample_dataset.csv
echo "1.0,2.0,3.0,0" >> data/sample_dataset.csv
echo "2.0,3.0,4.0,1" >> data/sample_dataset.csv

# Upload to S3
make upload-data ENV=dev

3. Run Pipeline

# Execute the full ML pipeline
make run-pipeline ENV=dev

# Monitor progress
make list-executions ENV=dev

πŸ“ Project Structure

sagemaker-ml-pipeline/
β”œβ”€β”€ configs/                 # Environment configurations
β”‚   β”œβ”€β”€ dev/
β”‚   β”œβ”€β”€ staging/
β”‚   └── prod/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ pipeline/           # SageMaker Pipeline definitions
β”‚   β”‚   └── sagemaker_pipeline.py
β”‚   β”œβ”€β”€ preprocessing/      # Data preprocessing scripts
β”‚   β”‚   └── preprocess.py
β”‚   β”œβ”€β”€ training/          # Model training scripts
β”‚   β”‚   └── train.py
β”‚   β”œβ”€β”€ evaluation/        # Model evaluation scripts
β”‚   β”‚   └── evaluate.py
β”‚   └── inference/         # Inference and endpoint management
β”‚       β”œβ”€β”€ inference.py
β”‚       └── deploy_endpoint.py
β”œβ”€β”€ infrastructure/        # Terraform IaC
β”‚   └── terraform/
β”‚       β”œβ”€β”€ modules/
β”‚       └── environments/
β”œβ”€β”€ scripts/              # Utility scripts
β”‚   └── run_pipeline.py
β”œβ”€β”€ tests/               # Test suite
└── notebooks/           # Jupyter notebooks

πŸ”§ Configuration

Environment Variables

Create a .env file for local development:

AWS_REGION=us-east-1
AWS_PROFILE=default
SAGEMAKER_ROLE_ARN=arn:aws:iam::ACCOUNT:role/SageMakerExecutionRole
S3_BUCKET=your-ml-artifacts-bucket

Pipeline Configuration

Modify configs/{env}/pipeline_config.json:

{
  "pipeline_name": "YourMLPipeline",
  "model_package_group_name": "YourModelGroup",
  "processing_instance_type": "ml.m5.xlarge",
  "training_instance_type": "ml.m5.xlarge",
  "endpoint_instance_type": "ml.m5.large"
}

πŸ“Š Pipeline Steps

1. Data Preprocessing

  • Input: Raw data from S3
  • Processing: Feature engineering, data cleaning, train/val/test split
  • Output: Processed datasets ready for training

2. Model Training

  • Input: Processed training and validation data
  • Training: Distributed training with hyperparameter optimization
  • Output: Trained model artifacts

3. Model Evaluation

  • Input: Trained model and test data
  • Evaluation: Performance metrics calculation
  • Output: Evaluation report and model quality assessment

4. Conditional Model Registration

  • Condition: Model meets quality thresholds
  • Registration: Automatic registration in SageMaker Model Registry
  • Approval: Configurable approval workflow

5. Model Deployment

  • Endpoint Creation: Automatic endpoint deployment for approved models
  • Scaling: Auto-scaling configuration
  • Monitoring: Data capture and model monitoring setup

πŸŽ›οΈ Model Management

Deploy a Model to Endpoint

from src.inference.deploy_endpoint import SageMakerEndpointManager

manager = SageMakerEndpointManager()

# Deploy model
predictor, endpoint_name = manager.deploy_model(
    model_data_url="s3://bucket/path/to/model.tar.gz",
    instance_type="ml.m5.large"
)

# Test endpoint
result = manager.test_endpoint(endpoint_name)

Model Registry Operations

# List model packages
aws sagemaker list-model-packages \
    --model-package-group-name TrafileaMLModelGroup-dev

# Approve a model
aws sagemaker update-model-package \
    --model-package-arn arn:aws:sagemaker:... \
    --model-approval-status Approved

πŸ“ˆ Monitoring and Observability

Built-in Monitoring

  • CloudWatch Metrics: Training job metrics, endpoint metrics
  • SageMaker Model Monitor: Data drift detection
  • Pipeline Execution Tracking: Step-by-step execution monitoring

Custom Monitoring

# Enable model monitoring
from sagemaker.model_monitor import DefaultModelMonitor

monitor = DefaultModelMonitor(
    role=sagemaker_role,
    instance_count=1,
    instance_type='ml.m5.xlarge'
)

# Attach to endpoint
monitor.suggest_baseline(
    baseline_dataset=baseline_data_uri,
    dataset_format=DatasetFormat.csv(header=True)
)

πŸ§ͺ Testing

# Run all tests
make test

# Run specific test
pytest tests/test_pipeline.py -v

# Test with coverage
pytest --cov=src tests/

πŸš€ Deployment

Development

make deploy-infra ENV=dev
make run-pipeline ENV=dev

Production

make deploy-infra ENV=prod
make run-pipeline ENV=prod

πŸ”„ CI/CD Integration

GitHub Actions Example

name: SageMaker Pipeline CI/CD

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Setup AWS
        uses: aws-actions/configure-aws-credentials@v2
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1
      
      - name: Deploy Infrastructure
        run: make deploy-infra ENV=prod
      
      - name: Run Pipeline
        run: make run-pipeline ENV=prod

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Commit changes: git commit -m 'Add amazing feature'
  4. Push to branch: git push origin feature/amazing-feature
  5. Open a Pull Request

πŸ“ License

Copyright Β© 2024 Trafilea

πŸ†˜ Support


πŸŽ‰ Ready to build production ML pipelines with SageMaker!

About

Full Native SageMaker ML Pipeline Template

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published