Full native AWS SageMaker ML Pipeline template for end-to-end machine learning workflows.
This template provides a complete SageMaker-native ML pipeline that leverages all AWS managed services for maximum scalability and minimum operational overhead.
π vs Batch ML Pipeline Template:
- 100% SageMaker native - no custom Docker containers needed
- Fully managed - AWS handles infrastructure, scaling, monitoring
- Integrated Model Registry - automatic model versioning and approval workflows
- Built-in monitoring - SageMaker Model Monitor for data drift detection
- Cost optimized - automatic scaling and spot instances support
π Data Input (S3)
β
π SageMaker Processing (Feature Engineering)
β
π€ SageMaker Training (Model Training)
β
π SageMaker Processing (Model Evaluation)
β
π¦ SageMaker Model Registry (Conditional Registration)
β
π SageMaker Endpoints (Real-time Inference)
- SageMaker Pipeline: Orchestrates the entire ML workflow
- SageMaker Processing: Data preprocessing and model evaluation
- SageMaker Training: Distributed model training
- SageMaker Model Registry: Model versioning and governance
- SageMaker Endpoints: Real-time model serving
- SageMaker Model Monitor: Data drift and model quality monitoring
# Deploy infrastructure
make deploy-infra ENV=dev
# Check AWS configuration
make check-aws
# Create sample dataset (replace with your data)
mkdir -p data
echo "feature1,feature2,feature3,target" > data/sample_dataset.csv
echo "1.0,2.0,3.0,0" >> data/sample_dataset.csv
echo "2.0,3.0,4.0,1" >> data/sample_dataset.csv
# Upload to S3
make upload-data ENV=dev
# Execute the full ML pipeline
make run-pipeline ENV=dev
# Monitor progress
make list-executions ENV=dev
sagemaker-ml-pipeline/
βββ configs/ # Environment configurations
β βββ dev/
β βββ staging/
β βββ prod/
βββ src/
β βββ pipeline/ # SageMaker Pipeline definitions
β β βββ sagemaker_pipeline.py
β βββ preprocessing/ # Data preprocessing scripts
β β βββ preprocess.py
β βββ training/ # Model training scripts
β β βββ train.py
β βββ evaluation/ # Model evaluation scripts
β β βββ evaluate.py
β βββ inference/ # Inference and endpoint management
β βββ inference.py
β βββ deploy_endpoint.py
βββ infrastructure/ # Terraform IaC
β βββ terraform/
β βββ modules/
β βββ environments/
βββ scripts/ # Utility scripts
β βββ run_pipeline.py
βββ tests/ # Test suite
βββ notebooks/ # Jupyter notebooks
Create a .env
file for local development:
AWS_REGION=us-east-1
AWS_PROFILE=default
SAGEMAKER_ROLE_ARN=arn:aws:iam::ACCOUNT:role/SageMakerExecutionRole
S3_BUCKET=your-ml-artifacts-bucket
Modify configs/{env}/pipeline_config.json
:
{
"pipeline_name": "YourMLPipeline",
"model_package_group_name": "YourModelGroup",
"processing_instance_type": "ml.m5.xlarge",
"training_instance_type": "ml.m5.xlarge",
"endpoint_instance_type": "ml.m5.large"
}
- Input: Raw data from S3
- Processing: Feature engineering, data cleaning, train/val/test split
- Output: Processed datasets ready for training
- Input: Processed training and validation data
- Training: Distributed training with hyperparameter optimization
- Output: Trained model artifacts
- Input: Trained model and test data
- Evaluation: Performance metrics calculation
- Output: Evaluation report and model quality assessment
- Condition: Model meets quality thresholds
- Registration: Automatic registration in SageMaker Model Registry
- Approval: Configurable approval workflow
- Endpoint Creation: Automatic endpoint deployment for approved models
- Scaling: Auto-scaling configuration
- Monitoring: Data capture and model monitoring setup
from src.inference.deploy_endpoint import SageMakerEndpointManager
manager = SageMakerEndpointManager()
# Deploy model
predictor, endpoint_name = manager.deploy_model(
model_data_url="s3://bucket/path/to/model.tar.gz",
instance_type="ml.m5.large"
)
# Test endpoint
result = manager.test_endpoint(endpoint_name)
# List model packages
aws sagemaker list-model-packages \
--model-package-group-name TrafileaMLModelGroup-dev
# Approve a model
aws sagemaker update-model-package \
--model-package-arn arn:aws:sagemaker:... \
--model-approval-status Approved
- CloudWatch Metrics: Training job metrics, endpoint metrics
- SageMaker Model Monitor: Data drift detection
- Pipeline Execution Tracking: Step-by-step execution monitoring
# Enable model monitoring
from sagemaker.model_monitor import DefaultModelMonitor
monitor = DefaultModelMonitor(
role=sagemaker_role,
instance_count=1,
instance_type='ml.m5.xlarge'
)
# Attach to endpoint
monitor.suggest_baseline(
baseline_dataset=baseline_data_uri,
dataset_format=DatasetFormat.csv(header=True)
)
# Run all tests
make test
# Run specific test
pytest tests/test_pipeline.py -v
# Test with coverage
pytest --cov=src tests/
make deploy-infra ENV=dev
make run-pipeline ENV=dev
make deploy-infra ENV=prod
make run-pipeline ENV=prod
name: SageMaker Pipeline CI/CD
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup AWS
uses: aws-actions/configure-aws-credentials@v2
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-east-1
- name: Deploy Infrastructure
run: make deploy-infra ENV=prod
- name: Run Pipeline
run: make run-pipeline ENV=prod
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature
- Commit changes:
git commit -m 'Add amazing feature'
- Push to branch:
git push origin feature/amazing-feature
- Open a Pull Request
Copyright Β© 2024 Trafilea
- π SageMaker Documentation
- π― SageMaker Pipelines Guide
- π§ Contact: mlops@trafilea.com
π Ready to build production ML pipelines with SageMaker!