This repository provides a solution for MLOps Pipeline, where MLOps Pipeline includes data ETL, model re-training, model archiving, model serving and event triggering. Although this solution provides XGBoost as an example, it can be extended to other SageMaker built-in algorithms because it abstracts model training of SageMaker's built-in algorithms. AWS various services(Amazon SageMaker, AWS Step Functions, AWS Lambda) are used to provide MLOps Pipeline, and those resources are modeled and deployed through AWS CDK.
Other "Using AWS CDK" series can be found at:
- Amazon Sagemaker Model Serving Using AWS CDK
- AWS ECS DevOps UsingAWS CDK
- AWS Serverless Using AWS CDK
- AWS IoT Greengrass Ver2 using AWS CDK
- Data ETL: AWS Glue Job for data ETL(extract/transform/load)
- Model Build/Train: Amazon SageMaker built-in(xgboost) algorithm and SageMaker training job
- Model Archive/Serve: Amazon SageMaker Model/Endpoint for realtime inference
- Pipeline Orchestration: AWS Step Functions for configuring a statemachine of MLOps pipeline
- Programming-based IaC: AWS CDK for modeling & provisioning for all AWS cloud resources(Typescript)
This solution refers to amazon-sagemaker-examples: automate_model_retraining_workflow for applying SageMaker built-in XGBoost algorithm. Please refer to the following links to apply other SageMaker built-in algorithms.
- Amazon SageMaker Built-in Algorithms Intro
- Amazon SageMaker Built-in Algorithms Docker Path
- Amazon SageMaker Built-in Algorithms Data Format
Note that using SageMaker built-in algorithms is very convenient because we only need to focus on a data, not model.
To efficiently define and provision AWS cloud resources, AWS Cloud Development Kit(CDK) which is an open source software development framework to define your cloud application resources using familiar programming languages is utilized.
Because this solusion is implemented in CDK, we can deploy these cloud resources using CDK CLI. In particular, TypeScript clearly defines restrictions on types, so we can easily and conveniently configure many parameters of various cloud resources with CDK. In addition, if programming, one of the advantages of CDK, is applied along with design patterns, it can be extended to more reusable assets.
npm install
install dependencies only for Typescriptcdk list
list up stackscdk deploy
deploy this stack to your default AWS account/regioncdk diff
compare deployed stack with current statecdk synth
emits the synthesized CloudFormation template
First of all, AWS Account and IAM User is required. And then the following modules must be installed.
- AWS CLI: aws --version
- Node.js: node --version
- AWS CDK: cdk --version
- jq: jq --version
Please refer to a kind guide in CDK Workshop.
aws configure --profile [your-profile]
AWS Access Key ID [None]: xxxxxx
AWS Secret Access Key [None]:yyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
Default region name [None]: us-east-2
Default output format [None]: json
aws sts get-caller-identity --profile [your-profile]
...
...
{
"UserId": ".............",
"Account": "75157*******",
"Arn": "arn:aws:iam::75157*******:user/[your IAM User ID]"
}
In this CDK project, a entry-point file is infra/app-main.ts which is described in cdk.json
.
This project is based on aws-cdk-project-template-for-devops, which adopts configuration driven development(CDD). So let's set up configuration file(config/app-config-demo.json
), which describes deployment target information(account/region) and how to configure each stack properties.
First of all, change deployment target information(account/region) in config/app-config-demo.json
according to your AWS accout environment.
{
"Project": {
"Name": "MLOps", <----- your project name, all stacks wil be prefixed with [Project.Name+Project.Stage]
"Stage": "Demo", <----- your project stage, all stacks wil be prefixed with [Project.Name+Project.Stage]
"Account": "75157*******", <----- update according to your AWS Account
"Region": "us-east-2", <----- update according to your target resion
"Profile": "cdk-v2" <----- AWS Profile, keep empty string if no profile configured
},
...
...
}
And then set the path of this json configuration file through setting up environment variable.
export APP_CONFIG=config/app-config-demo.json
Through this external configuration injection, multiple deployments(multiple account, multiple region, multiple stage) are possible without code modification. For example, we can maintain a variety of configuration files such as app-config-dev.json
, app-config-test.json
and app-config-prod.json
at the same time.
Execute the following command, which will check versions and install dependencies intead of us. For more details, open script/setup_initial.sh
file.
sh script/setup_initial.sh config/app-config-demo.json
Before deployment, please execute the following command for checking whether all configurations are ready.
cdk list
...
...
==> CDK App-Config File is config/app-config-demo.json, which is from Environment-Variable.
MLOpsDemo-ChurnXgboostPipelineStack
...
...
Check if you can see the list of stacks as shown above. If there is no problem, finally run the following command.
cdk deploy *ChurnXgboostPipelineStack --profile [optional: your profile name]
or
sh script/deploy_stacks.sh config/app-config-demo.json
Caution: This solution contains not-free tier AWS services. So be careful about the possible costs.
You can find the deployment results in AWS CloudFormation as shown in the following picture.
And you can see a new StateMachine in Step Functions, which looks like this.
Many resources such as Lambda/SageMakerTrainingJob/GlueETLJob have been deployed, but are not yet executed. Let's trigger that through just uploading input data.
Download sample data by running the following command, where sed
command is used to remove "
character in each line.
sh codes/glue/churn-xgboost/script/download_data.sh
A sample data will be downloaded in codes/glue/churn-xgboost/data/input.csv
.
Just execute the following command:
sh codes/glue/churn-xgboost/script/upload_input.sh config/app-config-demo.json data/request-01.csv
...
...
upload: codes/glue/churn-xgboost/data/input.csv to s3://mlopsdemo-churnxgboostpipelinestack-asset-[region]-[account 5 digits]/input/data/request-01.csv
This command will upload input.csv
file into S3 bucket such as mlopsdemo-churnxgboostpipelinestack-asset-[region]-[account 5 digits]
with input/data/request-01.csv
key.
Let's go to Step Functions
service in web console. We can see that the new one is currently running. Click on it to check the current status, which looks like this
When all steps are completed, you can see the following results.
Caution Sometimes training job can be failed because container image path is wrong. In this case, you may see the following exception:
In this case, visit SageMaker Docker Registry Path link, where select your region, and then select alogorithm. Finaly, you can find the exptected docker path. For example, if your choice is us-east-2
region and XGBoost
algorithm, you will see a page like this - https://docs.aws.amazon.com/sagemaker/latest/dg/ecr-us-east-2.html#xgboost-us-east-2.title.
Please update docker image path in app-config-demo.json
file and deploy the stack again. And finally, if you upload the input data to s3 with a different s3 key(data/request-02.csv
), the StateMachine will start again.
sh script/deploy_stacks.sh config/app-config-demo.json
...
...
sh codes/glue/churn-xgboost/script/upload_input.sh config/app-config-demo.json data/request-02.csv
...
...
upload: codes/glue/churn-xgboost/data/input.csv to s3://mlopsdemo-churnxgboostpipelinestack-asset-[region]-[account 5 digits]/input/data/request-02.csv
Amazon SageMaker Endpoint
Caution In the first deployment, the SageMaker Endpoint is newly created(Create Endpoint Path
), and in the second deployment, it is updated(Update Endpoint Path
).
AWS Step Functions StateMachine
Internally, it is implemented so that intermediate results are archived according to the following rules.
AWS Glue ETL Job Result in AWS S3 Bucket
Amazon SageMaker Training Job Result in AWS S3 Bucket
Finally, let's invoke SageMaker Endpoint
to make sure it works well.
Before invocation, open codes/glue/churn-xgboost/script/test_invoke.py
file, and update profile name
and endpoint name
according to your configuration.
...
...
os.environ['AWS_PROFILE'] = 'cdk-v2'
_endpoint_name = 'MLOpsDemo-churn-xgboost'
...
...
Invoke the endpoint by executing the following command:
python3 codes/glue/churn-xgboost/script/test_invoke.py
...
...
0 Invocation ------------------
>>input: 106,0,274.4,120,198.6,82,160.8,62,6.0,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0
>>label: 0
>>prediction: 0.37959378957748413
1 Invocation ------------------
>>input: 28,0,187.8,94,248.6,86,208.8,124,10.6,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0,1,0
>>label: 0
>>prediction: 0.03738965839147568
2 Invocation ------------------
>>input: 148,0,279.3,104,201.6,87,280.8,99,7.9,2,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0
>>label: 1
>>prediction: 0.9195730090141296
3 Invocation ------------------
>>input: 132,0,191.9,107,206.9,127,272.0,88,12.6,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0
>>label: 0
>>prediction: 0.025062650442123413
4 Invocation ------------------
>>input: 92,29,155.4,110,188.5,104,254.9,118,8.0,4,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1
>>label: 0
>>prediction: 0.028299745172262192
Because a S3-event trigger is registered in Lambda Function
, it is restarted when you upload a file with a different name(s3 key) under input
in mlopsdemo-churnxgboostpipelinestack-asset-region-account
.
A input S3 key will generate a unique title, it is used for TrainingJobName
, StateMachineExecutionName
, and s3 output key.
ChurnXgboostPipeline
in config/app-config-demo.json
file provides various deployment options. So change these values, and the re-deploy this stack. And trigger again.
"ChurnXgboostPipeline": {
"Name": "ChurnXgboostPipelineStack",
"EndpointName": "churn-xgboost", <----- SageMaker Endpoint Name, and other resource name
"GlueJobFilePath": "codes/glue/churn-xgboost/src/glue_etl.py", <----- Glue ETL Job Code
"GlueJobTimeoutInMin": 30, <----- Glue ETL Job Timeout
"TrainContainerImage": "306986355934.dkr.ecr.ap-northeast-2.amazonaws.com/xgboost:1", <------ This value is different according to SageMaker built-in algorithm & region
"TrainParameters": {
...
},
"TrainInputContent": "text/csv", <----- This value is difference according to SageMaker built-in alorithm
"TrainInstanceType": "c5.xlarge", <----- SageMaker training job instance type
"ModelValidationEnable": true, <----- Enable/disable a model validation state
"ModelErrorThreshold": 0.1, <----- Model accuracy validation metric threshold
"EndpointInstanceType": "t2.2xlarge", <----- SageMaker endpoint instance number
"EndpointInstanceCount": 1 <----- SageMaker ednpoint instance count
}
ChurnXgboostPipeline
in config/app-config-demo.json
file includes hyper-parameters like this. So change these values, and the re-deploy this stack. And trigger again.
"ChurnXgboostPipeline": {
"Name": "ChurnXgboostPipelineStack",
...
...
"TrainParameters": {
"max_depth": "5",
"eval_metric": "error",
"eta": "0.2",
"gamma": "4",
"min_child_weight": "6",
"subsample": "0.8",
"objective": "binary:logistic",
"silent": "0",
"num_round": "100"
},
...
...
}
MLOpsePipelineStack
provides a general MLOps Pipeline as abstraction as possible for SageMaker built-in algorithms. As a result, it can be extended by injecting only configuration information without modifying the code.
For example, consider Object2Vec algorithm.
Step1: Prepare a new configuration in config/app-config-demo.json
:
{
"Project": {
...
...
},
"Stack": {
"ChurnXgboostPipeline": {
"Name": "ChurnXgboostPipelineStack",
...
...
},
"RecommendObject2VecPipeline": {
"Name": "RecommendObject2VecPipelineStack",
"EndpointName": "recommand-object2vec", <----- change according model or usecase
"GlueJobFilePath": "codes/glue/recommand-object2vec/src/glue_etl.py", <----- change according to data format and etl-process
"GlueJobTimeoutInMin": 30, <----- change this value to avoid over-processing and over-charging
"TrainContainerImage": "835164637446.dkr.ecr.ap-northeast-2.amazonaws.com/object2vec:1", <----- change image according to SageMaker built-in alorithm and region
"TrainParameters": {
"_kvstore": "device",
"_num_gpus": "auto",
"_num_kv_servers": "auto",
"bucket_width": "0",
"early_stopping_patience": "3",
"early_stopping_tolerance": "0.01",
"enc0_cnn_filter_width": "3",
"enc0_layers": "auto",
"enc0_max_seq_len": "1",
"enc0_network": "pooled_embedding",
"enc0_token_embedding_dim": "300",
"enc0_vocab_size": "944",
"enc1_layers": "auto",
"enc1_max_seq_len": "1",
"enc1_network": "pooled_embedding",
"enc1_token_embedding_dim": "300",
"enc1_vocab_size": "1684",
"enc_dim": "1024",
"epochs": "20",
"learning_rate": "0.001",
"mini_batch_size": "64",
"mlp_activation": "tanh",
"mlp_dim": "256",
"mlp_layers": "1",
"num_classes": "2",
"optimizer": "adam",
"output_layer": "mean_squared_error"
},
"TrainInputContent": "application/jsonlines", <----- change according to algorithm supported types
"TrainInstanceType": "m4.xlarge", <--- change model training environments
"ModelValidationEnable": false, <----- disable if you don't want to validate model accuracy
"ModelErrorThreshold": 0.1,
"EndpointInstanceType": "m4.xlarge", <--- change model training environments
"EndpointInstanceCount": 1
}
}
}
Step2: Create a new object in infra/app-main.ts
like this:
new MLOpsPipelineStack(appContext, appContext.appConfig.Stack.ChurnXgboostPipeline);
new MLOpsPipelineStack(appContext, appContext.appConfig.Stack.RecommendObject2VecPipeline);
Step3: Deploy and trigger again with a new data
cdk list
cdk deploy *RecommendObject2VecPipelineStack --profile [optional: your profile name]
Sometimes you can extend functionality by inheriting from this stack for further expansion.
Execute the following command, which will destroy all resources except S3 Bucket. So destroy these resources in AWS web console manually.
sh ./script/destroy_stacks.sh config/app-config-demo.json
or
cdk destroy *Stack --profile [optional: your profile name]
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.