End to End Documentation for Torchserve - KServe Model Serving

The documentation covers the steps to run Torchserve inside the KServe environment for the mnist model.

Currently, KServe supports the Inference API for all the existing models but text to speech synthesizer and it's explain API works for the eager models of MNIST,BERT and text classification only.

Docker Image Building

To create a CPU based image

./build_image.sh

To create a CPU based image with custom tag

./build_image.sh -t <repository>/<image>:<tag>

To create a GPU based image

./build_image.sh -g

To create a GPU based image with custom tag

./build_image.sh -g -t <repository>/<image>:<tag>

To create dev image

./build_image.sh -g -d -t <repository>/<image>:<tag>

Running Torchserve inference service in KServe cluster

Create Kubernetes cluster with eksctl

Install eksctl - https://docs.aws.amazon.com/eks/latest/userguide/eksctl.html

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: "kserve-cluster"
  region: "us-west-2"

vpc:
  id: "vpc-xxxxxxxxxxxxxxxxx"
  subnets:
    private:
      us-west-2a:
          id: "subnet-xxxxxxxxxxxxxxxxx"
      us-west-2c:
          id: "subnet-xxxxxxxxxxxxxxxxx"
    public:
      us-west-2a:
          id: "subnet-xxxxxxxxxxxxxxxxx"
      us-west-2c:
          id: "subnet-xxxxxxxxxxxxxxxxx"

nodeGroups:
  - name: ng-1
    minSize: 1
    maxSize: 4
    desiredCapacity: 2
    instancesDistribution:
      instanceTypes: ["p3.8xlarge"] # At least one instance type should be specified
      onDemandBaseCapacity: 0
      onDemandPercentageAboveBaseCapacity: 50
      spotInstancePools: 5

eksctl create cluster -f cluster.yaml

Install KServe

Run the below command to install kserve in the cluster.

curl -s "https://raw.githubusercontent.com/kserve/kserve/release-0.8/hack/quick_install.sh" | bash

This installs the latest kserve in the kubernetes cluster.

create a test namespace kserve-test

kubectl create namespace kserve-test

Steps for running Torchserve inference service in KServe

Here we use the mnist example in Torchserve Repository.

Step - 1 : Create the .mar file for mnist by invoking the below command

Navigate to the cloned serve repo and run

torch-model-archiver --model-name mnist_kf --version 1.0 --model-file examples/image_classifier/mnist/mnist.py --serialized-file examples/image_classifier/mnist/mnist_cnn.pt --handler  examples/image_classifier/mnist/mnist_handler.py

For large models, creating a .mar file is not the recommended approach as it can be slow. Hence the suggestion is to use no-archive option. This will create a directory mnist_kf which can be uploaded to the model_store

torch-model-archiver --model-name mnist_kf --version 1.0 --model-file examples/image_classifier/mnist/mnist.py --serialized-file examples/image_classifier/mnist/mnist_cnn.pt --handler  examples/image_classifier/mnist/mnist_handler.py --archive-format no-archive

Step - 2 : Create a config.properties file and place the contents like below:

inference_address=http://0.0.0.0:8085
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
grpc_inference_port=7070
grpc_management_port=7071
enable_envvars_config=true
install_py_dep_per_model=true
enable_metrics_api=true
metrics_mode=prometheus
NUM_WORKERS=1
number_of_netty_threads=4
job_queue_size=10
model_store=/mnt/models/model-store
model_snapshot={"name":"startup.cfg","modelCount":1,"models":{"mnist_kf":{"1.0":{"defaultVersion":true,"marName":"mnist_kf.mar","minWorkers":1,"maxWorkers":5,"batchSize":1,"maxBatchDelay":5000,"responseTimeout":120}}}}

Please note that, the port for inference address should be set at 8085 since KServe by default makes use of 8080 for its inference service.

In case you have used --archive-format no-archive, the model_snapshot would be as follows. The only change is "marName":"mnist_kf"

model_snapshot={"name":"startup.cfg","modelCount":1,"models":{"mnist_kf":{"1.0":{"defaultVersion":true,"marName":"mnist_kf","minWorkers":1,"maxWorkers":5,"batchSize":1,"maxBatchDelay":5000,"responseTimeout":120}}}}

Step - 3 : Create PV, PVC and PV pods in KServe

For EFS backed volume refer - https://github.com/pytorch/serve/tree/master/kubernetes/EKS#setup-persistentvolume-backed-by-efs

Follow the instructions below for creating a PV and copying the config files

Create volume

EBS volume creation: https://docs.aws.amazon.com/cli/latest/reference/ec2/create-volume.html

For PV and PVC refer: https://kubernetes.io/docs/concepts/storage/persistent-volumes/
Create PV

Edit volume id in pv.yaml file

kubectl apply -f ../reference_yaml/pv-deployments/pv.yaml -n kserve-test

Create PVC

kubectl apply -f ../reference_yaml/pv-deployments/pvc.yaml -n kserve-test

Create pod for copying model store files to PV

kubectl apply -f ../reference_yaml/pvpod.yaml -n kserve-test

Step - 4 : Copy the config.properties file and mar file to the PVC using the model-store-pod

# Create directory in PV
kubectl exec -it model-store-pod -c model-store -n kserve-test -- mkdir /pv/model-store/
kubectl exec -it model-store-pod -c model-store -n kserve-test -- mkdir /pv/config/
# Copy files the path
kubectl cp mnist.mar model-store-pod:/pv/model-store/ -c model-store -n kserve-test
kubectl cp config.properties model-store-pod:/pv/config/ -c model-store -n kserve-test

Refer link for other storage options

Step - 5 : Create the Inference Service

# For v1 protocol
kubectl apply -f ../reference_yaml/torchserve-deployment/v1/ts_sample.yaml -n kserve-test

# For v2 protocol
kubectl apply -f ../reference_yaml/torchserve-deployment/v2/ts_sample.yaml -n kserve-test

Refer link for more examples

Step - 6 : Generating input files

KServe supports different types of inputs (ex: tensor, bytes). Use the following instructions to generate input files based on its type.

MNIST input generation Bert input generation

Step - 7 : Hit the Curl Request to make a prediction as below :

DEPLOYMENT_NAME=torch-pred
SERVICE_HOSTNAME=$(kubectl get inferenceservice ${DEPLOYMENT_NAME} -n KServe-test -o jsonpath='{.status.url}' | cut -d "/" -f 3)
INGRESS_HOST=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].port}')

For v1 protocol

curl -v -H "Host: ${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/mnist-kf:predict -d @./kf_request_json/v1/mnist/mnist.json

For v2 protocol

curl -v -H "Host: ${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT}/v2/models/mnist-kf/infer -d ./kf_request_json/v2/mnist/mnist_v2_bytes.json

Step - 8 : Hit the Curl Request to make an explanation as below:

For v1 protocol

curl -v -H "Host: ${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/mnist-kf:explain -d ./kf_request_json/v1/mnist/mnist.json

For v2 protocol

curl -v -H "Host: ${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT}/v2/models/mnist-kf/explain -d ./kf_request_json/v2/mnist/mnist_v2_bytes.json

Refer the individual readmes for KServe :

BERT
MNIST

Sample input JSON file for v1 and v2 protocols

For v1 protocol

{
  "instances": [
    {
      "data": "iVBORw0eKGgoAAAANSUhEUgAAABwAAAAcCAAAAABXZoBIAAAAw0lEQVR4nGNgGFggVVj4/y8Q2GOR83n+58/fP0DwcSqmpNN7oOTJw6f+/H2pjUU2JCSEk0EWqN0cl828e/FIxvz9/9cCh1zS5z9/G9mwyzl/+PNnKQ45nyNAr9ThMHQ/UG4tDofuB4bQIhz6fIBenMWJQ+7Vn7+zeLCbKXv6z59NOPQVgsIcW4QA9YFi6wNQLrKwsBebW/68DJ388Nun5XFocrqvIFH59+XhBAxThTfeB0r+vP/QHbuDCgr2JmOXoSsAAKK7bU3vISS4AAAAAElFTkSuQmCC"
    }
  ]
}

For v2 protocol

{
  "id": "d3b15cad-50a2-4eaf-80ce-8b0a428bd298",
  "inputs": [{
    "name": "4b7c7d4a-51e4-43c8-af61-04639f6ef4bc",
    "shape": -1,
    "datatype": "BYTES",
    "data": "this year business is good"
  }]
}

For the request and response of BERT and Text Classifier models, refer the "Request and Response" section of section of BERT Readme file.

Troubleshooting guide for KServe :

Check if the pod is up and running :

kubectl get pods -n kserve-test

Check pod events :

kubectl describe pod <pod-name> -n kserve-test

Getting pod logs to track errors :

kubectl log torch-pred -c kserve-container -n kserve-test

Autoscaling

One of the main serverless inference features is to automatically scale the replicas of an InferenceService matching the incoming workload. KServe by default enables Knative Pod Autoscaler which watches traffic flow and scales up and down based on the configured metrics.

Autoscaling Example

Canary Rollout

Canary rollout is a deployment strategy when you release a new version of model to a small percent of the production traffic.

Canary Deployment

Monitoring

Expose metrics and setup grafana dashboards

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

End to End Documentation for Torchserve - KServe Model Serving

Docker Image Building

Running Torchserve inference service in KServe cluster

Create Kubernetes cluster with eksctl

Install KServe

Steps for running Torchserve inference service in KServe

Troubleshooting guide for KServe :

Autoscaling

Canary Rollout

Monitoring

Files

README.md

Latest commit

History

README.md

File metadata and controls

End to End Documentation for Torchserve - KServe Model Serving

Docker Image Building

Running Torchserve inference service in KServe cluster

Create Kubernetes cluster with eksctl

Install KServe

Steps for running Torchserve inference service in KServe

Troubleshooting guide for KServe :

Autoscaling

Canary Rollout

Monitoring