redhat-na-ssa / demo-ocp-llm-d Public

Notifications You must be signed in to change notification settings
Fork 2
Star 1

Demo - Distributed Inference on OpenShift (aka llm-d)

1 star 2 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 208 Commits
demo		demo
docs		docs
gitops		gitops
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Repository files navigation

Distributed Inference `llm-d` Deployment Guide

This guide provides a demonstration of how to get up and running with distributed inference llm-d on RHOAI based on:

Prerequisites - Get a cluster

OpenShift - 4.19+
- role: cluster-admin
OpenShift AI - 2.25+

Red Hat Demo Platform Options (Tested)

NOTE: The node sizes below are the recommended minimum to select for provisioning

AWS with OpenShift Open Environment
- 1 x Control Plane - m6a.2xlarge
- 0 x Workers - m6a.4xlarge
Red Hat OpenShift Container Platform Cluster (AWS)
- 1 x Control Plane

Install the OpenShift Web Terminal

The following icon should appear in the top right of the OpenShift web console after you have installed the operator. Clicking this icon launches the web terminal.

NOTE: Reload the page in your browser if you do not see the icon after installing the operator.

# apply the enhanced web terminal
oc apply -k https://github.com/redhat-na-ssa/llm-d-demo/demo/web-terminal

# delete old web terminal
$(wtoctl | grep 'oc delete')

Setup cluster nodes

# isolate the control plane
ocp_control_nodes_not_schedulable

# setup L40 single GPU machine set
ocp_aws_machineset_create_gpu g6.xlarge

# scale machineset to at least 1
ocp_machineset_scale 1

# setup cluster gpu autoscaling
apply_firmly demo/nvidia-gpu-autoscale

Additional Prerequisites

Prerequisites MUST be installed for the following!

OpenShift
OpenShift AI
MetalLB - not required for cloud deployments

DO NOT ignore this section!

Quickstarts

Model Serving + GuideLLM

The following command will create an InferenceService using the model gpt-oss-20b in the demo-guidellm namespace.

A guidellm pod will attempt to download tokenizer info and benchmark the model deployment above.

Additional Notes

apply_firmly demo/guidellm

Distributed Inference (llm-d)

The following command will create a LLMInferenceService using the model gpt-oss-20b in the demo-llm namespace with a 40G persistent volume claim (to avoid downloading the model multiple times).

until oc apply -k demo/llm-d; do : ; done

Distributed Inference Monitoring Stack (Prometheus + Grafana)

The monitoring stack provides real-time metrics and dashboards for monitoring LLM inference performance, including Time to First Token (TTFT), inter-token latency, KV cache hit rates, and GPU utilization. This helps demonstrate the flexibility of OpenShift for collecting, monitoring, and display of inference performance provided by OpenShift AI.

Install Monitoring

until oc apply -k gitops/instance/llm-d-monitoring ; do : ; done

# get the grafana url
oc get route grafana -n llm-d-monitoring -o jsonpath='{.spec.host}'

Additional Notes

Send an HTTP request with the OpenAI API

# wait for the llm inference service to be available
oc get llminferenceservice -n demo-llm

Test with curl

INFERENCE_URL=$(
  oc -n openshift-ingress get gateway openshift-ai-inference \
    -o jsonpath='{.status.addresses[0].value}'
)

LLM=openai/gpt-oss-20b
LLM_SVC=${LLM##*/}

PROMPT="Explain the difference between supervised and unsupervised learning in machine learning. Include examples of algorithms used in each type."

llm_post_data(){
cat <<JSON
{
  "model": "${LLM}",
  "prompt": "${PROMPT}",
  "max_tokens": 200,
  "temperature": 0.7,
  "top_p": 0.9
}
JSON
}

curl -s -X POST http://a970653680479411ea2687bb74860cd4-328874611.us-east-2.elb.amazonaws.com/demo-llm/qwen/v1/completions \
  -H "Content-Type: application/json" \
  -d "$(llm_post_data)" | jq .choices[0].text

Additional Info