This guide provides a demonstration of how to get up and running with distributed inference llm-d on RHOAI based on:
- Deploying a model by using the Distributed Inference with llm-d [Developer preview]
- Kserve Docs - OpenDataHub
- LLM-D Docs - Precise Prefix Cache Aware Routing
- OpenShift - 4.19+
- role:
cluster-admin
- role:
- OpenShift AI - 2.25+
Red Hat Demo Platform Options (Tested)
NOTE: The node sizes below are the recommended minimum to select for provisioning
- AWS with OpenShift Open Environment
- 1 x Control Plane -
m6a.2xlarge - 0 x Workers -
m6a.4xlarge
- 1 x Control Plane -
- Red Hat OpenShift Container Platform Cluster (AWS)
- 1 x Control Plane
Install the OpenShift Web Terminal
The following icon should appear in the top right of the OpenShift web console after you have installed the operator. Clicking this icon launches the web terminal.
NOTE: Reload the page in your browser if you do not see the icon after installing the operator.
# apply the enhanced web terminal
oc apply -k https://github.com/redhat-na-ssa/llm-d-demo/demo/web-terminal
# delete old web terminal
$(wtoctl | grep 'oc delete')Setup cluster nodes
# isolate the control plane
ocp_control_nodes_not_schedulable
# setup L40 single GPU machine set
ocp_aws_machineset_create_gpu g6.xlarge
# scale machineset to at least 1
ocp_machineset_scale 1
# setup cluster gpu autoscaling
apply_firmly demo/nvidia-gpu-autoscalePrerequisites MUST be installed for the following!
OpenShiftOpenShift AIMetalLB- not required for cloud deployments
The following command will create an InferenceService
using the model gpt-oss-20b
in the demo-guidellm namespace.
A guidellm pod will attempt to download tokenizer info and
benchmark the model deployment above.
apply_firmly demo/guidellmThe following command will create a LLMInferenceService
using the model gpt-oss-20b
in the demo-llm namespace
with a 40G persistent volume claim (to avoid downloading the model multiple times).
until oc apply -k demo/llm-d; do : ; doneThe monitoring stack provides real-time metrics and dashboards for monitoring LLM inference performance, including Time to First Token (TTFT), inter-token latency, KV cache hit rates, and GPU utilization. This helps demonstrate the flexibility of OpenShift for collecting, monitoring, and display of inference performance provided by OpenShift AI.
until oc apply -k gitops/instance/llm-d-monitoring ; do : ; done
# get the grafana url
oc get route grafana -n llm-d-monitoring -o jsonpath='{.spec.host}'# wait for the llm inference service to be available
oc get llminferenceservice -n demo-llmTest with curl
INFERENCE_URL=$(
oc -n openshift-ingress get gateway openshift-ai-inference \
-o jsonpath='{.status.addresses[0].value}'
)
LLM=openai/gpt-oss-20b
LLM_SVC=${LLM##*/}
PROMPT="Explain the difference between supervised and unsupervised learning in machine learning. Include examples of algorithms used in each type."
llm_post_data(){
cat <<JSON
{
"model": "${LLM}",
"prompt": "${PROMPT}",
"max_tokens": 200,
"temperature": 0.7,
"top_p": 0.9
}
JSON
}
curl -s -X POST http://a970653680479411ea2687bb74860cd4-328874611.us-east-2.elb.amazonaws.com/demo-llm/qwen/v1/completions \
-H "Content-Type: application/json" \
-d "$(llm_post_data)" | jq .choices[0].text- Disconnected RHOAI Notes
- Local Notes
- Manual Steps
- Deploying a model by using the Distributed Inference Server with llm-d
- LLM-D: GPU-Accelerated Cache-Aware LLM Inference
- Demystifying Inferencing at Scale with LLM-D on Red Hat Openshift on IBM Cloud
- OAI Release Notes - 2.25
- OAI Distributed Inference - 2.25
- guideLLM
- Openshift Docs - MetalLB
- Openshift Docs - Ingress (GatewayAPI)
- LLM-d - Why do you need a Gateway?
- RHOAI Docs - Distributed Inference Examples
- Documentation and Improvements for exposing llm-d Gateway
