Skip to content

redhat-na-ssa/demo-ocp-llm-d

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Distributed Inference llm-d Deployment Guide

This guide provides a demonstration of how to get up and running with distributed inference llm-d on RHOAI based on:

Prerequisites - Get a cluster

  • OpenShift - 4.19+
    • role: cluster-admin
  • OpenShift AI - 2.25+

Red Hat Demo Platform Options (Tested)

NOTE: The node sizes below are the recommended minimum to select for provisioning

The following icon should appear in the top right of the OpenShift web console after you have installed the operator. Clicking this icon launches the web terminal.

Web Terminal

NOTE: Reload the page in your browser if you do not see the icon after installing the operator.

# apply the enhanced web terminal
oc apply -k https://github.com/redhat-na-ssa/llm-d-demo/demo/web-terminal

# delete old web terminal
$(wtoctl | grep 'oc delete')

Setup cluster nodes

# isolate the control plane
ocp_control_nodes_not_schedulable

# setup L40 single GPU machine set
ocp_aws_machineset_create_gpu g6.xlarge

# scale machineset to at least 1
ocp_machineset_scale 1

# setup cluster gpu autoscaling
apply_firmly demo/nvidia-gpu-autoscale

Additional Prerequisites

Prerequisites MUST be installed for the following!

  • OpenShift
  • OpenShift AI
  • MetalLB - not required for cloud deployments

DO NOT ignore this section!

Quickstarts

Model Serving + GuideLLM

The following command will create an InferenceService using the model gpt-oss-20b in the demo-guidellm namespace.

A guidellm pod will attempt to download tokenizer info and benchmark the model deployment above.

Additional Notes

apply_firmly demo/guidellm

Distributed Inference (llm-d)

The following command will create a LLMInferenceService using the model gpt-oss-20b in the demo-llm namespace with a 40G persistent volume claim (to avoid downloading the model multiple times).

until oc apply -k demo/llm-d; do : ; done

Distributed Inference Monitoring Stack (Prometheus + Grafana)

The monitoring stack provides real-time metrics and dashboards for monitoring LLM inference performance, including Time to First Token (TTFT), inter-token latency, KV cache hit rates, and GPU utilization. This helps demonstrate the flexibility of OpenShift for collecting, monitoring, and display of inference performance provided by OpenShift AI.

Install Monitoring

until oc apply -k gitops/instance/llm-d-monitoring ; do : ; done

# get the grafana url
oc get route grafana -n llm-d-monitoring -o jsonpath='{.spec.host}'

Additional Notes

Send an HTTP request with the OpenAI API

# wait for the llm inference service to be available
oc get llminferenceservice -n demo-llm

Test with curl

INFERENCE_URL=$(
  oc -n openshift-ingress get gateway openshift-ai-inference \
    -o jsonpath='{.status.addresses[0].value}'
)

LLM=openai/gpt-oss-20b
LLM_SVC=${LLM##*/}

PROMPT="Explain the difference between supervised and unsupervised learning in machine learning. Include examples of algorithms used in each type."

llm_post_data(){
cat <<JSON
{
  "model": "${LLM}",
  "prompt": "${PROMPT}",
  "max_tokens": 200,
  "temperature": 0.7,
  "top_p": 0.9
}
JSON
}

curl -s -X POST http://a970653680479411ea2687bb74860cd4-328874611.us-east-2.elb.amazonaws.com/demo-llm/qwen/v1/completions \
  -H "Content-Type: application/json" \
  -d "$(llm_post_data)" | jq .choices[0].text

Additional Info

About

Demo - Distributed Inference on OpenShift (aka llm-d)

Topics

Resources

License

Stars

Watchers

Forks

Contributors 6