-
Notifications
You must be signed in to change notification settings - Fork 81
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add base test for vLLM and its metrics (#1438)
* Add base test for vLLM and its metrics Signed-off-by: Luca Giorgi <lgiorgi@redhat.com> * Update ods_ci/tests/Resources/Files/llm/vllm/vllm_servingruntime.yaml Co-authored-by: Vedant Mahabaleshwarkar <vmahabal@redhat.com> * reimplement using common keywords Signed-off-by: Luca Giorgi <lgiorgi@redhat.com> * Better handle missing metrics from UWM, change expected response format Signed-off-by: Luca Giorgi <lgiorgi@redhat.com> * small cleanup Signed-off-by: Luca Giorgi <lgiorgi@redhat.com> * Move keyword, some cleanup, comments Signed-off-by: Luca Giorgi <lgiorgi@redhat.com> --------- Signed-off-by: Luca Giorgi <lgiorgi@redhat.com> Co-authored-by: Vedant Mahabaleshwarkar <vmahabal@redhat.com>
- Loading branch information
1 parent
0775cdf
commit 7b6969a
Showing
10 changed files
with
323 additions
and
22 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,74 @@ | ||
apiVersion: v1 | ||
kind: Namespace | ||
metadata: | ||
name: vllm-gpt2 | ||
--- | ||
apiVersion: v1 | ||
kind: PersistentVolumeClaim | ||
metadata: | ||
name: vlmm-gpt2-claim | ||
namespace: vllm-gpt2 | ||
spec: | ||
accessModes: | ||
- ReadWriteOnce | ||
volumeMode: Filesystem | ||
resources: | ||
requests: | ||
storage: 10Gi | ||
--- | ||
apiVersion: v1 | ||
kind: Pod | ||
metadata: | ||
name: setup-gpt2-binary | ||
namespace: vllm-gpt2 | ||
labels: | ||
gpt-download-pod: 'true' | ||
spec: | ||
volumes: | ||
- name: model-volume | ||
persistentVolumeClaim: | ||
claimName: vlmm-gpt2-claim | ||
restartPolicy: Never | ||
initContainers: | ||
- name: fix-volume-permissions | ||
image: quay.io/quay/busybox:latest | ||
imagePullPolicy: IfNotPresent | ||
securityContext: | ||
allowPrivilegeEscalation: true | ||
resources: | ||
requests: | ||
memory: "64Mi" | ||
cpu: "250m" | ||
nvidia.com/gpu: "1" | ||
limits: | ||
memory: "128Mi" | ||
cpu: "500m" | ||
nvidia.com/gpu: "1" | ||
command: ["sh"] | ||
args: ["-c", "chown -R 1001:1001 /mnt/models"] | ||
volumeMounts: | ||
- mountPath: "/mnt/models/" | ||
name: model-volume | ||
containers: | ||
- name: download-model | ||
image: registry.access.redhat.com/ubi9/python-311:latest | ||
imagePullPolicy: IfNotPresent | ||
securityContext: | ||
allowPrivilegeEscalation: true | ||
resources: | ||
requests: | ||
memory: "1Gi" | ||
cpu: "1" | ||
nvidia.com/gpu: "1" | ||
limits: | ||
memory: "1Gi" | ||
cpu: "1" | ||
nvidia.com/gpu: "1" | ||
command: ["sh"] | ||
args: [ "-c", "pip install --upgrade pip && pip install --upgrade huggingface_hub && python3 -c 'from huggingface_hub import snapshot_download\nsnapshot_download(repo_id=\"gpt2\", local_dir=\"/mnt/models/gpt2\", local_dir_use_symlinks=False)'"] | ||
volumeMounts: | ||
- mountPath: "/mnt/models/" | ||
name: model-volume | ||
env: | ||
- name: TRANSFORMERS_CACHE | ||
value: /tmp |
14 changes: 14 additions & 0 deletions
14
ods_ci/tests/Resources/Files/llm/vllm/vllm-gpt2_inferenceservice.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
apiVersion: serving.kserve.io/v1beta1 | ||
kind: InferenceService | ||
metadata: | ||
name: vllm-gpt2-openai | ||
namespace: vllm-gpt2 | ||
labels: | ||
modelmesh-enabled: "true" | ||
spec: | ||
predictor: | ||
model: | ||
runtime: kserve-vllm | ||
modelFormat: | ||
name: vLLM | ||
storageUri: pvc://vlmm-gpt2-claim/ |
77 changes: 77 additions & 0 deletions
77
ods_ci/tests/Resources/Files/llm/vllm/vllm_servingruntime.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,77 @@ | ||
apiVersion: serving.kserve.io/v1alpha1 | ||
kind: ServingRuntime | ||
metadata: | ||
name: kserve-vllm | ||
namespace: vllm-gpt2 | ||
spec: | ||
annotations: | ||
sidecar.istio.io/inject: "true" | ||
sidecar.istio.io/rewriteAppHTTPProbers: "true" | ||
serving.knative.openshift.io/enablePassthrough: "true" | ||
opendatahub.io/dashboard: "true" | ||
openshift.io/display-name: "vLLLM Openai entry point" | ||
prometheus.io/port: '8080' | ||
prometheus.io/path: "/metrics/" | ||
multiModel: false | ||
supportedModelFormats: | ||
- name: vLLM | ||
autoSelect: true | ||
containers: | ||
- name: kserve-container | ||
#image: kserve/vllmserver:latest | ||
image: quay.io/wxpe/tgis-vllm:release.74803b6 | ||
startupProbe: | ||
httpGet: | ||
port: 8080 | ||
path: /health | ||
# Allow 12 minutes to start | ||
failureThreshold: 24 | ||
periodSeconds: 30 | ||
readinessProbe: | ||
httpGet: | ||
port: 8080 | ||
path: /health | ||
periodSeconds: 30 | ||
timeoutSeconds: 5 | ||
livenessProbe: | ||
httpGet: | ||
port: 8080 | ||
path: /health | ||
periodSeconds: 100 | ||
timeoutSeconds: 8 | ||
terminationMessagePolicy: "FallbackToLogsOnError" | ||
terminationGracePeriodSeconds: 120 | ||
args: | ||
- --port | ||
- "8080" | ||
- --model | ||
- /mnt/models/gpt2 | ||
- --served-model-name | ||
- "gpt2" | ||
command: | ||
- python3 | ||
- -m | ||
- vllm.entrypoints.openai.api_server | ||
env: | ||
- name: STORAGE_URI | ||
value: pvc://vlmm-gpt2-claim/ | ||
- name: HF_HUB_CACHE | ||
value: /tmp | ||
- name: TRANSFORMERS_CACHE | ||
value: $(HF_HUB_CACHE) | ||
- name: NUM_GPUS | ||
value: "1" | ||
- name: CUDA_VISIBLE_DEVICES | ||
value: "0" | ||
ports: | ||
- containerPort: 8080 | ||
protocol: TCP | ||
resources: | ||
limits: | ||
cpu: "4" | ||
memory: 8Gi | ||
nvidia.com/gpu: "1" | ||
requests: | ||
cpu: "1" | ||
memory: 4Gi | ||
nvidia.com/gpu: "1" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
102 changes: 102 additions & 0 deletions
102
...sts/400__ods_dashboard/420__model_serving/LLMs/vllm/426__model_serving_vllm_metrics.robot
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,102 @@ | ||
*** Settings *** | ||
Documentation Basic vLLM deploy test to validate metrics being correctly exposed in OpenShift | ||
Resource ../../../../../Resources/Page/ODH/ODHDashboard/ODHModelServing.resource | ||
Resource ../../../../../Resources/OCP.resource | ||
Resource ../../../../../Resources/Page/Operators/ISVs.resource | ||
Resource ../../../../../Resources/Page/ODH/ODHDashboard/ODHDashboardAPI.resource | ||
Resource ../../../../../Resources/Page/ODH/ODHDashboard/ODHDataScienceProject/ModelServer.resource | ||
Resource ../../../../../Resources/Page/ODH/ODHDashboard/ODHDataScienceProject/DataConnections.resource | ||
Resource ../../../../../Resources/CLI/ModelServing/llm.resource | ||
Resource ../../../../../Resources/Page/ODH/ODHDashboard/ODHDataScienceProject/Permissions.resource | ||
Library OpenShiftLibrary | ||
Suite Setup Suite Setup | ||
Suite Teardown Suite Teardown | ||
Test Tags KServe | ||
|
||
|
||
*** Variables *** | ||
${VLLM_RESOURCES_DIRPATH}= ods_ci/tests/Resources/Files/llm/vllm | ||
${DL_POD_FILEPATH}= ${VLLM_RESOURCES_DIRPATH}/download_model.yaml | ||
${SR_FILEPATH}= ${VLLM_RESOURCES_DIRPATH}/vllm_servingruntime.yaml | ||
${IS_FILEPATH}= ${VLLM_RESOURCES_DIRPATH}/vllm-gpt2_inferenceservice.yaml | ||
${TEST_NS}= vllm-gpt2 | ||
@{SEARCH_METRICS}= vllm:cache_config_info | ||
... vllm:num_requests_running | ||
... vllm:num_requests_swapped | ||
... vllm:num_requests_waiting | ||
... vllm:gpu_cache_usage_perc | ||
... vllm:cpu_cache_usage_perc | ||
... vllm:prompt_tokens_total | ||
... vllm:generation_tokens_total | ||
... vllm:time_to_first_token_seconds_bucket | ||
... vllm:time_to_first_token_seconds_count | ||
... vllm:time_to_first_token_seconds_sum | ||
... vllm:time_per_output_token_seconds_bucket | ||
... vllm:time_per_output_token_seconds_count | ||
... vllm:time_per_output_token_seconds_sum | ||
... vllm:e2e_request_latency_seconds_bucket | ||
... vllm:e2e_request_latency_seconds_count | ||
... vllm:e2e_request_latency_seconds_sum | ||
... vllm:avg_prompt_throughput_toks_per_s | ||
... vllm:avg_generation_throughput_toks_per_s | ||
|
||
|
||
*** Test Cases *** | ||
Verify User Can Deploy A Model With Vllm Via CLI | ||
[Documentation] Deploy a model (gpt2) using the vllm runtime and confirm that it's running | ||
[Tags] Tier1 Sanity Resources-GPU RHOAIENG-6264 | ||
${rc} ${out}= Run And Return Rc And Output oc apply -f ${DL_POD_FILEPATH} | ||
Should Be Equal As Integers ${rc} ${0} | ||
Wait For Pods To Succeed label_selector=gpt-download-pod=true namespace=${TEST_NS} | ||
${rc} ${out}= Run And Return Rc And Output oc apply -f ${SR_FILEPATH} | ||
Should Be Equal As Integers ${rc} ${0} | ||
#TODO: Switch to common keyword for model DL and SR deploy | ||
#Set Project And Runtime runtime=vllm namespace=${TEST_NS} | ||
#... download_in_pvc=${DOWNLOAD_IN_PVC} model_name=gpt2 | ||
#... storage_size=10Gi | ||
Deploy Model Via CLI ${IS_FILEPATH} ${TEST_NS} | ||
Wait For Pods To Be Ready label_selector=serving.kserve.io/inferenceservice=vllm-gpt2-openai | ||
... namespace=${TEST_NS} | ||
Query Model Multiple Times model_name=gpt2 isvc_name=vllm-gpt2-openai runtime=vllm protocol=http | ||
... inference_type=chat-completions n_times=3 query_idx=8 | ||
... namespace=${TEST_NS} string_check_only=${TRUE} | ||
|
||
Verify Vllm Metrics Are Present | ||
[Documentation] Confirm vLLM metrics are exposed in OpenShift metrics | ||
[Tags] Tier1 Sanity Resources-GPU RHOAIENG-6264 | ||
... Depends On Test Verify User Can Deploy A Model With Vllm Via CLI | ||
${host} = llm.Get KServe Inference Host Via CLI isvc_name=vllm-gpt2-openai namespace=${TEST_NS} | ||
${rc} ${out}= Run And Return Rc And Output | ||
... curl -ks ${host}/metrics/ | ||
Should Be Equal As Integers ${rc} ${0} | ||
Log ${out} | ||
${thanos_url}= Get OpenShift Thanos URL | ||
${token}= Generate Thanos Token | ||
Metrics Should Exist In UserWorkloadMonitoring ${thanos_url} ${token} ${SEARCH_METRICS} | ||
|
||
|
||
*** Keywords *** | ||
Suite Setup | ||
Skip If Component Is Not Enabled kserve | ||
RHOSi Setup | ||
Set Default Storage Class In GCP default=ssd-csi | ||
${is_self_managed}= Is RHODS Self-Managed | ||
IF ${is_self_managed} | ||
Configure User Workload Monitoring | ||
Enable User Workload Monitoring | ||
#TODO: Find reliable signal for UWM being ready | ||
#Sleep 10m | ||
END | ||
Load Expected Responses | ||
|
||
Suite Teardown | ||
Set Default Storage Class In GCP default=standard-csi | ||
${rc}= Run And Return Rc oc delete inferenceservice -n ${TEST_NS} --all | ||
Should Be Equal As Integers ${rc} ${0} | ||
${rc}= Run And Return Rc oc delete servingruntime -n ${TEST_NS} --all | ||
Should Be Equal As Integers ${rc} ${0} | ||
${rc}= Run And Return Rc oc delete pod -n ${TEST_NS} --all | ||
Should Be Equal As Integers ${rc} ${0} | ||
${rc}= Run And Return Rc oc delete namespace ${TEST_NS} | ||
Should Be Equal As Integers ${rc} ${0} | ||
RHOSi Teardown |