Skip to content

Commit d8f6fe2

Browse files
authored
IGX MPS tutorial (#448)
* IGX MPS tutorial --------- Signed-off-by: Vani Nagarajan <vanin@nvidia.com>
1 parent 4c15732 commit d8f6fe2

File tree

5 files changed

+114
-3
lines changed

5 files changed

+114
-3
lines changed

tutorials/cuda_mps/README.md

Lines changed: 114 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,18 @@ applications. It allows multiple CUDA applications to share a single GPU, which
55
running more than one Holoscan application on a single machine featuring one or more GPUs. This
66
tutorial describes the steps to enable CUDA MPS and demonstrate few performance benefits of using it.
77

8+
## Table of Contents
9+
10+
1. [Steps to enable CUDA MPS](#steps-to-enable-cuda-mps)
11+
2. [Customization](#customization)
12+
3. [x86 System Performance](#performance-benefits-on-x86-system)
13+
4. [IGX Orin](#igx-orin)
14+
1. [Model Benchmarking Application Setup](#model-benchmarking-application-setup)
15+
1. [Performance Benchmark Setup](#performance-benchmark-setup)
16+
1. [Performance Benefits on IGX Orin w/ dGPU](#performance-benefits-on-igx-orin-w-dgpu)
17+
1. [Varying Number of Instances](#varying-number-of-instances)
18+
1. [Varying Number of Parallel Inferences](#varying-number-of-parallel-inferences)
19+
820
## Steps to enable CUDA MPS
921

1022
Before enabling CUDA MPS, please [check](https://docs.nvidia.com/deploy/mps/index.html#topic_3_3)
@@ -52,12 +64,12 @@ Please note that concurrently running Holoscan applications may increase the GPU
5264
footprint. Therefore, one needs to be careful about hitting the GPU memory size and [potential
5365
delay due to page faults](https://developer.nvidia.com/blog/improving-gpu-memory-oversubscription-performance/).
5466

55-
## Performance Benefits
56-
5767
CUDA MPS improves the performance for concurrently running Holoscan applications.
5868
Since multiple applications can simultaneously execute more than one CUDA compute tasks with CUDA
5969
MPS, it can also improve the overall GPU utilization.
6070

71+
## Performance Benefits on x86 System
72+
6173
Suppose, we want to run the endoscopy tool tracking and ultrasound segmentation applications
6274
concurrently on an x86 workstation with RTX A6000 GPU. The below table shows the maximum end-to-end latency performance
6375
without and with CUDA MPS, where the active thread percentage is set to 40\% for each application.
@@ -76,4 +88,103 @@ without CUDA MPS. The experiment demonstrates up to 36% improvement with CUDA MP
7688
![Alt text](image.png)
7789

7890
Such experiments can easily be conducted with [Holoscan Flow Benchmarking](../../benchmarks/holoscan_flow_benchmarking) to retrieve
79-
various end-to-end latency performance metrics.
91+
various end-to-end latency performance metrics.
92+
93+
## IGX Orin
94+
95+
CUDA MPS is available on IGX Orin since CUDA 12.5. Please check you CUDA version and upgrade to CUDA 12.5+ to test CUDA MPS. We evaluate the benefits of MPS on IGX Orin with discrete and integrated GPUs. Please follow the steps outlined in [Steps to enable CUDA MPS](https://github.com/nvidia-holoscan/holohub/tree/main/tutorials/cuda_mps#steps-to-enable-cuda-mps) to start running the MPS server on IGX Orin.
96+
97+
We use the [model benchmarking](https://github.com/nvidia-holoscan/holohub/tree/main/benchmarks/model_benchmarking) application to demonstrate the benefits of CUDA MPS. In general, MPS improves performance by enabling multiple concurrent processes to share a CUDA context and scheduling resources. We show the benefits of using CUDA MPS along two dimensions: (a) increasing the workload per application instance (varying the number of parallel inferences for the same model) and (b) increasing the total number of instances.
98+
99+
### Model Benchmarking Application Setup
100+
101+
Please follow the steps outlined in [model benchmarking](https://github.com/nvidia-holoscan/holohub/tree/main/benchmarks/model_benchmarking) to ensure that the application builds and runs properly.
102+
> Note that you need to run the video using [v4l2loopback](https://github.com/nvidia-holoscan/holoscan-sdk/tree/main/examples/v4l2_camera#use-with-v4l2-loopback-devices) in a separate terminal _while_ running the model benchmarking application.
103+
104+
> Make sure to change the device path in the `model_benchmarking/python/model_benchmarking.yaml` file to match the values you provided in the `modprobe` command when following the [v4l2loopback](https://github.com/nvidia-holoscan/holoscan-sdk/tree/main/examples/v4l2_camera#use-with-v4l2-loopback-devices) instructions.
105+
106+
### Performance Benchmark Setup
107+
108+
To gather performance metrics for the model benchmarking application, follow the steps outlined in [Holoscan Flow Benchmarking](../../benchmarks/holoscan_flow_benchmarking).
109+
> If you are running within a container, please complete Step-3 before launching the container
110+
111+
We use the following steps:
112+
113+
**1. Patch Application:**
114+
115+
`./benchmarks/holoscan_flow_benchmarking/patch_application.sh model_benchmarking`
116+
117+
**2. Build Application for Benchmarking:**
118+
119+
`./run build model_benchmarking python --configure-args -DCMAKE_CXX_FLAGS=-I$PWD/benchmarks/holoscan_flow_benchmarking`
120+
121+
**3. Set Up V4l2Loopback Devices:**
122+
123+
i. Install `v4l2loopback` and `v4l2loopback`:
124+
125+
`sudo apt-get install v4l2loopback-dkms ffmpeg`
126+
127+
ii. Determine the number of instances you would like to benchmark and set that as the value of `devices`. Then, load the `v4l2loopback` kernel module on virtual devices `/dev/video[*]`. This enables each instance to get its input from a separate virtual device.
128+
129+
**Example:** For 3 instances, the `v4l2loopback` kernel module can be loaded on `/dev/video1`, `/dev/video2` and `/dev/video3`:
130+
131+
`sudo modprobe v4l2loopback devices=3 video_nr=1 max_buffers=4`
132+
133+
Now open 3 separate terminals.
134+
135+
In terminal-1, run:
136+
`ffmpeg -stream_loop -1 -re -i /data/ultrasound_segmentation/ultrasound_256x256.avi -pix_fmt yuyv422 -f v4l2 /dev/video1`
137+
138+
In terminal-2, run:
139+
`ffmpeg -stream_loop -1 -re -i /data/ultrasound_segmentation/ultrasound_256x256.avi -pix_fmt yuyv422 -f v4l2 /dev/video2`
140+
141+
In terminal-3, run:
142+
`ffmpeg -stream_loop -1 -re -i /data/ultrasound_segmentation/ultrasound_256x256.avi -pix_fmt yuyv422 -f v4l2 /dev/video3`
143+
144+
**4. Benchmark Application:**
145+
146+
```python benchmarks/holoscan_flow_benchmarking/benchmark.py --run-command="python applications/model_benchmarking/python/model_benchmarking.py -l <number of parallel inferences> -i" --language python -i <number of instances> -r <number of runs> -m <number of messages> --sched greedy -d <outputs folder> -u```
147+
148+
The command executes `<number of runs>` runs of `<number of instances>` instances of the model benchmarking application with `<number of messages>` messages. Each instance runs `<number of parallel inferences>` parallel model benchmarking inferences with no post-processing and visualization (`-i`).
149+
150+
Please refer to [Model benchmarking options](https://github.com/nvidia-holoscan/holohub/tree/main/benchmarks/model_benchmarking#capabilities) and [Holoscan flow benchmarking options](https://github.com/nvidia-holoscan/holohub/tree/main/benchmarks/model_benchmarking#capabilities) for more information on the various command options.
151+
152+
**Example**: After Step-3, to benchmark 3 instances for 10 runs with 1000 messages, run:
153+
154+
`python benchmarks/holoscan_flow_benchmarking/benchmark.py --run-command="python applications/model_benchmarking/python/model_benchmarking.py -l 7 -i" --language python -i 3 -r 10 -m 1000 --sched greedy -d myoutputs -u`
155+
156+
157+
### Performance Benefits on IGX Orin w/ Discrete GPU
158+
159+
We look at the performance benefits of MPS by varying the number of instances and number of inferences. We use the RTX A6000 GPU for our experiments. From our experiments, we observe that enabling MPS results in upto 12% improvement in maximum latency compared to the default setting.
160+
161+
#### Varying Number of Instances
162+
163+
We fix the number of parallel inferences to 7, number of runs to 10 and number of messages to 1000 and vary the number of instances from 3 to 7 using the `-i` parameter. Please refer to [Performance Benchmark Setup](#performance-benchmark-setup) for benchmarking commands.
164+
165+
The graph below shows the maximum end-to-end latency of model benchmarking application with and without CUDA MPS, where the active thread percentage was set to `80/(number of instances)`. For example, for 5 instances, we set the active thread percentage to `80/5 = 16`. By provisioning resources this way, we leave some resources idle in case a client should require to use it. Please refer to [CUDA MPS Resource Provisioning](https://docs.nvidia.com/deploy/mps/#volta-mps-execution-resource-provisioning) for more details regarding this.
166+
167+
The graph is missing a bar for the case of 7 instances and 7 parallel inferences as we were unable to get the baseline to execute. However, we were able to run when MPS was enabled, highlighting the advantage of using MPS for large workloads. We see that the maximum end-to-end latency improves when MPS is enabled and the improvement is more pronounced as the number of instances increases. This is because, as the number of concurrent processes increases, MPS confines CUDA workloads to a certain predefined set of SMs. MPS combines multiple CUDA contexts from multiple processes into one, while simultaneously running them together.
168+
It reduces the number of context switches and related inferences, resulting in improved GPU utilization.
169+
170+
171+
| Maximum end-to-end Latency |
172+
| :-------------------------:|
173+
![max e2e latency](images/multiple_inference_7/Maximum%20Latency%20(ms).png)
174+
175+
We also notice minor improvements in the 99.9<sup>th</sup> percentile latency and similar improvements in the 99<sup>th</sup> percentile latency.
176+
177+
| 99.9<sup>th</sup> Percentile Latency| 99<sup>th</sup> Percentile Latency |
178+
| :-------------------------:|:-------------------------: |
179+
![ 99.9<sup>th</sup> percentile latency](images/multiple_inference_7/99.9th%20Percentile%20Latency%20(ms).png) | ![ 99<sup>th</sup> percentile latency](images/multiple_inference_7/99th%20Percentile%20Latency%20(ms).png) |
180+
181+
182+
#### Varying Number of Parallel Inferences
183+
184+
We vary the number of parallel inferences to show that MPS may not be beneficial if the workload is insufficient to offset the overhead of running the MPS server. The graph below shows the result of increasing the number of parallel inferences from 3 to 7 while the number of instances is constant.
185+
186+
As the number of parallel inferences increases, so does the workload, and the benefit of MPS is more evident. However, when the workload is low, CUDA MPS may not be beneficial.
187+
188+
| Maximum Latency for 5 Instances |
189+
| :-------------------------:|
190+
![max latency for 5 parallel inf](images/Maximum%20Latency%20(ms).png) |
Loading
Loading
Loading
Loading

0 commit comments

Comments
 (0)