You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1.[Performance Benefits on IGX Orin w/ dGPU](#performance-benefits-on-igx-orin-w-dgpu)
17
+
1.[Varying Number of Instances](#varying-number-of-instances)
18
+
1.[Varying Number of Parallel Inferences](#varying-number-of-parallel-inferences)
19
+
8
20
## Steps to enable CUDA MPS
9
21
10
22
Before enabling CUDA MPS, please [check](https://docs.nvidia.com/deploy/mps/index.html#topic_3_3)
@@ -52,12 +64,12 @@ Please note that concurrently running Holoscan applications may increase the GPU
52
64
footprint. Therefore, one needs to be careful about hitting the GPU memory size and [potential
53
65
delay due to page faults](https://developer.nvidia.com/blog/improving-gpu-memory-oversubscription-performance/).
54
66
55
-
## Performance Benefits
56
-
57
67
CUDA MPS improves the performance for concurrently running Holoscan applications.
58
68
Since multiple applications can simultaneously execute more than one CUDA compute tasks with CUDA
59
69
MPS, it can also improve the overall GPU utilization.
60
70
71
+
## Performance Benefits on x86 System
72
+
61
73
Suppose, we want to run the endoscopy tool tracking and ultrasound segmentation applications
62
74
concurrently on an x86 workstation with RTX A6000 GPU. The below table shows the maximum end-to-end latency performance
63
75
without and with CUDA MPS, where the active thread percentage is set to 40\% for each application.
@@ -76,4 +88,103 @@ without CUDA MPS. The experiment demonstrates up to 36% improvement with CUDA MP
76
88

77
89
78
90
Such experiments can easily be conducted with [Holoscan Flow Benchmarking](../../benchmarks/holoscan_flow_benchmarking) to retrieve
79
-
various end-to-end latency performance metrics.
91
+
various end-to-end latency performance metrics.
92
+
93
+
## IGX Orin
94
+
95
+
CUDA MPS is available on IGX Orin since CUDA 12.5. Please check you CUDA version and upgrade to CUDA 12.5+ to test CUDA MPS. We evaluate the benefits of MPS on IGX Orin with discrete and integrated GPUs. Please follow the steps outlined in [Steps to enable CUDA MPS](https://github.com/nvidia-holoscan/holohub/tree/main/tutorials/cuda_mps#steps-to-enable-cuda-mps) to start running the MPS server on IGX Orin.
96
+
97
+
We use the [model benchmarking](https://github.com/nvidia-holoscan/holohub/tree/main/benchmarks/model_benchmarking) application to demonstrate the benefits of CUDA MPS. In general, MPS improves performance by enabling multiple concurrent processes to share a CUDA context and scheduling resources. We show the benefits of using CUDA MPS along two dimensions: (a) increasing the workload per application instance (varying the number of parallel inferences for the same model) and (b) increasing the total number of instances.
98
+
99
+
### Model Benchmarking Application Setup
100
+
101
+
Please follow the steps outlined in [model benchmarking](https://github.com/nvidia-holoscan/holohub/tree/main/benchmarks/model_benchmarking) to ensure that the application builds and runs properly.
102
+
> Note that you need to run the video using [v4l2loopback](https://github.com/nvidia-holoscan/holoscan-sdk/tree/main/examples/v4l2_camera#use-with-v4l2-loopback-devices) in a separate terminal _while_ running the model benchmarking application.
103
+
104
+
> Make sure to change the device path in the `model_benchmarking/python/model_benchmarking.yaml` file to match the values you provided in the `modprobe` command when following the [v4l2loopback](https://github.com/nvidia-holoscan/holoscan-sdk/tree/main/examples/v4l2_camera#use-with-v4l2-loopback-devices) instructions.
105
+
106
+
### Performance Benchmark Setup
107
+
108
+
To gather performance metrics for the model benchmarking application, follow the steps outlined in [Holoscan Flow Benchmarking](../../benchmarks/holoscan_flow_benchmarking).
109
+
> If you are running within a container, please complete Step-3 before launching the container
ii. Determine the number of instances you would like to benchmark and set that as the value of `devices`. Then, load the `v4l2loopback` kernel module on virtual devices `/dev/video[*]`. This enables each instance to get its input from a separate virtual device.
128
+
129
+
**Example:** For 3 instances, the `v4l2loopback` kernel module can be loaded on `/dev/video1`, `/dev/video2` and `/dev/video3`:
```python benchmarks/holoscan_flow_benchmarking/benchmark.py --run-command="python applications/model_benchmarking/python/model_benchmarking.py -l <number of parallel inferences> -i" --language python -i <number of instances> -r <number of runs> -m <number of messages> --sched greedy -d <outputs folder> -u```
147
+
148
+
The command executes `<number of runs>` runs of `<number of instances>` instances of the model benchmarking application with `<number of messages>` messages. Each instance runs `<number of parallel inferences>` parallel model benchmarking inferences with no post-processing and visualization (`-i`).
149
+
150
+
Please refer to [Model benchmarking options](https://github.com/nvidia-holoscan/holohub/tree/main/benchmarks/model_benchmarking#capabilities) and [Holoscan flow benchmarking options](https://github.com/nvidia-holoscan/holohub/tree/main/benchmarks/model_benchmarking#capabilities) for more information on the various command options.
151
+
152
+
**Example**: After Step-3, to benchmark 3 instances for 10 runs with 1000 messages, run:
### Performance Benefits on IGX Orin w/ Discrete GPU
158
+
159
+
We look at the performance benefits of MPS by varying the number of instances and number of inferences. We use the RTX A6000 GPU for our experiments. From our experiments, we observe that enabling MPS results in upto 12% improvement in maximum latency compared to the default setting.
160
+
161
+
#### Varying Number of Instances
162
+
163
+
We fix the number of parallel inferences to 7, number of runs to 10 and number of messages to 1000 and vary the number of instances from 3 to 7 using the `-i` parameter. Please refer to [Performance Benchmark Setup](#performance-benchmark-setup) for benchmarking commands.
164
+
165
+
The graph below shows the maximum end-to-end latency of model benchmarking application with and without CUDA MPS, where the active thread percentage was set to `80/(number of instances)`. For example, for 5 instances, we set the active thread percentage to `80/5 = 16`. By provisioning resources this way, we leave some resources idle in case a client should require to use it. Please refer to [CUDA MPS Resource Provisioning](https://docs.nvidia.com/deploy/mps/#volta-mps-execution-resource-provisioning) for more details regarding this.
166
+
167
+
The graph is missing a bar for the case of 7 instances and 7 parallel inferences as we were unable to get the baseline to execute. However, we were able to run when MPS was enabled, highlighting the advantage of using MPS for large workloads. We see that the maximum end-to-end latency improves when MPS is enabled and the improvement is more pronounced as the number of instances increases. This is because, as the number of concurrent processes increases, MPS confines CUDA workloads to a certain predefined set of SMs. MPS combines multiple CUDA contexts from multiple processes into one, while simultaneously running them together.
168
+
It reduces the number of context switches and related inferences, resulting in improved GPU utilization.
We vary the number of parallel inferences to show that MPS may not be beneficial if the workload is insufficient to offset the overhead of running the MPS server. The graph below shows the result of increasing the number of parallel inferences from 3 to 7 while the number of instances is constant.
185
+
186
+
As the number of parallel inferences increases, so does the workload, and the benefit of MPS is more evident. However, when the workload is low, CUDA MPS may not be beneficial.
187
+
188
+
| Maximum Latency for 5 Instances |
189
+
| :-------------------------:|
190
+
.png) |
0 commit comments