You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This article lays out how GPU Utilization is actually measured and shows that it is possible for the utilization to be very high without that being true in the most basic sense. For example, the author shares that in some of their initial testing, their models were were reaching "100% utilization" while only hitting 20% of the maximum theoretical Model FLOPS (Floating Point Operations per Second).
The article recommends looking at a metric called SM Efficiency (SM for streaming multiprocessor, also called SM Activity) that reports the % of SMs are active. Seeing a discrepancy between these metrics can be an indicator that there is some less visible bottleneck that can be helped by the usage of "fused kernels." Using Flash Attention or SDPA is one example of doing this, but there are also similar implementations for other types of layers readily available according to the article. I didn't look into these alternatives too much, so it's possible that we're already using more than one of them for their general benefits.
If nothing else, it may be useful to add SM efficiency to our standard set of metrics logged on ClearML. The metric is available in the NVIDIA Data Center GPU Manager (DCGM), and it is also available on-demand through nvidia-smi dmon.
The text was updated successfully, but these errors were encountered:
This article lays out how GPU Utilization is actually measured and shows that it is possible for the utilization to be very high without that being true in the most basic sense. For example, the author shares that in some of their initial testing, their models were were reaching "100% utilization" while only hitting 20% of the maximum theoretical Model FLOPS (Floating Point Operations per Second).
The article recommends looking at a metric called SM Efficiency (SM for streaming multiprocessor, also called SM Activity) that reports the % of SMs are active. Seeing a discrepancy between these metrics can be an indicator that there is some less visible bottleneck that can be helped by the usage of "fused kernels." Using Flash Attention or SDPA is one example of doing this, but there are also similar implementations for other types of layers readily available according to the article. I didn't look into these alternatives too much, so it's possible that we're already using more than one of them for their general benefits.
If nothing else, it may be useful to add SM efficiency to our standard set of metrics logged on ClearML. The metric is available in the NVIDIA Data Center GPU Manager (DCGM), and it is also available on-demand through
nvidia-smi dmon
.The text was updated successfully, but these errors were encountered: