You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I’m currently exploring the feasibility and value of using Triton for our inference workloads and would love to hear about your use cases. Specifically, I’m interested in how Triton performs in terms of GPU memory usage and inference capacity, that is to say, I want to know 3 numbers: how many model(what is its size) can excute concurrently on your gpu(what is its memory).
Currently, I am using A30 GPUs with 24GB GPU memory on Alibaba Cloud servers. My model(is wav2lip, a CNN model) are around 200MB, and Alibaba Cloud provide the solution of GPU virtualization to make multiple models inference concurently , but it can only run 5 models simultaneously on A30 GPU with onnxruntime, which results in to high inference costs. I hope to run as many models as possiple, now I’m considering Triton as a potential alternative to reduce the inference costs.
Could you please share your Triton use cases, especially regarding how many models with what size you can run simultaneously on GPU of what size memory ?And some tricks about how to increase the number of models that Triton can infer in parallel ? For example, I guess maybe switch from onnxruntime to tensorrt, use fp16 or set some hyperparameter etc?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hello everyone,
I’m currently exploring the feasibility and value of using Triton for our inference workloads and would love to hear about your use cases. Specifically, I’m interested in how Triton performs in terms of GPU memory usage and inference capacity, that is to say, I want to know 3 numbers: how many model(what is its size) can excute concurrently on your gpu(what is its memory).
Currently, I am using A30 GPUs with 24GB GPU memory on Alibaba Cloud servers. My model(is wav2lip, a CNN model) are around 200MB, and Alibaba Cloud provide the solution of GPU virtualization to make multiple models inference concurently , but it can only run 5 models simultaneously on A30 GPU with onnxruntime, which results in to high inference costs. I hope to run as many models as possiple, now I’m considering Triton as a potential alternative to reduce the inference costs.
Could you please share your Triton use cases, especially regarding how many models with what size you can run simultaneously on GPU of what size memory ?And some tricks about how to increase the number of models that Triton can infer in parallel ? For example, I guess maybe switch from onnxruntime to tensorrt, use fp16 or set some hyperparameter etc?
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions