Replies: 1 comment 1 reply
-
Hi @csheaff did you figure out what was happening here? In my experience trying something similar, I find that going from 1 -> 2 GPU instances can have a slight improvement on throughput when sending many large batches from a triton client in short order, but as you said latency is often slightly worse. My loose mental model is that GPUs are generally unable to do simultaneous executions, so if both instances are in the middle of an inference, one ends up waiting on the other; kind of like a GIL, really. In a high-throughput scenario this helps because there's still some CPU-bound work in each GPU-using python backend instance, so it allows an instance to begin execution while the other is finishing up. |
Beta Was this translation helpful? Give feedback.
-
Context: I have a PyTorch model that cannot be converted to TorchScript, so I am serving it with the python backend as a BLS model (I'll refer to his as the inference model). I am calling this model from another BLS model, as this step is part of a pipeline (will call this the calling model).
I would like to have multiple copies of the inference model available for concurrent model execution on a single GPU, and then use the calling model to utilize these copies by splitting up a large batch of data into smaller batches and then send them asynchronously. Largely I believe I have the code established to do this. I specify for my inference model:
with an inference model file like so:
In my calling model I do:
This script works, but the runtime is not shorter than the single model/single request scenario. In fact, it's a little longer. Here's the output of my log:
Am I missing something here? Is concurrent model execution not possible with models served using the Python backend? do i need to design my inference model file more intelligently? I see here
...so I'm presuming that this should be possible. Any help would be much appreciated.
Update:
i decided to just literally copy the inference model, then update the model names in my caller model such that the two request get sent to diferent models. Same result. I have also verified that two different GPU models are being run concurrently by watching the output of nvidia-smi.
After throwing in various log statements in my inference model and verifying the time taken for each step, the conclusion I've come to is that the communication of the data (which is 3D) between the two BLS models is the culprit. Because this overshadows the inference time, I am basically observing no benefit.
Beta Was this translation helpful? Give feedback.
All reactions