Help with BLS models and concurrency #5720

csheaff · 2023-04-29T17:31:05Z

csheaff
Apr 29, 2023

Context: I have a PyTorch model that cannot be converted to TorchScript, so I am serving it with the python backend as a BLS model (I'll refer to his as the inference model). I am calling this model from another BLS model, as this step is part of a pipeline (will call this the calling model).

I would like to have multiple copies of the inference model available for concurrent model execution on a single GPU, and then use the calling model to utilize these copies by splitting up a large batch of data into smaller batches and then send them asynchronously. Largely I believe I have the code established to do this. I specify for my inference model:

instance_group [
  {
    count: 2
    kind: KIND_GPU 
  }   
]

with an inference model file like so:

def get_model():
    model = monai.networks.nets....
    model = torch.nn.parallel.DataParallel(model)
    model.load_state_dict(torch.load(model_path))
    return model.eval().cuda()

class TritonPythonModel:
    def initialize(self, args):
        self.model_config = model_config = json.loads(args['model_config'])
        output0_config = pb_utils.get_output_config_by_name(
            model_config, "output0")
        self.output0_dtype = pb_utils.triton_string_to_numpy(
            output0_config['data_type'])
        self.model = get_model() 

    def execute(self, requests):
        logger = pb_utils.Logger
        output0_dtype = self.output0_dtype
        responses = []
        for request in requests:
            rid = request.request_id()
            in_0 = pb_utils.get_input_tensor_by_name(request, "input0")
            in_0 = from_dlpack(pb_utils.Tensor.to_dlpack(in_0))
            with torch.no_grad():
                logger.log_info(f"Running inference for request id: {rid}")
                out_0 = self.model(in_0)
            out_0 = pb_utils.Tensor.from_dlpack("output0", to_dlpack(out_0))
            inference_response = pb_utils.InferenceResponse(output_tensors=[out_0])
            responses.append(inference_response)
        return responses

    def finalize(self):
        print('Cleaning up...')

In my calling model I do:

async def predictor(self, input_data): # data is [batch_size x C x H x W]
        # split data into two batches
        input_data_0 = input_data[0, np.newaxis, ...] 
        input_data_0 = pb_utils.Tensor('input0', input_data_0) 
        input_data_1 = input_data[1, np.newaxis, ...]
        input_data_1 = pb_utils.Tensor('input0', input_data_1)
        
       # send the batches using asychronous requests
        inference_response_awaits = []
        infer_request = pb_utils.InferenceRequest(
            model_name=model_name,
            requested_output_names=["output0"],
            inputs=[input_data_0],
            request_id="0",       
         )
        inference_response_awaits.append(infer_request.async_exec())
        infer_request = pb_utils.InferenceRequest(
            model_name=model_name,
            requested_output_names=["output0"],
            inputs=[input_data_1],
            request_id="1",        
         )
        inference_response_awaits.append(infer_request.async_exec())
        inference_responses = await asyncio.gather(*inference_response_awaits)
        
        # extract output
        output_tensor_0 = inference_responses[0].output_tensors()[0]
        out_0 = from_dlpack(pb_utils.Tensor.to_dlpack(output_tensor_0))
        output_tensor_1 = inference_responses[1].output_tensors()[0]
        out_1 = from_dlpack(pb_utils.Tensor.to_dlpack(output_tensor_0))
        out = torch.cat((out_0, out_1), dim=0).cpu()

        return out

This script works, but the runtime is not shorter than the single model/single request scenario. In fact, it's a little longer. Here's the output of my log:

2023-04-30T00:16:01Z I 93 model.py:54] Running inference for request id: 1
2023-04-30T00:16:01Z I 93 model.py:54] Running inference for request id: 0
2023-04-30T00:16:03Z I 93 model.py:54] Running inference for request id: 0
2023-04-30T00:16:03Z I 93 model.py:54] Running inference for request id: 1
2023-04-30T00:16:04Z I 93 model.py:54] Running inference for request id: 1
2023-04-30T00:16:04Z I 93 model.py:54] Running inference for request id: 0
2023-04-30T00:16:05Z I 93 model.py:54] Running inference for request id: 1
2023-04-30T00:16:05Z I 93 model.py:54] Running inference for request id: 0
2023-04-30T00:16:06Z I 93 model.py:54] Running inference for request id: 1
2023-04-30T00:16:06Z I 93 model.py:54] Running inference for request id: 0
2023-04-30T00:16:08Z I 93 model.py:54] Running inference for request id: 0
2023-04-30T00:16:08Z I 93 model.py:54] Running inference for request id: 1
2023-04-30T00:16:09Z I 93 model.py:54] Running inference for request id: 1

Am I missing something here? Is concurrent model execution not possible with models served using the Python backend? do i need to design my inference model file more intelligently? I see here

Python interpreter uses a global lock known as GIL. Because of GIL, it is not possible have multiple threads running in the same Python interpreter simultaneously as each thread requires to acquire the GIL when accessing Python objects which will serialize all the operations. In order to work around this issue, Python backend spawns a separate process for each model instance.

...so I'm presuming that this should be possible. Any help would be much appreciated.

Update:

i decided to just literally copy the inference model, then update the model names in my caller model such that the two request get sent to diferent models. Same result. I have also verified that two different GPU models are being run concurrently by watching the output of nvidia-smi.

After throwing in various log statements in my inference model and verifying the time taken for each step, the conclusion I've come to is that the communication of the data (which is 3D) between the two BLS models is the culprit. Because this overshadows the inference time, I am basically observing no benefit.

ClaytonJY · 2023-12-07T00:48:53Z

ClaytonJY
Dec 7, 2023

Hi @csheaff did you figure out what was happening here?

In my experience trying something similar, I find that going from 1 -> 2 GPU instances can have a slight improvement on throughput when sending many large batches from a triton client in short order, but as you said latency is often slightly worse.

My loose mental model is that GPUs are generally unable to do simultaneous executions, so if both instances are in the middle of an inference, one ends up waiting on the other; kind of like a GIL, really. In a high-throughput scenario this helps because there's still some CPU-bound work in each GPU-using python backend instance, so it allows an instance to begin execution while the other is finishing up.

1 reply

csheaff Dec 7, 2023
Author

Unfortunately I did not.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help with BLS models and concurrency #5720

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Help with BLS models and concurrency #5720

csheaff Apr 29, 2023

Replies: 1 comment · 1 reply

ClaytonJY Dec 7, 2023

csheaff Dec 7, 2023 Author

csheaff
Apr 29, 2023

Replies: 1 comment 1 reply

ClaytonJY
Dec 7, 2023

csheaff Dec 7, 2023
Author