Achieve batch inference using cuda shared memory #5659

HosseinKD · 2023-04-19T12:15:41Z

HosseinKD
Apr 19, 2023

Hello there!
I'm new to Triton and I can't get my head around solving how to achieve batch inference.
this is my function to call inference method using a grpc client.

void TritonClient::infer(std::shared_ptr<triton::client::InferResult> &results_ptr,
                         const std::vector<unsigned char *> &inputs,
                         const std::vector<size_t> &input_byte_sizes)
{
    triton::client::InferResult *result;
    prepare_input(inputs, input_byte_sizes);
    FAIL_IF_ERR(m_client->Infer(&result, m_options, m_inputs, m_outputs), "unable to run inference");
    results_ptr.reset(result);
    results_ptr->RequestStatus();
}

here I pass a inputs vector containing addresses allocated bu cudaMalloc, the addresses are not contiguous.
I already created my triton::client::InferInput. this is my prepare_input function which I'm having trouble with:

void TritonClient::prepare_input(const std::vector<unsigned char*>& inputs,
                                  const std::vector<size_t>& input_byte_sizes) {
    unregister_input_names();

    std::vector<std::unique_ptr<TritonInput>> input_ptrs;
    for (size_t i = 0; i < inputs.size(); ++i) {
        cudaIpcMemHandle_t input_cuda_handle;
        CreateCUDAIPCHandle(&input_cuda_handle, (void*)inputs[i]);
        FAIL_IF_ERR(
            m_client->RegisterCudaSharedMemory(m_triton_registrable_input_name[i], input_cuda_handle, 0,
                                               input_byte_sizes[i]),
            "failed to register input shared memory region");

        std::unique_ptr<TritonInput> input_ptr(new TritonInput());
        FAIL_IF_ERR(input_ptr->SetSharedMemory(m_triton_registrable_input_name[i], input_byte_sizes[i], 0),
                    "unable to set shared memory for input");
        input_ptrs.push_back(std::move(input_ptr));
    }

    m_inputs.clear();
    for (const auto& ptr : input_ptrs) {
        m_inputs.push_back(ptr.get());
    }
}

but there is a several issues which I can't find a solution for. first the second time I call SetSharedMemory I get segmentation fault! second if I remove all of these and just push the same input_ptrs into the m_inputs multiple times I get error that the input already exists. I could manage something similar in python without shared memory by concatenating the inputs but I can't manage to solve this issue, any help or suggestions would be appreciated. I want to pass different gpu memory addresses with their size (which is same for all of them) for batch inference and these memories are not contiguous.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Achieve batch inference using cuda shared memory #5659

{{title}}

Replies: 0 comments

Select a reply

Achieve batch inference using cuda shared memory #5659

HosseinKD Apr 19, 2023

Replies: 0 comments

HosseinKD
Apr 19, 2023