You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello there!
I'm new to Triton and I can't get my head around solving how to achieve batch inference.
this is my function to call inference method using a grpc client.
here I pass a inputs vector containing addresses allocated bu cudaMalloc, the addresses are not contiguous.
I already created my triton::client::InferInput. this is my prepare_input function which I'm having trouble with:
void TritonClient::prepare_input(const std::vector<unsigned char*>& inputs,
const std::vector<size_t>& input_byte_sizes) {
unregister_input_names();
std::vector<std::unique_ptr<TritonInput>> input_ptrs;
for (size_t i = 0; i < inputs.size(); ++i) {
cudaIpcMemHandle_t input_cuda_handle;
CreateCUDAIPCHandle(&input_cuda_handle, (void*)inputs[i]);
FAIL_IF_ERR(
m_client->RegisterCudaSharedMemory(m_triton_registrable_input_name[i], input_cuda_handle, 0,
input_byte_sizes[i]),
"failed to register input shared memory region");
std::unique_ptr<TritonInput> input_ptr(new TritonInput());
FAIL_IF_ERR(input_ptr->SetSharedMemory(m_triton_registrable_input_name[i], input_byte_sizes[i], 0),
"unable to set shared memory for input");
input_ptrs.push_back(std::move(input_ptr));
}
m_inputs.clear();
for (const auto& ptr : input_ptrs) {
m_inputs.push_back(ptr.get());
}
}
but there is a several issues which I can't find a solution for. first the second time I call SetSharedMemory I get segmentation fault! second if I remove all of these and just push the same input_ptrs into the m_inputs multiple times I get error that the input already exists. I could manage something similar in python without shared memory by concatenating the inputs but I can't manage to solve this issue, any help or suggestions would be appreciated. I want to pass different gpu memory addresses with their size (which is same for all of them) for batch inference and these memories are not contiguous.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hello there!
I'm new to Triton and I can't get my head around solving how to achieve batch inference.
this is my function to call inference method using a grpc client.
here I pass a inputs vector containing addresses allocated bu cudaMalloc, the addresses are not contiguous.
I already created my triton::client::InferInput. this is my prepare_input function which I'm having trouble with:
but there is a several issues which I can't find a solution for. first the second time I call SetSharedMemory I get segmentation fault! second if I remove all of these and just push the same input_ptrs into the m_inputs multiple times I get error that the input already exists. I could manage something similar in python without shared memory by concatenating the inputs but I can't manage to solve this issue, any help or suggestions would be appreciated. I want to pass different gpu memory addresses with their size (which is same for all of them) for batch inference and these memories are not contiguous.
Beta Was this translation helpful? Give feedback.
All reactions