-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Cuda in Graphics Implementation for TensorRT backend #100
Conversation
@nv-kmcgill53 Please review this as discussed |
CMakeLists.txt
Outdated
@@ -269,6 +269,7 @@ target_link_libraries( | |||
triton-tensorrt-backend | |||
PRIVATE | |||
CUDA::cudart | |||
CUDA::cuda_driver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nv-kmcgill53 , @mc-nv, @tanmayv25 - any issues with this dependency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the context behind adding this dependency?
From documentation is see:
CUDA Driver Library
The CUDA Driver library (cuda) are used by applications that use calls such ascuMemAlloc
, andcuMemFree
.
Targets Created:
CUDA::cuda_driver
Aren't this dependency is requisite of TensorRT itself?
Thought by default our product expect driver to be installed and if GPU capability given then available for usage including driver targets and binaries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding this dependency should be fine. Ashish is linking correctly according to the cuda documentation. As it states
Context management can be done through the driver API, but is not exposed in the runtime API
So they will need to link against the driver instead of just linking against the cuda runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So they will need to link against the driver instead of just linking against the cuda runtime.
I'm not agree with this statement, current linkage doesn't explain why user want to add it explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cmake documentation isn't exhaustive when it mentions cuMemAlloc
and cuMemFree
. The user in this case is using the Driver API to set/pass the cuda context around in the backend, rather than letting the core
take care of this. This is the reason for adding the CUDA::cuda_driver
lib to the linking path. This PR necessarily makes use of functions in the driver where the trt_backend didn't before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Triton TensorRT Backend is unable to work without CUDA, Triton Inference Server and TensorRT installation.
Current change, per my understanding, uses only cudaSetDevice
(CUDA::cudart) and cudaGetErrorString
(CUDA runtime API) and those dependencies are satisfied. There why I don't see any reason to link against CUDA::cuda_driver
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The additional dependency is required for cuCtxPushCurrent() / cuCtxPopCurrent() @ https://github.com/triton-inference-server/tensorrt_backend/pull/100/files#diff-3137e95db7c97b9ddd71fa1b600e3dd646025c5a872ad94d4d09f04313fe3fcdR66
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason we need this dependency is because we are using a special context call Cuda in Graphics (CiG) context which has to work with the cuda driver dll for its operations.
@ashishk98 - can you install and run pre-commit checks locally?
|
@nnshah1 fixed pre-commit |
src/tensorrt_model.h
Outdated
//! for gaming use case. Creating a shared contexts reduces context switching | ||
//! overhead and leads to better performance of model execution along side | ||
//! Graphics workload. | ||
CUcontext GetCiGContext() { return cig_ctx_; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ashishk98 question: is this specific to CIG - or could be applied to any application provided cuda context?
CMakeLists.txt
Outdated
target_compile_definitions( | ||
triton-tensorrt-backend | ||
PRIVATE TRITON_ENABLE_CIG | ||
) | ||
target_link_libraries( | ||
triton-tensorrt-backend | ||
PRIVATE | ||
CUDA::cuda_driver | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These setting could be achieved with generator expression, isn't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a generator expression?
src/tensorrt_model.cc
Outdated
triton::common::TritonJson::Value value; | ||
std::string ptr_value; | ||
if (parameters.Find("CIG_CONTEXT_PTR", &value)) { | ||
RETURN_IF_ERROR(value.MemberAsString("string_value", &ptr_value)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ashishk98 instead of directly converting here as a special case, I would prefer to use something similar to what is done in the trt-llm backend:
In this case there is a template method to convert from a parameter to a value - I think the code will be a little clearer to follow.
Also - can we convert to and from a 64 bit integer?
so something like:
model_state->GetParameter<uint64>("CUDA_CONTEXT");
Also it strikes me that although we use value.MemberAsString()
we could also use value.MemberAsUint("string_value",&ptr_value)
So two things - 1) add a templated GetParameter() method and 2) we can use MemberAsUint for the uint64 template. 3) officially transfer uint64 values and convert them to and from context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a GetParameter call for std::string instead of UINT64. This is because when we add the parameter to model config it is directly converted into a hex string instead of a numeric string. Hence while parsing the pointer, MemberAsUint fails because it gets a hex string to parse.
src/model_state.cc
Outdated
.c_str()); | ||
#ifdef TRITON_ENABLE_CIG | ||
// Set device if CiG is disabled | ||
if (!isCiGEnabled()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tanmayv25, @ashishk98 - is there a way to have a single scoped object
ScopedCudaDeviceContext
That internally checks if there is an application_context
and if there is an application context uses push / pop - if not uses cudaSetDevice
?
We don't currently use them in the same locations - but am wondering if that would be possible - I think it would be cleaner logically - where basically an 'application_context' takes the place of the 'device' but otherwise the logic remains the same.
ScopedObject(Device);
ScopedObject(Context);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can take a look at this in the next iteration
src/model_state.cc
Outdated
@@ -175,7 +175,13 @@ ModelState::ModelState(TRITONBACKEND_Model* triton_model) | |||
ModelState::~ModelState() | |||
{ | |||
for (auto& device_engine : device_engines_) { | |||
cudaSetDevice(device_engine.first.first); | |||
#ifdef TRITON_ENABLE_CIG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know I had asked for looking at macros to enable, but I would like to avoid this kind of guard - if we can use a single method and then have two different implementations of that method / object would prefer that to having the macros embedded in the functions / methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
I would like to see:
|
if (!model_state_->isCiGEnabled()) | ||
#endif // TRITON_ENABLE_CIG | ||
{ | ||
cudaSetDevice(DeviceId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind to share the reasoning of avoiding the set device calls? Wouldn't that cause the issue of model not being placed / executed on selected device (based on model config)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The intended use of cuda context sharing is targeted only of single GPU (RTX end-user) systems. I wanted to avoid complications with this use case
- When we call cudaSetDevice() the cuda runtime resets the to using the default cuda context for the thread
How model instances on multiple GPUs will be handled? AFAIK a cuda context is per device. If we have more than one GPU devices, then we should pass cuda context handles for each GPU device. Or am I missing something here? |
@ashishk98 I believe we still need to raise an error if someone tries to use pre-built cuda context with multi-GPU environment, right? |
Can the stakeholders provide another round of reviews on this PR? We'd like to get these changes into a release asset this week. |
I still don't see a guard against preventing the use of pre-built cuda context on a multi-GPU system. We are skipping calling cudaSetDevice() when using CiG context. And as we are passing only a single CiG context all the model instances will hit the same GPU device. We need to put a check if Triton is loading model instance on different device and throw a meaningful error. |
@tanmayv25 please review the latest commit |
Some minor fixes, otherwise LGTM. |
Resolved minor fixes as well |
@nvda-mesharma did you verify these changes in CI? The pipeline I launched yesterday seems to be failing with this error?
https://gitlab-master.nvidia.com/dl/dgx/tritonserver/-/jobs/137650575 @ashishk98 for viz. |
* Enable CiG support in Tensorrt backend * Creating scoped runtime context structure for better management of CiG context * instance_state null check * Minor bug fix * pre-commit fixes * Added new cmake flag TRITON_ENABLE_CIG to make the CiG support build conditional * pre-commit fixes * CiG->Cuda. Making the changes more generic to cuda context sharing + hidden ifdefs * remove todo * Add GetParameter to fetch string params * Cig->Cuda + comment updates * Use RETURN_IF_ERROR macro * Handle Multi-GPU failure case * typo + styling fixes --------- Co-authored-by: Ashish Karale <akarale@nvidia.com>
Add cuda context sharing support for TensorRT backend to reduce context switching overhead when graphics workload is running in parallel