-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the lack of direct GPU to GPU communications in multi-device runs. #642
Fix the lack of direct GPU to GPU communications in multi-device runs. #642
Conversation
In ICLDisco#570, we moved from using cuda_index to using device_index in the 'nvlink' mask that decides if we can directly communicate to another GPU. However, this bitmask was initialized at query time, before devices get assigned a device_index. As a consequence, the bitmask was wrong and no direct device to device communication was happening. In this PR, we add a step, after all devices have been registered, to complete this initialization.
Why did we moved from using cuda_index to using the device_index ? We cannot use d2d on devices of different types, and we check for that in |
device_gpu.c doesn't know the cuda_index (or hip_index or
level_zero_index), that's the reason why we moved that mask to global
device_index based.
I don't think we want to replicate the full parsec_device_gpu_stage_in()
function in all device implementations, so if we want to go back to
cuda_index based bitmask, we can replace the test
`if( gpu_device->peer_access_mask & (1 <<
candidate_dev->super.device_index) )`
(and similar tests in device_gpu.c)
by a call to the device-specific implementation, something like
`if( gpu_device->peer_access(candidate_dev)`
I'm fine with that, but I thought the global device_index mask would save
these function calls that cannot be inlined in stage_in, which is taking a
lot of time already.
…On Fri, Mar 8, 2024 at 3:11 PM bosilca ***@***.***> wrote:
Why did we moved from using cuda_index to using the device_index ? We
cannot use d2d on devices of different types, and we check for that in
parsec_default_gpu_stage_in and parsec_default_gpu_stage_out, so using
the device_index in the mask makes little sense. Going back and rebuilding
this mask based on cuda_index seems like a simpler and most resilient
solution (we could change the device_index without having to rebuild all
the info).
—
Reply to this email directly, view it on GitHub
<#642 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABFEZNVU76DEPMU6TOVWMODYXILN5AVCNFSM6AAAAABENLG62WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBWGM2TENBSGY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
to prevent their reuse as candidates on other gpus.
Passes all ctests and works with large size POTRF -g 8 with nvlink active on Leconte. |
I can understand how the need to add debugging messages can be justified as part of this PR, but there are other things (such as changing the coherency state or moving data copy version manipulation code across function calls) that do not fit into the description of this PR. Also, there are several instances where the indentation of the new code is incorrect, 8 commits (half of them not signed) for a seemingly minor issue. |
The issue is not minor at all. Turning back on D2D management unearthed a whole bunch of bugs that would produce wrong results (in particular in TRSM where we would exercise CPU->GPU1,2,3->CPU data motions). While I agree that the description of the PR does not match the real scope of what is achieved here, the changes are not random and serve the greater case of making D2D work at a basic level. |
bugfix: properly compute the number of readers when we impersonate the other gpu-manager during end of D2D transfer bugfix: d2d_complete tasks do not have a data_out set Add some comments for clarification, address review remarks
41e8e9b
to
2ab2fc4
Compare
In #570, we moved from using cuda_index to using device_index in the 'nvlink' mask that decides if we can directly communicate to another GPU. However, this bitmask was initialized at query time, before devices get assigned a device_index. As a consequence, the bitmask was wrong and no direct device to device communication was happening.
In this PR, we add a step, after all devices have been registered, to complete this initialization.
An alternative is to come back to using cuda_index-based bitmasks, but then the decision if two GPUs can directly communicate is device-type specific and needs to be moved from device_gpu.c to device_cuda_module.c, which means we add another function call to device_cuda_module.c inside the stage_in() function of device_gpu.c