-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add DMA-BUF support #618
Open
aws-nslick
wants to merge
5
commits into
aws:master
Choose a base branch
from
aws-nslick:dmabuf
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
+655
−453
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
aws-nslick
force-pushed
the
dmabuf
branch
17 times, most recently
from
September 22, 2024 00:26
e99cff1
to
df356b5
Compare
check at build time for at least cuda 11.7, remove runtime checks revalidating this. Prefer cudart functions for things like checking versions for simplicity. Drop cuda_check functional test as its intention was to ensure that CUDA was not linked, but CUDA is now linked. This also needed to fix nvtx's m4 autodetection, and our github actions needed to use the actual cuda repos (previously were using ancient 11.x toolkits from ubuntu universe repos). Signed-off-by: Nicholas Sielicki <nslick@amazon.com>
Along the interface with NCCL, continue to return -ENOTSUP, but for the internal api remove dmabuf fnptrs within the communicators and delete the impls they pointed at. Signed-off-by: Nicholas Sielicki <nslick@amazon.com>
The MR cache needs to be capable of handling triplets of {base, offset, len} in addition to the current arguments of {base, len}. Add a tagged union with a functional interface that can represent this generically. Tagged union members are struct iovec and struct fi_mr_dmabuf, which matches the union within struct fi_mr_attr; Signed-off-by: Nicholas Sielicki <nslick@amazon.com>
Immediately on any top-level NCCL call, construct an immutable nccl_ofi_mr_input_t on the stack, then pass that to communicator regmr implementations. Add a flags argument to internal regmr functions such that the input can be inspected and may add FI_MR_DMABUF if the input arguments correspond to a file descriptor. Implement top-level nccl_net_ofi_regMr in terms of nccl_net_ofi_regMrDmaBuf, simply forwarding arguments alongside an invalid file descriptor (-1) and a zero offset. DMA-BUF remains unsupported as of this commit, but only due to not advertising support back to NCCL/nccom. Signed-off-by: Nicholas Sielicki <nslick@amazon.com>
aws-nslick
requested review from
rajachan,
bwbarrett and
a team
as code owners
September 22, 2024 00:36
This adds DMA-BUF support to the plugin and enables it under the following conditions: At build time, libfabric>=1.20 is required (build checks for FI_MR_DMABUF). At runtime: + The specific version of NCCL being used supports DMA-BUF and passes valid dmabuf fds to the plugin. + FI_HMEM must be supported. + For CUDA accelerators, CU_DEVICE_ATTRIBUTE_DMA_BUF_SUPPORTED is queried and FI_HMEM_CUDA is requested. + For Neuron, we assume all nrt versions are viable of dmabuf export. Libfabric as of today provides no hints in at init time that allow the plugin to differentiate between a provider that merely has FI_HMEM support, and one that has dmabuf support. In the case that the plugin is built against libfabric>=1.20 but libfabric is unable to handle dmabuf registrations, a new environment variable (OFI_NCCL_DMABUF_DISABLE=1) is introduced to force the legacy path. When set, dmabuf support is not advertised to nccl and this ensures that the plugin remains in the legacy iovec path. Testing: Various combinations of + OFI_NCCL_DISABLE_DMABUF=0/1 + OFI_NCCL_PROTOCOL=RDMA/SENDRECV + FI_HMEM_CUDA_USE_GDRCOPY=0/1 Signed-off-by: Nicholas Sielicki <nslick@amazon.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Note that currently this requires a patch for libfabric or explicitly disabling gdrcopy: aws-nslick/libfabric@a475d9140
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.