Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(cuda): avoid loading stub #581

Closed
wants to merge 1 commit into from

Conversation

aws-nslick
Copy link
Contributor

CTK ships `stubs/libcuda.so', which may potentially be found by dlopen depending on the environment. The stub exists only for the sake of allowing software to resolve the UMD lib at build time w/o needing to have any particular driver actually installed.

In our usage, it is better to explicitly request libcuda.so.1 to avoid ever potentially loading the stub.

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

CTK ships `stubs/libcuda.so', which may potentially be found by dlopen
depending on the environment. The stub exists only for the sake of
allowing software to resolve the UMD lib at build time w/o needing to
have any particular driver actually installed.

In our usage, it is better to explicitly request libcuda.so.1 to avoid
ever potentially loading the stub.

Signed-off-by: Nicholas Sielicki <nslick@amazon.com>
@aws-nslick aws-nslick requested a review from a team as a code owner September 5, 2024 22:25
Copy link
Contributor

@bwbarrett bwbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has never been a problem reported by customers, and leaves us open to more problems down the line. I'm fine if you want to sync with what latest NCCL does, but let's not fix things that aren't actually broken.

@aws-nslick
Copy link
Contributor Author

Frustrating response, this just aligns us with the rest of the world.

With pytorch: link

With triton: link

And with the library itself:

$ readelf -d /usr/lib/x86_64-linux-gnu/libcuda.so.550.107.02 | grep -i soname
 0x000000000000000e (SONAME)             Library soname: [libcuda.so.1]

This is explained very well by @Artem-B here on the cmake forums.

leaves us open to more problems down the line

Can you elaborate on this? What problems, exactly?

I'm fine if you want to sync with what latest NCCL does, but let's not fix things that aren't actually broken./

NCCL statically links the cuda runtime and resolves all driver functions through it; swapping that out is a much larger change for us. I don't see why you would want to block taking this simple fix on doing that swap.

@aws-nslick
Copy link
Contributor Author

bot:aws:retest

@aws-nslick
Copy link
Contributor Author

Waiting to merge this and ofiwg/libfabric#10365 until @bwbarrett acks

@aws-nslick
Copy link
Contributor Author

Closing this as it's supplanted by d0040f9

included in pr: #618

@aws-nslick aws-nslick closed this Sep 22, 2024
@aws-nslick aws-nslick deleted the avoid-stub-load branch September 25, 2024 22:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants