Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a number of duplicate definition names #667

Merged
merged 2 commits into from
Oct 23, 2024

Conversation

bwbarrett
Copy link
Contributor

In preparation for moving to C++, make sure we're closer to following the One Definition Rule. There was one major offender, duplicate symbol names between the sendrecv and rdma transports. While we still need to clean up the naming in the rdma transport, this PR prefixes every private symbol in the sendrecv transport with sendrecv_.

The other potential badness was the param code, if the compiler did not inline those functions. To work around that, we do some ugly hacking to put all the param access functions in a single nccl_ofi_param.c file, and nccl_ofi_param.h only creates declarations most of the time. The implementation is ugly, but we want to rewrite this code anyway, so let's limp until we're using C++ and can write something better.

Without this PR:

% nm .libs/libinternal_net_plugin.a | grep ' [tT] ' | cut -f3 -d' ' | sort | uniq -d
accept
connect
dereg_mr_recv_comm
dereg_mr_send_comm
flush
get_properties
listen
listen_close
ofi_process_cq
recv
reg_mr_recv_comm
reg_mr_send_comm
send
test

With this PR:

% nm .libs/libinternal_net_plugin.a | grep ' [tT] ' | cut -f3 -d' ' | sort | uniq -d

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@bwbarrett bwbarrett requested a review from a team as a code owner October 18, 2024 17:25
@a-szegel
Copy link
Contributor

bot:aws:retest ... requested by Xin

@xwangxin
Copy link

bot:aws:retest

Prefix all symbols in the sendrecv interface with sendrecv_ to prevent
naming conflicts with similar functions in the rdma protocol.  We should
update the rdma protocol as well, but this at least gets us out of
violating C++'s ODR rules.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
The param implementation stamped out accessor functions in each
compilation unit, which created obvious problems with C++'s naming
rules.  We know we want to rewrite this in the future, so for now
just stamp out one implementation in nccl_ofi_param.c that everyone
can call, and do some ugly macro hacking to make that a minimum
required change.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
aws-nslick
aws-nslick previously approved these changes Oct 21, 2024
Copy link
Contributor

@aws-nslick aws-nslick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm but would prefer to see the diff minimized by just dropping the semicolons on these:

#define OFI_NCCL_PARAM_UINT(name, env, default_value) \
uint64_t ofi_nccl_##name(void);

#define OFI_NCCL_PARAM_INT(name, env, default_value) \
int64_t ofi_nccl_##name(void);

#define OFI_NCCL_PARAM_STR(name, env, default_value) \
const char *ofi_nccl_##name(void);

Then the definitions can remain untouched, unless i'm missing something.

@bwbarrett
Copy link
Contributor Author

bot:aws:retest

ERROR    root:utils.py:94 Command ssh -oStrictHostKeyChecking=no 172.31.3.21 "source /home/ec2-user/PortaFiducia/env/bin/activate; source /etc/profile; nvidia-smi" failed with error:

ssh: connect to host 172.31.3.21 port 22: Connection timed out

@bwbarrett
Copy link
Contributor Author

bot:aws:retest

ERROR    root:utils.py:94 Command ssh -oStrictHostKeyChecking=no 172.31.37.221 "source /home/ubuntu/PortaFiducia/env/bin/activate; source /etc/profile; export LD_LIBRARY_PATH=/home/ubuntu/PortaFiducia/build/libraries/aws_ofi_nccl/667/openmpi/4.1.6/install/lib; NEURON_RT_ROOT_COMM_ID=172.31.37.221:62010 timeout 1800 nccom-test --debug --non-interactive -r 128 -N 4 --data-collector-port 60567 -t avg p25 p50 p75 p90 p99 p100 -b 256 -e 1073741824 -f 2 -n 20 -d fp16 -c redsct --hosts 172.31.37.221 172.31.33.221 172.31.33.27 172.31.42.237" failed with error:

Unable to launch data collection server: [Errno 98] Address already in use

and

ERROR    root:utils.py:94 Command ssh -oStrictHostKeyChecking=no 172.31.0.26 "source /home/ubuntu/PortaFiducia/env/bin/activate; source /etc/profile; nvidia-smi" failed with error:

ssh: connect to host 172.31.0.26 port 22: Connection timed out

@bwbarrett bwbarrett merged commit 71d05d6 into aws:master Oct 23, 2024
27 checks passed
@bwbarrett bwbarrett deleted the cleanup/silly-odr branch October 23, 2024 03:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants