-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Env conf debug #1534
base: develop
Are you sure you want to change the base?
Env conf debug #1534
Conversation
|
||
local output_file="rccl_tests_output.txt" | ||
|
||
$1/build/all_reduce_perf -b 8 -e 16M -f 8 -g 2 > "${output_file}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are only running to get the version info, you can just run from -b 8 -e 8 -g 1
, no need to do any kind of sweep.
|
||
##################################################### README ##################################################### | ||
|
||
# This script only requires 1 input arguement, it is the path to the rccl-tests repo. EX: /path/to/rccl-tests/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# This script only requires 1 input arguement, it is the path to the rccl-tests repo. EX: /path/to/rccl-tests/ | |
# This script only requires 1 input argument, it is the path to the rccl-tests repo. EX: /path/to/rccl-tests/ |
# 10. ibv_devinfo is on path | ||
# 11. ibstat device GUIDs | ||
|
||
# All output will be in a folder called conf-script-output that will be created in the same directory as the script |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest making this path user configurable, including allowing it to output to stdout. Perhaps default to cwd.
echo "An error occurred during ${func_call}" | ||
echo "${error_message}" | ||
echo "in step ${func_name}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only successful output is captured to log files, which may not be desirable. These should probably tee
to the conf-script-output location, or to stderr if no output location provided.
|
||
##################################################### setup output folder ##################################################### | ||
|
||
mkdir conf-script-output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mkdir conf-script-output | |
mkdir -p conf-script-output |
|
||
##################################################### end define necessary functions ##################################################### | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this script is meant to be used by customers, and has a required argument, it should probably display help text. Probably also recognize --help
.
Should there be README or similar documentation to steer users toward this, or should we rely on support to instruct them? |
Should we call this script in our CI to capture our test configurations for debugging? |
Details
Do not mention proprietary info or link to internal work items in this PR.
Work item: Internal
What were the changes?
Developed Customer Support script that gathers customer system configuration.
Why were the changes made?
Script was made per request of the RCCL team to aid RCCL setup and debug.
How was the outcome achieved?
Technical details behind the work. Explain any publicly-available hardware peculiarities.
Additional Documentation:
What else should the reviewer know?
Approval Checklist
Do not approve until these items are satisfied.