Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Env conf debug #1534

Open
wants to merge 17 commits into
base: develop
Choose a base branch
from
Open

Env conf debug #1534

wants to merge 17 commits into from

Conversation

Nikhil-Nunna
Copy link
Contributor

@Nikhil-Nunna Nikhil-Nunna commented Feb 5, 2025

Details

Do not mention proprietary info or link to internal work items in this PR.

Work item: Internal

What were the changes?
Developed Customer Support script that gathers customer system configuration.

Why were the changes made?
Script was made per request of the RCCL team to aid RCCL setup and debug.

How was the outcome achieved?
Technical details behind the work. Explain any publicly-available hardware peculiarities.

Additional Documentation:
What else should the reviewer know?

Approval Checklist

Do not approve until these items are satisfied.

  • Verify the CHANGELOG has been updated, if
    • there are any NCCL API version changes,
    • any changes impact library users, and/or
    • any changes impact any other ROCm library.


local output_file="rccl_tests_output.txt"

$1/build/all_reduce_perf -b 8 -e 16M -f 8 -g 2 > "${output_file}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are only running to get the version info, you can just run from -b 8 -e 8 -g 1, no need to do any kind of sweep.


##################################################### README #####################################################

# This script only requires 1 input arguement, it is the path to the rccl-tests repo. EX: /path/to/rccl-tests/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# This script only requires 1 input arguement, it is the path to the rccl-tests repo. EX: /path/to/rccl-tests/
# This script only requires 1 input argument, it is the path to the rccl-tests repo. EX: /path/to/rccl-tests/

# 10. ibv_devinfo is on path
# 11. ibstat device GUIDs

# All output will be in a folder called conf-script-output that will be created in the same directory as the script
Copy link
Collaborator

@corey-derochie-amd corey-derochie-amd Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest making this path user configurable, including allowing it to output to stdout. Perhaps default to cwd.

Comment on lines +59 to +61
echo "An error occurred during ${func_call}"
echo "${error_message}"
echo "in step ${func_name}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only successful output is captured to log files, which may not be desirable. These should probably tee to the conf-script-output location, or to stderr if no output location provided.


##################################################### setup output folder #####################################################

mkdir conf-script-output
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mkdir conf-script-output
mkdir -p conf-script-output


##################################################### end define necessary functions #####################################################


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this script is meant to be used by customers, and has a required argument, it should probably display help text. Probably also recognize --help.

@corey-derochie-amd
Copy link
Collaborator

Should there be README or similar documentation to steer users toward this, or should we rely on support to instruct them?

@corey-derochie-amd
Copy link
Collaborator

Should we call this script in our CI to capture our test configurations for debugging?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
noCI Disable Jenkins for this PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants