-
Notifications
You must be signed in to change notification settings - Fork 699
Add run-to-run determinism testing to H100 CI #2339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This adds automatic run-to-run determinism verification for H100 integration tests. Tests marked with `determinism_test=True` will run twice with identical configuration and deterministic flags, then compare losses to ensure they match exactly. The core loss extraction logic is factored into `torchtitan/tools/loss_utils.py` and shared between the integration test runner and the existing `loss_compare.py` script. The scripts directory is now a package to enable clean imports via `python -m scripts.loss_compare`. The Float8 and HSDP+CP+compile+Float8 tests in the H100 suite are enabled for determinism testing (CUDA only). Co-authored-by: Claude <noreply@anthropic.com> stack-info: PR: #2339, branch: xmfan/stack/11
| (e.g., the logical progression), or if it's short just omit the bullet list | ||
| entirely. | ||
|
|
||
| Disclose that the PR was authored with Claude. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copied over from pytorch's claude md
This adds automatic run-to-run determinism verification for H100 integration tests. Tests marked with `determinism_test=True` will run twice with identical configuration and deterministic flags, then compare losses to ensure they match exactly. The core loss extraction logic is factored into `torchtitan/tools/loss_utils.py` and shared between the integration test runner and the existing `loss_compare.py` script. The scripts directory is now a package to enable clean imports via `python -m scripts.loss_compare`. The Float8 and HSDP+CP+compile+Float8 tests in the H100 suite are enabled for determinism testing (CUDA only). The `--run-to-run-determinism` flag in loss_compare.py now explicitly validates that no test-specific options are provided, raising a ValueError if they are. Co-authored-by: Claude <noreply@anthropic.com> stack-info: PR: #2339, branch: xmfan/stack/11
This adds automatic run-to-run determinism verification for H100 integration tests. Tests marked with `determinism_test=True` will run twice with identical configuration and deterministic flags, then compare losses to ensure they match exactly. The core loss extraction logic is factored into `torchtitan/tools/loss_utils.py` and shared between the integration test runner and the existing `loss_compare.py` script. The scripts directory is now a package to enable clean imports via `python -m scripts.loss_compare`. The Float8 and HSDP+CP+compile+Float8 tests in the H100 suite are enabled for determinism testing (CUDA only). The `--run-to-run-determinism` flag in loss_compare.py now explicitly validates that no test-specific options are provided, raising a ValueError if they are. Co-authored-by: Claude <noreply@anthropic.com> stack-info: PR: #2339, branch: xmfan/stack/11
| ngpu: int = 4 | ||
| disabled: bool = False | ||
| skip_rocm_test: bool = False | ||
| determinism_test: bool = False # Run twice and verify losses are identical |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point is not only about being deterministic, but also not changing before vs. after
- pytorch nightly updates
- user commits
Is it correct that this PR doesn't address such issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pr just makes sure that when you run the same command twice, it produces the same outputs. by adding this to PR time CI, you would run H100 CI twice on each PR, both against the same pytorch nightly.
Add run-to-run determinism testing to H100 CI
This adds automatic run-to-run determinism verification for H100 integration tests. Tests marked with
determinism_test=Truewill run twice with identical configuration and deterministic flags, then compare losses to ensure they match exactly.The core loss extraction logic is factored into
torchtitan/tools/loss_utils.pyand shared between the integration test runner and the existingloss_compare.pyscript. The scripts directory is now a package to enable clean imports viapython -m scripts.loss_compare.The Float8 and HSDP+CP+compile+Float8 tests in the H100 suite are enabled for determinism testing (CUDA only).
Co-authored-by: Claude noreply@anthropic.com