Skip to content

Conversation

@xmfan
Copy link
Member

@xmfan xmfan commented Feb 6, 2026

Add run-to-run determinism testing to H100 CI

This adds automatic run-to-run determinism verification for H100 integration tests. Tests marked with determinism_test=True will run twice with identical configuration and deterministic flags, then compare losses to ensure they match exactly.

The core loss extraction logic is factored into torchtitan/tools/loss_utils.py and shared between the integration test runner and the existing loss_compare.py script. The scripts directory is now a package to enable clean imports via python -m scripts.loss_compare.

The Float8 and HSDP+CP+compile+Float8 tests in the H100 suite are enabled for determinism testing (CUDA only).

Co-authored-by: Claude noreply@anthropic.com

xmfan added a commit that referenced this pull request Feb 6, 2026
This adds automatic run-to-run determinism verification for H100 integration tests. Tests marked with `determinism_test=True` will run twice with identical configuration and deterministic flags, then compare losses to ensure they match exactly.

The core loss extraction logic is factored into `torchtitan/tools/loss_utils.py` and shared between the integration test runner and the existing `loss_compare.py` script. The scripts directory is now a package to enable clean imports via `python -m scripts.loss_compare`.

The Float8 and HSDP+CP+compile+Float8 tests in the H100 suite are enabled for determinism testing (CUDA only).

Co-authored-by: Claude <noreply@anthropic.com>

stack-info: PR: #2339, branch: xmfan/stack/11
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 6, 2026
(e.g., the logical progression), or if it's short just omit the bullet list
entirely.

Disclose that the PR was authored with Claude.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copied over from pytorch's claude md

xmfan added a commit that referenced this pull request Feb 9, 2026
This adds automatic run-to-run determinism verification for H100 integration tests. Tests marked with `determinism_test=True` will run twice with identical configuration and deterministic flags, then compare losses to ensure they match exactly.

The core loss extraction logic is factored into `torchtitan/tools/loss_utils.py` and shared between the integration test runner and the existing `loss_compare.py` script. The scripts directory is now a package to enable clean imports via `python -m scripts.loss_compare`.

The Float8 and HSDP+CP+compile+Float8 tests in the H100 suite are enabled for determinism testing (CUDA only).

The `--run-to-run-determinism` flag in loss_compare.py now explicitly validates that no test-specific options are provided, raising a ValueError if they are.

Co-authored-by: Claude <noreply@anthropic.com>

stack-info: PR: #2339, branch: xmfan/stack/11
This adds automatic run-to-run determinism verification for H100 integration tests. Tests marked with `determinism_test=True` will run twice with identical configuration and deterministic flags, then compare losses to ensure they match exactly.

The core loss extraction logic is factored into `torchtitan/tools/loss_utils.py` and shared between the integration test runner and the existing `loss_compare.py` script. The scripts directory is now a package to enable clean imports via `python -m scripts.loss_compare`.

The Float8 and HSDP+CP+compile+Float8 tests in the H100 suite are enabled for determinism testing (CUDA only).

The `--run-to-run-determinism` flag in loss_compare.py now explicitly validates that no test-specific options are provided, raising a ValueError if they are.

Co-authored-by: Claude <noreply@anthropic.com>

stack-info: PR: #2339, branch: xmfan/stack/11
@xmfan xmfan marked this pull request as ready for review February 9, 2026 23:23
ngpu: int = 4
disabled: bool = False
skip_rocm_test: bool = False
determinism_test: bool = False # Run twice and verify losses are identical
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is not only about being deterministic, but also not changing before vs. after

  • pytorch nightly updates
  • user commits

Is it correct that this PR doesn't address such issues?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pr just makes sure that when you run the same command twice, it produces the same outputs. by adding this to PR time CI, you would run H100 CI twice on each PR, both against the same pytorch nightly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants