Add run-to-run determinism testing to H100 CI #2339

xmfan · 2026-02-06T23:33:00Z

Add run-to-run determinism testing to H100 CI

This adds automatic run-to-run determinism verification for H100 integration tests. Tests marked with determinism_test=True will run twice with identical configuration and deterministic flags, then compare losses to ensure they match exactly.

The core loss extraction logic is factored into torchtitan/tools/loss_utils.py and shared between the integration test runner and the existing loss_compare.py script. The scripts directory is now a package to enable clean imports via python -m scripts.loss_compare.

The Float8 and HSDP+CP+compile+Float8 tests in the H100 suite are enabled for determinism testing (CUDA only).

Co-authored-by: Claude noreply@anthropic.com

This adds automatic run-to-run determinism verification for H100 integration tests. Tests marked with `determinism_test=True` will run twice with identical configuration and deterministic flags, then compare losses to ensure they match exactly. The core loss extraction logic is factored into `torchtitan/tools/loss_utils.py` and shared between the integration test runner and the existing `loss_compare.py` script. The scripts directory is now a package to enable clean imports via `python -m scripts.loss_compare`. The Float8 and HSDP+CP+compile+Float8 tests in the H100 suite are enabled for determinism testing (CUDA only). Co-authored-by: Claude <noreply@anthropic.com> stack-info: PR: #2339, branch: xmfan/stack/11

xmfan · 2026-02-06T23:34:30Z

CLAUDE.md

+(e.g., the logical progression), or if it's short just omit the bullet list
+entirely.
+
+Disclose that the PR was authored with Claude.


copied over from pytorch's claude md

This adds automatic run-to-run determinism verification for H100 integration tests. Tests marked with `determinism_test=True` will run twice with identical configuration and deterministic flags, then compare losses to ensure they match exactly. The core loss extraction logic is factored into `torchtitan/tools/loss_utils.py` and shared between the integration test runner and the existing `loss_compare.py` script. The scripts directory is now a package to enable clean imports via `python -m scripts.loss_compare`. The Float8 and HSDP+CP+compile+Float8 tests in the H100 suite are enabled for determinism testing (CUDA only). The `--run-to-run-determinism` flag in loss_compare.py now explicitly validates that no test-specific options are provided, raising a ValueError if they are. Co-authored-by: Claude <noreply@anthropic.com> stack-info: PR: #2339, branch: xmfan/stack/11

tianyu-l · 2026-02-10T00:20:31Z

tests/integration_tests/__init__.py

    ngpu: int = 4
    disabled: bool = False
    skip_rocm_test: bool = False
+    determinism_test: bool = False  # Run twice and verify losses are identical


The point is not only about being deterministic, but also not changing before vs. after

pytorch nightly updates

user commits

Is it correct that this PR doesn't address such issues?

This pr just makes sure that when you run the same command twice, it produces the same outputs. by adding this to PR time CI, you would run H100 CI twice on each PR, both against the same pytorch nightly.

xmfan force-pushed the xmfan/stack/11 branch from 1cfdf55 to ce77fee Compare February 6, 2026 23:33

pytorch-bot bot added the ciflow/8gpu label Feb 6, 2026

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 6, 2026

xmfan commented Feb 6, 2026

View reviewed changes

xmfan force-pushed the xmfan/stack/11 branch from ce77fee to 965ec8e Compare February 9, 2026 21:43

xmfan force-pushed the xmfan/stack/11 branch from 965ec8e to 072a12c Compare February 9, 2026 21:51

xmfan marked this pull request as ready for review February 9, 2026 23:23

xmfan requested review from fegin, tianyu-l, wconstab and wwwjn as code owners February 9, 2026 23:23

tianyu-l reviewed Feb 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add run-to-run determinism testing to H100 CI #2339

Add run-to-run determinism testing to H100 CI #2339

xmfan commented Feb 6, 2026

Uh oh!

xmfan Feb 6, 2026

Uh oh!

tianyu-l Feb 10, 2026

Uh oh!

xmfan Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add run-to-run determinism testing to H100 CI #2339

Are you sure you want to change the base?

Add run-to-run determinism testing to H100 CI #2339

Conversation

xmfan commented Feb 6, 2026

Uh oh!

xmfan Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

tianyu-l Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

xmfan Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants