-
Notifications
You must be signed in to change notification settings - Fork 700
Add run-to-run determinism testing to H100 CI #2339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,10 @@ | ||
| # Commit messages | ||
|
|
||
| Don't commit unless the user explicitly asks you to. | ||
|
|
||
| When writing a commit message, don't make a bullet list of the individual | ||
| changes. Instead, if the PR is large, explain the order to review changes | ||
| (e.g., the logical progression), or if it's short just omit the bullet list | ||
| entirely. | ||
|
|
||
| Disclose that the PR was authored with Claude. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| # Copyright (c) Meta Platforms, Inc. and affiliates. | ||
| # All rights reserved. | ||
| # | ||
| # This source code is licensed under the BSD-style license found in the | ||
| # LICENSE file in the root directory of this source tree. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -24,6 +24,7 @@ class OverrideDefinitions: | |
| ngpu: int = 4 | ||
| disabled: bool = False | ||
| skip_rocm_test: bool = False | ||
| determinism_test: bool = False # Run twice and verify losses are identical | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The point is not only about being deterministic, but also not changing before vs. after
Is it correct that this PR doesn't address such issues?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This pr just makes sure that when you run the same command twice, it produces the same outputs. by adding this to PR time CI, you would run H100 CI twice on each PR, both against the same pytorch nightly.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So this test only guards the deterministic is setup correctly and working correctly, right? I think if we make sure the loss doesn't change before vs. after (pytorch nightly, and user commits), it already covers the deterministic check:
|
||
|
|
||
| def __repr__(self): | ||
| return self.test_descr | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,90 @@ | ||
| # Copyright (c) Meta Platforms, Inc. and affiliates. | ||
| # All rights reserved. | ||
| # | ||
| # This source code is licensed under the BSD-style license found in the | ||
| # LICENSE file in the root directory of this source tree. | ||
|
|
||
| """ | ||
| Shared utilities for loss extraction and comparison. | ||
|
|
||
| This module provides common functionality used by both: | ||
| - scripts/loss_compare.py (CLI tool for comparing losses across commits) | ||
| - tests/integration_tests/run_tests.py (integration test runner) | ||
| """ | ||
|
|
||
| import re | ||
|
|
||
|
|
||
| def extract_losses_from_log(log_file: str) -> dict[int, float]: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What are other use cases for this function to put it in an utils file? |
||
| """Extract step and loss pairs from a training log file. | ||
|
|
||
| Parses log lines matching the pattern: "step: N loss: X.XXXX" | ||
| Handles ANSI escape codes that may be present in colored terminal output. | ||
|
|
||
| Args: | ||
| log_file: Path to the training log file | ||
|
|
||
| Returns: | ||
| Dictionary mapping step numbers to loss values | ||
| """ | ||
| losses = {} | ||
| step_loss_pattern = re.compile(r"step:\s*(\d+)\s*loss:\s*(\d+\.\d+)") | ||
| ansi_escape = re.compile(r"\x1b\[[0-9;]*m") | ||
|
|
||
| with open(log_file, "r") as f: | ||
| for line in f: | ||
| # Strip ANSI codes before matching | ||
| clean_line = ansi_escape.sub("", line) | ||
| match = step_loss_pattern.search(clean_line) | ||
| if match: | ||
| step, loss = match.groups() | ||
| losses[int(step)] = float(loss) | ||
|
|
||
| return losses | ||
|
|
||
|
|
||
| def compare_losses( | ||
| losses1: dict[int, float], | ||
| losses2: dict[int, float], | ||
| name1: str = "run1", | ||
| name2: str = "run2", | ||
| ) -> tuple[bool, str]: | ||
| """Compare two loss dictionaries for equality. | ||
|
|
||
| Args: | ||
| losses1: First loss dictionary (step -> loss) | ||
| losses2: Second loss dictionary (step -> loss) | ||
| name1: Name for first run (for error messages) | ||
| name2: Name for second run (for error messages) | ||
|
|
||
| Returns: | ||
| Tuple of (success: bool, message: str) | ||
| - success is True if all losses match exactly | ||
| - message contains details about the comparison or mismatch | ||
| """ | ||
| if not losses1: | ||
| return False, f"No losses found in {name1}" | ||
|
|
||
| if not losses2: | ||
| return False, f"No losses found in {name2}" | ||
|
|
||
| steps1 = set(losses1.keys()) | ||
| steps2 = set(losses2.keys()) | ||
|
|
||
| if steps1 != steps2: | ||
| return False, ( | ||
| f"Steps mismatch: {name1} has {len(steps1)} steps, " | ||
| f"{name2} has {len(steps2)} steps" | ||
| ) | ||
|
|
||
| mismatches = [] | ||
| for step in sorted(steps1): | ||
| loss1 = losses1[step] | ||
| loss2 = losses2[step] | ||
| if loss1 != loss2: | ||
| mismatches.append(f" step {step}: {name1}={loss1}, {name2}={loss2}") | ||
|
|
||
| if mismatches: | ||
| return False, "Loss mismatches:\n" + "\n".join(mismatches) | ||
|
|
||
| return True, f"All {len(steps1)} steps have identical losses" | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copied over from pytorch's claude md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this file, we can directly ask Claude code to create a PR for us?