fix: support arbitrary values for `checkpointing.metric_name` #1291

ashors1 · 2025-10-06T23:03:49Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

Release Notes

Chores
- Standardized checkpoint metric naming convention across all training configurations to use "val:" or "train:" prefixes for clearer metric source identification.
- Enhanced checkpointing logic to parse and validate the new metric naming format, with improved warning messages when metrics are unavailable.
Documentation
- Updated checkpoint configuration documentation to clarify the required metric naming format.

Signed-off-by: ashors1 <ashors@nvidia.com>

examples/configs/distillation_math.yaml

nemo_rl/algorithms/sft.py

Signed-off-by: ashors1 <ashors@nvidia.com>

ashors1 · 2025-10-07T16:37:56Z

Thanks for the comments @samodi-nv! I've addressed them

nemo_rl/algorithms/distillation.py

Signed-off-by: ashors1 <ashors@nvidia.com>

coderabbitai · 2025-10-17T06:20:51Z

📝 Walkthrough

Walkthrough

This pull request refactors metric-based checkpointing across the codebase by introducing a namespaced metric format with "train:" or "val:" prefixes. Configuration files are updated to use the new format, and algorithm implementations are enhanced to parse these prefixes, validate metric existence, and handle missing metrics gracefully with warnings.

Changes

Cohort / File(s)	Summary
Configuration Updates `examples/configs/distillation_math.yaml`, `examples/configs/dpo.yaml`, `examples/configs/grpo_math_1B.yaml`, `examples/configs/grpo_math_1B_megatron.yaml`, `examples/configs/grpo_sliding_puzzle.yaml`, `examples/configs/rm.yaml`, `examples/configs/sft.yaml`, `examples/configs/sft_openmathinstruct2.yaml`, `examples/configs/sft_openmathinstruct2_megatron.yaml`, `examples/configs/sft_vlm_3B.yaml`, `examples/configs/vlm_grpo_3B.yaml`, `examples/configs/vlm_grpo_3B_megatron.yaml`	Updated `checkpointing.metric_name` from formats like `"val_loss"`/`"val_reward"` to namespaced formats like `"val:loss"`/`"val:reward"`. Added clarifying comments specifying that metrics must be prefixed with either `"val:"` or `"train:"` followed by the metric name.
Algorithm Implementations `nemo_rl/algorithms/distillation.py`, `nemo_rl/algorithms/dpo.py`, `nemo_rl/algorithms/grpo.py`, `nemo_rl/algorithms/rm.py`, `nemo_rl/algorithms/sft.py`	Implemented logic to parse `metric_name` with "train:"/"val:" prefixes, extract the source and metric name, select appropriate metrics dictionary, validate metric existence, issue warnings for missing metrics, and store metric values under full metric name keys in checkpointing save state.
Checkpoint Utilities `nemo_rl/utils/checkpoint.py`	Added documentation in `CheckpointingConfig` docstring specifying the required metric name format (prefixed with "val:" or "train:").

Sequence Diagram

sequenceDiagram
    participant Config as Config
    participant Algo as Algorithm (SFT/GRPO/etc)
    participant Metrics as Metrics Storage
    participant SaveState as Checkpoint SaveState

    Config->>Algo: metric_name = "val:loss"
    Algo->>Algo: Parse prefix<br/>("val" or "train")
    alt Prefix valid ("val" or "train")
        Algo->>Algo: Extract metric_name
        alt Prefix is "val"
            Algo->>Metrics: Select val_metrics
        else Prefix is "train"
            Algo->>Metrics: Select train metrics
        end
        alt Metric exists
            Metrics-->>Algo: metric_value
            Algo->>SaveState: Store metric under<br/>"val:loss" key
        else Metric missing
            Algo-->>Algo: Emit warning
            Algo->>SaveState: Remove "val:loss"<br/>entry if present
        end
    else Invalid prefix
        Algo-->>Algo: Emit warning
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

The changes demonstrate consistent patterns across configuration files (homogeneous updates), but the algorithm implementations introduce heterogeneous logic for metric parsing, validation, and conditional branching based on metric availability. Multiple files require reasoning about the new control flow and error-handling paths.

Possibly related PRs

fix: fix checkpointing when val_period does not divide save_period #1229: Modifies metric-based checkpointing logic in the same algorithm modules (nemo_rl/algorithms/{dpo,grpo,sft}.py) with similar refactoring of metric name interpretation and top-k selection behavior.

Suggested reviewers

terrykong

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Test Results For Major Changes	⚠️ Warning	This PR introduces major breaking changes to the checkpointing metric naming system across five algorithm implementations (SFT, GRPO, DPO, RM, distillation), requiring migration from simple metric names to a "prefix:metric" format. The PR description explicitly contains only "placeholders for usage examples and pre-check items" with "Additional information marked as '...'" and no concrete test documentation. Searches of the test suite reveal that no tests exist for metric_name validation in test_sft.py, test_dpo.py, test_grpo.py, or test_rm.py. The only metric_name reference found in existing tests is set to None. Additionally, multiple outstanding review comments indicate the code still requires fixes for error handling, validation strictness, and stacklevel parameters. The complete absence of test results documenting validation of this substantial change prevents verification that the new functionality works correctly and does not introduce regressions.	The PR should not be merged until the PR description is updated to include: (1) test results from unit and integration tests validating the new metric_name parsing logic across all affected algorithms, (2) tests confirming that both valid formats ("train:metric", "val:metric") work correctly and invalid formats raise appropriate errors, (3) evidence that checkpoint selection and top-k saving functionality operates correctly with the new format, (4) confirmation that model training convergence is unaffected by these changes, and (5) verification that all outstanding review comments have been addressed and validated by tests before merge.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The pull request title "fix: support arbitrary values for `checkpointing.metric_name`" directly and accurately captures the primary change in this PR. The changeset focuses on updating configuration files and algorithm implementations to properly parse and use configured metric names with a new prefix format (val: or train:), replacing the previous behavior of defaulting to hardcoded metric names. The title is concise, specific, and clearly communicates the main objective without vagueness or unnecessary noise.
Linked Issues Check	✅ Passed	The pull request successfully addresses the core requirements from issue #1261. The code changes implement the required behavior: parsing the configured `checkpointing.metric_name` from the configuration, validating the format (train: or val: prefix), selecting the appropriate metrics source (training or validation), checking for metric existence in the corresponding metrics dictionary, and issuing warnings when metrics are missing. Configuration files are updated to use the new prefix format across all examples (distillation, DPO, GRPO, RM, SFT, VLM variants), and algorithm implementations in distillation.py, dpo.py, grpo.py, rm.py, and sft.py all include the full_metric_name parsing and validation logic. The checkpoint utility documentation was also clarified to specify the required format.
Out of Scope Changes Check	✅ Passed	All changes in this pull request are directly aligned with the stated objective of making `checkpointing.metric_name` respect configured values rather than defaulting to hardcoded metric names. The configuration file updates introduce the new prefix format (val: or train:), the algorithm implementations add the necessary parsing and validation logic, and the utility documentation clarifies the expected format. No unrelated changes, refactorings, or improvements to other areas are present. Every modification serves the primary purpose of enabling arbitrary metric name support with the new namespaced format.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch ashors/ckpt_metric

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

examples/configs/rm.yaml (1)
152-155: Update stale example to match new prefix requirement

The comment still shows the pre-change form. Recommend this fix for consistency:
-  #     metric_name: "validation-<NameOfValidationDataset1>_loss"
+  #     metric_name: "val:validation-<NameOfValidationDataset1>_loss"
nemo_rl/algorithms/rm.py (1)
308-312: Wrong config key in assertion (RM uses rm, not dpo).

Assertion references master_config["dpo"]["val_period"]; should be ["rm"]["val_period"]. This can raise KeyError or mask validation misconfig.

Apply:
-        assert val_dataloader is not None or master_config["dpo"]["val_period"] == 0, (
+        assert val_dataloader is not None or master_config["rm"]["val_period"] == 0, (
             "val_dataloader is None, so dpo.val_period must be 0"
         )
nemo_rl/algorithms/grpo.py (1)
1054-1056: Wrong config key in GRPO validate assertion.

Should reference grpo.val_period, not dpo.val_period.

Apply:
-        assert val_dataloader is not None or master_config["dpo"]["val_period"] == 0, (
+        assert val_dataloader is not None or master_config["grpo"]["val_period"] == 0, (
             "val_dataloader is None, so dpo.val_period must be 0"
         )

🧹 Nitpick comments (3)

nemo_rl/algorithms/sft.py (1)
511-520: Consider more explicit parsing logic for clarity.

The logic at line 519 uses "val" in parts[0] to determine the metric source. While the assertion above ensures the format is correct, using an explicit comparison would be clearer:
-                        train_or_val = "val" if "val" in parts[0] else "train"
+                        train_or_val = parts[0]  # Already validated to be "val" or "train"
This makes the intent clearer and leverages the assertion's validation.
nemo_rl/algorithms/distillation.py (1)
734-759: Parse metric prefix safely and preserve metric names containing colons

Current parsing uses split(":") without maxsplit, which fails for metrics containing colons after the prefix, and relies on substring checks instead of exact comparison. Use split(":", 1) for safe parsing.

Apply consistently across all algorithm files:
-                        parts = full_metric_name.split(":")
-                        train_or_val = "val" if "val" in parts[0] else "train"
-                        metric_name = parts[1]
+                        train_or_val, metric_name = full_metric_name.split(":", 1)
+                        assert train_or_val in ("train", "val"), (
+                            f"Invalid metric prefix '{train_or_val}'. Expected 'train' or 'val'."
+                        )
Files to update:

nemo_rl/algorithms/distillation.py:741

nemo_rl/algorithms/sft.py:518

nemo_rl/algorithms/rm.py:582

nemo_rl/algorithms/dpo.py:654

nemo_rl/algorithms/grpo.py:914 and grpo.py:1714

Optionally, improve the warning message for clarity:
-                            warnings.warn(
-                                f"You asked to save checkpoints based on {metric_name} but the metric is not found in the {train_or_val} metrics. "
-                                "This checkpoint will not be saved as top-k.",
-                                stacklevel=2,
-                            )
+                            warnings.warn(
+                                f"Checkpoint metric '{full_metric_name}' not found in {train_or_val} metrics; skipping top-k update.",
+                                stacklevel=2,
+                            )
nemo_rl/algorithms/grpo.py (1)

907-931: Deduplicate metric_name parsing via a small utility.

Parsing logic is duplicated across RM/DPO/GRPO (sync + async). Consider a shared helper in nemo_rl/utils/checkpoint.py, e.g., parse_checkpoint_metric_name(full_metric_name) -> tuple[prefix, metric], and reuse. Reduces drift and enforces a single policy.

If helpful, I can draft the utility and apply call-site changes across modules.

Also applies to: 1709-1731

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dee3fd9 and 8a0a168.

📒 Files selected for processing (18)

examples/configs/distillation_math.yaml (1 hunks)
examples/configs/dpo.yaml (2 hunks)
examples/configs/grpo_math_1B.yaml (1 hunks)
examples/configs/grpo_math_1B_megatron.yaml (1 hunks)
examples/configs/grpo_sliding_puzzle.yaml (1 hunks)
examples/configs/rm.yaml (1 hunks)
examples/configs/sft.yaml (1 hunks)
examples/configs/sft_openmathinstruct2.yaml (1 hunks)
examples/configs/sft_openmathinstruct2_megatron.yaml (1 hunks)
examples/configs/sft_vlm_3B.yaml (1 hunks)
examples/configs/vlm_grpo_3B.yaml (1 hunks)
examples/configs/vlm_grpo_3B_megatron.yaml (1 hunks)
nemo_rl/algorithms/distillation.py (1 hunks)
nemo_rl/algorithms/dpo.py (1 hunks)
nemo_rl/algorithms/grpo.py (2 hunks)
nemo_rl/algorithms/rm.py (1 hunks)
nemo_rl/algorithms/sft.py (1 hunks)
nemo_rl/utils/checkpoint.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

examples/configs/*.yaml

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

examples/configs/*.yaml: Exemplar configs under examples/configs/.yaml must include documented defaults
When adding a new config key, reflect its recommended default in exemplar YAMLs under examples/configs/.yaml

Files:

examples/configs/grpo_math_1B.yaml
examples/configs/sft_openmathinstruct2.yaml
examples/configs/rm.yaml
examples/configs/distillation_math.yaml
examples/configs/vlm_grpo_3B_megatron.yaml
examples/configs/sft_openmathinstruct2_megatron.yaml
examples/configs/sft_vlm_3B.yaml
examples/configs/sft.yaml
examples/configs/grpo_math_1B_megatron.yaml
examples/configs/vlm_grpo_3B.yaml
examples/configs/grpo_sliding_puzzle.yaml
examples/configs/dpo.yaml

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

nemo_rl/utils/checkpoint.py
nemo_rl/algorithms/distillation.py
nemo_rl/algorithms/grpo.py
nemo_rl/algorithms/sft.py
nemo_rl/algorithms/dpo.py
nemo_rl/algorithms/rm.py

nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

nemo_rl/utils/checkpoint.py
nemo_rl/algorithms/distillation.py
nemo_rl/algorithms/grpo.py
nemo_rl/algorithms/sft.py
nemo_rl/algorithms/dpo.py
nemo_rl/algorithms/rm.py

🧠 Learnings (1)

📚 Learning: 2025-09-18T13:26:43.307Z

Learnt from: zpqiu
PR: NVIDIA-NeMo/RL#1006
File: examples/configs/recipes/llm/distillation-qwen3-32b-to-8b-base-2n8g-fsdp2tp2.v1.yaml:19-26
Timestamp: 2025-09-18T13:26:43.307Z
Learning: In on-policy distillation workflows, validation can use downstream task performance (like math problem solving) as RL-like reward metrics rather than traditional distillation metrics like KL divergence. In this case, "val_reward" with "higher_is_better: true" is the correct checkpoint monitoring configuration.

Applied to files:

examples/configs/grpo_math_1B.yaml
examples/configs/sft_openmathinstruct2.yaml
examples/configs/rm.yaml
examples/configs/distillation_math.yaml
examples/configs/vlm_grpo_3B_megatron.yaml
examples/configs/sft_openmathinstruct2_megatron.yaml
examples/configs/sft_vlm_3B.yaml
examples/configs/sft.yaml
examples/configs/grpo_math_1B_megatron.yaml
examples/configs/vlm_grpo_3B.yaml
examples/configs/grpo_sliding_puzzle.yaml
examples/configs/dpo.yaml

🪛 Ruff (0.14.0)

nemo_rl/algorithms/grpo.py

922-922: No explicit stacklevel keyword argument found

Set stacklevel=2

(B028)

1722-1722: No explicit stacklevel keyword argument found

Set stacklevel=2

(B028)

nemo_rl/algorithms/sft.py

531-531: No explicit stacklevel keyword argument found

Set stacklevel=2

(B028)

nemo_rl/algorithms/dpo.py

662-662: No explicit stacklevel keyword argument found

Set stacklevel=2

(B028)

nemo_rl/algorithms/rm.py

590-590: No explicit stacklevel keyword argument found

Set stacklevel=2

(B028)

🔇 Additional comments (13)

nemo_rl/utils/checkpoint.py (1)

43-44: LGTM! Clear documentation of the required metric format.

The documentation clearly specifies that metric_name must use "val:" or "train:" prefixes, which aligns with the implementation across algorithm files and addresses the requirement from past review comments.

examples/configs/grpo_math_1B_megatron.yaml (1)

32-36: LGTM! Config correctly adopts the new metric naming convention.

The change properly updates the metric_name to use the "val:" prefix and includes a helpful inline comment. The checkpoint_must_save_by field addition aligns with broader checkpointing patterns.

Based on learnings, "val:reward" with "higher_is_better: true" is the correct configuration for RL-based reward metrics.

examples/configs/vlm_grpo_3B_megatron.yaml (1)

29-33: LGTM! Consistent with the new metric naming convention.

The configuration correctly adopts the "val:" prefix format with appropriate inline documentation.

examples/configs/grpo_sliding_puzzle.yaml (1)

14-18: LGTM! Consistent adoption of the new format.

examples/configs/distillation_math.yaml (1)

22-26: LGTM! Config correctly updated.

The metric_name properly uses the new format. The past review comment about documenting format options in checkpoint.py has been addressed in this PR.

examples/configs/sft_openmathinstruct2.yaml (1)

15-19: LGTM! Correctly configured for loss metric.

The metric_name properly uses the "val:loss" format, and higher_is_better: false is correctly set for loss metrics.

examples/configs/grpo_math_1B.yaml (1)

37-41: LGTM! Final config correctly updated.

The metric_name properly adopts the "val:reward" format with appropriate documentation.

examples/configs/sft_vlm_3B.yaml (1)

19-19: LGTM: namespaced checkpoint metric

Switch to "val:loss" with clarifying comment matches the new convention; higher_is_better: false remains correct for loss.

examples/configs/sft_openmathinstruct2_megatron.yaml (1)

17-18: LGTM: prefixed metric format

"val:loss" and the inline note align with the new full_metric_name workflow.

examples/configs/dpo.yaml (1)

25-26: LGTM: DPO metric now explicitly from validation set

Using "val:validation-default_loss" and updating the comment example reduces ambiguity and matches the new requirement.

Also applies to: 183-184

examples/configs/vlm_grpo_3B.yaml (1)

34-36: LGTM: reward metric correctly namespaced

"val:reward" with higher_is_better: true matches GRPO usage.

examples/configs/rm.yaml (1)

18-20: LGTM: namespaced loss metric

"val:loss" and higher_is_better: false are consistent.

examples/configs/sft.yaml (1)

18-19: LGTM: prefixed loss metric

"val:loss" matches the new convention; no further changes needed.

nemo_rl/algorithms/dpo.py

nemo_rl/algorithms/grpo.py

nemo_rl/algorithms/rm.py

nemo_rl/algorithms/sft.py

terrykong

lgtm. small comment on warning

nemo_rl/algorithms/distillation.py

Signed-off-by: ashors1 <ashors@nvidia.com>

Signed-off-by: Anna Shors <ashors@nvidia.com>

Signed-off-by: ashors1 <ashors@nvidia.com>

Signed-off-by: ashors1 <ashors@nvidia.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

ashors1 added 2 commits October 6, 2025 15:56

support generic checkpointing.metric_name

c35b3b5

Signed-off-by: ashors1 <ashors@nvidia.com>

update configs

17de85f

Signed-off-by: ashors1 <ashors@nvidia.com>

ashors1 mentioned this pull request Oct 7, 2025

config option checkpointing.metric_name is not respected #1261

Closed

samodi-nv suggested changes Oct 7, 2025

View reviewed changes

examples/configs/distillation_math.yaml Outdated Show resolved Hide resolved

nemo_rl/algorithms/sft.py Outdated Show resolved Hide resolved

samodi-nv reviewed Oct 7, 2025

View reviewed changes

nemo_rl/algorithms/sft.py Outdated Show resolved Hide resolved

ashors1 requested a review from terrykong October 7, 2025 16:13

ashors1 added 3 commits October 7, 2025 09:27

address comments

a47bfdd

Signed-off-by: ashors1 <ashors@nvidia.com>

update documentation

3c0ce36

Signed-off-by: ashors1 <ashors@nvidia.com>

lint

f6602b2

Signed-off-by: ashors1 <ashors@nvidia.com>

samodi-nv approved these changes Oct 7, 2025

View reviewed changes

terrykong reviewed Oct 7, 2025

View reviewed changes

nemo_rl/algorithms/distillation.py Outdated Show resolved Hide resolved

ashors1 added 2 commits October 16, 2025 22:07

address comment, small bug fix

c43278f

Signed-off-by: ashors1 <ashors@nvidia.com>

fix dpo config

678dbf3

Signed-off-by: ashors1 <ashors@nvidia.com>

ashors1 force-pushed the ashors/ckpt_metric branch from a662a80 to 678dbf3 Compare October 17, 2025 05:13

fix grpo

8a0a168

Signed-off-by: ashors1 <ashors@nvidia.com>

ashors1 marked this pull request as ready for review October 17, 2025 06:20

ashors1 requested review from a team as code owners October 17, 2025 06:20

coderabbitai bot reviewed Oct 17, 2025

View reviewed changes

nemo_rl/algorithms/dpo.py Show resolved Hide resolved

nemo_rl/algorithms/grpo.py Show resolved Hide resolved

nemo_rl/algorithms/grpo.py Show resolved Hide resolved

nemo_rl/algorithms/rm.py Show resolved Hide resolved

nemo_rl/algorithms/sft.py Show resolved Hide resolved

terrykong reviewed Oct 17, 2025

View reviewed changes

nemo_rl/algorithms/distillation.py Show resolved Hide resolved

ashors1 added 2 commits October 17, 2025 07:40

apply suggestions

72f8537

Signed-off-by: ashors1 <ashors@nvidia.com>

lint

cf304c8

Signed-off-by: ashors1 <ashors@nvidia.com>

ashors1 added the CI:L0 Run doctests and unit tests label Oct 22, 2025

ashors1 temporarily deployed to nemo-ci October 22, 2025 18:35 — with GitHub Actions Inactive

ashors1 temporarily deployed to nemo-ci October 22, 2025 18:53 — with GitHub Actions Inactive

Merge branch 'main' into ashors/ckpt_metric

dcfecde

terrykong temporarily deployed to nemo-ci October 24, 2025 23:03 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci October 24, 2025 23:33 — with GitHub Actions Inactive

terrykong had a problem deploying to nemo-ci October 25, 2025 06:09 — with GitHub Actions Failure

terrykong had a problem deploying to nemo-ci October 25, 2025 16:37 — with GitHub Actions Failure

terrykong had a problem deploying to nemo-ci October 26, 2025 03:10 — with GitHub Actions Failure

terrykong had a problem deploying to nemo-ci October 26, 2025 05:23 — with GitHub Actions Failure

fix metric_name in configs

b6d0708

Signed-off-by: ashors1 <ashors@nvidia.com>

ashors1 dismissed terrykong’s stale review via b6d0708 October 26, 2025 21:55

terrykong added r0.4.0 CI:L1 Run doctests, unit tests, and functional tests and removed r0.4.0 CI:L1 Run doctests, unit tests, and functional tests labels Oct 28, 2025

terrykong temporarily deployed to nemo-ci October 28, 2025 23:41 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci October 28, 2025 23:46 — with GitHub Actions Inactive

terrykong had a problem deploying to nemo-ci October 29, 2025 04:54 — with GitHub Actions Failure

terrykong had a problem deploying to nemo-ci October 29, 2025 16:41 — with GitHub Actions Failure

ashors1 added 4 commits October 29, 2025 12:18

fix

6ae257e

Signed-off-by: ashors1 <ashors@nvidia.com>

Merge branch 'main' of github.com:NVIDIA-NeMo/RL into ashors/ckpt_metric

c8da425

fix functional tests

4124c58

Signed-off-by: Anna Shors <ashors@nvidia.com>

lint

b77a3fc

Signed-off-by: ashors1 <ashors@nvidia.com>

terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Oct 30, 2025

terrykong approved these changes Oct 30, 2025

View reviewed changes

terrykong temporarily deployed to nemo-ci October 30, 2025 05:18 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci October 30, 2025 06:02 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci October 30, 2025 11:09 — with GitHub Actions Inactive

terrykong merged commit 7b32363 into main Oct 30, 2025
40 of 41 checks passed

terrykong deleted the ashors/ckpt_metric branch October 30, 2025 12:58

coderabbitai bot mentioned this pull request Oct 30, 2025

cp: fix: support arbitrary values for checkpointing.metric_name (1291) into r0.4.0 #1449

Merged

fix: support arbitrary values for checkpointing.metric_name #1291

fix: support arbitrary values for checkpointing.metric_name #1291

Uh oh!

Conversation

ashors1 commented Oct 6, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Release Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ashors1 commented Oct 7, 2025

Uh oh!

Uh oh!

coderabbitai bot commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

terrykong left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix: support arbitrary values for `checkpointing.metric_name` #1291

fix: support arbitrary values for `checkpointing.metric_name` #1291

ashors1 commented Oct 6, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 17, 2025 •

edited

Loading