feat: Add support for IPO and RPO #1388

sanjana-inflection · 2025-10-17T22:31:49Z

What does this PR do ?

Add support for IPO and RPO (with forward and backward KL distance, and squared distance)

Issues

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

Release Notes

New Features
- Added support for multiple DPO preference loss objectives: IPO and RPO variants (forward KL, backward KL, squared distance).
- Introduced configurable ground-truth reward scaling for preference optimization.
- Extended training dataset schemas to support optional reward values per completion.
Documentation
- Updated DPO guides with new preference loss options and configuration examples.
Tests
- Added test coverage for new preference loss modes and reward handling.

Signed-off-by: Sanjana Ravi <sanjana@inflection.ai>

coderabbitai · 2025-10-17T22:37:55Z

📝 Walkthrough

Walkthrough

This PR extends DPO training to support multiple preference loss objectives (IPO, RPO variants) by adding optional reward fields to data schemas, introducing new DPO configuration options for preference loss selection and ground-truth reward scaling, and implementing corresponding loss computation logic.

Changes

Cohort / File(s)	Summary
Documentation & Examples `docs/guides/dpo.md`, `examples/run_dpo.py`	Adds "Other Objectives" subsection documenting Identity and Reward-Aware Preference Optimization; updates DPO training data examples to include top-level reward fields; introduces new config options `dpo.preference_loss` (dpo, ipo, rpo_sq, rpo_fwd_kl, rpo_bwd_kl) and `dpo.gt_reward_scale`. Extends example preprocessor to propagate reward fields from chosen/rejected completions.
Data Schema & Interfaces `nemo_rl/data/interfaces.py`, `nemo_rl/data/datasets/preference_datasets/preference_dataset.py`	Extends `DPODatumSpec` TypedDict with optional fields `reward_chosen` and `reward_rejected`; documents optional reward field in PreferenceDataset completion schema for RPO use.
Data Processing & Collation `nemo_rl/data/collate_fn.py`	Augments `preference_collate_fn` to track and propagate optional reward fields from training samples, appending `reward_chosen` and `reward_rejected` pairs and validating length consistency.
Loss Function Implementation `nemo_rl/algorithms/loss_functions.py`	Adds `preference_loss` and `gt_reward_scale` fields to `DPOLossConfig`; adds `rewards` field to `DPOLossDataDict`; changes `DPOLossFn` base class from `PreferenceLoss` to `LossFunction`; introduces `split_output_tensor()` method; implements branched loss computation supporting dpo, ipo, rpo_sq, rpo_fwd_kl, rpo_bwd_kl modes with ground-truth reward scaling.
Tests `tests/unit/algorithms/test_loss_functions.py`	Adds four new unit tests for `DPOLossFn` covering ipo, rpo_sq, rpo_fwd_kl, and rpo_bwd_kl preference loss modes with CUDA-dependent execution and mock reward data validation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 36.36% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Test Results For Major Changes	⚠️ Warning	This PR adds support for IPO (Inverse Preference Optimization) and RPO (Reward-Aware Preference Optimization) variants, which represents a major feature addition affecting core loss computations and training behavior. The changes are substantial and could impact convergence and numeric behavior through new loss calculations and reward-based training paths. According to the provided context, the PR description includes a pre-review checklist mentioning that tests should be added and local tests should be run, and the code changes confirm that four new unit tests were added to `tests/unit/algorithms/test_loss_functions.py` for the new preference loss modes. However, the provided PR objectives indicate that the PR description itself contains only a placeholder for a usage example and a contributor checklist—not actual test execution results, convergence validation results, or performance benchmarks. The check requires that the PR description document test results or testing information to verify there is no regression, which is not evident in the materials provided.	The PR description should be updated to include test results from the newly added unit tests, specifically demonstrating that the IPO and RPO implementations produce expected loss values and that convergence behavior is not regressed compared to the baseline DPO mode. Additionally, if performance characteristics differ between the preference loss modes (DPO, IPO, RPO variants), before-and-after performance metrics with relevant configuration details should be provided to validate the implementation's efficiency. At minimum, include output from running the unit tests showing they pass, and add a brief summary of test coverage for each new preference loss mode.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title "feat: Add support for IPO and RPO" directly and accurately reflects the primary objective of the pull request. The changeset's main focus is implementing support for two new preference loss types—Identity Preference Optimization (IPO) and Reward-Aware Preference Optimization (RPO) with variants—along with the necessary infrastructure (reward fields, configuration options, and data handling). The title is concise, specific, and avoids vague terminology, clearly communicating the core feature being introduced.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

nemo_rl/data/collate_fn.py (1)

152-165: Make rewards a torch tensor and validate presence once.

Currently rewards is a Python list; DPOLossFn uses tensor ops and will error. Also, assert triggers late if only some samples include rewards.

Apply:

@@
-    rewards = []
+    rewards: list[float] = []
+    # Ensure all-or-none rewards presence for the batch to avoid silent partials
+    has_rewards_for_all = all(
+        ("reward_chosen" in ds and "reward_rejected" in ds) for ds in data_batch
+    )
@@
-        if "reward_chosen" in datum_spec and "reward_rejected" in datum_spec:
-            rewards.append(datum_spec["reward_chosen"])
-            rewards.append(datum_spec["reward_rejected"])
+        if has_rewards_for_all:
+            rewards.append(float(datum_spec["reward_chosen"]))       # type: ignore[arg-type]
+            rewards.append(float(datum_spec["reward_rejected"]))     # type: ignore[arg-type]
@@
-    if rewards:
-        assert len(rewards) == len(message_log), (
-            f"rewards length ({len(rewards)}) and message_log length ({len(message_log)}) mismatch"
-        )
+    if rewards:
+        if len(rewards) != len(message_log):
+            raise ValueError(
+                f"rewards length ({len(rewards)}) and message_log length ({len(message_log)}) mismatch; "
+                "either provide rewards for every (chosen,rejected) pair in the batch or for none."
+            )
@@
-    if rewards:
-        data["rewards"] = rewards
+    if rewards:
+        # Align dtype/device with input_ids
+        data["rewards"] = torch.tensor(rewards, dtype=torch.float32, device=data["input_ids"].device)

Also applies to: 166-168, 204-205

nemo_rl/algorithms/loss_functions.py (1)

700-707: Avoid hardcoded .cuda(); use the logits’ device.

Using .cuda() breaks on CPU-only runs and heterogeneous devices. Align with next_token_logits.

Apply:
-            next_tokens = data["input_ids"][:, 1:].cuda()  # Skip first token
+            next_tokens = data["input_ids"][:, 1:].to(next_token_logits.device)  # Skip first token
Consider making the same change in NLLLoss for consistency (separate PR ok).

🧹 Nitpick comments (8)

nemo_rl/data/interfaces.py (1)
50-51: Document new DPODatumSpec fields.

Add concise comments describing semantics to keep schema self-explanatory.

Apply:
-    reward_chosen: NotRequired[float]
-    reward_rejected: NotRequired[float]
+    reward_chosen: NotRequired[float]   # Ground-truth reward for the chosen completion (used by RPO)
+    reward_rejected: NotRequired[float] # Ground-truth reward for the rejected completion (used by RPO)
nemo_rl/algorithms/loss_functions.py (2)
595-609: Replace Greek σ in docstrings to satisfy linters and improve clarity.

Use sigmoid(x) instead of σ(x); also render Δ consistently as Delta.

Apply (excerpt):
-    L_pref(θ) = -E[log(σ(β * Δ_r))]
+    L_pref(theta) = -E[log(sigmoid(beta * Delta_r))]
@@
-    L_pref(θ) = E[(Δ_r - (1/(2β))) ^ 2]
+    L_pref(theta) = E[(Delta_r - (1/(2*beta)))^2]
@@
-    L_pref(θ) = E[(Δ_r - Δ_gtr) ^ 2]
+    L_pref(theta) = E[(Delta_r - Delta_gtr)^2]
@@
-    - σ is the sigmoid function
-    - β is the reference_policy_kl_penalty
+    - sigmoid is the logistic function
+    - beta is the reference_policy_kl_penalty
805-807: Remove/adjust outdated TODO.

Comment refers to inheriting from PreferenceLoss, but DPOLossFn now inherits from LossFunction.

Replace with a current action item or drop the comment.
tests/unit/algorithms/test_loss_functions.py (4)
348-380: Replace unused metrics_dict with underscore.

The metrics_dict variable is unpacked but never used. Replace it with _ to indicate it's intentionally unused.

Apply this diff:
-    loss, metrics_dict = loss_fn(
+    loss, _ = loss_fn(
382-416: Replace unused metrics_dict with underscore.

The metrics_dict variable is unpacked but never used. Replace it with _ to indicate it's intentionally unused.

Apply this diff:
-    loss, metrics_dict = loss_fn(
+    loss, _ = loss_fn(
418-452: Replace unused metrics_dict with underscore.

The metrics_dict variable is unpacked but never used. Replace it with _ to indicate it's intentionally unused.

Apply this diff:
-    loss, metrics_dict = loss_fn(
+    loss, _ = loss_fn(
454-488: Replace unused metrics_dict with underscore.

The metrics_dict variable is unpacked but never used. Replace it with _ to indicate it's intentionally unused.

Apply this diff:
-    loss, metrics_dict = loss_fn(
+    loss, _ = loss_fn(
docs/guides/dpo.md (1)
49-49: Clarify reward field requirement and use consistent capitalization.

The comment states "Optional, float" followed by "required for rpo", which may confuse readers. Consider rephrasing to clarify that the field is optional for DPO/IPO but required when using RPO variants. Also, use "RPO" (uppercase) for consistency with other acronym usage in the documentation.

Consider updating the comments to:
-      "reward": 10.0, // Optional, float - The ground truth reward of the completion (required for rpo)
+      "reward": 10.0, // Optional, float - The ground truth reward of the completion (required for RPO variants)
Also applies to: 54-54, 87-87, 97-97

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 85eeb8d and d6f9b90.

📒 Files selected for processing (7)

docs/guides/dpo.md (5 hunks)
examples/run_dpo.py (1 hunks)
nemo_rl/algorithms/loss_functions.py (6 hunks)
nemo_rl/data/collate_fn.py (3 hunks)
nemo_rl/data/datasets/preference_datasets/preference_dataset.py (1 hunks)
nemo_rl/data/interfaces.py (1 hunks)
tests/unit/algorithms/test_loss_functions.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

docs/**/*.md

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

When a markdown doc under docs/**/*.md is added or renamed, update docs/index.md to include it in the appropriate section

Files:

docs/guides/dpo.md

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

nemo_rl/data/interfaces.py
examples/run_dpo.py
nemo_rl/data/datasets/preference_datasets/preference_dataset.py
nemo_rl/algorithms/loss_functions.py
nemo_rl/data/collate_fn.py
tests/unit/algorithms/test_loss_functions.py

nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

nemo_rl/data/interfaces.py
nemo_rl/data/datasets/preference_datasets/preference_dataset.py
nemo_rl/algorithms/loss_functions.py
nemo_rl/data/collate_fn.py

🧬 Code graph analysis (2)

nemo_rl/algorithms/loss_functions.py (2)

nemo_rl/algorithms/interfaces.py (2)

LossFunction (28-70)

LossType (23-25)

nemo_rl/algorithms/utils.py (1)

masked_mean (134-146)

tests/unit/algorithms/test_loss_functions.py (2)

nemo_rl/algorithms/loss_functions.py (1)

DPOLossFn (565-869)

nemo_rl/distributed/batched_data_dict.py (1)

to (825-832)

🪛 Ruff (0.14.0)

nemo_rl/algorithms/loss_functions.py

596-596: Docstring contains ambiguous σ (GREEK SMALL LETTER SIGMA). Did you mean o (LATIN SMALL LETTER O)?

(RUF002)

605-605: Docstring contains ambiguous σ (GREEK SMALL LETTER SIGMA). Did you mean o (LATIN SMALL LETTER O)?

(RUF002)

605-605: Docstring contains ambiguous σ (GREEK SMALL LETTER SIGMA). Did you mean o (LATIN SMALL LETTER O)?

(RUF002)

605-605: Docstring contains ambiguous σ (GREEK SMALL LETTER SIGMA). Did you mean o (LATIN SMALL LETTER O)?

(RUF002)

605-605: Docstring contains ambiguous σ (GREEK SMALL LETTER SIGMA). Did you mean o (LATIN SMALL LETTER O)?

(RUF002)

605-605: Docstring contains ambiguous σ (GREEK SMALL LETTER SIGMA). Did you mean o (LATIN SMALL LETTER O)?

(RUF002)

605-605: Docstring contains ambiguous σ (GREEK SMALL LETTER SIGMA). Did you mean o (LATIN SMALL LETTER O)?

(RUF002)