-
Notifications
You must be signed in to change notification settings - Fork 175
feat: LoRA SFT support for DTensorV2 path #1556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
45bb8b8 to
fedecbc
Compare
|
fedecbc to
3356fc4
Compare
|
|
|
…bug logging in llm_message_utils.py; adjust lora_dtype in dtensor_policy_worker_v2.py Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: Jonas Yang <joyang@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
7272936 to
bac01be
Compare
|
…ks for llm and vlm recipes; remove unused sft-llama3.1-8b-1n8g-dtensor-lora configuration and related test scripts; fix tokenizer model path in unit tests Signed-off-by: ruit <ruit@nvidia.com>
|
Signed-off-by: ruit <ruit@nvidia.com>
|
…2; adjust return value for refit_info to only include weights Signed-off-by: ruit <ruit@nvidia.com>
|
|
Hi, @samodi-nv. I made a few updates on top of your original PR:
After discussing with @joyang-nv , we’d like to first merge the SFT LoRA, and then add LoRA support for GRPO. Could you please review this PR again? |
📝 WalkthroughWalkthroughIntroduces LoRA (Low-Rank Adaptation) configuration and integration support to the DTensor policy worker V2 for distributed training. Includes new configuration types, selective weight streaming for LoRA-only modes, checkpoint metadata handling, and test infrastructure for LoRA-based training scenarios. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Nitpick comments (4)
examples/configs/sft.yaml (1)
48-59: LGTM - LoRA configuration block is well-structured.The LoRA configuration provides sensible defaults and aligns with the
LoRAConfigTypedDict. Minor note: Line 50 has a typo "precendence" → "precedence".- target_modules: [] # match all linear modules takes precendence + target_modules: [] # match all linear modules takes precedencetests/unit/models/policy/test_dtensor_worker.py (2)
122-134: Mutable default arguments should be avoided.Using mutable default arguments (
[]) can lead to unexpected behavior if the lists are mutated. Although they aren't mutated here, it's best practice to useNoneand initialize within the function.def update_lora_config( config: PolicyConfig, enabled: bool = True, - target_modules: list[str] = [], - exclude_modules: list[str] = [], + target_modules: list[str] | None = None, + exclude_modules: list[str] | None = None, match_all_linear: bool = True, dim: int = 32, alpha: int = 32, dropout: float = 0.0, dropout_position: str = "post", lora_A_init: str = "xavier", use_triton: bool = True, ): if enabled: config["dtensor_cfg"]["_v2"] = True config["dtensor_cfg"]["lora"].update( { "enabled": enabled, - "target_modules": target_modules, - "exclude_modules": exclude_modules, + "target_modules": target_modules if target_modules is not None else [], + "exclude_modules": exclude_modules if exclude_modules is not None else [], "match_all_linear": match_all_linear, ... } )
292-294: Broad exception handling masks root cause errors.Catching bare
Exceptionand callingpytest.skiphides the actual failure reason, making debugging difficult. Consider catching specific exceptions or at least logging the full traceback.except Exception as e: - print(f"Error during setup: {e}") + import traceback + print(f"Error during setup: {e}\n{traceback.format_exc()}") pytest.skip(f"Setup failed: {e}")nemo_rl/models/policy/dtensor_policy_worker_v2.py (1)
1755-1764: PotentialAttributeErrorwhenpeft_configisNone.On line 1758,
self.peft_config.to_dict()is called whenself.lora_enabled and self.peft_configis truthy. However, the conditional structure means iflora_enabledisTruebutpeft_configis somehowNone, this would fail. The current code should be safe due to the initialization logic, but consider simplifying:refit_info = { "weights": state_dict_info, "lora_enabled": self.lora_enabled, - "lora_config": self.peft_config.to_dict() - if self.lora_enabled and self.peft_config - else None, + "lora_config": self.peft_config.to_dict() if self.peft_config else None, "lora_weights": lora_weight_names if self.lora_enabled else None, }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
examples/configs/recipes/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.yaml(1 hunks)examples/configs/sft.yaml(2 hunks)nemo_rl/models/policy/__init__.py(2 hunks)nemo_rl/models/policy/dtensor_policy_worker_v2.py(11 hunks)nemo_rl/models/policy/lm_policy.py(1 hunks)tests/functional/test_automodel_lora_sft.sh(1 hunks)tests/unit/models/policy/test_dtensor_worker.py(8 hunks)tests/unit/utils/test_automodel_checkpoint.py(4 hunks)
🧰 Additional context used
📓 Path-based instructions (7)
examples/configs/recipes/**/*.yaml
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
When adding support for a new model, create a recipe YAML under examples/configs/recipes/ in the appropriate domain subdirectory (llm, vlm, etc.)
Files:
examples/configs/recipes/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.yaml
examples/configs/recipes/llm/*.yaml
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Recipe YAML files should follow the naming pattern: --ng-[-modifiers][-long][.vN].yaml for LLM recipes
Files:
examples/configs/recipes/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.yaml
!(**/tests/**|**/test_*.py|**/test_*.sh)
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year
Files:
examples/configs/recipes/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.yamlnemo_rl/models/policy/lm_policy.pytests/functional/test_automodel_lora_sft.shnemo_rl/models/policy/__init__.pyexamples/configs/sft.yamltests/unit/utils/test_automodel_checkpoint.pytests/unit/models/policy/test_dtensor_worker.pynemo_rl/models/policy/dtensor_policy_worker_v2.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Conform code to Python 3.12+
Indent code with 4 spaces. Do not use tabs
Use snake_case for file names
Use PascalCase for class names
Use snake_case for function and method names
Use snake_case for local variables
Prefix variable names that start with a number with 'k' (e.g., k_99th_percentile)
Use upper snake_case with 'G' prefix for global variables (e.g., G_MY_GLOBAL)
Use upper snake_case for constants
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
Prefer docstrings over comments for interfaces that may be used outside a file
Reserve comments for code within a function or interfaces that are local to a file
If a piece of code is commented out, include a comment describing its usage and why it's commented out. Remove debug comments before merging
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx
Avoid using reflection when functionality can be easily achieved without reflection
When using try-except blocks, limit the except clause to the smallest set of specific errors possible
When using try-except blocks for duck-typing, keep the body of the try as small as possible and use the else block for logic
YAML is the single source of truth for configuration defaults. Do not set non-None defaults in code for configuration values
For required configuration attributes, access config directly and expect presence (e.g., policy_cfg['precision']) without hidden defaults
Use typing.NotRequired to mark optional attributes in TypedDict for configuration
When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, and recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml
Follow the Google Python Style Guide for Python code
Files:
nemo_rl/models/policy/lm_policy.pynemo_rl/models/policy/__init__.pytests/unit/utils/test_automodel_checkpoint.pytests/unit/models/policy/test_dtensor_worker.pynemo_rl/models/policy/dtensor_policy_worker_v2.py
nemo_rl/**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
For any source file under nemo_rl/*.py that defines a class or function decorated with @ray.remote, add a coverage pragma (# pragma: no cover) because these run in separate Ray processes
Files:
nemo_rl/models/policy/lm_policy.pynemo_rl/models/policy/__init__.pynemo_rl/models/policy/dtensor_policy_worker_v2.py
**/*.{py,sh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
The NVIDIA copyright header should appear at the top of all Python files and shell scripts (excluding tests)
Files:
nemo_rl/models/policy/lm_policy.pytests/functional/test_automodel_lora_sft.shnemo_rl/models/policy/__init__.pytests/unit/utils/test_automodel_checkpoint.pytests/unit/models/policy/test_dtensor_worker.pynemo_rl/models/policy/dtensor_policy_worker_v2.py
**/*.sh
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.sh: Use uv run instead of python to execute scripts
Follow the Google Shell Style Guide for shell scripts
Files:
tests/functional/test_automodel_lora_sft.sh
🧠 Learnings (5)
📚 Learning: 2025-09-19T07:28:29.887Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: tests/test_suites/llm/distillation-qwen3-32b-to-4b-base-2n8g-fsdp2tp2-long.v1.sh:1-4
Timestamp: 2025-09-19T07:28:29.887Z
Learning: The NVIDIA-NeMo/RL project prefers to maintain consistent formatting across test scripts rather than applying individual bash hardening improvements like `set -euo pipefail` or proper quoting for sourcing files.
Applied to files:
tests/functional/test_automodel_lora_sft.sh
📚 Learning: 2025-10-12T14:46:55.513Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:16-30
Timestamp: 2025-10-12T14:46:55.513Z
Learning: In the NVIDIA-NeMo/RL repository, test scripts under tests/ follow a consistent pattern: use `cd $PROJECT_ROOT` without quotes or error handling, and pass arguments with `$@` unquoted. Maintain this consistency when adding new test scripts.
Applied to files:
tests/functional/test_automodel_lora_sft.sh
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to tests/test_suites/**/*.sh : Driver shell scripts should match the YAML base name with .sh extension and invoke training entrypoint with uv run
Applied to files:
tests/functional/test_automodel_lora_sft.sh
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to **/*.py : Use typing.NotRequired to mark optional attributes in TypedDict for configuration
Applied to files:
nemo_rl/models/policy/__init__.py
📚 Learning: 2025-10-30T20:50:44.126Z
Learnt from: adil-a
Repo: NVIDIA-NeMo/RL PR: 1440
File: examples/configs/sft_automodel.yaml:48-58
Timestamp: 2025-10-30T20:50:44.126Z
Learning: In DTensor configurations for MoE (Mixture of Experts) models, expert_parallel_size and data_parallel_size can be applied together without multiplying the GPU requirements. Expert Parallelism (EP) only applies to MoE layers, while Data Parallelism/FSDP applies to non-MoE layers. Therefore, configurations like expert_parallel_size: 8 and data_parallel_size: 8 are valid on an 8-GPU cluster for MoE models.
Applied to files:
examples/configs/sft.yaml
🧬 Code graph analysis (3)
tests/unit/utils/test_automodel_checkpoint.py (1)
nemo_rl/utils/automodel_checkpoint.py (2)
save_checkpoint(94-181)load_checkpoint(184-240)
tests/unit/models/policy/test_dtensor_worker.py (4)
nemo_rl/models/policy/lm_policy.py (1)
shutdown(752-759)nemo_rl/models/policy/dtensor_policy_worker_v2.py (1)
shutdown(2020-2025)nemo_rl/distributed/virtual_cluster.py (1)
shutdown(477-496)nemo_rl/models/policy/dtensor_policy_worker.py (1)
shutdown(1942-1947)
nemo_rl/models/policy/dtensor_policy_worker_v2.py (2)
nemo_rl/models/policy/dtensor_policy_worker.py (1)
prepare_refit_info(1729-1736)nemo_rl/models/policy/interfaces.py (1)
prepare_refit_info(157-158)
🪛 Ruff (0.14.6)
tests/unit/utils/test_automodel_checkpoint.py
478-478: Unused method argument: init_distributed
(ARG002)
tests/unit/models/policy/test_dtensor_worker.py
125-125: Do not use mutable data structures for argument defaults
Replace with None; initialize within function
(B006)
126-126: Do not use mutable data structures for argument defaults
Replace with None; initialize within function
(B006)
292-292: Do not catch blind exception: Exception
(BLE001)
672-672: Unused function argument: use_v2
(ARG001)
702-702: Unused function argument: use_v2
(ARG001)
🪛 Shellcheck (0.11.0)
tests/functional/test_automodel_lora_sft.sh
[error] 39-39: Double quote array expansions to avoid re-splitting elements.
(SC2068)
🔇 Additional comments (9)
examples/configs/recipes/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.yaml (1)
8-10: LoRA is not explicitly enabled in this recipe.The recipe sets
lora.dim: 32but doesn't setlora.enabled: true. Since the basesft.yamldefaultslora.enabledtofalse, this recipe won't actually enable LoRA training unlessenabled: trueis added or overridden at runtime.If this is a LoRA recipe (as suggested by the v3 suffix), consider adding:
dtensor_cfg: lora: + enabled: true dim: 32tests/functional/test_automodel_lora_sft.sh (1)
1-4: Missing NVIDIA copyright header.Per coding guidelines, shell scripts (excluding tests under
tests/) should include the NVIDIA copyright header. However, I see this is undertests/functional/, so this may be intentional. Please verify if functional tests should have the header.tests/unit/utils/test_automodel_checkpoint.py (1)
477-479: Theinit_distributedfixture argument is intentionally used for its side effect.The static analysis warning about unused
init_distributedis a false positive. The fixture is used for its setup/teardown side effects (initializing and destroying the process group). This pattern is common in pytest.tests/unit/models/policy/test_dtensor_worker.py (1)
672-674: Theuse_v2parameter is used indirectly via_get_use_v2(request).The static analysis warning about unused
use_v2is a false positive. The parameter is accessed throughrequest.node.callspec.paramsinside_get_use_v2(), which reads it from the pytest parametrization context.nemo_rl/models/policy/dtensor_policy_worker_v2.py (5)
1763-1764: Refit info currently only returns weights, ignoring other LoRA metadata.The
prepare_refit_infomethod builds a comprehensiverefit_infodict withlora_enabled,lora_config, andlora_weights, but then only returnsrefit_info["weights"]. The comment on line 1763 explains this is intentional ("Lora have not fully supported..."), but this may cause issues when GRPO LoRA support is added.Consider tracking this as a follow-up item when GRPO LoRA support is implemented.
103-109: LGTM - Temporary patch for LoRA weight initialization.The TODO comment clearly indicates this is a temporary patch pending an Automodel version bump. The Xavier/Kaiming initialization logic is correct.
265-266: LGTM - LoRA applied after base model load.Correctly applies LoRA to linear modules after loading the base model on rank 0, before broadcasting state dict.
290-291: LGTM - LoRA applied to FSDP-wrapped model.LoRA is correctly applied to the empty model before FSDP parallelization, ensuring consistent architecture across ranks.
1843-1857: LGTM - Filtered state dict iterator for LoRA-only broadcasting.Clean implementation that filters state dict to only include LoRA weights when LoRA is enabled, reducing data transfer overhead during weight synchronization.
| class LoRAConfig(TypedDict): | ||
| enabled: bool | ||
| target_modules: NotRequired[list[str]] | ||
| exclude_modules: NotRequired[list[str]] | ||
| match_all_linear: NotRequired[bool] | ||
| dim: NotRequired[int] | ||
| alpha: NotRequired[int] | ||
| dropout: NotRequired[float] | ||
| dropout_position: NotRequired[Literal["pre", "post"]] | ||
| lora_A_init: NotRequired[str] | ||
| use_triton: NotRequired[bool] | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing lora_dtype field in LoRAConfig TypedDict.
The sft.yaml configuration includes lora_dtype: ${policy.precision}, but this field is not defined in the LoRAConfig TypedDict. This will cause type-checking inconsistencies.
Apply this diff to add the missing field:
class LoRAConfig(TypedDict):
enabled: bool
target_modules: NotRequired[list[str]]
exclude_modules: NotRequired[list[str]]
match_all_linear: NotRequired[bool]
dim: NotRequired[int]
alpha: NotRequired[int]
dropout: NotRequired[float]
dropout_position: NotRequired[Literal["pre", "post"]]
lora_A_init: NotRequired[str]
+ lora_dtype: NotRequired[str]
use_triton: NotRequired[bool]🤖 Prompt for AI Agents
In nemo_rl/models/policy/__init__.py around lines 24 to 35, the LoRAConfig
TypedDict is missing the lora_dtype field referenced in sft.yaml; add a
NotRequired[literal union or type alias] entry named lora_dtype (e.g.
NotRequired[str] or NotRequired[torch.dtype/str matching project conventions])
to the TypedDict so type-checkers recognize the configuration key, keeping
optionality consistent with other NotRequired fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find and examine sft.yaml to verify lora_dtype usage
fd -t f "sft.yaml" --exec cat {}
Length of output: 7258
@joyang-nv Confirmed! I verified that lora_dtype: ${policy.precision} is indeed used in examples/configs/sft.yaml (line 60) but is missing from the LoRAConfig TypedDict.
Here's the fix to add the missing field:
class LoRAConfig(TypedDict):
enabled: bool
target_modules: NotRequired[list[str]]
exclude_modules: NotRequired[list[str]]
match_all_linear: NotRequired[bool]
dim: NotRequired[int]
alpha: NotRequired[int]
dropout: NotRequired[float]
dropout_position: NotRequired[Literal["pre", "post"]]
lora_A_init: NotRequired[str]
lora_dtype: NotRequired[str]
use_triton: NotRequired[bool]This will ensure type consistency with the configuration files.
🧬 Code Graph Analysis Results
- File: nemo_rl/models/generation/interfaces.py, lines 118-131
class GenerationConfig(TypedDict):
"""Configuration for generation."""
backend: str
max_new_tokens: int
temperature: float
top_p: float
top_k: int | None
model_name: NotRequired[str] # Not Required b/c GRPO writes this
stop_token_ids: list[int] | None
stop_strings: list[str] | None
colocated: NotRequired[ColocationConfig]
# This isn't meant to be passed by the user, but is populated by nemo_rl.models.generation.__init__.configure_generation_config
_pad_token_id: NotRequired[int]
- File: nemo_rl/models/policy/megatron_policy_worker.py, lines 260-272
def freeze_moe_router(megatron_model):
if not isinstance(megatron_model, list):
megatron_model = [megatron_model]
for model_module in megatron_model:
# Handle both wrapped (Float16Module) and unwrapped models
if isinstance(model_module, Float16Module):
model_module = model_module.module
# Handle VLM models
if hasattr(model_module, "language_model"):
model_module = model_module.language_model
for layer in model_module.decoder.layers:
if hasattr(layer, "mlp") and hasattr(layer.mlp, "router"):
layer.mlp.router.weight.requires_grad = False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you make all of these required if they are required when enabled=True? you can get the type checker to respect notrequired if false and required if true by doing something like this:
https://github.com/NVIDIA-NeMo/RL/blob/main/nemo_rl/models/policy/__init__.py#L191
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
| # patch the init_lora_weights method to use the xavier initialization | ||
| _lora_mod.LinearLoRA.init_lora_weights = _patched_init_lora_weights |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Global monkey-patch affects all LoRA instances in the process.
Patching _lora_mod.LinearLoRA.init_lora_weights globally affects all LoRA instances across the entire process, not just this worker. This could cause issues if multiple workers or other code paths rely on the original initialization behavior.
Consider applying the patch in a more scoped manner, or add a guard to prevent re-patching.
+# Guard to prevent re-patching in multi-worker scenarios
+_LORA_INIT_PATCHED = False
+
# TODO: @ruit remove this once the bump Automodel to 2d20e33a19d5e53a271b1403b507475e68ad14dc
def _patched_init_lora_weights(self, init_method: str):
...
...
- # patch the init_lora_weights method to use the xavier initialization
- _lora_mod.LinearLoRA.init_lora_weights = _patched_init_lora_weights
+ # patch the init_lora_weights method to use the xavier initialization
+ global _LORA_INIT_PATCHED
+ if not _LORA_INIT_PATCHED:
+ _lora_mod.LinearLoRA.init_lora_weights = _patched_init_lora_weights
+ _LORA_INIT_PATCHED = True| assert config.get("lora", {}).get("enabled", False) is False, ( | ||
| "LoRA is not supported for DTensorPolicyWorker V1" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Incorrect config path for LoRA check.
The assertion checks config.get("lora", ...) but LoRA configuration is nested under dtensor_cfg.lora, not at the top level. This means the guard will never trigger, allowing LoRA to be incorrectly used with V1.
Apply this diff to fix the config path:
- assert config.get("lora", {}).get("enabled", False) is False, (
+ assert config.get("dtensor_cfg", {}).get("lora", {}).get("enabled", False) is False, (
"LoRA is not supported for DTensorPolicyWorker V1"
)🤖 Prompt for AI Agents
In nemo_rl/models/policy/lm_policy.py around lines 117 to 119, the assertion is
checking the wrong config path — it reads config.get("lora", ...) but LoRA lives
under config["dtensor_cfg"]["lora"]; update the assertion to check
config.get("dtensor_cfg", {}).get("lora", {}).get("enabled", False) is False so
the guard correctly prevents LoRA from being enabled for DTensorPolicyWorker V1.
| checkpointing.enabled=true \ | ||
| checkpointing.save_period=3 \ | ||
| checkpointing.checkpoint_dir=/tmp/lora_sft_checkpoints \ | ||
| $@ \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quote the array expansion to prevent word splitting.
Static analysis correctly identifies that $@ should be quoted as "$@" to avoid re-splitting elements containing spaces.
- $@ \
+ "$@" \📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| $@ \ | |
| "$@" \ |
🧰 Tools
🪛 Shellcheck (0.11.0)
[error] 39-39: Double quote array expansions to avoid re-splitting elements.
(SC2068)
🤖 Prompt for AI Agents
In tests/functional/test_automodel_lora_sft.sh around line 39, the unquoted
array expansion `$@` may cause word splitting; change it to `"$@"` in the
command invocation so each original argument is preserved as a single parameter
(i.e., replace `$@` with `"$@"`).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I think this is a good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joyang-nv, thank you for confirming! I noticed something interesting though - there's a learning in the knowledge base (ID: ccad141e-7829-44b1-98d0-ec79c01decb2) that states:
"In the NVIDIA-NeMo/RL repository, test scripts under tests/ follow a consistent pattern: use
cd $PROJECT_ROOTwithout quotes or error handling, and pass arguments with$@unquoted."
This learning conflicts with the current suggestion to quote $@. Since you've confirmed that quoting is the correct approach, should I remove this obsolete learning from the knowledge base? This will help ensure future reviews recommend the proper shell scripting practice of quoting array expansions.
🧠 Learnings used
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:16-30
Timestamp: 2025-10-12T14:46:55.513Z
Learning: In the NVIDIA-NeMo/RL repository, test scripts under tests/ follow a consistent pattern: use `cd $PROJECT_ROOT` without quotes or error handling, and pass arguments with `$@` unquoted. Maintain this consistency when adding new test scripts.
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to tests/test_suites/**/*.sh : Driver shell scripts should match the YAML base name with .sh extension and invoke training entrypoint with uv run
| @pytest.fixture | ||
| def mock_distributed(): | ||
| """Mock torch.distributed calls for non-distributed tests.""" | ||
| with ( | ||
| patch("torch.distributed.is_initialized", return_value=False), | ||
| patch("torch.distributed.get_rank", return_value=0), | ||
| ): | ||
| yield | ||
| torch.distributed.destroy_process_group() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: mock_distributed fixture calls destroy_process_group after mocking is_initialized.
The fixture patches is_initialized to return False, but then unconditionally calls destroy_process_group() after yield. If distributed was never actually initialized (which is the case when mocking), this will raise an error.
Apply this diff to fix the cleanup logic:
@pytest.fixture
def mock_distributed():
"""Mock torch.distributed calls for non-distributed tests."""
with (
patch("torch.distributed.is_initialized", return_value=False),
patch("torch.distributed.get_rank", return_value=0),
):
yield
- torch.distributed.destroy_process_group()
+ # Don't call destroy_process_group - we were mocking, not actually initializing🤖 Prompt for AI Agents
In tests/unit/utils/test_automodel_checkpoint.py around lines 92 to 101, the
teardown unconditionally calls torch.distributed.destroy_process_group() which
can raise if the process group isn’t initialized; update the fixture to call
torch.distributed.destroy_process_group() only if
torch.distributed.is_initialized() returns True (i.e., guard the cleanup with an
if check) so destruction runs only when a process group actually exists.
| initial_distribute = torch.distributed.is_initialized() | ||
| print(f"Initial distribute: {initial_distribute}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Remove debug print statement.
This appears to be leftover debug code that should be removed before merging.
- initial_distribute = torch.distributed.is_initialized()
- print(f"Initial distribute: {initial_distribute}")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| initial_distribute = torch.distributed.is_initialized() | |
| print(f"Initial distribute: {initial_distribute}") |
🤖 Prompt for AI Agents
In tests/unit/utils/test_automodel_checkpoint.py around lines 520-521 there is a
leftover debug print statement printing the initial distributed status; remove
the print(f"Initial distribute: {initial_distribute}") line so the test file
contains only the assignment to initial_distribute (or remove both lines if the
variable is unused) and ensure no debug prints remain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only P1. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joyang-nv Got it! Thanks for the clarification on the priority. The debug print statement should still be removed before merging. 👍
|
Hi, @ffrujeri. Would you please help review this, especially the tests? I made significant changes to |
Signed-off-by: ruit <ruit@nvidia.com>
|
joyang-nv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add a test case similar to 3rdparty/Automodel-workspace/Automodel/tests/unit_tests/_peft/test_lora.py::test_forward_output_consistency()?
| class LoRAConfig(TypedDict): | ||
| enabled: bool | ||
| target_modules: NotRequired[list[str]] | ||
| exclude_modules: NotRequired[list[str]] | ||
| match_all_linear: NotRequired[bool] | ||
| dim: NotRequired[int] | ||
| alpha: NotRequired[int] | ||
| dropout: NotRequired[float] | ||
| dropout_position: NotRequired[Literal["pre", "post"]] | ||
| lora_A_init: NotRequired[str] | ||
| use_triton: NotRequired[bool] | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to fix this.
| checkpointing.enabled=true \ | ||
| checkpointing.save_period=3 \ | ||
| checkpointing.checkpoint_dir=/tmp/lora_sft_checkpoints \ | ||
| $@ \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I think this is a good catch!
| initial_distribute = torch.distributed.is_initialized() | ||
| print(f"Initial distribute: {initial_distribute}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only P1. :)
| ) | ||
|
|
||
| if self.peft_config is not None: | ||
| apply_lora_to_linear_modules(model, self.peft_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this function is called twice (on this line and line 291). Is that expected?
| class LoRAConfig(TypedDict): | ||
| enabled: bool | ||
| target_modules: NotRequired[list[str]] | ||
| exclude_modules: NotRequired[list[str]] | ||
| match_all_linear: NotRequired[bool] | ||
| dim: NotRequired[int] | ||
| alpha: NotRequired[int] | ||
| dropout: NotRequired[float] | ||
| dropout_position: NotRequired[Literal["pre", "post"]] | ||
| lora_A_init: NotRequired[str] | ||
| use_triton: NotRequired[bool] | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you make all of these required if they are required when enabled=True? you can get the type checker to respect notrequired if false and required if true by doing something like this:
https://github.com/NVIDIA-NeMo/RL/blob/main/nemo_rl/models/policy/__init__.py#L191
| enabled: bool | ||
| target_modules: NotRequired[list[str]] | ||
| exclude_modules: NotRequired[list[str]] | ||
| match_all_linear: NotRequired[bool] | ||
| dim: NotRequired[int] | ||
| alpha: NotRequired[int] | ||
| dropout: NotRequired[float] | ||
| dropout_position: NotRequired[Literal["pre", "post"]] | ||
| lora_A_init: NotRequired[str] | ||
| use_triton: NotRequired[bool] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please help document these flags?
|
|
||
|
|
||
| # TODO: @ruit remove this once the bump Automodel to 2d20e33a19d5e53a271b1403b507475e68ad14dc (https://github.com/NVIDIA-NeMo/RL/issues/1586) | ||
| def _patched_init_lora_weights(self, init_method: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i can see this function silently getting ignored. could you add a unit test that does a check and will fail when it's okay to remove? something in spirit to this
| "If this fails, that means the upstream bug has been fixed. You can close this issue: https://github.com/huggingface/transformers/issues/41190" |
| class LoRAConfig(TypedDict): | ||
| enabled: bool | ||
| target_modules: NotRequired[list[str]] | ||
| exclude_modules: NotRequired[list[str]] | ||
| match_all_linear: NotRequired[bool] | ||
| dim: NotRequired[int] | ||
| alpha: NotRequired[int] | ||
| dropout: NotRequired[float] | ||
| dropout_position: NotRequired[Literal["pre", "post"]] | ||
| lora_A_init: NotRequired[str] | ||
| use_triton: NotRequired[bool] | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.

Issues
Addresses #833
Test with thinking machine config
Description
This PR is a a work in progress to add LoRA support for the DTensor path.
Current status
Notes
Summary by CodeRabbit
New Features
Tests
✏️ Tip: You can customize this high-level summary in your review settings.