build: Use dynamic engine for generate. #1502

shanmugamr1992 · 2025-11-11T00:05:07Z

What does this PR do ?

Adds mcore dynamic engine generation support

https://wandb.ai/shanmugamr/grpo-dev?nw=nwusershanmugamr (PLOT showing vllm and mcore having similar performance)

So final times :
The numbers are as follows

Generation alone :
Mcore - 20 seconds
VLLM - 8 seconds

End to end step times :
Mcore - 47 seconds
VLLM - 37 seconds

Issues

List issues that this PR closes (syntax):

Closes #1079

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

New Features
- Introduced dynamic inference engine for improved generation performance with CUDA graph optimization support.
- Added configuration for GRPO Llama 3.2 1B Instruct model with Megatron generation backend.
Bug Fixes
- Fixed potential error in policy generation initialization.
Tests
- Added new functional tests for GRPO Megatron generation workflow.
- Enabled previously disabled generation tests.

coderabbitai · 2025-11-11T00:16:30Z

📝 Walkthrough

Walkthrough

This PR introduces GRPO support for Llama 3.2-1B with dynamic megatron generation inference, refactors MegatronPolicyWorker to use DynamicInferenceEngine instead of StaticInferenceEngine, adds new test scripts to the GPU and nightly test suites, adds a None-check guard for policy_generation setup, and enables previously skipped megatron generation unit tests.

Changes

Cohort / File(s)	Summary
Configuration Addition `examples/configs/recipes/llm/grpo-llama3.2-1b-instruct-1n8g-megatron_generation.yaml`	New YAML config for GRPO with Llama-3.2-1B-Instruct, megatron generation backend, and settings for checkpointing, logging, and inference (max_new_tokens 512).
Core Algorithm Logic `nemo_rl/algorithms/grpo.py`	Adds None-check guard around `prepare_refit_info(state_dict_info)` call to prevent AttributeError when `policy_generation` is None.
Inference Engine Refactor `nemo_rl/models/policy/megatron_policy_worker.py`	Replaces StaticInferenceEngine with DynamicInferenceContext and DynamicInferenceEngine in `generate()` method; introduces per-prompt request handling, dynamic sampling parameter configuration, explicit detokenization, and CUDA graph optimization support.
GPU Functional Tests `tests/functional/L1_Functional_Tests_GPU.sh`, `tests/functional/grpo_megatron_generation.sh`	Extends GPU L1 test suite to include new GRPO megatron generation functional test; new test script sets up experiment environment, runs GRPO training, generates metrics, and validates token probability error < 1.05.
Nightly & Unit Tests `tests/test_suites/llm/grpo-llama3.2-1b-instruct-1n8g-megatron_generation.sh`, `tests/test_suites/nightly.txt`, `tests/unit/models/policy/test_megatron_worker.py`	Adds new nightly test script for GRPO megatron generation with config and metric validation; registers test in nightly suite; removes pytest skip decorator to enable megatron worker unit tests.

Sequence Diagram(s)

sequenceDiagram
    participant caller as Caller
    participant old as Old: StaticInferenceEngine
    participant new as New: DynamicInferenceEngine
    participant model as Model
    participant engine as Engine

    rect rgb(200, 220, 250)
    Note over old,engine: Previous Flow (Static)
    caller->>old: generate(prompts, ...)
    old->>old: run_mcore_engine()
    old->>engine: forward pass
    engine-->>old: output tokens + logprobs
    old-->>caller: return results
    end

    rect rgb(220, 250, 200)
    Note over new,engine: New Flow (Dynamic)
    caller->>new: generate(prompts, ...)
    new->>new: DynamicInferenceContext setup
    new->>new: GPTInferenceWrapper init
    new->>model: prep_model_for_inference()
    model->>model: enable CUDA graphs (local)
    new->>new: compute tokens_to_generate
    new->>new: configure SamplingParams (temp, top_k, top_p)
    loop per-prompt
        new->>engine: create request with params
        engine->>engine: dynamic batching
    end
    new->>engine: collect results by request_id
    new->>new: detokenize per-prompt
    new->>new: sort by request_id
    new-->>caller: return ordered results
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

MegatronPolicyWorker.generate() refactor (nemo_rl/models/policy/megatron_policy_worker.py): Significant internal logic change replacing inference engine approach; requires careful verification of per-prompt batching, sampling parameter handling, result ordering, and CUDA graph integration.
Test script logic (tests/functional/grpo_megatron_generation.sh and corresponding test suite script): Verify environment setup, metric validation thresholds, and log parsing correctness.
Policy generation None-check (nemo_rl/algorithms/grpo.py): Minor defensive fix; straightforward guard condition.

Possibly related PRs

cp: fix: Fixes to make Megatron backend match dtensor (1389) into r0.4.0 #1454: Modifies MegatronPolicyWorker __init__ and adds tensor-parallel helpers to the same class being refactored in this PR's generate() method.
feat: FP8 Training in Megatron Path #971: Updates MegatronPolicyWorker with FP8 config handling and padding logic, affecting the same inference code path being refactored.
cp: feat: add Megatron support for on-policy distillation (1324) into r0.4.0 #1398: Implements get_topk_logits and related imports in MegatronPolicyWorker, extending the inference/logits handling in the same class.

Suggested labels

CI:L1, Run CICD

Suggested reviewers

parthchadha
terrykong
guyueh1

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR implements major inference engine refactoring and performance optimizations but PR description lacks test results, performance metrics, or convergence analysis documentation.	Document performance benchmarks, test execution results, and convergence/regression analysis validating the StaticInferenceEngine to DynamicInferenceEngine refactoring.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'build: Use dynamic engine for generate' directly reflects the main change in the PR: replacing the hard-coded StaticInferenceEngine with DynamicInferenceEngine in the megatron_policy_worker.py file, which is the core technical change across the PR.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch build_dynmamic

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6a035bc and 8a0f86b.

📒 Files selected for processing (8)

examples/configs/recipes/llm/grpo-llama3.2-1b-instruct-1n8g-megatron_generation.yaml (1 hunks)
nemo_rl/algorithms/grpo.py (1 hunks)
nemo_rl/models/policy/megatron_policy_worker.py (2 hunks)
tests/functional/L1_Functional_Tests_GPU.sh (1 hunks)
tests/functional/grpo_megatron_generation.sh (1 hunks)
tests/test_suites/llm/grpo-llama3.2-1b-instruct-1n8g-megatron_generation.sh (1 hunks)
tests/test_suites/nightly.txt (1 hunks)
tests/unit/models/policy/test_megatron_worker.py (0 hunks)

💤 Files with no reviewable changes (1)

tests/unit/models/policy/test_megatron_worker.py

🧰 Additional context used

📓 Path-based instructions (10)

**/*.sh