Torchtitan changes to integrate into Verl #2333

acisseJZhong · 2026-02-05T23:36:49Z

Goal: This PR makes the changes so that we can integrate Torchtitan as a trainer to Verl: verl-project/verl#5051

Major changes:

Change LR schedule to be 0 indexed instead of 1 indexed; to align with Verl's fsdp util
See more analysis in https://docs.google.com/document/d/1YiFUvIa_JqTYpBd2Xj7ReH3Bw6wS07nKldycBX--uVE/edit?usp=sharing ==> We decide not to change Titan's LR Scheduler behavior.

~~add position_block_causal attn mask type, which creates block causal mask based on position_id for both varlen and flex attention: transformers reference~~ ==> this is added in Verl's Torchtitan Engine code instead

Todos:

Enable PP, right now pp_schedule.eval() does the microbatch split for us, as it takes in the whole batch. However, in verl we split batch into microbatches before pp, and we'd love to pass in a list of pre-split microbatches to pp schedule. (thanks for @H-Huang's help)

tianyu-l · 2026-02-05T23:44:31Z

torchtitan/distributed/utils.py

 ) -> int:
+    # Skip initialization if already initialized
+    if torch.distributed.is_initialized():
+        return torch.distributed.get_world_size()


If going this path, it means a lot of the setting in this function / config won't take effect. Shall we add a warning to users?

makes sense, will add a warning. But I think this also means user initialize distributed env somewhere else with their own settings.

tianyu-l · 2026-02-05T23:45:31Z

torchtitan/components/lr_scheduler.py

-            # linear warmup
-            # 0-indexed step, hence + 1 adjustments
-            current_step += 1
+            # linear warmup (0-indexed to match FSDP/HuggingFace)


we spent a lot of time converging to the current behavior.
this change will likely break unit test and user code (silent change)
let's be careful about this change.
cc @wwwjn

#1284 Here's more context why we converge to current behavior (red line) in this PR description. Can you explain in similar graph removing this current_step += 1 would affect the shape of learning rate scheduler

thanks! ~~turns out i don't need this change and can still achieve numeric parity. This was added during my debugging journey.~~

sorry I take that back, the change seems necessary, only with this change lr schedule exactly matches with verl's fsdp impl, and therefore loss exactly matches. Otherwise torchtitan lr is always one step ahead of FSDP lr(FSDP uses 0 index but titan use 1 index).

wondering what's the reason for this +1 adjustment in the beginning?

sync with @wwwjn offline, likely moving from 1 indexed to 0 indexed will not change the shape of lr schedule, it only shift by 1 step. But I will do more thorough testing to confirm the new lr schedule.

I understand that lr schedule mismatch would cause loss curve mismatch, and aligning with verl fsdp would show stronger numerical alignment.

What I don't understand why you think verl is the golden standard. Do you think we could either

not changing either side and bear with this difference

change verl side LR scheduling

If not, could you give me an argument that it is us who needs to change? Thanks!

yes this is a valid question, apologies for not doing enough research on this. I think we should aim for exact same lr schedule and loss(although loss difference is not large; still same decreasing trend).
the pink one is with titan's original lr schedule.
I asked agent to do some research https://docs.google.com/document/d/1YiFUvIa_JqTYpBd2Xj7ReH3Bw6wS07nKldycBX--uVE/edit?usp=sharing and most frameworks use 0 based index(current_step starts from 0). It would be great if titan also switch to 0 based if it doesn't cause other issues? But I am also fine if we want to preserve the difference.

Having learning rate 0 at the first update sounds ... pointless?

We don't need to align with other frameworks if we don't understand their rationale. +1 that we don't want to have any step with lr = 0 (basically waste foward and backward computation). Would it be possible to make "offset 1" operation in verl?

verl have this init_lr_ratio so it's lr schedule is 0-indexed with configurable init. If init_lr_ratio is not set(which is the default case), then first step lr will be 0.

Megatron, deepspeed also adopts a similar 0 index with configurable min_lr(default to 0)

We can also keep the lr schedule different, as offset by 1 step likely will not affect model quality? wdyt?

tianyu-l · 2026-02-07T01:58:50Z

torchtitan/components/lr_scheduler.py

-            # linear warmup
-            # 0-indexed step, hence + 1 adjustments
-            current_step += 1
+            # linear warmup (0-indexed to match FSDP/HuggingFace)


I understand that lr schedule mismatch would cause loss curve mismatch, and aligning with verl fsdp would show stronger numerical alignment.

What I don't understand why you think verl is the golden standard. Do you think we could either

not changing either side and bear with this difference

change verl side LR scheduling

If not, could you give me an argument that it is us who needs to change? Thanks!

tianyu-l · 2026-02-07T02:00:55Z

torchtitan/train.py

        # The returned loss here is local SUM loss / global_valid_tokens
        return loss

+    def forward_step(


we might eventually need this as we build RL by ourselves

but for now, can we put this in verl engine?

tianyu-l · 2026-02-07T02:39:09Z

torchtitan/models/attention.py

+    return (position_diff != 1).cumsum(-1)  # [batch, seq]
+
+
+def create_sdpa_document_causal_mask(positions: torch.Tensor) -> torch.Tensor:


I think in general torchtitan wouldn't advocate using sdpa with is_causal=False. The main reason is that this mask is too memory-heavy to work with. cc @drisspg @fegin to confirm.

The general recommendation is to work with varlen attention and flex, if one needs block causal capability.

yes @fegin suggested supporting document mask only for flex and varlen attention, I will update the PR

tianyu-l · 2026-02-07T02:40:58Z

torchtitan/models/llama3/model/model.py

    ) -> AttentionMasksType:
        match self.model_args.attn_type:
+            case "sdpa":
+                assert extra_inputs is not None and "positions" in extra_inputs


positions won't be always available -- wouldn't this break torchtitan in general?

but anyways, we probably don't want to do sdpa with positions, see the other comment

tianyu-l · 2026-02-07T02:58:57Z

torchtitan/models/attention.py

+) -> _mask_mod_signature:
+    """Creates a document mask from position_ids for flex attention.
+
+    Detects document boundaries where position_ids reset (diff != 1).


Let's be careful here.

Previously, in torchtitan, position_ids is used to indexing rope cache (for CP to work with sharded sequence and replicate rope cache), in a sense that it's decoupled from attention block mask decision (which is determined by eos_id in inputs today).

According to what you are trying to do, it sounds like verl / HF is coupling position_ids with mask creation? IIUC it only works for block-causal assuming the block boundary is given by positions_ids. In particular, if masking is more complicated than block causal, e.g. in multimodal training, where the attention is bidirectional within an image, it can't be expressed using position_ids?

So, now there will be two ways to create block_causal masking, one by eos_id and the other by positions_id, is that correct?

cc @fegin do you think it's fine to create another model_args.attn_mask_type for this?

Verl's fsdp engine uses transformers model, whose forward takes in position_ids and generate block causal mask based on that. Therefore, apart from get_document_mask_mod which generate block_mask from eos_id, I add another get_document_mask_mod_from_positions

should we make this generic to <separator> and this separator can be eos_id or position_ids?

@drisspg position_ids is not a separator. You need a different function to translate it to a mask mod.

Yeah fair one could also probably just have a base doc_mask(, segments) and it does the check for equal segments but I dont think it adds to much value since you still need the find_packed_sequence_indices

wwwjn · 2026-02-09T00:01:38Z

tests/unit_tests/test_lr_scheduler.py

-            3.0 / 8.0,  # Step 7: 3/8 of max LR
-            2.0 / 8.0,  # Step 8: 1/4 of max LR
-            1.0 / 8.0,  # Step 9: 1/8 of max LR
+            0.0,  # Step 0: 0% of max LR (warmup start)


+1, this step is not updating weights at all ,the forward and backward computation are wasted.

wwwjn · 2026-02-09T00:11:14Z

torchtitan/components/lr_scheduler.py

-            # linear warmup
-            # 0-indexed step, hence + 1 adjustments
-            current_step += 1
+            # linear warmup (0-indexed to match FSDP/HuggingFace)


We don't need to align with other frameworks if we don't understand their rationale. +1 that we don't want to have any step with lr = 0 (basically waste foward and backward computation). Would it be possible to make "offset 1" operation in verl?

wwwjn · 2026-02-09T00:21:07Z

torchtitan/models/deepseek_v3/model/model.py

+                        input_ids=input_batch, eos_id=tokenizer.eos_id
+                    )
+                )
+            case "position_block_causal":


So in "position block causal", the EOS id is not used to create block mask, and the user might accidentally use both, and don't know which one is actually taking effect?

If they are both referring to "block_causal", one possible way to do this is let get_document_mask_mod takes both EOS id and positions, and specifically add warning when both is not None and let user know which field we are using to create the "block_causal" mask

I'm curious about knowing more about the actual data format (the Trajectory) that the generator passed to trainer. For the following field:

Prompt + Completion: Does this field has EOS in it? If so, can we use EOS id instead of position_id?

And for trajectory, do they do padding or packing? If padding, would position_id field also be padded? If yes, does the mask_mod algorithm handled padded field correctly?

the user would have to specify using block_causal(with eos id) or position_block_causal(using positions); they can't use both at the same time

Prompt + Completion: Does this field has EOS in it? If so, can we use EOS id instead of position_id?

I checked the input prompt + completion does has EOS in it, but it's not corresponding to positions. Looks like there is one more new line token after EOS for each sample.

wwwjn · 2026-02-09T06:55:41Z

torchtitan/models/attention.py

-    cumulative_mask = torch.cumsum(torch.where(eos_mask, 1, 0), dim=1)
-    sequence_indices = torch.zeros_like(cumulative_mask, dtype=torch.int32)
-    sequence_indices[:, 1:] = cumulative_mask[:, :-1]
+    if positions is not None:


Would this algorithm work for inference time continuous batching, with mixed prefix and decode request? eg, the position could be "[4,5,6,0,1,2,3,7,8,9,10]"

And does it handled padding correctly?

yes it detects sequence boundary through position_diff!=1 so it will give [0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2].

it will not handle padding tho(each padding marked as separate sequence), as we don't expect any padding, and use packing.

tianyu-l · 2026-02-10T00:26:40Z

torchtitan/train.py

+                tokenizer=self.tokenizer,
+                job_config=job_config,
+            )
+            if self.train_spec.build_dataloader_fn is not None


according to https://github.com/pytorch/torchtitan/blob/main/torchtitan/protocols/train_spec.py#L51, it can't be None

I'm OK with type change, but you'll need to assert not None in torchtitan before it's used?

Since we are using verl's dataloader, i don't want to initialize titan's dataloader(otherwise I will encounter loading c4 dataset error). I did a hack here https://github.com/verl-project/verl/pull/5051/changes#diff-f658afe18d14b480f4067f7544fbdb0ef6962a20ef3b5f5d0c709ae31e91809dR101

verl integration changes

d1f6761

acisseJZhong requested review from fegin, tianyu-l, wconstab and wwwjn as code owners February 5, 2026 23:36

pytorch-bot bot added the ciflow/8gpu label Feb 5, 2026

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 5, 2026

acisseJZhong marked this pull request as draft February 5, 2026 23:36

tianyu-l reviewed Feb 5, 2026

View reviewed changes

acisseJZhong added 2 commits February 6, 2026 02:16

pp=1 working

3e72875

pp bug

82821c7

acisseJZhong mentioned this pull request Feb 6, 2026

[trainer] feat: Add Torchtitan as alternative training engine verl-project/verl#5051

Open

11 tasks

formatting

0d3f3c6

acisseJZhong changed the title ~~[WIP] verl integration changes~~ Torchtitan changes to integrate into Verl Feb 7, 2026

acisseJZhong marked this pull request as ready for review February 7, 2026 00:25

tianyu-l reviewed Feb 7, 2026

View reviewed changes

acisseJZhong added 4 commits February 7, 2026 23:26

address comments

c08d6cd

cleanup

8c8db22

cleanup

047712b

add more tests

e28ce31

wwwjn reviewed Feb 9, 2026

View reviewed changes

acisseJZhong added 2 commits February 9, 2026 16:21

address comments

e3a9eb5

address comments

2da51a3

tianyu-l reviewed Feb 10, 2026

View reviewed changes

		return (position_diff != 1).cumsum(-1) # [batch, seq]


		def create_sdpa_document_causal_mask(positions: torch.Tensor) -> torch.Tensor:

Torchtitan changes to integrate into Verl #2333

Are you sure you want to change the base?

Torchtitan changes to integrate into Verl #2333

Conversation

acisseJZhong commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

acisseJZhong Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

acisseJZhong Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

acisseJZhong Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

acisseJZhong Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

acisseJZhong commented Feb 5, 2026 •

edited

Loading

acisseJZhong Feb 6, 2026 •

edited

Loading

acisseJZhong Feb 8, 2026 •

edited

Loading

acisseJZhong Feb 9, 2026 •

edited

Loading

acisseJZhong Feb 9, 2026 •

edited

Loading

tianyu-l Feb 10, 2026 •

edited

Loading