[CP][RFC] Enable FlexCP for llama3 with parallelize_module #1707

fegin · 2025-09-12T18:27:03Z

Similar to #1696, but this PR uses parallel_module similar to TP/SP.

This PR also requires pytorch/pytorch#162542

Similar to #1696, but this PR uses parallel_module similar to TP/SP. This PR also requires pytorch/pytorch#162542

tianyu-l · 2025-09-12T22:07:08Z

torchtitan/models/llama3/infra/parallelize.py

+                device_mesh=world_mesh["cp"],
+                parallelize_plan=_ContextParallel(
+                    seq_dim=2,
+                    attention_type=_ContextParallel.AttentionType.FLEX,


Does this only work for FlexAttention?
Is there a plan to consolidate SDPA and FlexAttention in terms of how CP is applied?

This will work for both SDPA and Flex. We just need to pass in a different type based on what attention is used.

tianyu-l · 2025-09-17T04:28:46Z

torchtitan/models/llama3/infra/parallelize.py

+    if parallel_dims.cp_enabled:
+        for block in model.layers.values():
+            parallelize_module(
+                module=block.attention.sdpa.attention_fn_wrapper,


IIUC for FlexAttn we need this wrapper because of block mask has to be obtained inside FlexAttention class before calling the wrapper. For SDPA it seems unnecessary? It is already a very thin wrapper.

If the concern is code branching, the code is going to branch couple of lines below anyway, so I think it's fine.

It's not just about the unification. That wrapper must have the exact function signatures as scaled_dot_product_attention. Our ScaledDotProductAttention doesn't meet this requirement. More importantly, we don't want this wrapper to be broken when the core library changes the function signature of scaled_dot_product_attention or flex_attention. So the best UX is to always ask users to wrap the APIs with forward being def forward(*args, **kwargs) -> Any. So TorchTitan should also follow this rule.

wwwjn · 2025-09-17T14:21:39Z

torchtitan/models/llama3/infra/parallelize.py

+            parallelize_module(
+                module=block.attention.sdpa.attention_fn_wrapper,
+                device_mesh=world_mesh["cp"],
+                parallelize_plan=_ContextParallel(


So after this change, we only need to specify context parallel plan for attention module here, and CP of other modules is still handled by the context manager,

You can check the discussion in pytorch/pytorch#162542. It's definitely good to remove the context manager, but that may also have some implications to how users should write the model, like the wrapper in this PR.

[CP][RFC] Enable FlexCP for llama3 with parallelize_module

70e5920

Similar to #1696, but this PR uses parallel_module similar to TP/SP. This PR also requires pytorch/pytorch#162542

fegin requested review from tianyu-l, wwwjn and wconstab as code owners September 12, 2025 18:27

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 12, 2025

tianyu-l reviewed Sep 12, 2025

View reviewed changes

tianyu-l mentioned this pull request Sep 16, 2025

[WIP] Experimental implementation of gpt-oss (grouped GEMM MoE + FlexAttention sink/sliding) #1559

Open

fegin mentioned this pull request Sep 16, 2025

[CP][RFC] Enable FlexCP for llama3 with function wrapper #1696

Closed

Enable SDPA for all models

a4b4ef1

tianyu-l reviewed Sep 17, 2025

View reviewed changes

wwwjn reviewed Sep 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CP][RFC] Enable FlexCP for llama3 with parallelize_module #1707

[CP][RFC] Enable FlexCP for llama3 with parallelize_module #1707

Uh oh!

fegin commented Sep 12, 2025

Uh oh!

tianyu-l Sep 12, 2025

Uh oh!

fegin Sep 12, 2025

Uh oh!

tianyu-l Sep 17, 2025

Uh oh!

fegin Sep 17, 2025

Uh oh!

wwwjn Sep 17, 2025

Uh oh!

fegin Sep 17, 2025

Uh oh!

Uh oh!

[CP][RFC] Enable FlexCP for llama3 with parallelize_module #1707

Are you sure you want to change the base?

[CP][RFC] Enable FlexCP for llama3 with parallelize_module #1707

Uh oh!

Conversation

fegin commented Sep 12, 2025

Uh oh!

tianyu-l Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

fegin Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

fegin Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

wwwjn Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

fegin Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!