Apply #1895 only when really necessary #2322

ericschreiber · 2026-02-04T16:15:11Z

We observed a significant performance regression for MoE models using pure FSDP, with MFU dropping from 24.5% to 16.3% when running Qwen3 30A3B on 2×8 B200s. This regression was traced back to commit 2a7a148, associated with issue #1895.

While 2a7a148 correctly fixes Dynamo graph breaks for combined EP + FSDP setups, the change is unnecessary for pure FSDP configurations and introduces avoidable overhead in that case.

This pull request refactors and extends the apply_compile logic used for MoE model parallelization (DeepSeekV3, Llama4, Qwen3) to better distinguish between EP/FSDP combinations. The fix from #1895 is now applied only when it is actually required. We verified that the experiments from #1895 continue to work as expected.

Because FSDP is applied after apply_compile, we introduce an additional fsdp_enabled flag, which results in corresponding updates to files that inherit from this logic.

Finally, as was the case prior to commit 2a7a148, enabling SAC with Qwen3 30A3B triggers the warning:

Detected that context_fn is passed to torch.utils.checkpoint under torch.compile.
Please make sure the checkpointed region does not contain in-place ops (e.g. torch.relu_).

This appears to be related to how SAC augments context_fn. For Qwen3, we verified that this warning can be safely ignored. Still, we would appreciate reviewer input on how this warning should be handled going forward.

wconstab · 2026-02-05T05:33:10Z

@xmfan can you take a look at this?

tianyu-l

I believe @weifengpy has an ongoing work (block-level compile even with EP) to fix FSDP / EP + FSDP altogether.

I'd prefer we waiting for that, instead of landing this temp workaround.

weifengpy · 2026-02-05T08:37:54Z

I believe @weifengpy has an ongoing work (block-level compile even with EP) to fix FSDP / EP + FSDP altogether.

I'd prefer we waiting for that, instead of landing this temp workaround.

I am verifying profiler traces of per-param mesh FSDP2, that will unblock per-layer torch.compile for MoE. Should be able to publish soon: #2281

Clarify cases when manual graph breaks are needed.

73becab

ericschreiber requested review from fegin, tianyu-l, wconstab and wwwjn as code owners February 4, 2026 16:15

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 4, 2026

tianyu-l requested changes Feb 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply #1895 only when really necessary #2322

Apply #1895 only when really necessary #2322

Uh oh!

ericschreiber commented Feb 4, 2026

Uh oh!

wconstab commented Feb 5, 2026

Uh oh!

tianyu-l left a comment

Uh oh!

weifengpy commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Apply #1895 only when really necessary #2322

Are you sure you want to change the base?

Apply #1895 only when really necessary #2322

Uh oh!

Conversation

ericschreiber commented Feb 4, 2026

Uh oh!

wconstab commented Feb 5, 2026

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

weifengpy commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants