Skip to content

Conversation

@ericschreiber
Copy link

We observed a significant performance regression for MoE models using pure FSDP, with MFU dropping from 24.5% to 16.3% when running Qwen3 30A3B on 2×8 B200s. This regression was traced back to commit 2a7a148, associated with issue #1895.

While 2a7a148 correctly fixes Dynamo graph breaks for combined EP + FSDP setups, the change is unnecessary for pure FSDP configurations and introduces avoidable overhead in that case.

This pull request refactors and extends the apply_compile logic used for MoE model parallelization (DeepSeekV3, Llama4, Qwen3) to better distinguish between EP/FSDP combinations. The fix from #1895 is now applied only when it is actually required. We verified that the experiments from #1895 continue to work as expected.

Because FSDP is applied after apply_compile, we introduce an additional fsdp_enabled flag, which results in corresponding updates to files that inherit from this logic.

Finally, as was the case prior to commit 2a7a148, enabling SAC with Qwen3 30A3B triggers the warning:

Detected that context_fn is passed to torch.utils.checkpoint under torch.compile.
Please make sure the checkpointed region does not contain in-place ops (e.g. torch.relu_).

This appears to be related to how SAC augments context_fn. For Qwen3, we verified that this warning can be safely ignored. Still, we would appreciate reviewer input on how this warning should be handled going forward.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 4, 2026
@wconstab
Copy link
Contributor

wconstab commented Feb 5, 2026

@xmfan can you take a look at this?

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe @weifengpy has an ongoing work (block-level compile even with EP) to fix FSDP / EP + FSDP altogether.

I'd prefer we waiting for that, instead of landing this temp workaround.

@weifengpy
Copy link
Contributor

I believe @weifengpy has an ongoing work (block-level compile even with EP) to fix FSDP / EP + FSDP altogether.

I'd prefer we waiting for that, instead of landing this temp workaround.

I am verifying profiler traces of per-param mesh FSDP2, that will unblock per-layer torch.compile for MoE. Should be able to publish soon: #2281

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants