[mxfp8 moe training] temp workaround: don't compile GroupedExperts #2268

danielvegamyhre · 2026-01-21T17:09:15Z

Stacked PRs:

[mxfp8 moe training] temp workaround: don't compile GroupedExperts

See thread (#2250 (comment)) for context.

TL;DR is this avoids a tensor metadata mismatch issue between forward output and backward() input (upstream grad).

Tests

CONFIG_FILE=/home/dev/torchtitan/torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml ./run_train.sh --metrics.log_freq=10 \
--training.steps=1500  \
--parallelism.data_parallel_shard_degree=4 \
--parallelism.expert_parallel_degree=4 \
--parallelism.tensor_parallel_degree=2 \
--parallelism.expert_tensor_parallel_degree=1 \
--training.seq_len=8192 \
--activation_checkpoint.mode=full \
--model.print_after_conversion \
--training.local_batch_size=16 \
--quantize.linear.mx.mxfp8_dim0_cast_kernel_choice="triton" --quantize.linear.mx.mxfp8_dim1_cast_kernel_choice="cuda" \
--quantize.grouped_mm.mx.fqns="experts" --quantize.grouped_mm.mx.recipe_name="mxfp8_wgrad_with_hp" \
--compile.enable --compile.components="model,loss" --debug.moe_force_load_balance \
--model.converters="quantize.grouped_mm.mx"

stack-info: PR: #2268, branch: danielvegamyhre/stack/6

bdhirsh · 2026-01-21T17:50:32Z

torchtitan/models/llama4/infra/parallelize.py

-                            ),
-                        )
+                        # temp workaround: compile everything except GroupedExperts
+                        if not isinstance(submod, moe_module.GroupedExperts):


there's also another hardcoded call to compile _run_experts_grouped_mm here that you'd need to skip (I'm doing it in my patch in https://gist.github.com/bdhirsh/970a671b84c35cc95a76f33657ca4d69)

stack-info: PR: #2268, branch: danielvegamyhre/stack/6

tianyu-l · 2026-01-24T22:49:34Z

torchtitan/models/llama4/__init__.py

    "17bx16e": TransformerModelArgs(
        dim=5120,
-        n_layers=48,
+        n_layers=6,


tianyu-l · 2026-01-24T22:51:23Z

torchtitan/models/llama4/infra/parallelize.py

@@ -687,14 +689,8 @@ def apply_compile(model: nn.Module, compile_config: CompileConfig, ep_enabled: b
        in moe_module._run_experts_grouped_mm.__qualname__
    )
    if not already_patched:


can we remove this logic since we are not compiling GroupedExperts any more?

tianyu-l · 2026-01-24T22:52:23Z

torchtitan/models/llama4/infra/parallelize.py

-                                submod, backend=compile_config.backend, fullgraph=True
-                            ),
-                        )
+                        # temp workaround: compile everything except GroupedExperts


add a comment on the issue this code is working around

danielvegamyhre · 2026-01-24T23:05:17Z

@tianyu-l thanks for the early feedback. the top 3 PRs in this stack aren't quite ready - once scale testing is complete and i've confirmed the mxfp8 a2a feature works as expected (perf, numerics) then i'll polish it up, add PR descriptions, and let you know when it's ready.

So far so good, just waiting for a long training run to finish to verify training stability and identical convergence to bf16. Should be ready early next week.

stack-info: PR: #2268, branch: danielvegamyhre/stack/6

tianyu-l

we don't have to land this, if #2281 is going to be landed soon

danielvegamyhre requested review from fegin, tianyu-l, wconstab and wwwjn as code owners January 21, 2026 17:09

pytorch-bot bot added the ciflow/8gpu label Jan 21, 2026

danielvegamyhre added a commit that referenced this pull request Jan 21, 2026

[mxfp8 moe training] temp workaround: don't compile GroupedExperts

baa9d08

stack-info: PR: #2268, branch: danielvegamyhre/stack/6

danielvegamyhre force-pushed the danielvegamyhre/stack/6 branch from f7bd318 to baa9d08 Compare January 21, 2026 17:09

This was referenced Jan 21, 2026

[mxfp8 moe training] support wgrad_with_hp recipe #2249

Open

[mxfp8 moe training] mxfp8 all to all #2250

Open

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 21, 2026

danielvegamyhre mentioned this pull request Jan 21, 2026

[mxfp8 training] add new configurable params now exposed by torchao #2251

Open

bdhirsh reviewed Jan 21, 2026

View reviewed changes

danielvegamyhre marked this pull request as draft January 21, 2026 21:51

danielvegamyhre changed the base branch from danielvegamyhre/stack/5 to main January 21, 2026 21:51

danielvegamyhre added a commit that referenced this pull request Jan 21, 2026

[mxfp8 moe training] temp workaround: don't compile GroupedExperts

35fa4d4

stack-info: PR: #2268, branch: danielvegamyhre/stack/6

danielvegamyhre force-pushed the danielvegamyhre/stack/6 branch from baa9d08 to 35fa4d4 Compare January 21, 2026 21:51

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/5 January 21, 2026 21:51

danielvegamyhre marked this pull request as ready for review January 21, 2026 21:52

tianyu-l reviewed Jan 24, 2026

View reviewed changes

danielvegamyhre marked this pull request as draft February 3, 2026 20:03

danielvegamyhre changed the base branch from danielvegamyhre/stack/5 to main February 3, 2026 20:03

danielvegamyhre added a commit that referenced this pull request Feb 3, 2026

[mxfp8 moe training] temp workaround: don't compile GroupedExperts

d31d45d

stack-info: PR: #2268, branch: danielvegamyhre/stack/6

danielvegamyhre force-pushed the danielvegamyhre/stack/6 branch from 35fa4d4 to d31d45d Compare February 3, 2026 20:03

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/5 February 3, 2026 20:03

danielvegamyhre marked this pull request as ready for review February 3, 2026 20:03

danielvegamyhre marked this pull request as draft February 4, 2026 18:49

danielvegamyhre changed the base branch from danielvegamyhre/stack/5 to main February 4, 2026 18:49

danielvegamyhre added a commit that referenced this pull request Feb 4, 2026

[mxfp8 moe training] temp workaround: don't compile GroupedExperts

61d426f

stack-info: PR: #2268, branch: danielvegamyhre/stack/6

danielvegamyhre force-pushed the danielvegamyhre/stack/6 branch from d31d45d to 61d426f Compare February 4, 2026 18:49

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/5 February 4, 2026 18:50

danielvegamyhre marked this pull request as ready for review February 4, 2026 18:50

danielvegamyhre marked this pull request as draft February 4, 2026 18:56

danielvegamyhre changed the base branch from danielvegamyhre/stack/5 to main February 4, 2026 18:56

danielvegamyhre added a commit that referenced this pull request Feb 4, 2026

[mxfp8 moe training] temp workaround: don't compile GroupedExperts

44bfc34

stack-info: PR: #2268, branch: danielvegamyhre/stack/6

danielvegamyhre force-pushed the danielvegamyhre/stack/6 branch from 61d426f to 44bfc34 Compare February 4, 2026 18:56

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/5 February 4, 2026 18:56

danielvegamyhre marked this pull request as ready for review February 4, 2026 18:56

danielvegamyhre marked this pull request as draft February 4, 2026 19:16

danielvegamyhre changed the base branch from danielvegamyhre/stack/5 to main February 4, 2026 19:16

danielvegamyhre added a commit that referenced this pull request Feb 4, 2026

[mxfp8 moe training] temp workaround: don't compile GroupedExperts

cda7d8d

stack-info: PR: #2268, branch: danielvegamyhre/stack/6

danielvegamyhre force-pushed the danielvegamyhre/stack/6 branch from 44bfc34 to cda7d8d Compare February 4, 2026 19:16

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/5 February 4, 2026 19:16

danielvegamyhre marked this pull request as ready for review February 4, 2026 19:16

[mxfp8 moe training] temp workaround: don't compile GroupedExperts

d920414

stack-info: PR: #2268, branch: danielvegamyhre/stack/6

danielvegamyhre marked this pull request as draft February 6, 2026 22:59

danielvegamyhre changed the base branch from danielvegamyhre/stack/5 to main February 6, 2026 22:59

danielvegamyhre force-pushed the danielvegamyhre/stack/6 branch from cda7d8d to d920414 Compare February 6, 2026 22:59

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/5 February 6, 2026 23:00

danielvegamyhre marked this pull request as ready for review February 6, 2026 23:00

tianyu-l reviewed Feb 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mxfp8 moe training] temp workaround: don't compile GroupedExperts #2268

[mxfp8 moe training] temp workaround: don't compile GroupedExperts #2268

Uh oh!

danielvegamyhre commented Jan 21, 2026 •

edited

Loading

Uh oh!

bdhirsh Jan 21, 2026

Uh oh!

tianyu-l Jan 24, 2026

Uh oh!

tianyu-l Jan 24, 2026

Uh oh!

tianyu-l Jan 24, 2026

Uh oh!

danielvegamyhre commented Jan 24, 2026 •

edited

Loading

Uh oh!

tianyu-l left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[mxfp8 moe training] temp workaround: don't compile GroupedExperts #2268

Are you sure you want to change the base?

[mxfp8 moe training] temp workaround: don't compile GroupedExperts #2268

Uh oh!

Conversation

danielvegamyhre commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tests

Uh oh!

bdhirsh Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

tianyu-l Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

tianyu-l Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

tianyu-l Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

danielvegamyhre commented Jan 21, 2026 •

edited

Loading

danielvegamyhre commented Jan 24, 2026 •

edited

Loading