[mxfp8 moe training] support wgrad_with_hp recipe #2249

danielvegamyhre · 2026-01-17T20:37:12Z

Stacked PRs:

Summary

torchao MXFP8 MoE training code now supports a new recipe: wgrad_with_hp (described below). This PR update the torchtitan integration to allow users to use. it

Recipes:
- "mxfp8": Use MXFP8 for all 3 grouped GEMMs in the forward and backward pass (output, dgrad, wgrad).
- "mxfp8_wgrad_with_hp": Use MXFP8 for forward output and dgrad, but keep wgrad in high-precision.
This can be used to trade-off some performance for improved accuracy.

Note: I plan to do some benchmarking to provide more concrete guidance to users on what expert shapes will result in better TPS using wgrad_with_hp

Tests

mxfp8 recipe:

CONFIG_FILE=/home/dev/torchtitan/torchtitan/models/llama4/train_configs/llama4_17bx16e.toml ./run_train.sh --metrics.log_freq=10 \--training.steps=200  \--parallelism.data_parallel_shard_degree=8 \--parallelism.expert_parallel_degree=8 \--parallelism.tensor_parallel_degree=1 \                                                  
--parallelism.expert_tensor_parallel_degree=1 \
--profiling.enable_profiling --profiling.profile_freq=30 \
--training.seq_len=8192 \
--activation_checkpoint.mode=none \
--model.print_after_conversion \
--training.local_batch_size=12 \
--model.converters="quantize.grouped_mm.mx,quantize.linear.mx" \
--quantize.linear.mx.mxfp8_dim1_cast_kernel_choice="cuda" \
--quantize.linear.mx.filter_fqns="output,router.gate,wk,wv" \
--quantize.grouped_mm.mx.fqns="experts" --quantize.grouped_mm.mx.recipe_name="mxfp8_wgrad" \
--compile.enable --debug.moe_force_load_balance

mxfp8_wgrad_with_hp recipe:

CONFIG_FILE=/home/dev/torchtitan/torchtitan/models/llama4/train_configs/llama4_17bx16e.toml ./run_train.sh --metrics.log_freq=10 \--training.steps=200  \--parallelism.data_parallel_shard_degree=8 \--parallelism.expert_parallel_degree=8 \--parallelism.tensor_parallel_degree=1 \                                                  
--parallelism.expert_tensor_parallel_degree=1 \
--profiling.enable_profiling --profiling.profile_freq=30 \
--training.seq_len=8192 \
--activation_checkpoint.mode=none \
--model.print_after_conversion \
--training.local_batch_size=12 \
--model.converters="quantize.grouped_mm.mx,quantize.linear.mx" \
--quantize.linear.mx.mxfp8_dim1_cast_kernel_choice="cuda" \
--quantize.linear.mx.filter_fqns="output,router.gate,wk,wv" \
--quantize.grouped_mm.mx.fqns="experts" --quantize.grouped_mm.mx.recipe_name="mxfp8_wgrad_with_hp" \
--compile.enable --debug.moe_force_load_balance

stack-info: PR: #2249, branch: danielvegamyhre/stack/3

danielvegamyhre · 2026-01-17T20:48:08Z

cc @tianyu-l for review. pre-commit passes locally but fails in CI, not sure why yet. I did a fresh pip install of requirements-dev.txt

tianyu-l

Would love to see some requests / tech reports / experiments / scientific studies / guidance on such new features -- with all these flexible options, it's not clear to users what works best.

stack-info: PR: #2249, branch: danielvegamyhre/stack/3

danielvegamyhre · 2026-02-05T00:56:52Z

Would love to see some requests / tech reports / experiments / scientific studies / guidance on such new features -- with all these flexible options, it's not clear to users what works best.

We are publishing docs and an e2e tutorial on torchao docsite on this, which we can link to in the torchtitan MXFP8 docs, if that works? If so, I can add a new PR on top of this stack with documentation updates.

tianyu-l · 2026-02-05T08:53:03Z

We are publishing docs and an e2e tutorial on torchao docsite on this

would love to read this tutorial

danielvegamyhre · 2026-02-05T21:07:38Z

We are publishing docs and an e2e tutorial on torchao docsite on this

would love to read this tutorial

Here is the tutorial preview: https://docs-preview.pytorch.org/pytorch/ao/3752/mxfp8_expert_parallel.html

the torchtitan config params will need to be updated since i changed the MXFP8ExpertParallel to be applied automatically for mxfp8_wgrad_with_hp recipe, but otherwise it is good to go i think, let me know if you have any thoughts

danielvegamyhre · 2026-02-05T21:57:03Z

test failures are unrelated

tianyu-l · 2026-02-07T19:45:14Z

torchtitan/config/job_config.py

 @dataclass
 class MXGroupedMM:
-    recipe_name: Literal["mxfp8"] = "mxfp8"
+    recipe_name: Literal["mxfp8", "mxfp8_wgrad_with_hp"] = "mxfp8"


In the tutorial

The mxfp8_wgrad_with_hp recipe is required for MoE training with expert parallelism.

why it's only required for EP?

why here default is not wgrad with hp?

why mxfp8 is an option if hp wgrad is "required"?

I just saw #2250 (comment)

Does it make sense that we don't give user control, just do

mxfp8 when DeepEP is used, or no EP is used

mxfp8_wgrad_with_hp when EP > 1 but DeepEP is not used

When EP is not enabled, which one should user use?

When EP is not enabled, which one should user use?

For performance, it depends on the expert shapes, batch size, and sequence length. For directional guidance on which recipe is better in a given context, I'm thinking about working on some tables like we have for float8 (see screenshot below). There has been positive feedback from users on this.

For accuracy / improved step quality, wgrad_with_hp compute weight gradients in bf16, so this recipe can be more of a net benefit, not in terms of TPS but in terms of "time to target validation loss" or "time to some eval score threshold." Luca also found perf benefit for certain smaller shapes, with fp8_rowwise_with_gw_hp (Hopper recipe). These have to evaluated through experimentation though.

danielvegamyhre requested review from fegin, tianyu-l, wconstab and wwwjn as code owners January 17, 2026 20:37

danielvegamyhre added a commit that referenced this pull request Jan 17, 2026

[mxfp8 moe training] support wgrad_with_hp recipe

a59fe09

stack-info: PR: #2249, branch: danielvegamyhre/stack/3

danielvegamyhre force-pushed the danielvegamyhre/stack/3 branch from df304f1 to a59fe09 Compare January 17, 2026 20:37

pytorch-bot bot added the ciflow/8gpu label Jan 17, 2026

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 17, 2026

danielvegamyhre marked this pull request as draft January 17, 2026 23:51

danielvegamyhre marked this pull request as ready for review January 17, 2026 23:51

danielvegamyhre mentioned this pull request Jan 17, 2026

[mxfp8 moe training] mxfp8 all to all #2250

Open

danielvegamyhre marked this pull request as draft January 18, 2026 23:21

danielvegamyhre marked this pull request as ready for review January 18, 2026 23:21

danielvegamyhre mentioned this pull request Jan 18, 2026

[mxfp8 training] add new configurable params now exposed by torchao #2251

Open

tianyu-l reviewed Jan 19, 2026

View reviewed changes

danielvegamyhre marked this pull request as draft January 20, 2026 20:40

danielvegamyhre marked this pull request as ready for review January 20, 2026 20:40

danielvegamyhre marked this pull request as draft January 20, 2026 22:49

danielvegamyhre marked this pull request as ready for review January 20, 2026 22:50

danielvegamyhre marked this pull request as draft January 21, 2026 04:22

danielvegamyhre marked this pull request as ready for review January 21, 2026 04:22

danielvegamyhre marked this pull request as draft January 21, 2026 17:09

danielvegamyhre marked this pull request as ready for review January 21, 2026 17:09

danielvegamyhre mentioned this pull request Jan 21, 2026

[mxfp8 moe training] temp workaround: don't compile GroupedExperts #2268

Open

danielvegamyhre marked this pull request as draft January 21, 2026 21:51

danielvegamyhre marked this pull request as ready for review January 21, 2026 21:51

[mxfp8 moe training] support wgrad_with_hp recipe

ac87d5c

stack-info: PR: #2249, branch: danielvegamyhre/stack/3

danielvegamyhre marked this pull request as draft February 3, 2026 20:03

danielvegamyhre force-pushed the danielvegamyhre/stack/3 branch from a59fe09 to ac87d5c Compare February 3, 2026 20:03

danielvegamyhre marked this pull request as ready for review February 3, 2026 20:03

danielvegamyhre marked this pull request as draft February 4, 2026 18:49

danielvegamyhre marked this pull request as ready for review February 4, 2026 18:50

danielvegamyhre marked this pull request as draft February 4, 2026 18:56

danielvegamyhre marked this pull request as ready for review February 4, 2026 18:56

danielvegamyhre marked this pull request as draft February 4, 2026 19:15

danielvegamyhre marked this pull request as ready for review February 4, 2026 19:16

danielvegamyhre marked this pull request as draft February 6, 2026 22:59

danielvegamyhre marked this pull request as ready for review February 6, 2026 23:00

tianyu-l reviewed Feb 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mxfp8 moe training] support wgrad_with_hp recipe #2249

[mxfp8 moe training] support wgrad_with_hp recipe #2249

Uh oh!

danielvegamyhre commented Jan 17, 2026 •

edited

Loading

Uh oh!

danielvegamyhre commented Jan 17, 2026

Uh oh!

tianyu-l left a comment

Uh oh!

danielvegamyhre commented Feb 5, 2026

Uh oh!

tianyu-l commented Feb 5, 2026

Uh oh!

danielvegamyhre commented Feb 5, 2026 •

edited

Loading

Uh oh!

danielvegamyhre commented Feb 5, 2026

Uh oh!

tianyu-l Feb 7, 2026

Uh oh!

tianyu-l Feb 7, 2026

Uh oh!

danielvegamyhre Feb 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[mxfp8 moe training] support wgrad_with_hp recipe #2249

Are you sure you want to change the base?

[mxfp8 moe training] support wgrad_with_hp recipe #2249

Uh oh!

Conversation

danielvegamyhre commented Jan 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tests

Uh oh!

danielvegamyhre commented Jan 17, 2026

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre commented Feb 5, 2026

Uh oh!

tianyu-l commented Feb 5, 2026

Uh oh!

danielvegamyhre commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielvegamyhre commented Feb 5, 2026

Uh oh!

tianyu-l Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

tianyu-l Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

danielvegamyhre commented Jan 17, 2026 •

edited

Loading

danielvegamyhre commented Feb 5, 2026 •

edited

Loading

danielvegamyhre Feb 10, 2026 •

edited

Loading