Skip to content

Conversation

@danielvegamyhre
Copy link
Contributor

@danielvegamyhre danielvegamyhre commented Jan 21, 2026

Stacked PRs:


[mxfp8 moe training] temp workaround: don't compile GroupedExperts

See thread (#2250 (comment)) for context.

TL;DR is this avoids a tensor metadata mismatch issue between forward output and backward() input (upstream grad).

Tests

CONFIG_FILE=/home/dev/torchtitan/torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml ./run_train.sh --metrics.log_freq=10 \
--training.steps=1500  \
--parallelism.data_parallel_shard_degree=4 \
--parallelism.expert_parallel_degree=4 \
--parallelism.tensor_parallel_degree=2 \
--parallelism.expert_tensor_parallel_degree=1 \
--training.seq_len=8192 \
--activation_checkpoint.mode=full \
--model.print_after_conversion \
--training.local_batch_size=16 \
--quantize.linear.mx.mxfp8_dim0_cast_kernel_choice="triton" --quantize.linear.mx.mxfp8_dim1_cast_kernel_choice="cuda" \
--quantize.grouped_mm.mx.fqns="experts" --quantize.grouped_mm.mx.recipe_name="mxfp8_wgrad_with_hp" \
--compile.enable --compile.components="model,loss" --debug.moe_force_load_balance \
--model.converters="quantize.grouped_mm.mx"

danielvegamyhre added a commit that referenced this pull request Jan 21, 2026
stack-info: PR: #2268, branch: danielvegamyhre/stack/6
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/6 branch from f7bd318 to baa9d08 Compare January 21, 2026 17:09
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 21, 2026
),
)
# temp workaround: compile everything except GroupedExperts
if not isinstance(submod, moe_module.GroupedExperts):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's also another hardcoded call to compile _run_experts_grouped_mm here that you'd need to skip (I'm doing it in my patch in https://gist.github.com/bdhirsh/970a671b84c35cc95a76f33657ca4d69)

@danielvegamyhre danielvegamyhre marked this pull request as draft January 21, 2026 21:51
@danielvegamyhre danielvegamyhre changed the base branch from danielvegamyhre/stack/5 to main January 21, 2026 21:51
danielvegamyhre added a commit that referenced this pull request Jan 21, 2026
stack-info: PR: #2268, branch: danielvegamyhre/stack/6
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/6 branch from baa9d08 to 35fa4d4 Compare January 21, 2026 21:51
@danielvegamyhre danielvegamyhre changed the base branch from main to danielvegamyhre/stack/5 January 21, 2026 21:51
@danielvegamyhre danielvegamyhre marked this pull request as ready for review January 21, 2026 21:52
"17bx16e": TransformerModelArgs(
dim=5120,
n_layers=48,
n_layers=6,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert

@@ -687,14 +689,8 @@ def apply_compile(model: nn.Module, compile_config: CompileConfig, ep_enabled: b
in moe_module._run_experts_grouped_mm.__qualname__
)
if not already_patched:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove this logic since we are not compiling GroupedExperts any more?

submod, backend=compile_config.backend, fullgraph=True
),
)
# temp workaround: compile everything except GroupedExperts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment on the issue this code is working around

@danielvegamyhre
Copy link
Contributor Author

danielvegamyhre commented Jan 24, 2026

@tianyu-l thanks for the early feedback. the top 3 PRs in this stack aren't quite ready - once scale testing is complete and i've confirmed the mxfp8 a2a feature works as expected (perf, numerics) then i'll polish it up, add PR descriptions, and let you know when it's ready.

So far so good, just waiting for a long training run to finish to verify training stability and identical convergence to bf16. Should be ready early next week.

@danielvegamyhre danielvegamyhre marked this pull request as draft February 3, 2026 20:03
@danielvegamyhre danielvegamyhre changed the base branch from danielvegamyhre/stack/5 to main February 3, 2026 20:03
danielvegamyhre added a commit that referenced this pull request Feb 3, 2026
stack-info: PR: #2268, branch: danielvegamyhre/stack/6
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/6 branch from 35fa4d4 to d31d45d Compare February 3, 2026 20:03
@danielvegamyhre danielvegamyhre changed the base branch from main to danielvegamyhre/stack/5 February 3, 2026 20:03
@danielvegamyhre danielvegamyhre marked this pull request as ready for review February 3, 2026 20:03
@danielvegamyhre danielvegamyhre marked this pull request as draft February 4, 2026 18:49
@danielvegamyhre danielvegamyhre changed the base branch from danielvegamyhre/stack/5 to main February 4, 2026 18:49
danielvegamyhre added a commit that referenced this pull request Feb 4, 2026
stack-info: PR: #2268, branch: danielvegamyhre/stack/6
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/6 branch from d31d45d to 61d426f Compare February 4, 2026 18:49
@danielvegamyhre danielvegamyhre changed the base branch from main to danielvegamyhre/stack/5 February 4, 2026 18:50
@danielvegamyhre danielvegamyhre marked this pull request as ready for review February 4, 2026 18:50
@danielvegamyhre danielvegamyhre marked this pull request as draft February 4, 2026 18:56
@danielvegamyhre danielvegamyhre changed the base branch from danielvegamyhre/stack/5 to main February 4, 2026 18:56
danielvegamyhre added a commit that referenced this pull request Feb 4, 2026
stack-info: PR: #2268, branch: danielvegamyhre/stack/6
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/6 branch from 61d426f to 44bfc34 Compare February 4, 2026 18:56
@danielvegamyhre danielvegamyhre changed the base branch from main to danielvegamyhre/stack/5 February 4, 2026 18:56
@danielvegamyhre danielvegamyhre marked this pull request as ready for review February 4, 2026 18:56
@danielvegamyhre danielvegamyhre marked this pull request as draft February 4, 2026 19:16
@danielvegamyhre danielvegamyhre changed the base branch from danielvegamyhre/stack/5 to main February 4, 2026 19:16
danielvegamyhre added a commit that referenced this pull request Feb 4, 2026
stack-info: PR: #2268, branch: danielvegamyhre/stack/6
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/6 branch from 44bfc34 to cda7d8d Compare February 4, 2026 19:16
@danielvegamyhre danielvegamyhre changed the base branch from main to danielvegamyhre/stack/5 February 4, 2026 19:16
@danielvegamyhre danielvegamyhre marked this pull request as ready for review February 4, 2026 19:16
stack-info: PR: #2268, branch: danielvegamyhre/stack/6
@danielvegamyhre danielvegamyhre marked this pull request as draft February 6, 2026 22:59
@danielvegamyhre danielvegamyhre changed the base branch from danielvegamyhre/stack/5 to main February 6, 2026 22:59
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/6 branch from cda7d8d to d920414 Compare February 6, 2026 22:59
@danielvegamyhre danielvegamyhre changed the base branch from main to danielvegamyhre/stack/5 February 6, 2026 23:00
@danielvegamyhre danielvegamyhre marked this pull request as ready for review February 6, 2026 23:00
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have to land this, if #2281 is going to be landed soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants