-
Notifications
You must be signed in to change notification settings - Fork 701
Open
Description
Bug description
See pytorch/ao#3636
When training MXFP8 MoE models with TorchTitan, expert group sizes must be padded to a multiple of 32 to satisfy MXFP8 scaling-block requirements in the backward pass. Currently, when using quantize via TorchTitan’s model-converter path, this padding does not appear to be consistently enforced, leading to unsupported group sizes (e.g. 44) reaching the MXFP8 CUDA kernels and causing runtime failures or forced fallback to Triton. TorchTitan already has a permutation step that gathers tokens across ranks and can insert padding, so this issue tracks ensuring that expert group padding to a multiple of 32 is always applied before entering the MXFP8 grouped-MM path.
Versions
pytorch == 2.11.0a0+gita156055
torchtitan == 0.2.1+gita25dd8f
torchao == 0.16.0+gita5f2693
torchrun \
--nnodes=2 \
--nproc-per-node=4 \
...
-m torchtitan.train \
--job.config_file torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml \
--training.steps 50 \
...
--model.converters="quantize.linear.mx,quantize.grouped_mm.mx" \
--quantize.linear.mx.recipe_name="mxfp8_cublas" \
--quantize.grouped_mm.mx.fqns="experts" \
--parallelism.expert_parallel_degree 8
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
No status