Skip to content

Expert group padding to multiple of 32 #2262

@eee4017

Description

@eee4017

Bug description

See pytorch/ao#3636

When training MXFP8 MoE models with TorchTitan, expert group sizes must be padded to a multiple of 32 to satisfy MXFP8 scaling-block requirements in the backward pass. Currently, when using quantize via TorchTitan’s model-converter path, this padding does not appear to be consistently enforced, leading to unsupported group sizes (e.g. 44) reaching the MXFP8 CUDA kernels and causing runtime failures or forced fallback to Triton. TorchTitan already has a permutation step that gathers tokens across ranks and can insert padding, so this issue tracks ensuring that expert group padding to a multiple of 32 is always applied before entering the MXFP8 grouped-MM path.

cc @danielvegamyhre

Versions

pytorch == 2.11.0a0+gita156055
torchtitan == 0.2.1+gita25dd8f
torchao == 0.16.0+gita5f2693
torchrun \
  --nnodes=2 \
  --nproc-per-node=4 \
  ...
  -m torchtitan.train \
  --job.config_file torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml \
  --training.steps 50 \
  ...
  --model.converters="quantize.linear.mx,quantize.grouped_mm.mx" \
  --quantize.linear.mx.recipe_name="mxfp8_cublas" \
  --quantize.grouped_mm.mx.fqns="experts" \
  --parallelism.expert_parallel_degree 8

Metadata

Metadata

Labels

No labels
No labels

Type

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions