Expert group padding to multiple of 32

### Bug description

See https://github.com/pytorch/ao/issues/3636

When training MXFP8 MoE models with TorchTitan, expert group sizes must be padded to a multiple of 32 to satisfy MXFP8 scaling-block requirements in the backward pass. Currently, when using quantize via TorchTitan’s model-converter path, this padding does not appear to be consistently enforced, leading to unsupported group sizes (e.g. 44) reaching the MXFP8 CUDA kernels and causing runtime failures or forced fallback to Triton. TorchTitan already has a permutation step that gathers tokens across ranks and can insert padding, so this issue tracks ensuring that expert group padding to a multiple of 32 is always applied before entering the MXFP8 grouped-MM path.


cc @danielvegamyhre 

### Versions

```
pytorch == 2.11.0a0+gita156055
torchtitan == 0.2.1+gita25dd8f
torchao == 0.16.0+gita5f2693
```


```
torchrun \
  --nnodes=2 \
  --nproc-per-node=4 \
  ...
  -m torchtitan.train \
  --job.config_file torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml \
  --training.steps 50 \
  ...
  --model.converters="quantize.linear.mx,quantize.grouped_mm.mx" \
  --quantize.linear.mx.recipe_name="mxfp8_cublas" \
  --quantize.grouped_mm.mx.fqns="experts" \
  --parallelism.expert_parallel_degree 8
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expert group padding to multiple of 32 #2262

Bug description

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Expert group padding to multiple of 32 #2262

Description

Bug description

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions