Muon Optimizer #2601

caojiaolong · 2025-09-13T14:36:46Z

caojiaolong
Sep 13, 2025

The recently introduced [Muon Optimizer](https://github.com/KellerJordan/Muon) has shown promising results, claiming to outperform commonly used optimizers such as AdamW, Shampoo, and SOAP ([reference benchmarks](https://github.com/KellerJordan/modded-nanogpt/tree/master/records/102924_Optimizers)).

Given its potential advantages, it would be valuable to have Muon integrated into the timm library for broader experimentation and adoption.

Is there any plan or interest in adding support for Muon Optimizer to timm?

rwightman · 2025-09-15T05:08:26Z

rwightman
Sep 15, 2025
Maintainer

@caojiaolong I was pondering doing it, but it's in torch now https://github.com/pytorch/pytorch/blob/main/torch/optim/_muon.py ... so that lowers priority for adding to timm unless there are any modes/features not in the torch version. EDIT: I will add the mapping to timm to allow using the torch.optim.Muon w/ the timm factory.

One thing with the original, I had some q re compatibility with convnet weight layouts as the dim handling isn't always friendly with convs and transpose convs. Think some of that might have been fixed, I have to try out the torch version, haven't had a chance.

0 replies

rwightman · 2025-09-15T05:09:39Z

rwightman
Sep 15, 2025
Maintainer

I could never get Shampoo and SOAP working well (finding decent hparams for vision tasks). Though another related optimizer Kron works quite well.

0 replies

rwightman · 2025-09-15T14:56:38Z

rwightman
Sep 15, 2025
Maintainer

@caojiaolong looking more closely, surprisingly the pytorch impl of Muon only supports 2D parameters... sooo guess that's not going to work. I figured they'd generalize it better if it was going to be included there...

0 replies

rwightman · 2025-10-15T23:37:38Z

rwightman
Oct 15, 2025
Maintainer

@caojiaolong pushed an impl I had sitting around closer to completion and created a PR (#2596), still testing some things. It's based on Keller's like many of the others, but I added/integrated a few additional options, sped up the NS iteration a bit without getting too crazy. I should work with convnets and hybrid vit-cnn decently and has two modes for that, flatten like Keller's and another option that treats the spatial kernel dims as a batch dim for the NS iterations.

0 replies

rwightman · 2025-10-16T22:34:31Z

rwightman
Oct 16, 2025
Maintainer

I tweaked the heuristics to assign params to Muon vs AdamW/NAdamW updates. Seemed to improve behaviour with convnets like EfficientNets / MobileNets that have lots of depthwise convs.

I think I'm ready to merge the initial version, but if anyone following wants to try it, feedback welcome.....

@caojiaolong @sjiang95 @ilbash

0 replies

sjiang95 · 2025-10-24T03:56:02Z

sjiang95
Oct 24, 2025

Hello @rwightman ,

Thank you! I have tried the Muon optimizer on a private self-supervised learning task, but observed a performance gap (frankly, worse) with Adamw, probably task-specific. I'll try more tasks to see how Muon performs.

Thanks for your quick response on the demand of Muon~

0 replies

caojiaolong · 2025-10-24T16:07:51Z

caojiaolong
Oct 24, 2025
Author

@rwightman Thank you for your excellent work.
I tried Muon with ConvNeXt-Tiny on ImageNet-1K, training for 300 epochs.
Initially, Muon converged faster than AdamW, but AdamW gradually caught up around the 200th epoch.
In my setup, I excluded the head and stem from Muon and kept all other parameters the same as AdamW.
In the end, Muon was outperformed by AdamW (81.7 vs. 82.1).

Training details can be found in this W&B report.
The training script is as follows (8*RTX 3090):

export OMP_NUM_THREADS=2
export MKL_NUM_THREADS=2
# export HF_DATASETS_IN_MEMORY_MAX_SIZE=50240000
MODEL=convnext_tiny # drop-path 0.1, 0.1, 0.15, 0.2 for T, S, B, L

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 PORT=31929 bash distributed_train.sh 8 \
    `# Dataset parameters` \
    --dataset hfds//ImageNet_arrow_rgbdpa_new/ \
    --data-dir /.cache/huggingface/datasets/ \
    --train-split train \
    --val-split validation \
    --input-img-mode RGB \
    --input-key rgb \
    $() \
    `# Model parameters` \
    --model $MODEL \
    --num-classes 1000 \
    --input-size 3 224 224 \
    --mean 0.485 0.456 0.406 \
    --std 0.229 0.224 0.225 \
    --batch-size 256 \
    --grad-accum-steps 2 \
    $() \
    `# Scripting / codegen` \
    `#--torchcompile inductor` \
    $() \
    `# Device & distributed` \
    --amp \
    $() \
    `# Optimizer parameters` \
    --opt muon \
    --weight-decay 0.05 \
    --opt-kwargs "fallback_list=['head.*', 'stem.*']" \
    $() \
    `# Learning rate schedule parameters` \
    --sched-on-updates \
    --lr-base 0.004 \
    --lr-base-size 4096 \
    --warmup-lr 1e-6 \
    --epochs 300 \
    --warmup-epochs 20 \
    $() \
    `# Augmentation & regularization parameters` \
    --aa rand-m9-mstd0.5-inc1 \
    --reprob 0.25 \
    --remode pixel \
    --cutmix 1.0 \
    --mixup 0.8 \
    --drop-path 0.1 \
    $() \
    `# Model Exponential Moving Average` \
    --model-ema \
    --model-ema-decay 0.9999 \
    --model-ema-warmup \
    $() \
    `# Misc` \
    --seed 42 \
    --log-interval 1 \
    --workers 8 \
    --pin-mem \
    --output output/train \
    --experiment convnext_tiny_rep_nmuon_rgb \
    --use-multi-epochs-loader \
    --log-wandb \
    --wandb-project RGBDPretrain

I’m now also experimenting with nMuon, and will share the results once it’s done.

0 replies

rwightman · 2025-10-24T16:25:54Z

rwightman
Oct 24, 2025
Maintainer

@caojiaolong @sjiang95 so far, this mirrors my experience ... early convergence is notably faster with Muon but over a typical training schedule AdamW or NAdamW ends up winning by a bit. In larger scale LLM training, I feel you're often in a more 'undertrained' state where this convergence behaviour is highly beneficial, vs doing hundreds of epochs for smaller vision models on smaller data.

It does seem like LR could/should be pushed a bit higher for Muon so trying that right now. I've been leaving the stem as Muon but the final head projection as AdamW.

I don't think there's anything inherintely wrong with my Muon impl or Muon generally. But as usual, it's hard to beat good ol AdamW. Even with a theoretically better optimizer, you have to run through a whole search of LR, weight decay, schedule to be 'fair' and match familiar hparams on AdamW.

2 replies

caojiaolong Oct 25, 2025
Author

nMuon seems to work slightly better than Muon, but it’s still outperformed by AdamW in the end. (https://api.wandb.ai/links/wuyingruoxian/ea6b8ana)
I’m not quite sure why Muon requires a higher learning rate, since Jianlin Su’s RMS matching approach should, in theory, align the learning rates between AdamW and Muon ([blog](https://spaces.ac.cn/archives/10739), [paper](https://arxiv.org/abs/2502.16982)).

I understand that Muon may not necessarily surpass AdamW in performance, but ideally it shouldn’t come at the cost of final accuracy.
I plan to test it on a classic ViT to see if this trend persists, though it will likely be on a smaller scale due to compute limitations.

rwightman Oct 29, 2025
Maintainer

@caojiaolong yeah, this is more or less what I see, promising start and then it plateaus and/or gets worse. Where as AdamW keeps slowly but surely improving.

Reading up on Muon, Scion and related, some ideas/thoughts to explore

some papers suggest increasing WD and/or increasing + scheduling WD so that it's larger earlier and decreases or ends later in training
try different lr scaling tweaks, I have the 'AdamW' matching enabled by default, but there is a range of suggested possible constants there, I have 0.2 hard coded, 0.4 is at the top of the recommended range, does it impact?
the Scion rms_to_rms aka spectral lr scale could have an impact?

rwightman · 2025-10-24T16:27:22Z

rwightman
Oct 24, 2025
Maintainer

I think a better vision test would be pretraining a much larger ViT model, but I don't currently have free hardware resources to give that a spin. I'm going to convert this to a discussion for ongoing updates...

0 replies

Uh oh!

Muon Optimizer #2601

Uh oh!

caojiaolong Sep 13, 2025

Replies: 9 comments · 2 replies

Uh oh!

Uh oh!

rwightman Sep 15, 2025 Maintainer

Uh oh!

rwightman Sep 15, 2025 Maintainer

Uh oh!

Uh oh!

rwightman Sep 15, 2025 Maintainer

Uh oh!

Uh oh!

rwightman Oct 15, 2025 Maintainer

Uh oh!

rwightman Oct 16, 2025 Maintainer

Uh oh!

sjiang95 Oct 24, 2025

Uh oh!

Uh oh!

caojiaolong Oct 24, 2025 Author

Uh oh!

rwightman Oct 24, 2025 Maintainer

Uh oh!

caojiaolong Oct 25, 2025 Author

Uh oh!

rwightman Oct 29, 2025 Maintainer

Uh oh!

Uh oh!

rwightman Oct 24, 2025 Maintainer

caojiaolong
Sep 13, 2025

Replies: 9 comments 2 replies

rwightman
Sep 15, 2025
Maintainer

rwightman
Sep 15, 2025
Maintainer

rwightman
Sep 15, 2025
Maintainer

rwightman
Oct 15, 2025
Maintainer

rwightman
Oct 16, 2025
Maintainer

sjiang95
Oct 24, 2025

caojiaolong
Oct 24, 2025
Author

rwightman
Oct 24, 2025
Maintainer

caojiaolong Oct 25, 2025
Author

rwightman Oct 29, 2025
Maintainer

rwightman
Oct 24, 2025
Maintainer