muon optimizer explained to a toddler

there's no way you won't get it

Yacine Mahdid

Aug 19, 2025

adamw is as of 2025 the most used optimizer to train deep neural network

adamw is momentum based though

empirically momentum based optimizer tend to favor few update directions over the rest

update matrix look suspiciously low-rank

it’s like trying to roll a boat down the cost plane

muon’s contributor thinks that if we could dampen these directions and boost rarer ones it would lead to better convergence

kinda bringing the boat back to ball shape

muon’s contributor says that if we run svd on momentum update matrix and throw away the singular value then we get roughly unit vector length in all directions

can’t run a svd on each update step though too costly

we don’t have to be that precise so we can look at faster iterative algorithms

newtonschulz is one such algorithm

basically if you take an odd-matrix polynomial and apply it to a matrix it’s like applying it straight to the singular values while U and V.T are left unchanged

would be neat to apply the sign function to the singular value because then we would just have 1s on that diagonal

cant use sign because it’s not a polynomial even though it’s odd

we can approximate the sign function though with cubic/quintic polynomial if we apply the function multiple times on itself (not well but close enough around [-1,1])

5 iterations (or steps) are good enough