Compare SGD, Adam, and AdamW. What does each actually do, and when would you prefer SGD over Adam despite Adam being adaptive?
formulate your answer, then —
tldr
SGD: one LR for all params, robust generalization in many vision settings, needs tuning. Adam: adaptive per-parameter LR, fast convergence, useful for sparse or unstable gradients. AdamW: Adam + decoupled weight decay — preferred regularization behavior for modern neural networks and default for transformers. Pick SGD+momentum for many CNN baselines where you can tune, AdamW for transformers and fine-tuning. Be cautious with Adam + L2-style regularization; AdamW is usually the cleaner choice.
follow-up
- What is learning rate warmup, why do transformers need it, and what happens if you skip it?
- Why does Adam find sharper minima than SGD, and how does the Lion optimizer try to address this?
- What is gradient clipping, when should you use it, and how does it interact with adaptive optimizers?