Compare SGD, Adam, and AdamW — when would you use each?

Question

Accepted Answer

Compare SGD, Adam, and AdamW. What does each actually do, and when would you prefer SGD over Adam despite Adam being adaptive? Think about: what "adaptive learning rates" means and why they help some problems but hurt others. Why momentum matters. What the difference between L2 regularization and weight decay is — and why they're not the same thing in Adam. Why Adam often achieves lower training loss but worse generalization than SGD with momentum on some benchmarks. **SGD with Momentum** Vanill