Walk me through gradient descent variants — SGD, momentum, Adam. How do you decide which to use?
formulate your answer, then —
You mentioned Adam's bias correction — what exactly does it fix, and are there situations where Adam is the wrong choice?
formulate your answer, then —
tldr
Adam combines momentum (smooth gradient direction) + RMSProp (adaptive per-parameter learning rate) + bias correction (fix for zero initialization of moments). It's the default for transformers and NLP. SGD + momentum often generalizes better for vision because it finds flatter minima. Always prefer AdamW over Adam when weight decay is used.
follow-up
- What is learning rate scheduling and how does warmup interact with Adam's bias correction?
- How does weight decay in AdamW differ from L2 regularization in standard Adam?
- What are the practical signs that your optimizer has converged to a sharp minimum and how would you address it?