Walk me through gradient descent variants

Question

Accepted Answer

Walk me through gradient descent variants — SGD, momentum, Adam. How do you decide which to use? Think about: why plain gradient descent is rarely used in practice. What noisy gradients do to convergence. What the problem with AdaGrad is. What Adam actually combines, and the typical rule of thumb for vision vs NLP. **Vanilla gradient descent** Computes the gradient over the entire dataset. Guaranteed descent at each step but prohibitively slow — one update per full pass over the data. Never used