mlprep
mlprep/ML Breadthmedium10 min

Walk me through learning rate scheduling. What are the common strategies, and how do you choose between them? Why do transformers specifically need warmup?

formulate your answer, then —

tldr

Warmup: start LR small, ramp to target — required for Adam/AdamW because early moment estimates are unreliable and cause large unstable updates. Cosine annealing: smooth LR decay following cos curve — best general-purpose schedule. Step decay: legacy CNN schedule, abrupt drops, cosine preferred. Linear decay: simple, common for BERT fine-tuning. Use LR range test to find base LR. Transformers: warmup + cosine or linear. CNNs: cosine or step.

follow-up

  • How do you set the warmup duration? Is there a principled approach or is it heuristic?
  • What is the "1cycle policy" and why does cycling momentum inversely with LR help?
  • In distributed training across many GPUs, the effective batch size scales up. How should you adjust the learning rate?