Explain mixed precision training. Why does it speed up deep learning, and what numerical problems do you need to handle?
formulate your answer, then —
tldr
Mixed precision speeds training by using FP16/BF16 for many tensor operations while preserving stability with FP32 where needed. FP16 often needs loss scaling; BF16 is more numerically forgiving. Monitor overflows, NaNs, convergence, and quality against a baseline.
follow-up
- Why is BF16 often more stable than FP16?
- What is dynamic loss scaling?
- Which parts of Adam training are risky to store only in low precision?