How does mixed precision training work, and why can it be unstable?

Question

Accepted Answer

Explain mixed precision training. Why does it speed up deep learning, and what numerical problems do you need to handle? Think about: FP32 vs FP16 vs BF16, tensor cores, memory bandwidth, activation storage, gradient underflow, loss scaling, and why some operations stay in FP32. **The goal** Mixed precision uses lower-precision formats for most tensor operations while keeping sensitive parts in higher precision. This improves throughput and reduces memory. Modern GPUs have specialized hardware f