What is focal loss and when would you use it over cross-entropy?

Question

Accepted Answer

What is focal loss? Why was it introduced, and how does it differ from cross-entropy? When would you reach for it? Think about: what happens during training when a dataset has 1000 easy negatives for every hard positive. How cross-entropy treats an easy negative (ŷ=0.01 for class 0) vs a hard positive (ŷ=0.4 for class 1). Who dominates the gradient sum and what the model ends up learning. **The problem: easy examples swamp the gradient** In single-stage object detection (RetinaNet's setting), th