What is focal loss? Why was it introduced, and how does it differ from cross-entropy? When would you reach for it?
formulate your answer, then —
tldr
Focal loss = cross-entropy × (1 - p_t)^γ. The modulating factor downweights easy, well-classified examples so hard examples dominate training. γ=0 recovers CE; γ=2 is standard. Use for dense detection (many background anchors) or severe class imbalance where easy negatives drown out hard positives. Simple weighted cross-entropy is still the first thing to try.
follow-up
- How would you set the α and γ hyperparameters? Is there a systematic way?
- Focal loss was designed for single-stage detectors. Could you use it for NLP tasks with class imbalance?
- What are alternatives to focal loss for handling class imbalance, and how do they compare?