Explain label smoothing and knowledge distillation. Why can softer targets improve generalization, and what are the tradeoffs?
formulate your answer, then —
tldr
Label smoothing regularizes one-hot targets by reducing overconfidence. Distillation trains a student on a teacher's soft probabilities, transferring similarity structure and enabling cheaper serving. Both can improve generalization, but they can also hurt calibration or transfer teacher bias.
follow-up
- Why does temperature help in distillation?
- When can label smoothing hurt calibration?
- How would you validate a distilled model before replacing a larger teacher in production?