How do label smoothing and knowledge distillation change cross-entropy training?

Question

Accepted Answer

Explain label smoothing and knowledge distillation. Why can softer targets improve generalization, and what are the tradeoffs? Think about: one-hot labels, overconfidence, dark knowledge, temperature, calibration, noisy labels, and when soft targets can hide useful certainty. **Label smoothing** Standard cross-entropy uses one-hot labels: correct class probability 1, every other class 0. Label smoothing replaces that with a softened target: This discourages the model from becoming infinitely con