mlprep
mlprep/ML Breadthmedium10 min

What is KL divergence? How does it relate to cross-entropy? Where do you see it used in ML?

formulate your answer, then —

tldr

KL(p||q) = H(p,q) - H(p): the extra bits you pay for using q when p is true. Not symmetric. Forward KL (CE training) is mean-seeking — model covers all modes. Reverse KL (variational inference) is mode-seeking — model picks one mode. Appears in VAEs (regularize latent space), RLHF (policy constraint), and distillation (match teacher's full distribution).

follow-up

  • In a VAE, what goes wrong if you remove the KL term? What goes wrong if the KL term dominates?
  • Why is forward KL used for supervised learning and reverse KL common in variational inference?
  • How does Jensen-Shannon divergence differ from KL, and why was it preferred in the original GAN objective?