What is KL divergence and how does it relate to cross-entropy?

Question

Accepted Answer

What is KL divergence? How does it relate to cross-entropy? Where do you see it used in ML? Think about: what KL divergence measures geometrically — is it a distance? Is it symmetric? Why is KL(p||q) different from KL(q||p) and what that means for training. How entropy, cross-entropy, and KL fit together in one equation. **KL divergence defined** KL divergence measures how much distribution q differs from reference distribution p: Which gives the fundamental relationship: Cross-entropy = entropy