Explain maximum likelihood estimation. How does it connect to cross-entropy loss in classification? What are its assumptions and when do they break down?
formulate your answer, then —
tldr
MLE finds parameters that maximize the probability of observed data. Log-likelihood is used for numerical stability and gradient convenience. Cross-entropy loss = negative log-likelihood under Bernoulli/categorical distribution. MSE loss = MLE under Gaussian noise. MLE is consistent and asymptotically efficient but overfits with small data (no regularization). MAP = MLE + prior = regularized MLE: L2 reg is Gaussian prior, L1 is Laplace prior.
follow-up
- Why does L2 regularization correspond to a Gaussian prior in the MAP framework?
- What is the Fisher information matrix and how does it relate to the variance of MLE estimates?
- When would you prefer MAP estimation over MLE for a neural network?