Explain maximum likelihood estimation. How does it connect to cross-entropy loss in classification? What are its assumptions and when do they break down?

Question

Accepted Answer

The core idea

Given observed data X = (x_1, ..., x_n) and a parametric model p(x | θ), find the parameters θ that make the observed data most probable:

θ_MLE = argmax_θ  L(θ; X) = argmax_θ  Π_i p(x_i | θ)

The likelihood L(θ; X) is the joint probability of all observations, viewed as a function of θ (data is fixed, θ varies). MLE asks: which θ would have been most likely to produce this data?

Why log-likelihood

The product Π_i p(x_i | θ) is numerically catastrophic for large n — multiplying 1000 probabilities gives numbers near zero, causing underflow. Take the log:

log L(θ; X) = Σ_i log p(x_i | θ)

Log is monotone so argmax is preserved. The sum is numerically stable. Gradients are sums, not products — much easier to differentiate.

MLE = minimizing cross-entropy

For binary classification with logistic regression, the model outputs p(y=1 | x; θ). Log-likelihood:

log L = Σ_i [y_i log p_i + (1 - y_i) log(1 - p_i)]

Maximizing this is identical to minimizing binary cross-entropy loss:

L_CE = -Σ_i [y_i log p_i + (1 - y_i) log(1 - p_i)]

Training a logistic regression or any classification model with cross-entropy loss is exactly MLE under the assumption that labels are drawn from a Bernoulli (binary) or categorical (multi-class) distribution.

MLE for regression = MSE

Assume y_i = f(x_i; θ) + ε, where ε ~ N(0, σ²). The log-likelihood:

log L ∝ -Σ_i (y_i - f(x_i; θ))² / (2σ²)

Maximizing this is equivalent to minimizing mean squared error. MSE loss in regression is MLE under Gaussian noise assumption.

Properties of MLE

Consistency: as n → ∞, θ_MLE → θ_true (converges to the true parameter).

Asymptotic efficiency: among all consistent estimators, MLE achieves the lowest possible variance (Cramér-Rao bound) in large samples.

Asymptotic normality: θ_MLE is approximately normally distributed around θ_true for large n — enables confidence intervals on model parameters.

When MLE breaks down

Overfitting with small data: MLE maximizes fit to observed data. With n=10 data points and 100 parameters, MLE finds parameters that perfectly fit the training data but generalize poorly. No regularization in vanilla MLE.

Model misspecification: MLE finds the best parameters within the model family. If the true distribution isn't in your family (e.g., data is actually bimodal but you fit a Gaussian), MLE will still fit but the estimate is biased.

Unidentifiability: if multiple θ values give the same likelihood (e.g., scale ambiguity in mixture models), MLE isn't unique. EM algorithm addresses this for mixtures.

Heavy tails: Gaussian noise assumption (implicit in MSE) is wrong for outlier-heavy data. MLE under Gaussian = MSE = outliers get enormous weight. Alternative: MLE under Laplace distribution = MAE loss.

MAP = MLE + regularization

Maximum A Posteriori (MAP) adds a prior p(θ):

θ_MAP = argmax_θ [log p(X | θ) + log p(θ)]

L2 regularization = MAP with Gaussian prior on weights. L1 regularization = MAP with Laplace prior. Regularization isn't ad hoc — it's principled Bayesian inference.