Implement softmax and cross-entropy loss in NumPy. Then tell me: what goes wrong with the naive implementation, and how do you fix it?
formulate your answer, then —
Your cross-entropy implementation handles multi-class classification. How would you modify it for binary classification, and is there a numerical stability issue there too?
formulate your answer, then —
tldr
Naive softmax overflows for large logits (exp(1000) = inf). Fix: subtract max before exponentiation — softmax is shift-invariant so this doesn't change the output. For cross-entropy, compute log-softmax directly in log-space rather than log(softmax(x)) to avoid log(0). Binary cross-entropy has the same issue: use the max(x,0) - x*y + log(1+exp(-|x|)) form that PyTorch's BCEWithLogitsLoss uses internally.
follow-up
- How would you implement the backward pass for softmax cross-entropy? What's the surprisingly clean form of the gradient?
- What is label smoothing and how do you modify the cross-entropy loss to implement it?
- If you were implementing this in a production training loop, what would you do differently from a pure NumPy implementation?