mlprep
mlprep/ML Breadthmedium12 min

Walk me through backpropagation. How does a neural network actually learn from a mistake?

formulate your answer, then —

You mentioned the activation derivative — what happens to gradients in very deep networks, and how do residual connections address it?

formulate your answer, then —

tldr

Backprop applies the chain rule backward through the network, computing ∂L/∂w at each layer using cached forward-pass activations. Vanishing gradients occur when activation derivatives are small and multiply across many layers — ReLU reduced this by having derivative 1 for positive inputs. Residual connections add a direct identity gradient path that bypasses the activation derivatives entirely, making very deep networks trainable.

follow-up

  • How does batch normalization interact with backpropagation, and why does it help with training stability?
  • What's the difference between gradient checkpointing and standard backprop, and when would you use it?
  • How would you debug a network where training loss isn't decreasing at all from epoch 1?