Explain how RNNs work and where they still fit in the ML landscape today.
formulate your answer, then —
You mentioned vanishing gradients — how do LSTMs actually solve this, and what's the core mechanism that makes the difference?
formulate your answer, then —
tldr
RNNs model sequences with a recurrent hidden state but suffer from vanishing gradients over long sequences. LSTMs fix this with a cell state that updates additively, not multiplicatively — the forget gate acts as a gradient valve, letting relevant signals flow backward across hundreds of steps unchanged.
follow-up
- How would you decide between an LSTM and a transformer for a new time-series forecasting task?
- What is gradient clipping and why is it necessary even with LSTMs?
- How do modern state space models like Mamba differ from LSTMs, and what problem are they trying to solve?