How do mixture-of-experts models scale, and what can go wrong?

Question

Accepted Answer

Explain mixture-of-experts models. Why can they increase model capacity without proportional inference cost, and what are the hard parts? Think about: sparse activation, router networks, expert load balancing, communication overhead, expert collapse, and why parameter count is not the same as FLOPs per token. **The core idea** A mixture-of-experts model replaces some dense feed-forward blocks with many expert MLPs. A router chooses a small number of experts per token: If there are 64 experts but