Explain mixture-of-experts models. Why can they increase model capacity without proportional inference cost, and what are the hard parts?
formulate your answer, then —
tldr
MoE models use sparse activation: many experts exist, but each token routes to only a few. This increases capacity without proportional per-token compute. The hard parts are routing, load balancing, expert collapse, distributed communication, and serving reliability.
follow-up
- Why does MoE parameter count overstate inference compute?
- What metrics would you monitor to detect expert imbalance?
- When would a dense model be preferable to an MoE model?