mlprep
mlprep/ML Breadthhard14 min

Explain mixture-of-experts models. Why can they increase model capacity without proportional inference cost, and what are the hard parts?

formulate your answer, then —

tldr

MoE models use sparse activation: many experts exist, but each token routes to only a few. This increases capacity without proportional per-token compute. The hard parts are routing, load balancing, expert collapse, distributed communication, and serving reliability.

follow-up

  • Why does MoE parameter count overstate inference compute?
  • What metrics would you monitor to detect expert imbalance?
  • When would a dense model be preferable to an MoE model?