mlprep
mlprep/MLOpshard15 min

How do you serve an ML model at low latency and high throughput in production? What are the main levers?

formulate your answer, then —

You mentioned int8 quantization requires calibration — what does calibration mean here, and what goes wrong without it?

formulate your answer, then —

tldr

Model serving optimizes for latency (single request speed) and throughput (requests per second) — they trade off. Profile first: feature retrieval and preprocessing often dominate over the model itself. Dynamic batching, float16 quantization, and keeping the model warm in memory are the highest-leverage quick wins. Int8 quantization needs calibration data to set activation scales — without it, clipping or precision loss degrades accuracy. Distillation and pruning are the heavier tools for throughput at scale.

follow-up

  • How would you design a serving system that serves both a fast simple model and a slow accurate model, routing between them?
  • What is speculative decoding in LLM serving and what problem does it solve?
  • How do you set an SLO for a model serving endpoint, and what do you do when the model can't meet it?