How do you serve an ML model at low latency and high throughput in production? What are the main levers?
You mentioned int8 quantization requires calibration — what does calibration mean here, and what goes wrong without it?
tldr
Model serving optimizes for latency (single request speed) and throughput (requests per second) — they trade off. Profile first: feature retrieval and preprocessing often dominate over the model itself. Dynamic batching, float16 quantization, and keeping the model warm in memory are the highest-leverage quick wins. Int8 quantization needs calibration data to set activation scales — without it, clipping or precision loss degrades accuracy. Distillation and pruning are the heavier tools for throughput at scale.
follow-up
- How would you design a serving system that serves both a fast simple model and a slow accurate model, routing between them?
- What is speculative decoding in LLM serving and what problem does it solve?
- How do you set an SLO for a model serving endpoint, and what do you do when the model can't meet it?