Design an LLM serving system for a product used by millions of users — think ChatGPT or Claude at scale. Walk me through how you'd handle request routing, GPU scheduling, KV cache management, latency SLOs, and cost control. What breaks first as you scale?
tldr
LLM serving is memory-bandwidth-bound at decode, compute-bound at prefill — these require different optimization strategies. Continuous batching (schedule at iteration level) is the highest-leverage throughput improvement. PagedAttention manages KV cache memory without fragmentation. Disaggregate prefill and decode to separate GPU fleets for better utilization. Prefix caching amortizes system prompt cost across users. Key SLOs: TTFT (prefill latency) and TPOT (decode throughput). Cost controls: quantization, speculative decoding, smaller models + RAG.
follow-up
- How does speculative decoding interact with continuous batching — do they compose well?
- What is chunked prefill and why does it improve TPOT for requests that arrive during long prefill operations?
- How would you design a multi-model serving system that routes between a 7B and 70B model based on query complexity?