Design an LLM serving system for a product used by millions of users — think ChatGPT or Claude at scale. Walk me through how you'd handle request routing, GPU scheduling, KV cache management, latency SLOs, and cost control. What breaks first as you scale?

Question

Accepted Answer

How to drive the interview

Don't start with infrastructure. Start by clarifying the contract:
- What latency SLO? Time-to-first-token (TTFT) vs time-per-output-token (TPOT)?
- Streaming outputs or wait-for-complete?
- Average vs max context length?
- Is the model fixed or does the system serve multiple models?
- Cost sensitivity vs latency sensitivity?

A strong answer moves through: compute architecture → request lifecycle → batching strategy → KV cache management → scaling and cost → failure modes.

The fundamental bottleneck: memory bandwidth

LLM decode is memory-bandwidth-bound. The GPU must stream model weights and KV cache from HBM each step. A 70B FP16 model = 140 GB. H100 HBM bandwidth = ~3.4 TB/s. Minimum time to read weights = 140 GB / 3.4 TB/s ≈ 41ms per token for a batch of 1.

At batch size 1, GPU compute utilization is ~5%. You're paying for a $30k GPU to mostly wait for memory.

Continuous batching (iteration-level scheduling)

Naïve batching waits for all requests to finish before starting new ones — a long request blocks the batch. Continuous batching (vLLM, TGI) schedules at the iteration level:

- Each decode step, scheduler fills available GPU slots with new requests or continues existing ones
- New requests join the batch as soon as a slot opens (after another request finishes or is preempted)
- Increases GPU utilization from ~20% to ~70-80% in practice

This is the single most impactful optimization for throughput.

Prefill-decode disaggregation

Prefill (processing the prompt) is compute-bound — all positions processed in parallel. Decode (generating tokens) is memory-bandwidth-bound — one token at a time. These have different optimal batch sizes and GPU configurations.

Disaggregated serving (DistServe, Splitwise): route prefill to compute-optimized GPUs (more tensor cores), decode to memory-optimized GPUs (more HBM bandwidth). Reduces interference between the two phases, improves utilization of each.

KV cache management with PagedAttention

KV cache grows during decode: you don't know the final context length upfront, so pre-allocating max memory wastes GPU RAM. PagedAttention (vLLM) manages KV cache like virtual memory:
- Fixed-size pages (e.g., 16 tokens per page) allocated on demand
- No internal fragmentation — different requests can share memory pages
- Prefix caching: system prompt processed once, KV cache shared across all requests. 1000 concurrent users with the same 2k-token system prompt pay prefill cost once

Request routing and load balancing

Multiple inference servers run the same model (data parallel). Routing considerations:
- Route requests with the same prefix to the same server (to reuse prefix cache)
- Balance load to avoid hot spots — monitor GPU memory utilization, not just request count
- Separate routing for different model sizes (3B vs 70B) or quantization levels

Latency SLOs: TTFT vs TPOT

Users perceive streaming differently from batch:
- TTFT (Time to First Token): how quickly the stream starts. Dominated by prefill duration. Target: under 1 second for most use cases.
- TPOT (Time Per Output Token): streaming speed. Minimum human reading ~100 tokens/sec. Target: >50ms/token for "instant feel."

Priority queues: short prompts get lower TTFT, not because the GPU is faster, but because prefill parallelizes well. Long prompts with short outputs (summarization) need priority scheduling.

Cost control levers

| Lever | Effect |
|---|---|
| Quantization (INT4/INT8) | 2-4× more model capacity per GPU |
| Continuous batching | 4-10× throughput improvement over naïve batching |
| Speculative decoding | 2-3× latency reduction on predictable output |
| Prefix caching | Amortize shared prompt prefill across users |
| Request-level priority | Protect interactive users from batch job interference |
| Smaller model + RAG | Lower GPU cost for knowledge-intensive tasks |

What breaks first at scale

GPU OOM: KV cache fills before requests complete. Fix: lower max concurrent requests, add more GPUs, reduce context length limit, use INT4 KV cache.

Prefill spikes: a burst of long-prompt requests (RAG + long docs) saturates compute while decode stalls. Fix: prefill rate limiting, separate prefill fleet.

Latency tail: p99 TTFT blows up under load. Root cause: long requests block short ones in same batch slot. Fix: request length estimation and priority queuing.

Hot prefix cache: one popular prompt uses disproportionate KV memory. Fix: LRU eviction on prefix cache, cap max prefix cache per key.