Explain speculative decoding. Why is LLM decode latency hard to improve with more compute, and how does speculative decoding address this? Walk me through the algorithm and what determines whether it actually helps in practice.

Question

Accepted Answer

Why decode latency is hard to parallelize

LLM decode is inherently sequential: each token depends on the previous one. More GPUs help with throughput (process more requests in parallel) but not latency (time to generate one response). A single request with a 32k-token output must generate each token one at a time.

The decode step is also memory-bandwidth-bound: the GPU spends most of its time streaming model weights and KV cache from HBM, not doing compute. The GPU's compute units sit underutilized. This means you could potentially do more work per decode step without increasing latency.

The core idea

Speculative decoding exploits that underutilization. Use a small, fast draft model (same architecture, fewer layers — e.g., 7B draft for a 70B target) to speculatively generate K tokens ahead. Then verify all K tokens in parallel with one forward pass of the large target model.

The algorithm

1. Draft model generates K tokens autoregressively: d_1, d_2, ..., d_K
2. Target model runs one forward pass on the prefix + K draft tokens simultaneously
3. For each draft token d_i, compare draft probability q(d_i) vs target probability p(d_i):
   - Accept with probability min(1, p(d_i) / q(d_i))
   - If accepted, continue to d_{i+1}
   - If rejected, sample a correction token from an adjusted distribution and stop
4. The verified tokens are appended; repeat

Why this preserves the target distribution

The acceptance-rejection step is not just a heuristic — it's a formal sampling theorem. The output distribution of speculative decoding is provably identical to sampling from the target model directly. There is no quality loss.

Intuitively: if the draft model is confident and correct, the target model agrees and accepts. If the draft is wrong, the target model rejects and provides its own sample. In both cases, the final distribution matches the target.

What determines speedup

Acceptance rate α: if α is high (draft usually matches target), you get ~K tokens per target forward pass. If α is low, you reject often and get little benefit. Typical acceptance rates: 0.7–0.9 for a well-matched draft-target pair.

Expected tokens per target call ≈ (1 - α^(K+1)) / (1 - α)

At α=0.8, K=4: expected ~3.6 tokens per target call vs 1 without speculation.

Draft model speed: draft must be fast enough that K draft tokens + one target verification is faster than K sequential target passes. Draft model at 10× the target's latency would eliminate all benefit.

Token type matters: speculative decoding helps most on predictable spans (boilerplate, common phrases) and least on high-entropy tokens (first word of a new idea, proper nouns). Some systems use adaptive K — stop drafting after first low-confidence token.

Variants

Self-speculation: no separate draft model. Use the same model's early exit layers as the draft. Fewer infrastructure changes but smaller speedup.

Medusa: train multiple lightweight heads on the base model to predict K tokens simultaneously as drafts. Parallelizes drafting within one model.

Lookahead decoding: uses n-gram lookup tables to propose multi-token candidates without a separate model.