Explain speculative decoding. Why is LLM decode latency hard to improve with more compute, and how does speculative decoding address this? Walk me through the algorithm and what determines whether it actually helps in practice.
formulate your answer, then —
tldr
Speculative decoding uses a fast draft model to propose K tokens, then verifies them in one parallel target-model forward pass. Accepted tokens come from the target distribution exactly — no quality loss. Speedup depends on acceptance rate α (how often draft matches target) and K. At α=0.8, K=4: ~3.6 tokens per target call vs 1. Most effective on predictable text spans; least effective on high-entropy generation.
follow-up
- How does the acceptance-rejection step guarantee the output distribution is identical to the target model?
- When would speculative decoding make latency worse rather than better?
- How does Medusa differ from standard speculative decoding and what tradeoff does it make?