Speculative Decoding and Inference-Time Scaling
Training-time scale is not the only way to improve model systems. Inference-time techniques can improve speed, throughput, or answer quality after a model is deployed.
Speculative decoding
Speculative decoding uses a smaller or cheaper draft process to propose tokens. The larger model verifies them.
draft model proposes: token1 token2 token3 token4
target model verifies: accept accept reject
target model continues: corrected token...
If many draft tokens are accepted, generation becomes faster.
When it helps
- high-throughput serving
- long generations
- repeated traffic patterns
- open-weight model hosting
- workloads with predictable language
It helps less when the target model rejects most draft tokens.
Other inference-time scaling patterns
| Pattern | Goal |
|---|---|
| Best-of-N sampling | generate several answers and choose the best |
| Self-consistency | majority vote over reasoning paths |
| Rerank generated answers | use a judge/reward model |
| Chunked prefill | handle long prompts more efficiently |
| Continuous batching | keep GPUs full across many users |
| Prefix caching | reuse repeated system/context prefixes |
Serving tradeoffs
Optimization can move bottlenecks:
- GPU memory
- KV cache size
- network streaming
- scheduler overhead
- batch latency
- cost per accepted token
Always test with your real prompt lengths and concurrency.
Knowledge check
Q1: What is the draft model's role in speculative decoding?
It proposes likely next tokens cheaply so the larger model can verify them.
Q2: Why can best-of-N be expensive?
It generates multiple candidate answers, multiplying token usage.