Back
advanced
Optimization & Deployment

Speculative Decoding and Inference-Time Scaling

Improve serving throughput and quality with draft models, parallel sampling, and inference-time optimization

24 min read· speculative decoding· inference· vLLM· serving

Speculative Decoding and Inference-Time Scaling

Training-time scale is not the only way to improve model systems. Inference-time techniques can improve speed, throughput, or answer quality after a model is deployed.

Speculative decoding

Speculative decoding uses a smaller or cheaper draft process to propose tokens. The larger model verifies them.

text
draft model proposes:   token1 token2 token3 token4
target model verifies:  accept accept reject
target model continues: corrected token...

If many draft tokens are accepted, generation becomes faster.

When it helps

  • high-throughput serving
  • long generations
  • repeated traffic patterns
  • open-weight model hosting
  • workloads with predictable language

It helps less when the target model rejects most draft tokens.

Other inference-time scaling patterns

PatternGoal
Best-of-N samplinggenerate several answers and choose the best
Self-consistencymajority vote over reasoning paths
Rerank generated answersuse a judge/reward model
Chunked prefillhandle long prompts more efficiently
Continuous batchingkeep GPUs full across many users
Prefix cachingreuse repeated system/context prefixes

Serving tradeoffs

Optimization can move bottlenecks:

  • GPU memory
  • KV cache size
  • network streaming
  • scheduler overhead
  • batch latency
  • cost per accepted token

Always test with your real prompt lengths and concurrency.

Knowledge check

Q1: What is the draft model's role in speculative decoding?
It proposes likely next tokens cheaply so the larger model can verify them.

Q2: Why can best-of-N be expensive?
It generates multiple candidate answers, multiplying token usage.