Speculative Decoding and Inference-Time Scaling

Training-time scale is not the only way to improve model systems. Inference-time techniques can improve speed, throughput, or answer quality after a model is deployed.

Speculative decoding

Speculative decoding uses a smaller or cheaper draft process to propose tokens. The larger model verifies them.

text

draft model proposes:   token1 token2 token3 token4
target model verifies:  accept accept reject
target model continues: corrected token...

If many draft tokens are accepted, generation becomes faster.

When it helps

high-throughput serving
long generations
repeated traffic patterns
open-weight model hosting
workloads with predictable language

It helps less when the target model rejects most draft tokens.

Other inference-time scaling patterns

Pattern	Goal
Best-of-N sampling	generate several answers and choose the best
Self-consistency	majority vote over reasoning paths
Rerank generated answers	use a judge/reward model
Chunked prefill	handle long prompts more efficiently
Continuous batching	keep GPUs full across many users
Prefix caching	reuse repeated system/context prefixes

Serving tradeoffs

Optimization can move bottlenecks:

GPU memory
KV cache size
network streaming
scheduler overhead
batch latency
cost per accepted token

Always test with your real prompt lengths and concurrency.

Knowledge check

Q1: What is the draft model's role in speculative decoding?
It proposes likely next tokens cheaply so the larger model can verify them.

Q2: Why can best-of-N be expensive?
It generates multiple candidate answers, multiplying token usage.