Home Concept Explainers Inference & Optimization Speculative Decoding: A Cheap Model Guessing for an Expensiv...

Inference & Optimization MCP handshake 3 sliders

Speculative Decoding: A Cheap Model Guessing for an Expensive One

A tiny draft model proposes 5 tokens at once; the big model verifies them in a single forward pass. Net effect: 2–3× faster decode at identical quality.

Apr 29, 2026 · 3 min de lectura

Ir al laboratorio Sin registro · Gratis para siempre

▸ Pruébalo tú mismo

Arrastra un slider — el diagrama reacciona en tiempo real.

Espacio para play · ←/→ para scrubear

MCP handshake

FR /100 SN-312

SPACE · ◄ ►

¶ La analogía

The autocomplete analogy

You are typing an email. A junior assistant reads over your shoulder and suggests the next five words. You glance — yep, three are right, two are wrong. You accept the three, fix the rest. You finished five words in the time it would have taken to type two.

Speculative decoding is exactly that. A small fast draft model proposes the next 5–10 tokens. The big slow target model verifies them all in one forward pass. Accepted tokens are kept; the first rejected one and beyond are discarded. Net: usually 2–3× faster generation, with identical output distribution to the big model alone.

The trick that makes it lossless

The output is mathematically equivalent to running the big model alone. How? Rejection sampling on the probability ratios.

For each drafted token, you compare the draft model's probability with the target model's probability and accept with min(1, p_target / p_draft). If you reject, you resample from a corrected distribution. The math guarantees the final sample matches what the target model would have produced.

You are not cutting corners on quality — you are exploiting the fact that the target model can verify tokens much faster than it can generate them.

Why verification is cheap

Generating one token is one forward pass. Verifying K drafted tokens is also one forward pass (the target model processes all K in parallel using its existing attention). So you spend the cost of one big-model pass and potentially get multiple tokens out of it.

If 4 of 5 draft tokens are accepted, you advanced 4 tokens for the price of one big-model step + one tiny-model step. Net speedup ≈ 3–4×.

Choosing the draft model

A good draft model has two properties:

Fast — orders of magnitude cheaper per token than the target.
Predictive — outputs that broadly agree with the target's high-probability tokens.

Common picks:

A distilled version of the target.
A smaller sibling in the same family (Llama-7B drafting for Llama-70B).
A medusa head — extra prediction heads on the same model that propose multiple tokens at once.
An ngram cache — for very repetitive contexts (code, structured outputs), even a literal prefix match can be a draft.

When speedup is large vs small

Speedup tracks acceptance rate:

High acceptance (predictable text, structured output): up to 3–4× faster.
Medium acceptance (general prose): 1.8–2.5×.
Low acceptance (creative, surprising output): barely 1.2×, sometimes even slower because verification overhead dominates.

The technique shines on factual / template-heavy / code generation. It pays less on highly creative generation.

Variants you'll meet

Vanilla speculative decoding — separate draft + target models.
Medusa — drafting heads bolted onto the same target model. No second model to run.
Lookahead decoding — uses the model's own past outputs to bootstrap drafts.
EAGLE / SpS — newer recipes that condition the draft on richer signals.

All share the same core idea: propose many, verify cheaply, accept what matches. The implementations differ in how they get cheap good proposals.

Practical caveat

Speculative decoding lifts single-request latency. Throughput is more nuanced — under heavy continuous batching, the GPU was already busy, and adding spec-decode work can compete for the same kernels. Many serving stacks let you turn it on per-request based on tenant or SLO.

From the field

Speculative decoding is one of those wins you mostly enjoy without configuring — the provider runs it and your tokens-per-second go up for free. The nuance worth knowing: it speeds the streaming part, not the first token, so it won't rescue a slow TTFT caused by a giant prompt. And the speedup depends on how predictable your text is — boilerplate and code accelerate a lot, genuinely novel prose less so. When I'm puzzling over why one provider's endpoint streams faster than another's on the same model, draft-model speculative decoding is often the hidden reason, not a fundamentally better core model.

→ ¿Lo quieres en tu stack?

AWS Cloud Infrastructure & DevOps Engineering

Get an AWS environment that is secure by default, sized to what you actually run, and documented so your team can operate it without guesswork. Whether you are launching a new workload or inheriting a...

Ver cómo puedo ayudar