Skip to main content
Inference & Optimization MCP handshake 3 sliders

Speculative Decoding: A Cheap Model Guessing for an Expensive One

A tiny draft model proposes 5 tokens at once; the big model verifies them in a single forward pass. Net effect: 2–3× faster decode at identical quality.

· 3 min de lectura
Ir al laboratorio
▸ Pruébalo tú mismo

Arrastra un slider — el diagrama reacciona en tiempo real.

FR /100
¶ La analogía

The autocomplete analogy

You are typing an email. A junior assistant reads over your shoulder and suggests the next five words. You glance — yep, three are right, two are wrong. You accept the three, fix the rest. You finished five words in the time it would have taken to type two.

Speculative decoding is exactly that. A small fast draft model proposes the next 5–10 tokens. The big slow target model verifies them all in one forward pass. Accepted tokens are kept; the first rejected one and beyond are discarded. Net: usually 2–3× faster generation, with identical output distribution to the big model alone.

The trick that makes it lossless

The output is mathematically equivalent to running the big model alone. How? Rejection sampling on the probability ratios.

For each drafted token, you compare the draft model's probability with the target model's probability and accept with min(1, p_target / p_draft). If you reject, you resample from a corrected distribution. The math guarantees the final sample matches what the target model would have produced.

You are not cutting corners on quality — you are exploiting the fact that the target model can verify tokens much faster than it can generate them.

Why verification is cheap

Generating one token is one forward pass. Verifying K drafted tokens is also one forward pass (the target model processes all K in parallel using its existing attention). So you spend the cost of one big-model pass and potentially get multiple tokens out of it.

If 4 of 5 draft tokens are accepted, you advanced 4 tokens for the price of one big-model step + one tiny-model step. Net speedup ≈ 3–4×.

Choosing the draft model

A good draft model has two properties:

  1. Fast — orders of magnitude cheaper per token than the target.
  2. Predictive — outputs that broadly agree with the target's high-probability tokens.

Common picks:

  • A distilled version of the target.
  • A smaller sibling in the same family (Llama-7B drafting for Llama-70B).
  • A medusa head — extra prediction heads on the same model that propose multiple tokens at once.
  • An ngram cache — for very repetitive contexts (code, structured outputs), even a literal prefix match can be a draft.

When speedup is large vs small

Speedup tracks acceptance rate:

  • High acceptance (predictable text, structured output): up to 3–4× faster.
  • Medium acceptance (general prose): 1.8–2.5×.
  • Low acceptance (creative, surprising output): barely 1.2×, sometimes even slower because verification overhead dominates.

The technique shines on factual / template-heavy / code generation. It pays less on highly creative generation.

Variants you'll meet

  • Vanilla speculative decoding — separate draft + target models.
  • Medusa — drafting heads bolted onto the same target model. No second model to run.
  • Lookahead decoding — uses the model's own past outputs to bootstrap drafts.
  • EAGLE / SpS — newer recipes that condition the draft on richer signals.

All share the same core idea: propose many, verify cheaply, accept what matches. The implementations differ in how they get cheap good proposals.

Practical caveat

Speculative decoding lifts single-request latency. Throughput is more nuanced — under heavy continuous batching, the GPU was already busy, and adding spec-decode work can compete for the same kernels. Many serving stacks let you turn it on per-request based on tenant or SLO.

Engr Mejba Ahmed

Engr Mejba Ahmed

Claude Code Expert · Online

👋

Hey there!

Quick Actions

WhatsApp Instant reply

Chat on WhatsApp

+880 1723 741224 · Instant reply

Popular Questions

Engr Mejba Ahmed is connected
Engr Mejba Ahmed is typing...
Engr Mejba Ahmed avatar

✉ Want me to follow up? Drop your email

Engr Mejba Ahmed avatar

📞 Connect Directly

Choose how you'd like to reach me

WhatsApp

+880 1723 741224

Email

[email protected]

✓ Details sent! I'll get back to you shortly.

Powered by OpenAI

335+

Blog Posts

25

AI Courses

63

Projects

Services & Expertise

Pricing & Process

Learning & Resources

Connect & Support