Home Concept Explainers LLM APIs & Tooling LLM Streaming: Why First-Token Latency Beats Total Time

LLM APIs & Tooling MCP handshake 3 sliders

LLM Streaming: Why First-Token Latency Beats Total Time

Streaming sends tokens as the model produces them. Total wall-time is similar; perceived speed is dramatically better — and lets you cut off when the answer is good enough.

Apr 29, 2026 · 4 min lezen

Naar het lab Geen registratie · Voor altijd gratis

▸ Probeer het zelf

Sleep een slider — het diagram reageert in real time.

Spatie voor play · ←/→ om te scrubben

MCP handshake

FR /100 SN-312

SPACE · ◄ ►

¶ De analogie

The pizza-delivery analogy

Two pizzerias take 20 minutes to bake. Pizzeria A delivers in one trip after baking. Pizzeria B delivers slice-by-slice as each one comes out of the oven, every 2 minutes. The total food, the total wait, the total slices — identical. But you've been eating for 18 minutes at B. Perceived experience is night and day.

LLM streaming works the same way. The model still spends X seconds generating. But you start seeing output token-by-token — at "2-minute slices" — instead of waiting for the whole answer.

What streaming changes

Time-to-first-token (TTFT) — the metric users actually feel. Streaming makes this the number to optimise.
Cancellation — once the answer is clearly going where you want, cut off and save tokens. Once it's clearly going wrong, cut off even sooner.
Progressive UI — typing animation, partial markdown rendering, incremental tool dispatch.
Backpressure — large outputs no longer pin a request for 30 seconds; bytes flow as soon as they're produced.

What it does not change:

Total cost — you still pay for every token.
Total wall-clock time — the model still takes the same time end-to-end.
Quality — same output as non-streaming, just delivered differently.

How it works on the wire

Server-Sent Events (SSE) is the dominant transport. The HTTP response stays open and emits data: {...} lines.
Each event carries a small JSON delta: a token chunk, a tool-call argument fragment, a stop reason.
The client accumulates deltas; when the stream ends, it has the full response.

WebSockets and chunked JSON are alternatives but rarer. SSE wins because it's stateless, replayable, and works through every proxy you'll meet.

What you stream

Text deltas — the obvious one.
Tool-call arguments — useful for showing "calling get_user…" while the args are still being decided.
Citations — incremental "this paragraph cites doc 4" markers.
Reasoning / thinking — for models that emit a separate thinking stream.
Metadata — token counts, finish reason, model version, often in a final event.

Engineering streaming well

First-byte fast. Send your initial event (even just an empty heartbeat) immediately; clients hate slow opening connections.
Heartbeats every ~15s. Keeps proxies and load balancers from killing idle connections.
Idempotent retries. If the stream dies mid-flow, re-issue and dedupe by request ID.
Clean cancellation. When the user closes the tab, propagate cancel to the upstream API. Otherwise you pay for tokens nobody reads.
Buffering tradeoff. Tiny chunks fragment too much; huge chunks defeat the point. ~20–50 tokens per visible flush feels right.

Streaming with structured outputs

JSON streaming is trickier — you receive partial JSON that may be invalid. Solutions:

Streaming JSON parsers — libraries like partial-json parse what's complete and tolerate dangling tokens.
Field-by-field reveal — the API streams completed fields as discrete events.
Final-only delivery for critical paths — sometimes it's worth waiting for the whole structured response and accept the latency cost.

Where streaming wins big

Chat UIs — typing animation makes the assistant feel responsive even on long answers.
Code generation — show the function as it forms; user can interrupt early.
Long content — articles, reports; users don't wait staring at a spinner.
Voice / audio output — speech synthesis can start on the first chunk, not after the full text.

Where it doesn't matter

Background jobs — nobody is watching; just take the response.
Tiny outputs — TTFT and total time converge; complexity isn't worth it.
Strict batch pipelines — non-streaming is simpler when you're going to wait anyway.

In one line

Streaming doesn't make your model faster. It makes the wait feel half as long, and lets you cut off the half you didn't need.

From the field

Streaming is the single best perceived-latency trick there is — users tolerate a long answer if something appears fast — but it quietly complicates everything off the happy path. You can't validate a JSON object that's still arriving, so for structured responses I either don't stream or stream a display version and assemble the strict one separately. And mid-stream errors are nastier than errors before it: the user has already seen half an answer when the connection drops, so you need a way to recover or clearly mark it incomplete. Stream the user-facing text; be careful streaming anything your code later has to parse.

→ Wilt u dit in uw stack?

Custom SaaS App, AI Dashboard & Web Application Development — Full-Stack Engineer

Need a SaaS app, AI dashboard, or web application built fast and production-ready?I build full-stack AI-powered products using vibe coding with Lovable AI, React, Next.js, Tailwind CSS, Supabase, Pyth...

Zie hoe ik kan helpen