Skip to main content
LLM APIs & Tooling MCP handshake 3 sliders

LLM Streaming: Why First-Token Latency Beats Total Time

Streaming sends tokens as the model produces them. Total wall-time is similar; perceived speed is dramatically better — and lets you cut off when the answer is good enough.

· 3 min lezen
Naar het lab
▸ Probeer het zelf

Sleep een slider — het diagram reageert in real time.

FR /100
¶ De analogie

The pizza-delivery analogy

Two pizzerias take 20 minutes to bake. Pizzeria A delivers in one trip after baking. Pizzeria B delivers slice-by-slice as each one comes out of the oven, every 2 minutes. The total food, the total wait, the total slices — identical. But you've been eating for 18 minutes at B. Perceived experience is night and day.

LLM streaming works the same way. The model still spends X seconds generating. But you start seeing output token-by-token — at "2-minute slices" — instead of waiting for the whole answer.

What streaming changes

  • Time-to-first-token (TTFT) — the metric users actually feel. Streaming makes this the number to optimise.
  • Cancellation — once the answer is clearly going where you want, cut off and save tokens. Once it's clearly going wrong, cut off even sooner.
  • Progressive UI — typing animation, partial markdown rendering, incremental tool dispatch.
  • Backpressure — large outputs no longer pin a request for 30 seconds; bytes flow as soon as they're produced.

What it does not change:

  • Total cost — you still pay for every token.
  • Total wall-clock time — the model still takes the same time end-to-end.
  • Quality — same output as non-streaming, just delivered differently.

How it works on the wire

  • Server-Sent Events (SSE) is the dominant transport. The HTTP response stays open and emits data: {...} lines.
  • Each event carries a small JSON delta: a token chunk, a tool-call argument fragment, a stop reason.
  • The client accumulates deltas; when the stream ends, it has the full response.

WebSockets and chunked JSON are alternatives but rarer. SSE wins because it's stateless, replayable, and works through every proxy you'll meet.

What you stream

  • Text deltas — the obvious one.
  • Tool-call arguments — useful for showing "calling get_user…" while the args are still being decided.
  • Citations — incremental "this paragraph cites doc 4" markers.
  • Reasoning / thinking — for models that emit a separate thinking stream.
  • Metadata — token counts, finish reason, model version, often in a final event.

Engineering streaming well

  • First-byte fast. Send your initial event (even just an empty heartbeat) immediately; clients hate slow opening connections.
  • Heartbeats every ~15s. Keeps proxies and load balancers from killing idle connections.
  • Idempotent retries. If the stream dies mid-flow, re-issue and dedupe by request ID.
  • Clean cancellation. When the user closes the tab, propagate cancel to the upstream API. Otherwise you pay for tokens nobody reads.
  • Buffering tradeoff. Tiny chunks fragment too much; huge chunks defeat the point. ~20–50 tokens per visible flush feels right.

Streaming with structured outputs

JSON streaming is trickier — you receive partial JSON that may be invalid. Solutions:

  • Streaming JSON parsers — libraries like partial-json parse what's complete and tolerate dangling tokens.
  • Field-by-field reveal — the API streams completed fields as discrete events.
  • Final-only delivery for critical paths — sometimes it's worth waiting for the whole structured response and accept the latency cost.

Where streaming wins big

  • Chat UIs — typing animation makes the assistant feel responsive even on long answers.
  • Code generation — show the function as it forms; user can interrupt early.
  • Long content — articles, reports; users don't wait staring at a spinner.
  • Voice / audio output — speech synthesis can start on the first chunk, not after the full text.

Where it doesn't matter

  • Background jobs — nobody is watching; just take the response.
  • Tiny outputs — TTFT and total time converge; complexity isn't worth it.
  • Strict batch pipelines — non-streaming is simpler when you're going to wait anyway.

In one line

Streaming doesn't make your model faster. It makes the wait feel half as long, and lets you cut off the half you didn't need.

Engr Mejba Ahmed

Engr Mejba Ahmed

Claude Code Expert · Online

👋

Hey there!

Quick Actions

WhatsApp Instant reply

Chat on WhatsApp

+880 1723 741224 · Instant reply

Popular Questions

Engr Mejba Ahmed is connected
Engr Mejba Ahmed is typing...
Engr Mejba Ahmed avatar

✉ Want me to follow up? Drop your email

Engr Mejba Ahmed avatar

📞 Connect Directly

Choose how you'd like to reach me

WhatsApp

+880 1723 741224

Email

[email protected]

✓ Details sent! I'll get back to you shortly.

Powered by OpenAI

335+

Blog Posts

25

AI Courses

63

Projects

Services & Expertise

Pricing & Process

Learning & Resources

Connect & Support