Home Concept Explainers AI Operations & Production AI Latency: P50, P99, and Why TTFT Matters Most

AI Operations & Production Crawler graph 3 sliders

AI Latency: P50, P99, and Why TTFT Matters Most

Users feel TTFT (time to first token), not total time. Optimise for it. P99 hides the customers who actually churn — track it like your job depends on it.

Apr 29, 2026 · 2 min de lectura

Ir al laboratorio Sin registro · Gratis para siempre

▸ Pruébalo tú mismo

Arrastra un slider — el diagrama reacciona en tiempo real.

Espacio para play · ←/→ para scrubear

Crawler graph

FR /100 SN-514

SPACE · ◄ ►

¶ La analogía

The restaurant-wait analogy

You're starving in a restaurant. The chef sends out bread in 2 minutes — you're happy, you settle in, dinner can take 30 more minutes. Send nothing for 10 minutes and you're walking out, even if dinner would have arrived eventually.

LLM apps run on the same psychology. Time to first token (TTFT) is the bread. Total time is the dinner. Get the bread out fast and users tolerate a long generation. Make them stare at a spinner and they bail at second 8.

The latency metrics that matter

TTFT (time to first token) — wall-time from request start to first token streamed. The user-felt latency.
Tokens per second — how fast the rest streams. Affects "this feels fast" vs "this feels grinding."
Total latency — request → final token. What you bill batch jobs against, less what users feel.
Tool-call round-trip — for agents, each tool call adds wall-time. Counts as latency.

Track all four, broken down by P50, P95, P99. The averages lie; the tails are where users churn.

Why P99 matters more than P50

P50 (the median) is your typical user. P99 is your unhappy 1%. They are not the same people.

A P50 of 800ms with a P99 of 30s means 1% of users wait 30 seconds — and they are the loudest 1%. They write angry tweets and write to support.
A P50 of 1.2s with a P99 of 3s feels worse on charts but is consistently fine for everyone.

You cannot ship a "great UX on average" feature. You ship one that's acceptable for everyone. Tail latency is the spec.

What controls TTFT

Prefill time — proportional to input size and quadratic in attention (mitigated by caching). Long context = slow first token.
Queueing on the provider side — under load, your request waits. Higher tier accounts get less queueing.
Network round-trips — region matters. A request from Tokyo to a US-only model pays 200ms+ before any work starts.
Cold starts — first request to a self-hosted endpoint that just spun up.

Levers that actually move TTFT

Prompt caching — cached prefixes process drastically faster. Often the single biggest TTFT lever.
Trim the prompt — every removed token shortens prefill.
Smaller model — Haiku TTFT << Opus TTFT for the same prompt. Combine with routing.
Region pinning — match your serving region to your users.
Provider tier — higher tiers get better priority.
Speculative decoding — accelerates per-token generation; less impact on TTFT but improves total latency.
Streaming — doesn't reduce TTFT mathematically but makes the perceived latency much lower because UI starts updating.

What controls total latency

Output length — set max_tokens, ask for terser responses. Sometimes the cheapest latency win.
Tool calls — every agent step is a network round-trip. Cap max steps.
Sequential vs parallel — independent tool calls done in parallel save real wall-time.
Self-consistency, ToT, reflexion — more samples = more wall-time. Use only where the quality lift justifies it.

Architectural moves

Multi-region deployment. Route by user geography.
Pre-warm pools — keep some replicas hot so cold starts don't hit live traffic.
Hedging — fire two requests, take whichever lands first. Doubles cost but shrinks tail latency dramatically. Worth it for the most user-facing paths.
Async fallback — if P99 is unacceptable, render a quick partial answer and resolve the full one async ("we're working on the rest of your answer…").

Latency vs throughput

These pull against each other. Bigger batches = more throughput per dollar; bigger batches = higher per-request latency. Production-grade serving stacks (vLLM, TGI, TRT-LLM) make this knob explicit.

For interactive UX, prioritise latency. For offline pipelines, prioritise throughput. Don't run them on the same configuration.

What to NOT chase

Sub-100ms TTFT for hard prompts on big models. Physics. Use a smaller model or split the work.
Zero variance. Some variance is inherent. Aim for tight P95 / P99, not constant times.
Latency wins that hurt quality. A 200ms cut by switching to a worse model that produces 10% more support tickets is a bad trade.

In one line

Optimise for TTFT and P99, not P50 and total time. The user feels the bread; the angry 1% is the test of fitness.

From the field

The cheapest latency win is almost always perceptual, not technical. The moment I stream tokens and show something — even a skeleton or a "thinking…" line — within a few hundred milliseconds, complaints about speed mostly vanish, even when total time is unchanged. After that, prompt caching is the biggest real TTFT lever I've measured; trimming the prompt is second. I save hedging — firing duplicate requests and taking the fastest — for the few user-facing paths where tail latency genuinely hurts, because it doubles spend. And I never "optimise" latency by swapping in a weaker model without checking support-ticket volume: a faster feature that's wrong more often is slower for the business.

→ ¿Lo quieres en tu stack?

AWS Cloud Infrastructure & DevOps Engineering

Get an AWS environment that is secure by default, sized to what you actually run, and documented so your team can operate it without guesswork. Whether you are launching a new workload or inheriting a...

Ver cómo puedo ayudar