Skip to main content
AI Operations & Production Crawler graph 3 sliders

AI Latency: P50, P99, and Why TTFT Matters Most

Users feel TTFT (time to first token), not total time. Optimise for it. P99 hides the customers who actually churn — track it like your job depends on it.

· 2 min de lectura
Ir al laboratorio
▸ Pruébalo tú mismo

Arrastra un slider — el diagrama reacciona en tiempo real.

FR /100
¶ La analogía

The restaurant-wait analogy

You're starving in a restaurant. The chef sends out bread in 2 minutes — you're happy, you settle in, dinner can take 30 more minutes. Send nothing for 10 minutes and you're walking out, even if dinner would have arrived eventually.

LLM apps run on the same psychology. Time to first token (TTFT) is the bread. Total time is the dinner. Get the bread out fast and users tolerate a long generation. Make them stare at a spinner and they bail at second 8.

The latency metrics that matter

  • TTFT (time to first token) — wall-time from request start to first token streamed. The user-felt latency.
  • Tokens per second — how fast the rest streams. Affects "this feels fast" vs "this feels grinding."
  • Total latency — request → final token. What you bill batch jobs against, less what users feel.
  • Tool-call round-trip — for agents, each tool call adds wall-time. Counts as latency.

Track all four, broken down by P50, P95, P99. The averages lie; the tails are where users churn.

Why P99 matters more than P50

P50 (the median) is your typical user. P99 is your unhappy 1%. They are not the same people.

  • A P50 of 800ms with a P99 of 30s means 1% of users wait 30 seconds — and they are the loudest 1%. They write angry tweets and write to support.
  • A P50 of 1.2s with a P99 of 3s feels worse on charts but is consistently fine for everyone.

You cannot ship a "great UX on average" feature. You ship one that's acceptable for everyone. Tail latency is the spec.

What controls TTFT

  • Prefill time — proportional to input size and quadratic in attention (mitigated by caching). Long context = slow first token.
  • Queueing on the provider side — under load, your request waits. Higher tier accounts get less queueing.
  • Network round-trips — region matters. A request from Tokyo to a US-only model pays 200ms+ before any work starts.
  • Cold starts — first request to a self-hosted endpoint that just spun up.

Levers that actually move TTFT

  • Prompt caching — cached prefixes process drastically faster. Often the single biggest TTFT lever.
  • Trim the prompt — every removed token shortens prefill.
  • Smaller model — Haiku TTFT << Opus TTFT for the same prompt. Combine with routing.
  • Region pinning — match your serving region to your users.
  • Provider tier — higher tiers get better priority.
  • Speculative decoding — accelerates per-token generation; less impact on TTFT but improves total latency.
  • Streaming — doesn't reduce TTFT mathematically but makes the perceived latency much lower because UI starts updating.

What controls total latency

  • Output length — set max_tokens, ask for terser responses. Sometimes the cheapest latency win.
  • Tool calls — every agent step is a network round-trip. Cap max steps.
  • Sequential vs parallel — independent tool calls done in parallel save real wall-time.
  • Self-consistency, ToT, reflexion — more samples = more wall-time. Use only where the quality lift justifies it.

Architectural moves

  • Multi-region deployment. Route by user geography.
  • Pre-warm pools — keep some replicas hot so cold starts don't hit live traffic.
  • Hedging — fire two requests, take whichever lands first. Doubles cost but shrinks tail latency dramatically. Worth it for the most user-facing paths.
  • Async fallback — if P99 is unacceptable, render a quick partial answer and resolve the full one async ("we're working on the rest of your answer…").

Latency vs throughput

These pull against each other. Bigger batches = more throughput per dollar; bigger batches = higher per-request latency. Production-grade serving stacks (vLLM, TGI, TRT-LLM) make this knob explicit.

For interactive UX, prioritise latency. For offline pipelines, prioritise throughput. Don't run them on the same configuration.

What to NOT chase

  • Sub-100ms TTFT for hard prompts on big models. Physics. Use a smaller model or split the work.
  • Zero variance. Some variance is inherent. Aim for tight P95 / P99, not constant times.
  • Latency wins that hurt quality. A 200ms cut by switching to a worse model that produces 10% more support tickets is a bad trade.

In one line

Optimise for TTFT and P99, not P50 and total time. The user feels the bread; the angry 1% is the test of fitness.

Engr Mejba Ahmed

Engr Mejba Ahmed

Claude Code Expert · Online

👋

Hey there!

Quick Actions

WhatsApp Instant reply

Chat on WhatsApp

+880 1723 741224 · Instant reply

Popular Questions

Engr Mejba Ahmed is connected
Engr Mejba Ahmed is typing...
Engr Mejba Ahmed avatar

✉ Want me to follow up? Drop your email

Engr Mejba Ahmed avatar

📞 Connect Directly

Choose how you'd like to reach me

WhatsApp

+880 1723 741224

Email

[email protected]

✓ Details sent! I'll get back to you shortly.

Powered by OpenAI

335+

Blog Posts

25

AI Courses

63

Projects

Services & Expertise

Pricing & Process

Learning & Resources

Connect & Support