Home Concept Explainers Inference & Optimization Batching: How Inference Servers Serve a Thousand Users at On...

Inference & Optimization Crawler graph 3 sliders

Batching: How Inference Servers Serve a Thousand Users at Once

GPUs are starved on a single request — most of the chip is idle. Batching packs many requests into one forward pass for huge throughput wins.

Apr 29, 2026 · 3 min lezen

Naar het lab Geen registratie · Voor altijd gratis

▸ Probeer het zelf

Sleep een slider — het diagram reageert in real time.

Spatie voor play · ←/→ om te scrubben

Crawler graph

FR /100 SN-514

SPACE · ◄ ►

¶ De analogie

The bus-vs-taxi analogy

A taxi takes one passenger A to B. A bus takes 50. The bus has more wait at each stop and is less personal — but per passenger, it is dramatically cheaper and uses the road more efficiently.

A GPU running one request at a time is a taxi: the chip's thousands of cores sit half-idle waiting for one user. Batching turns it into a bus: 50 requests share the same forward pass. The wait per request is a hair longer, but throughput goes up 10–50×.

Why GPUs need batching

Modern GPUs have thousands of compute cores. A single small request — say generating one token for one user — uses a sliver of them. The rest are stalled. Batching multiple requests together makes the same kernel call do useful work for many users in parallel, with very little extra latency per user.

Throughput at batch size 1: wasted silicon. Throughput at batch size 32: GPU happy.

Static batching (the naive way)

Wait for N requests, run them together, return all answers when the slowest finishes.

Problems:

One long generation blocks N–1 short ones.
A new request that arrives 10 ms after the batch starts waits for the whole batch to finish.
GPU sits idle when there are not enough requests in the queue to fill a batch.

This is what early inference servers did. It is rarely good enough in production.

Continuous (in-flight) batching

The modern approach (vLLM, TGI, TRT-LLM):

The scheduler runs one decode step per active request per iteration.
When a request finishes, its slot is freed immediately and a new request slides in mid-flight.
Requests of wildly different lengths happily coexist; the slowest one does not block the rest.

This is the single biggest serving improvement of the last few years. Going from static to continuous batching often delivers a 5–10× throughput lift on the same hardware.

The knobs you'll actually tune

Max batch size — how many requests share one forward pass. Higher = more throughput, more memory, slightly higher per-request latency.
Max waiting tokens — how many incoming prompts to queue before forcing a prefill. Tunes the latency / utilisation trade-off.
Prefill vs decode scheduling — prefill is heavy; running it alongside many decodes can cause latency spikes. Some servers separate them onto different streams or even different replicas.
Token budget per step — caps the amount of work in any one forward pass to avoid head-of-line blocking.

What batching does not fix

Memory pressure from KV cache — see the KV cache explainer. More concurrent requests = more cache.
Tail latency from long generations — a request asking for 8000 output tokens is just slow.
Cold start — first request to an empty server is always painful. Pre-warm critical pipelines.

Latency vs throughput in one chart

Optimise for throughput when the workload is bursty and large (offline jobs, batch summarisation). Optimise for latency when the workload is interactive (chat, autocomplete). Continuous batching gets you most of both at once — but you still pick a side at the margin.

This is the core trade-off. Every serving config you see is a point on that curve.

From the field

Unless you self-host you never set these knobs, but batching still explains a mystery every API user hits: why the same prompt is fast at 3am and slow at peak. You're sharing a batch with everyone else. The one decision batching forces when you do self-host is picking a serving stack that does continuous (in-flight) batching by default — static batching looks fine in a benchmark and falls apart under real, variable-length traffic. That experience cured me of benchmarking inference with identical prompts; real load is mixed lengths, and that's exactly what separates a stack that holds up from one that doesn't.

→ Wilt u dit in uw stack?

AWS Cloud Infrastructure & DevOps Engineering

Get an AWS environment that is secure by default, sized to what you actually run, and documented so your team can operate it without guesswork. Whether you are launching a new workload or inheriting a...

Zie hoe ik kan helpen