The bus-vs-taxi analogy
A taxi takes one passenger A to B. A bus takes 50. The bus has more wait at each stop and is less personal — but per passenger, it is dramatically cheaper and uses the road more efficiently.
A GPU running one request at a time is a taxi: the chip's thousands of cores sit half-idle waiting for one user. Batching turns it into a bus: 50 requests share the same forward pass. The wait per request is a hair longer, but throughput goes up 10–50×.
Why GPUs need batching
Modern GPUs have thousands of compute cores. A single small request — say generating one token for one user — uses a sliver of them. The rest are stalled. Batching multiple requests together makes the same kernel call do useful work for many users in parallel, with very little extra latency per user.
Throughput at batch size 1: wasted silicon. Throughput at batch size 32: GPU happy.
Static batching (the naive way)
Wait for N requests, run them together, return all answers when the slowest finishes.
Problems:
- One long generation blocks N–1 short ones.
- A new request that arrives 10 ms after the batch starts waits for the whole batch to finish.
- GPU sits idle when there are not enough requests in the queue to fill a batch.
This is what early inference servers did. It is rarely good enough in production.
Continuous (in-flight) batching
The modern approach (vLLM, TGI, TRT-LLM):
- The scheduler runs one decode step per active request per iteration.
- When a request finishes, its slot is freed immediately and a new request slides in mid-flight.
- Requests of wildly different lengths happily coexist; the slowest one does not block the rest.
This is the single biggest serving improvement of the last few years. Going from static to continuous batching often delivers a 5–10× throughput lift on the same hardware.
The knobs you'll actually tune
- Max batch size — how many requests share one forward pass. Higher = more throughput, more memory, slightly higher per-request latency.
- Max waiting tokens — how many incoming prompts to queue before forcing a prefill. Tunes the latency / utilisation trade-off.
- Prefill vs decode scheduling — prefill is heavy; running it alongside many decodes can cause latency spikes. Some servers separate them onto different streams or even different replicas.
- Token budget per step — caps the amount of work in any one forward pass to avoid head-of-line blocking.
What batching does not fix
- Memory pressure from KV cache — see the KV cache explainer. More concurrent requests = more cache.
- Tail latency from long generations — a request asking for 8000 output tokens is just slow.
- Cold start — first request to an empty server is always painful. Pre-warm critical pipelines.
Latency vs throughput in one chart
Optimise for throughput when the workload is bursty and large (offline jobs, batch summarisation). Optimise for latency when the workload is interactive (chat, autocomplete). Continuous batching gets you most of both at once — but you still pick a side at the margin.
This is the core trade-off. Every serving config you see is a point on that curve.