Skip to main content
Inference & Optimization Crawler graph 3 Slider

Batching: How Inference Servers Serve a Thousand Users at Once

GPUs are starved on a single request — most of the chip is idle. Batching packs many requests into one forward pass for huge throughput wins.

· 3 Min. Lesezeit
Zum Lab springen
▸ Selbst ausprobieren

Zieh einen Slider — das Diagramm reagiert in Echtzeit.

FR /100
¶ Die Analogie

The bus-vs-taxi analogy

A taxi takes one passenger A to B. A bus takes 50. The bus has more wait at each stop and is less personal — but per passenger, it is dramatically cheaper and uses the road more efficiently.

A GPU running one request at a time is a taxi: the chip's thousands of cores sit half-idle waiting for one user. Batching turns it into a bus: 50 requests share the same forward pass. The wait per request is a hair longer, but throughput goes up 10–50×.

Why GPUs need batching

Modern GPUs have thousands of compute cores. A single small request — say generating one token for one user — uses a sliver of them. The rest are stalled. Batching multiple requests together makes the same kernel call do useful work for many users in parallel, with very little extra latency per user.

Throughput at batch size 1: wasted silicon. Throughput at batch size 32: GPU happy.

Static batching (the naive way)

Wait for N requests, run them together, return all answers when the slowest finishes.

Problems:

  • One long generation blocks N–1 short ones.
  • A new request that arrives 10 ms after the batch starts waits for the whole batch to finish.
  • GPU sits idle when there are not enough requests in the queue to fill a batch.

This is what early inference servers did. It is rarely good enough in production.

Continuous (in-flight) batching

The modern approach (vLLM, TGI, TRT-LLM):

  • The scheduler runs one decode step per active request per iteration.
  • When a request finishes, its slot is freed immediately and a new request slides in mid-flight.
  • Requests of wildly different lengths happily coexist; the slowest one does not block the rest.

This is the single biggest serving improvement of the last few years. Going from static to continuous batching often delivers a 5–10× throughput lift on the same hardware.

The knobs you'll actually tune

  • Max batch size — how many requests share one forward pass. Higher = more throughput, more memory, slightly higher per-request latency.
  • Max waiting tokens — how many incoming prompts to queue before forcing a prefill. Tunes the latency / utilisation trade-off.
  • Prefill vs decode scheduling — prefill is heavy; running it alongside many decodes can cause latency spikes. Some servers separate them onto different streams or even different replicas.
  • Token budget per step — caps the amount of work in any one forward pass to avoid head-of-line blocking.

What batching does not fix

  • Memory pressure from KV cache — see the KV cache explainer. More concurrent requests = more cache.
  • Tail latency from long generations — a request asking for 8000 output tokens is just slow.
  • Cold start — first request to an empty server is always painful. Pre-warm critical pipelines.

Latency vs throughput in one chart

Optimise for throughput when the workload is bursty and large (offline jobs, batch summarisation). Optimise for latency when the workload is interactive (chat, autocomplete). Continuous batching gets you most of both at once — but you still pick a side at the margin.

This is the core trade-off. Every serving config you see is a point on that curve.

Engr Mejba Ahmed

Engr Mejba Ahmed

Claude Code Expert · Online

👋

Hey there!

Quick Actions

WhatsApp Instant reply

Chat on WhatsApp

+880 1723 741224 · Instant reply

Popular Questions

Engr Mejba Ahmed is connected
Engr Mejba Ahmed is typing...
Engr Mejba Ahmed avatar

✉ Want me to follow up? Drop your email

Engr Mejba Ahmed avatar

📞 Connect Directly

Choose how you'd like to reach me

WhatsApp

+880 1723 741224

Email

[email protected]

✓ Details sent! I'll get back to you shortly.

Powered by OpenAI

335+

Blog Posts

25

AI Courses

63

Projects

Services & Expertise

Pricing & Process

Learning & Resources

Connect & Support