Home Concept Explainers Inference & Optimization Quantization: Shrinking Models Without Killing Them

Inference & Optimization Crawler graph 3 Slider

Quantization: Shrinking Models Without Killing Them

Store every weight in 4 bits instead of 16, fit a 70B model on one GPU, and lose almost no quality. Tune precision to feel the trade-off.

Apr 29, 2026 · 3 Min. Lesezeit

Zum Lab springen Keine Anmeldung · Für immer kostenlos

▸ Selbst ausprobieren

Zieh einen Slider — das Diagramm reagiert in Echtzeit.

Leertaste für Play · ←/→ zum Scrubben

Crawler graph

FR /100 SN-514

SPACE · ◄ ►

¶ Die Analogie

The map-resolution analogy

A topographical map can be ultra-detailed (every contour at 1m) or coarse (every 10m). Walking from town A to town B, you do not need every blade of grass — you just need the path. The coarser map is 5–10× smaller and still gets you there.

Quantization is the same idea for AI weights. The full-precision values are like the ultra-detailed map. Most of that precision is not needed — the model still does the right thing if every weight is rounded to a much coarser grid.

The precision ladder

Precision	Bits/weight	Size of 7B model	Speed (relative)	Quality loss
FP32	32	28 GB	1×	none
FP16 / BF16	16	14 GB	1.5×	negligible
INT8	8	7 GB	2×	tiny
INT4	4	3.5 GB	3–4×	small (with care)
INT2	2	1.75 GB	4–5×	usually noticeable

Modern training is mostly BF16/FP16. Modern serving is increasingly INT8 or INT4.

Two flavours

Post-training quantization (PTQ) — take a trained FP16 model, quantize. Cheap (minutes), some quality loss.
Quantization-aware training (QAT) — train as if weights were quantized. Higher quality at the same bits, but needs retraining.

How "lossless 4-bit" is even possible

Naive INT4 averaged across all weights destroys the model. The tricks that make 4-bit work:

Group-wise quantization — different scale factors per small group of weights, not per layer.
Outlier handling — keep the rare extreme values in higher precision (the famous 0.1% that cause 90% of damage).
Calibration — pick scales using a small dataset of real activations, not just min/max of weights.

GPTQ, AWQ, and bitsandbytes are the popular implementations. Each uses a slightly different recipe.

Activation quantization is the harder half

Quantizing weights is the easy part. Quantizing activations during inference is harder because they vary per input and have wild outliers. INT8 weight + FP16 activation is a common compromise; full INT8 needs more care.

What you actually feel

Memory — drops linearly with bits. Lets bigger models fit on smaller GPUs.
Throughput — usually goes up because the bottleneck is memory bandwidth, not compute.
Latency — often improves, especially for memory-bound steps like decode.
Quality — drops mildly at INT8, modestly at INT4, sharply below.

When not to quantize

Your eval is brittle and small drops matter (tight prod metrics).
You're below ~1B params — quantization noise hurts small models more than big ones.
You're training, not just serving. Use mixed-precision training (BF16 + FP32 master weights) instead.

The rule of thumb

Start FP16. Try INT8 — almost always free quality. Try INT4 with AWQ or GPTQ — usually 1–3% quality loss for ~4× memory savings. Do not go to INT2 or below without strong evidence the eval still holds.

From the field

Quantization is one of the rare free lunches in serving. Dropping to INT8 has cost me essentially no quality on every model I've shipped, while roughly halving the memory so it fits on a cheaper GPU. INT4 is usually fine too with AWQ or GPTQ — but "usually" is the operative word: I never trust a published benchmark for this and always re-run my own eval after quantizing, because the quality hit lands unevenly across tasks. The one place I don't quantize is small models — there's no precision to spare. Start at the smallest bit-width your own eval still passes, not the one a leaderboard blessed.

→ Wollen Sie das in Ihrem Stack?

AWS Cloud Infrastructure & DevOps Engineering

Get an AWS environment that is secure by default, sized to what you actually run, and documented so your team can operate it without guesswork. Whether you are launching a new workload or inheriting a...

So kann ich helfen