The map-resolution analogy
A topographical map can be ultra-detailed (every contour at 1m) or coarse (every 10m). Walking from town A to town B, you do not need every blade of grass — you just need the path. The coarser map is 5–10× smaller and still gets you there.
Quantization is the same idea for AI weights. The full-precision values are like the ultra-detailed map. Most of that precision is not needed — the model still does the right thing if every weight is rounded to a much coarser grid.
The precision ladder
| Precision | Bits/weight | Size of 7B model | Speed (relative) | Quality loss |
|---|---|---|---|---|
| FP32 | 32 | 28 GB | 1× | none |
| FP16 / BF16 | 16 | 14 GB | 1.5× | negligible |
| INT8 | 8 | 7 GB | 2× | tiny |
| INT4 | 4 | 3.5 GB | 3–4× | small (with care) |
| INT2 | 2 | 1.75 GB | 4–5× | usually noticeable |
Modern training is mostly BF16/FP16. Modern serving is increasingly INT8 or INT4.
Two flavours
- Post-training quantization (PTQ) — take a trained FP16 model, quantize. Cheap (minutes), some quality loss.
- Quantization-aware training (QAT) — train as if weights were quantized. Higher quality at the same bits, but needs retraining.
How "lossless 4-bit" is even possible
Naive INT4 averaged across all weights destroys the model. The tricks that make 4-bit work:
- Group-wise quantization — different scale factors per small group of weights, not per layer.
- Outlier handling — keep the rare extreme values in higher precision (the famous 0.1% that cause 90% of damage).
- Calibration — pick scales using a small dataset of real activations, not just min/max of weights.
GPTQ, AWQ, and bitsandbytes are the popular implementations. Each uses a slightly different recipe.
Activation quantization is the harder half
Quantizing weights is the easy part. Quantizing activations during inference is harder because they vary per input and have wild outliers. INT8 weight + FP16 activation is a common compromise; full INT8 needs more care.
What you actually feel
- Memory — drops linearly with bits. Lets bigger models fit on smaller GPUs.
- Throughput — usually goes up because the bottleneck is memory bandwidth, not compute.
- Latency — often improves, especially for memory-bound steps like decode.
- Quality — drops mildly at INT8, modestly at INT4, sharply below.
When not to quantize
- Your eval is brittle and small drops matter (tight prod metrics).
- You're below ~1B params — quantization noise hurts small models more than big ones.
- You're training, not just serving. Use mixed-precision training (BF16 + FP32 master weights) instead.
The rule of thumb
Start FP16. Try INT8 — almost always free quality. Try INT4 with AWQ or GPTQ — usually 1–3% quality loss for ~4× memory savings. Do not go to INT2 or below without strong evidence the eval still holds.