Skip to main content
Inference & Optimization Crawler graph 3 Slider

Quantization: Shrinking Models Without Killing Them

Store every weight in 4 bits instead of 16, fit a 70B model on one GPU, and lose almost no quality. Tune precision to feel the trade-off.

· 2 Min. Lesezeit
Zum Lab springen
▸ Selbst ausprobieren

Zieh einen Slider — das Diagramm reagiert in Echtzeit.

FR /100
¶ Die Analogie

The map-resolution analogy

A topographical map can be ultra-detailed (every contour at 1m) or coarse (every 10m). Walking from town A to town B, you do not need every blade of grass — you just need the path. The coarser map is 5–10× smaller and still gets you there.

Quantization is the same idea for AI weights. The full-precision values are like the ultra-detailed map. Most of that precision is not needed — the model still does the right thing if every weight is rounded to a much coarser grid.

The precision ladder

Precision Bits/weight Size of 7B model Speed (relative) Quality loss
FP32 32 28 GB none
FP16 / BF16 16 14 GB 1.5× negligible
INT8 8 7 GB tiny
INT4 4 3.5 GB 3–4× small (with care)
INT2 2 1.75 GB 4–5× usually noticeable

Modern training is mostly BF16/FP16. Modern serving is increasingly INT8 or INT4.

Two flavours

  • Post-training quantization (PTQ) — take a trained FP16 model, quantize. Cheap (minutes), some quality loss.
  • Quantization-aware training (QAT) — train as if weights were quantized. Higher quality at the same bits, but needs retraining.

How "lossless 4-bit" is even possible

Naive INT4 averaged across all weights destroys the model. The tricks that make 4-bit work:

  • Group-wise quantization — different scale factors per small group of weights, not per layer.
  • Outlier handling — keep the rare extreme values in higher precision (the famous 0.1% that cause 90% of damage).
  • Calibration — pick scales using a small dataset of real activations, not just min/max of weights.

GPTQ, AWQ, and bitsandbytes are the popular implementations. Each uses a slightly different recipe.

Activation quantization is the harder half

Quantizing weights is the easy part. Quantizing activations during inference is harder because they vary per input and have wild outliers. INT8 weight + FP16 activation is a common compromise; full INT8 needs more care.

What you actually feel

  • Memory — drops linearly with bits. Lets bigger models fit on smaller GPUs.
  • Throughput — usually goes up because the bottleneck is memory bandwidth, not compute.
  • Latency — often improves, especially for memory-bound steps like decode.
  • Quality — drops mildly at INT8, modestly at INT4, sharply below.

When not to quantize

  • Your eval is brittle and small drops matter (tight prod metrics).
  • You're below ~1B params — quantization noise hurts small models more than big ones.
  • You're training, not just serving. Use mixed-precision training (BF16 + FP32 master weights) instead.

The rule of thumb

Start FP16. Try INT8 — almost always free quality. Try INT4 with AWQ or GPTQ — usually 1–3% quality loss for ~4× memory savings. Do not go to INT2 or below without strong evidence the eval still holds.

Engr Mejba Ahmed

Engr Mejba Ahmed

Claude Code Expert · Online

👋

Hey there!

Quick Actions

WhatsApp Instant reply

Chat on WhatsApp

+880 1723 741224 · Instant reply

Popular Questions

Engr Mejba Ahmed is connected
Engr Mejba Ahmed is typing...
Engr Mejba Ahmed avatar

✉ Want me to follow up? Drop your email

Engr Mejba Ahmed avatar

📞 Connect Directly

Choose how you'd like to reach me

WhatsApp

+880 1723 741224

Email

[email protected]

✓ Details sent! I'll get back to you shortly.

Powered by OpenAI

335+

Blog Posts

25

AI Courses

63

Projects

Services & Expertise

Pricing & Process

Learning & Resources

Connect & Support