What is Quantization?
Neural network weights are typically stored as 16-bit floating point numbers (FP16 or BF16). An 8-billion parameter model needs 16 GB of memory just for the weights.
Quantization reduces the precision of these numbers — from 16-bit down to 8, 6, 4, or even 2 bits — making models smaller and faster at the cost of some quality.
| Precision | Memory per 8B model | Speed impact | Quality impact |
|---|---|---|---|
| BF16 (baseline) | 16.4 GB | Baseline | Baseline |
| 8-bit (Q8_0) | 8.5 GB | ~1.5x faster | ~99.5% quality |
| 6-bit (Q6_K) | 6.7 GB | ~2x faster | ~98% quality |
| 4-bit (Q4_K_M) | 4.7 GB | ~2.5x faster | ~95% quality |
| 2-bit (IQ2_M) | 2.5 GB | ~3x faster | ~85% quality |
The speedup comes from the memory wall — LLM inference is bottlenecked by how fast you can stream weights from VRAM, not by computation. Smaller weights = faster streaming.
Why Inference is Memory-Bound
A modern GPU like the RTX 5090 has massive compute power:
CUDA cores want: 103 TB/s of weight data
VRAM delivers: 1.8 TB/s
Gap: 57x — cores idle 98% of the time
During token generation, the entire model is read from memory for each token. Quantization reduces the amount of data to read, directly improving throughput:
| Format | Data to Stream | Time per Token | Max tok/s |
|---|---|---|---|
| BF16 | 16.4 GB | 9.1 ms | ~110 |
| Q6_K | 6.7 GB | 3.7 ms | ~270 |
| Q4_K_M | 4.5 GB | 2.5 ms | ~400 |
| NVFP4 | 6.4 GB | 3.6 ms | ~280 |
These are theoretical maximums. Real-world performance is lower due to overhead, but the relative ratios hold.
The Formats
GGUF (llama.cpp)
The most widely used format. GGUF uses mixed-precision integer quantization with multiple sub-formats:
| Sub-format | Bits | Quality | Use case |
|---|---|---|---|
| Q8_0 | 8-bit | Excellent (99.5%) | When you have VRAM to spare |
| Q6_K | 6-bit | Very good (98%) | Sweet spot for production |
| Q5_K_M | 5-bit | Good (96-97%) | Balanced |
| Q4_K_M | 4-bit | Acceptable (95%) | When VRAM is tight |
| IQ4_XS | 4-bit | Acceptable (94%) | Aggressive compression |
| IQ2_M | 2-bit | Poor (85%) | Experimental |
Pros: Universal compatibility (Ollama, llama.cpp, kobold.cpp), well-tested, many options Cons: Uses integer tensor cores, leaving FP4/FP8 cores unused on newer GPUs
GPTQ
GPT-Quantized. A calibration-based 4-bit integer format that uses sample data to minimize quantization error.
| Property | Value |
|---|---|
| Typical bits | 4-bit |
| Quality | ~95-96% (better than naive 4-bit) |
| Calibration | Required (128-256 samples) |
| Framework | vLLM, transformers, AutoGPTQ |
Pros: Good quality at 4-bit, works with vLLM for multi-LoRA Cons: Requires calibration step, slower to create
GPTQ-Int4 is the best path for vLLM + LoRA — NVFP4 doesn't support LoRA adapters yet.
AWQ
Activation-Aware Weight Quantization. Similar to GPTQ but protects important weights identified by analyzing activations.
| Property | Value |
|---|---|
| Typical bits | 4-bit |
| Quality | ~96% (slightly better than GPTQ on some benchmarks) |
| Calibration | Required |
| Framework | vLLM, transformers |
Pros: Slightly better quality than GPTQ at 4-bit Cons: Smaller ecosystem, fewer pre-quantized models available
EXL2 (ExLlama v2)
A flexible format that allows per-layer bit allocation — giving more bits to important layers and fewer to redundant ones.
| Property | Value |
|---|---|
| Typical bits | 2.5-6 (configurable per layer) |
| Quality | Best-in-class at any target size |
| Calibration | Required |
| Framework | ExLlama v2 only |
Pros: Best quality per bit, fine-grained control Cons: Limited to ExLlama v2 runtime, no vLLM support
NVFP4 (Blackwell FP4)
NVIDIA's 4-bit floating point format, native to Blackwell GPUs (RTX 5090, B100, B200).
| Property | Value |
|---|---|
| Bits | 4-bit floating point |
| Quality | 97.5% on MMLU, but 80-82% on hard reasoning |
| Calibration | Scale factors from first inference |
| Framework | vLLM only |
| Hardware | Blackwell GPUs only |
Pros: Uses dedicated FP4 tensor cores (not integer), hardware-native performance Cons: Hard reasoning degradation, Blackwell only, no LoRA support yet
Benchmark data from Qwen3-8B:
| Benchmark | BF16 | NVFP4 | Recovery |
|---|---|---|---|
| MMLU (general) | 74.97 | 73.07 | 97.5% |
| GSM8K (math) | 87.26 | 86.73 | 99.4% |
| MMLU-Pro (hard) | 34.64 | 27.49 | 79.4% |
| AIME24 (math olympiad) | 75.86 | 62.07 | 81.8% |
The drop on hard reasoning tasks is significant. For business applications that require accurate numerical reasoning (prices, calculations), this matters.
Head-to-Head: Speed vs Quality
All benchmarks on RTX 5090, Qwen3-8B, single user:
| Format | Size | Speed | Quality | Best for |
|---|---|---|---|---|
| Q8_0 (GGUF) | 8.5 GB | 120 tok/s | 99.5% | Maximum quality |
| Q6_K (GGUF) | 6.7 GB | 161 tok/s | 98% | Production sweet spot |
| Q4_K_M (GGUF) | 4.7 GB | 190 tok/s | 95% | VRAM-constrained |
| GPTQ-Int4 | 4.5 GB | 150 tok/s* | 96% | vLLM + LoRA |
| AWQ-Int4 | 4.5 GB | 155 tok/s* | 96% | vLLM alternative |
| NVFP4 | 6.4 GB | 68 tok/s* | 80-97% | Multi-user vLLM |
*vLLM speeds include framework overhead; aggregate throughput is higher with concurrent users.
Quantization and Non-English Languages
A critical consideration for multilingual deployments: aggressive quantization hurts low-resource languages more.
LLMs allocate weight capacity proportional to training data volume. English dominates training corpora (40-60%), so English tokens get the most model capacity. For languages with less training data:
- 6-bit (Q6_K): Preserves 98-99% quality across languages
- 4-bit (Q4_K_M/GPTQ): Drops to 90-95% for non-English
- 4-bit (NVFP4): Can drop to 80-92% on hard reasoning in non-English
If you're serving customers in languages other than English, Q6_K is the safest choice. The extra 2 GB of VRAM is a small price for maintaining quality across all languages.
Decision Matrix
| If you need... | Use |
|---|---|
| Maximum quality, have VRAM | Q8_0 GGUF |
| Production balance (speed + quality) | Q6_K GGUF |
| Fit on 8GB GPU | Q4_K_M GGUF |
| vLLM + multiple LoRA adapters | GPTQ-Int4 |
| Best quality per bit | EXL2 |
| Multi-user production on Blackwell | NVFP4 (vLLM) |
| Non-English language support | Q6_K GGUF (minimum) |
Practical Advice
-
Start with Q6_K GGUF. It's the best all-around choice: excellent quality, fast inference, works everywhere.
-
Drop to Q4_K_M only if VRAM forces you to. The 3% quality loss is noticeable in edge cases.
-
Use GPTQ-Int4 for multi-LoRA vLLM deployments. It's the only 4-bit format with LoRA adapter support.
-
NVFP4 is for multi-user production only. Don't use it for single-user chat — Ollama with Q6_K is 2.4x faster.
-
Test with your actual data. Benchmark numbers vary by model family, language, and task type. Always validate on your specific use case before deploying.