Infrastructure 13 min read

Quantization Methods Compared: GGUF, AWQ, GPTQ, EXL2, NVFP4

ai.rs Feb 5, 2026

What is Quantization?

Neural network weights are typically stored as 16-bit floating point numbers (FP16 or BF16). An 8-billion parameter model needs 16 GB of memory just for the weights.

Quantization reduces the precision of these numbers — from 16-bit down to 8, 6, 4, or even 2 bits — making models smaller and faster at the cost of some quality.

Precision Memory per 8B model Speed impact Quality impact
BF16 (baseline) 16.4 GB Baseline Baseline
8-bit (Q8_0) 8.5 GB ~1.5x faster ~99.5% quality
6-bit (Q6_K) 6.7 GB ~2x faster ~98% quality
4-bit (Q4_K_M) 4.7 GB ~2.5x faster ~95% quality
2-bit (IQ2_M) 2.5 GB ~3x faster ~85% quality

The speedup comes from the memory wall — LLM inference is bottlenecked by how fast you can stream weights from VRAM, not by computation. Smaller weights = faster streaming.

Why Inference is Memory-Bound

A modern GPU like the RTX 5090 has massive compute power:

CUDA cores want:    103 TB/s of weight data
VRAM delivers:        1.8 TB/s
Gap:                  57x — cores idle 98% of the time

During token generation, the entire model is read from memory for each token. Quantization reduces the amount of data to read, directly improving throughput:

Format Data to Stream Time per Token Max tok/s
BF16 16.4 GB 9.1 ms ~110
Q6_K 6.7 GB 3.7 ms ~270
Q4_K_M 4.5 GB 2.5 ms ~400
NVFP4 6.4 GB 3.6 ms ~280

These are theoretical maximums. Real-world performance is lower due to overhead, but the relative ratios hold.

The Formats

GGUF (llama.cpp)

The most widely used format. GGUF uses mixed-precision integer quantization with multiple sub-formats:

Sub-format Bits Quality Use case
Q8_0 8-bit Excellent (99.5%) When you have VRAM to spare
Q6_K 6-bit Very good (98%) Sweet spot for production
Q5_K_M 5-bit Good (96-97%) Balanced
Q4_K_M 4-bit Acceptable (95%) When VRAM is tight
IQ4_XS 4-bit Acceptable (94%) Aggressive compression
IQ2_M 2-bit Poor (85%) Experimental

Pros: Universal compatibility (Ollama, llama.cpp, kobold.cpp), well-tested, many options Cons: Uses integer tensor cores, leaving FP4/FP8 cores unused on newer GPUs

GPTQ

GPT-Quantized. A calibration-based 4-bit integer format that uses sample data to minimize quantization error.

Property Value
Typical bits 4-bit
Quality ~95-96% (better than naive 4-bit)
Calibration Required (128-256 samples)
Framework vLLM, transformers, AutoGPTQ

Pros: Good quality at 4-bit, works with vLLM for multi-LoRA Cons: Requires calibration step, slower to create

GPTQ-Int4 is the best path for vLLM + LoRA — NVFP4 doesn't support LoRA adapters yet.

AWQ

Activation-Aware Weight Quantization. Similar to GPTQ but protects important weights identified by analyzing activations.

Property Value
Typical bits 4-bit
Quality ~96% (slightly better than GPTQ on some benchmarks)
Calibration Required
Framework vLLM, transformers

Pros: Slightly better quality than GPTQ at 4-bit Cons: Smaller ecosystem, fewer pre-quantized models available

EXL2 (ExLlama v2)

A flexible format that allows per-layer bit allocation — giving more bits to important layers and fewer to redundant ones.

Property Value
Typical bits 2.5-6 (configurable per layer)
Quality Best-in-class at any target size
Calibration Required
Framework ExLlama v2 only

Pros: Best quality per bit, fine-grained control Cons: Limited to ExLlama v2 runtime, no vLLM support

NVFP4 (Blackwell FP4)

NVIDIA's 4-bit floating point format, native to Blackwell GPUs (RTX 5090, B100, B200).

Property Value
Bits 4-bit floating point
Quality 97.5% on MMLU, but 80-82% on hard reasoning
Calibration Scale factors from first inference
Framework vLLM only
Hardware Blackwell GPUs only

Pros: Uses dedicated FP4 tensor cores (not integer), hardware-native performance Cons: Hard reasoning degradation, Blackwell only, no LoRA support yet

Benchmark data from Qwen3-8B:

Benchmark BF16 NVFP4 Recovery
MMLU (general) 74.97 73.07 97.5%
GSM8K (math) 87.26 86.73 99.4%
MMLU-Pro (hard) 34.64 27.49 79.4%
AIME24 (math olympiad) 75.86 62.07 81.8%

The drop on hard reasoning tasks is significant. For business applications that require accurate numerical reasoning (prices, calculations), this matters.

Head-to-Head: Speed vs Quality

All benchmarks on RTX 5090, Qwen3-8B, single user:

Format Size Speed Quality Best for
Q8_0 (GGUF) 8.5 GB 120 tok/s 99.5% Maximum quality
Q6_K (GGUF) 6.7 GB 161 tok/s 98% Production sweet spot
Q4_K_M (GGUF) 4.7 GB 190 tok/s 95% VRAM-constrained
GPTQ-Int4 4.5 GB 150 tok/s* 96% vLLM + LoRA
AWQ-Int4 4.5 GB 155 tok/s* 96% vLLM alternative
NVFP4 6.4 GB 68 tok/s* 80-97% Multi-user vLLM

*vLLM speeds include framework overhead; aggregate throughput is higher with concurrent users.

Quantization and Non-English Languages

A critical consideration for multilingual deployments: aggressive quantization hurts low-resource languages more.

LLMs allocate weight capacity proportional to training data volume. English dominates training corpora (40-60%), so English tokens get the most model capacity. For languages with less training data:

  • 6-bit (Q6_K): Preserves 98-99% quality across languages
  • 4-bit (Q4_K_M/GPTQ): Drops to 90-95% for non-English
  • 4-bit (NVFP4): Can drop to 80-92% on hard reasoning in non-English

If you're serving customers in languages other than English, Q6_K is the safest choice. The extra 2 GB of VRAM is a small price for maintaining quality across all languages.

Decision Matrix

If you need... Use
Maximum quality, have VRAM Q8_0 GGUF
Production balance (speed + quality) Q6_K GGUF
Fit on 8GB GPU Q4_K_M GGUF
vLLM + multiple LoRA adapters GPTQ-Int4
Best quality per bit EXL2
Multi-user production on Blackwell NVFP4 (vLLM)
Non-English language support Q6_K GGUF (minimum)

Practical Advice

  1. Start with Q6_K GGUF. It's the best all-around choice: excellent quality, fast inference, works everywhere.

  2. Drop to Q4_K_M only if VRAM forces you to. The 3% quality loss is noticeable in edge cases.

  3. Use GPTQ-Int4 for multi-LoRA vLLM deployments. It's the only 4-bit format with LoRA adapter support.

  4. NVFP4 is for multi-user production only. Don't use it for single-user chat — Ollama with Q6_K is 2.4x faster.

  5. Test with your actual data. Benchmark numbers vary by model family, language, and task type. Always validate on your specific use case before deploying.

Share: Post Share

Related Articles