Infrastructure 13 min read

Quantization Methods Compared: GGUF, AWQ, GPTQ, EXL2, NVFP4

ai.rs Feb 5, 2026
Quantization Methods Compared: GGUF, AWQ, GPTQ, EXL2, NVFP4 illustration

What is Quantization?

Neural network weights are typically stored as 16-bit floating point numbers (FP16 or BF16). An 8-billion parameter model needs 16 GB of memory just for the weights.

Quantization reduces the precision of these numbers — from 16-bit down to 8, 6, 4, or even 2 bits — making models smaller and faster at the cost of some quality.

Precision Memory per 8B model Speed impact Quality impact
BF16 (baseline) 16.4 GB Baseline Baseline
8-bit (Q8_0) 8.5 GB ~1.5x faster ~99.5% quality
6-bit (Q6_K) 6.7 GB ~2x faster ~98% quality
4-bit (Q4_K_M) 4.7 GB ~2.5x faster ~95% quality
2-bit (IQ2_M) 2.5 GB ~3x faster ~85% quality

The speedup comes from the memory wall — LLM inference is bottlenecked by how fast you can stream weights from VRAM, not by computation. Smaller weights = faster streaming.

Why Inference is Memory-Bound

A modern GPU like the RTX 5090 has massive compute power:

CUDA cores want:    103 TB/s of weight data
VRAM delivers:        1.8 TB/s
Gap:                  57x — cores idle 98% of the time

During token generation, the entire model is read from memory for each token. Quantization reduces the amount of data to read, directly improving throughput:

Format Data to Stream Time per Token Max tok/s
BF16 16.4 GB 9.1 ms ~110
Q6_K 6.7 GB 3.7 ms ~270
Q4_K_M 4.5 GB 2.5 ms ~400
NVFP4 6.4 GB 3.6 ms ~280

These are theoretical maximums. Real-world performance is lower due to overhead, but the relative ratios hold.

The Formats

GGUF (llama.cpp)

The most widely used format. GGUF uses mixed-precision integer quantization with multiple sub-formats:

Sub-format Bits Quality Use case
Q8_0 8-bit Excellent (99.5%) When you have VRAM to spare
Q6_K 6-bit Very good (98%) Sweet spot for production
Q5_K_M 5-bit Good (96-97%) Balanced
Q4_K_M 4-bit Acceptable (95%) When VRAM is tight
IQ4_XS 4-bit Acceptable (94%) Aggressive compression
IQ2_M 2-bit Poor (85%) Experimental

Pros: Universal compatibility (Ollama, llama.cpp, kobold.cpp), well-tested, many options Cons: Uses integer tensor cores, leaving FP4/FP8 cores unused on newer GPUs

GPTQ

GPT-Quantized. A calibration-based 4-bit integer format that uses sample data to minimize quantization error.

Property Value
Typical bits 4-bit
Quality ~95-96% (better than naive 4-bit)
Calibration Required (128-256 samples)
Framework vLLM, transformers, AutoGPTQ

Pros: Good quality at 4-bit, works with vLLM for multi-LoRA Cons: Requires calibration step, slower to create

GPTQ-Int4 is the best path for vLLM + LoRA — NVFP4 doesn't support LoRA adapters yet.

AWQ

Activation-Aware Weight Quantization. Similar to GPTQ but protects important weights identified by analyzing activations.

Property Value
Typical bits 4-bit
Quality ~96% (slightly better than GPTQ on some benchmarks)
Calibration Required
Framework vLLM, transformers

Pros: Slightly better quality than GPTQ at 4-bit Cons: Smaller ecosystem, fewer pre-quantized models available

EXL2 (ExLlama v2)

A flexible format that allows per-layer bit allocation — giving more bits to important layers and fewer to redundant ones.

Property Value
Typical bits 2.5-6 (configurable per layer)
Quality Best-in-class at any target size
Calibration Required
Framework ExLlama v2 only

Pros: Best quality per bit, fine-grained control Cons: Limited to ExLlama v2 runtime, no vLLM support

NVFP4 (Blackwell FP4)

NVIDIA's 4-bit floating point format, native to Blackwell GPUs (RTX 5090, B100, B200).

Property Value
Bits 4-bit floating point
Quality 97.5% on MMLU, but 80-82% on hard reasoning
Calibration Scale factors from first inference
Framework vLLM only
Hardware Blackwell GPUs only

Pros: Uses dedicated FP4 tensor cores (not integer), hardware-native performance Cons: Hard reasoning degradation, Blackwell only, no LoRA support yet

Benchmark data from Qwen3-8B:

Benchmark BF16 NVFP4 Recovery
MMLU (general) 74.97 73.07 97.5%
GSM8K (math) 87.26 86.73 99.4%
MMLU-Pro (hard) 34.64 27.49 79.4%
AIME24 (math olympiad) 75.86 62.07 81.8%

The drop on hard reasoning tasks is significant. For business applications that require accurate numerical reasoning (prices, calculations), this matters.

Head-to-Head: Speed vs Quality

All benchmarks on RTX 5090, Qwen3-8B, single user:

Format Size Speed Quality Best for
Q8_0 (GGUF) 8.5 GB 120 tok/s 99.5% Maximum quality
Q6_K (GGUF) 6.7 GB 161 tok/s 98% Production sweet spot
Q4_K_M (GGUF) 4.7 GB 190 tok/s 95% VRAM-constrained
GPTQ-Int4 4.5 GB 150 tok/s* 96% vLLM + LoRA
AWQ-Int4 4.5 GB 155 tok/s* 96% vLLM alternative
NVFP4 6.4 GB 68 tok/s* 80-97% Multi-user vLLM

*vLLM speeds include framework overhead; aggregate throughput is higher with concurrent users.

Quantization and Non-English Languages

A critical consideration for multilingual deployments: aggressive quantization hurts low-resource languages more.

LLMs allocate weight capacity proportional to training data volume. English dominates training corpora (40-60%), so English tokens get the most model capacity. For languages with less training data:

  • 6-bit (Q6_K): Preserves 98-99% quality across languages
  • 4-bit (Q4_K_M/GPTQ): Drops to 90-95% for non-English
  • 4-bit (NVFP4): Can drop to 80-92% on hard reasoning in non-English

If you're serving customers in languages other than English, Q6_K is the safest choice. The extra 2 GB of VRAM is a small price for maintaining quality across all languages.

Decision Matrix

If you need... Use
Maximum quality, have VRAM Q8_0 GGUF
Production balance (speed + quality) Q6_K GGUF
Fit on 8GB GPU Q4_K_M GGUF
vLLM + multiple LoRA adapters GPTQ-Int4
Best quality per bit EXL2
Multi-user production on Blackwell NVFP4 (vLLM)
Non-English language support Q6_K GGUF (minimum)

Practical Advice

  1. Start with Q6_K GGUF. It's the best all-around choice: excellent quality, fast inference, works everywhere.

  2. Drop to Q4_K_M only if VRAM forces you to. The 3% quality loss is noticeable in edge cases.

  3. Use GPTQ-Int4 for multi-LoRA vLLM deployments. It's the only 4-bit format with LoRA adapter support.

  4. NVFP4 is for multi-user production only. Don't use it for single-user chat — Ollama with Q6_K is 2.4x faster.

  5. Test with your actual data. Benchmark numbers vary by model family, language, and task type. Always validate on your specific use case before deploying.

Deploying AI for your business?

Inference, GPUs, and quantization choices look different in production. See where your business is on the readiness curve.

Take the AI Readiness Check
Share: Post Share

Read next