Infrastructure 13 min read

Quantization Methods Compared: GGUF, AWQ, GPTQ, EXL2, NVFP4

ai.rs Feb 5, 2026

quantization gguf gptq awq nvfp4 benchmarks

What is Quantization?

Neural network weights are typically stored as 16-bit floating point numbers (FP16 or BF16). An 8-billion parameter model needs 16 GB of memory just for the weights.

Quantization reduces the precision of these numbers — from 16-bit down to 8, 6, 4, or even 2 bits — making models smaller and faster at the cost of some quality.

Precision	Memory per 8B model	Speed impact	Quality impact
BF16 (baseline)	16.4 GB	Baseline	Baseline
8-bit (Q8_0)	8.5 GB	~1.5x faster	~99.5% quality
6-bit (Q6_K)	6.7 GB	~2x faster	~98% quality
4-bit (Q4_K_M)	4.7 GB	~2.5x faster	~95% quality
2-bit (IQ2_M)	2.5 GB	~3x faster	~85% quality

The speedup comes from the memory wall — LLM inference is bottlenecked by how fast you can stream weights from VRAM, not by computation. Smaller weights = faster streaming.

Why Inference is Memory-Bound

A modern GPU like the RTX 5090 has massive compute power:

CUDA cores want:    103 TB/s of weight data
VRAM delivers:        1.8 TB/s
Gap:                  57x — cores idle 98% of the time

During token generation, the entire model is read from memory for each token. Quantization reduces the amount of data to read, directly improving throughput:

Format	Data to Stream	Time per Token	Max tok/s
BF16	16.4 GB	9.1 ms	~110
Q6_K	6.7 GB	3.7 ms	~270
Q4_K_M	4.5 GB	2.5 ms	~400
NVFP4	6.4 GB	3.6 ms	~280

These are theoretical maximums. Real-world performance is lower due to overhead, but the relative ratios hold.

The Formats

GGUF (llama.cpp)

The most widely used format. GGUF uses mixed-precision integer quantization with multiple sub-formats:

Sub-format	Bits	Quality	Use case
Q8_0	8-bit	Excellent (99.5%)	When you have VRAM to spare
Q6_K	6-bit	Very good (98%)	Sweet spot for production
Q5_K_M	5-bit	Good (96-97%)	Balanced
Q4_K_M	4-bit	Acceptable (95%)	When VRAM is tight
IQ4_XS	4-bit	Acceptable (94%)	Aggressive compression
IQ2_M	2-bit	Poor (85%)	Experimental

Pros: Universal compatibility (Ollama, llama.cpp, kobold.cpp), well-tested, many options Cons: Uses integer tensor cores, leaving FP4/FP8 cores unused on newer GPUs

GPTQ

GPT-Quantized. A calibration-based 4-bit integer format that uses sample data to minimize quantization error.

Property	Value
Typical bits	4-bit
Quality	~95-96% (better than naive 4-bit)
Calibration	Required (128-256 samples)
Framework	vLLM, transformers, AutoGPTQ

Pros: Good quality at 4-bit, works with vLLM for multi-LoRA Cons: Requires calibration step, slower to create

GPTQ-Int4 is the best path for vLLM + LoRA — NVFP4 doesn't support LoRA adapters yet.

AWQ

Activation-Aware Weight Quantization. Similar to GPTQ but protects important weights identified by analyzing activations.

Property	Value
Typical bits	4-bit
Quality	~96% (slightly better than GPTQ on some benchmarks)
Calibration	Required
Framework	vLLM, transformers

Pros: Slightly better quality than GPTQ at 4-bit Cons: Smaller ecosystem, fewer pre-quantized models available

EXL2 (ExLlama v2)

A flexible format that allows per-layer bit allocation — giving more bits to important layers and fewer to redundant ones.

Property	Value
Typical bits	2.5-6 (configurable per layer)
Quality	Best-in-class at any target size
Calibration	Required
Framework	ExLlama v2 only

Pros: Best quality per bit, fine-grained control Cons: Limited to ExLlama v2 runtime, no vLLM support

NVFP4 (Blackwell FP4)

NVIDIA's 4-bit floating point format, native to Blackwell GPUs (RTX 5090, B100, B200).

Property	Value
Bits	4-bit floating point
Quality	97.5% on MMLU, but 80-82% on hard reasoning
Calibration	Scale factors from first inference
Framework	vLLM only
Hardware	Blackwell GPUs only

Pros: Uses dedicated FP4 tensor cores (not integer), hardware-native performance Cons: Hard reasoning degradation, Blackwell only, no LoRA support yet

Benchmark data from Qwen3-8B:

Benchmark	BF16	NVFP4	Recovery
MMLU (general)	74.97	73.07	97.5%
GSM8K (math)	87.26	86.73	99.4%
MMLU-Pro (hard)	34.64	27.49	79.4%
AIME24 (math olympiad)	75.86	62.07	81.8%

The drop on hard reasoning tasks is significant. For business applications that require accurate numerical reasoning (prices, calculations), this matters.

Head-to-Head: Speed vs Quality

All benchmarks on RTX 5090, Qwen3-8B, single user:

Format	Size	Speed	Quality	Best for
Q8_0 (GGUF)	8.5 GB	120 tok/s	99.5%	Maximum quality
Q6_K (GGUF)	6.7 GB	161 tok/s	98%	Production sweet spot
Q4_K_M (GGUF)	4.7 GB	190 tok/s	95%	VRAM-constrained
GPTQ-Int4	4.5 GB	150 tok/s*	96%	vLLM + LoRA
AWQ-Int4	4.5 GB	155 tok/s*	96%	vLLM alternative
NVFP4	6.4 GB	68 tok/s*	80-97%	Multi-user vLLM

*vLLM speeds include framework overhead; aggregate throughput is higher with concurrent users.

Quantization and Non-English Languages

A critical consideration for multilingual deployments: aggressive quantization hurts low-resource languages more.

LLMs allocate weight capacity proportional to training data volume. English dominates training corpora (40-60%), so English tokens get the most model capacity. For languages with less training data:

6-bit (Q6_K): Preserves 98-99% quality across languages
4-bit (Q4_K_M/GPTQ): Drops to 90-95% for non-English
4-bit (NVFP4): Can drop to 80-92% on hard reasoning in non-English

If you're serving customers in languages other than English, Q6_K is the safest choice. The extra 2 GB of VRAM is a small price for maintaining quality across all languages.

Decision Matrix

If you need...	Use
Maximum quality, have VRAM	Q8_0 GGUF
Production balance (speed + quality)	Q6_K GGUF
Fit on 8GB GPU	Q4_K_M GGUF
vLLM + multiple LoRA adapters	GPTQ-Int4
Best quality per bit	EXL2
Multi-user production on Blackwell	NVFP4 (vLLM)
Non-English language support	Q6_K GGUF (minimum)

Practical Advice

Start with Q6_K GGUF. It's the best all-around choice: excellent quality, fast inference, works everywhere.
Drop to Q4_K_M only if VRAM forces you to. The 3% quality loss is noticeable in edge cases.
Use GPTQ-Int4 for multi-LoRA vLLM deployments. It's the only 4-bit format with LoRA adapter support.
NVFP4 is for multi-user production only. Don't use it for single-user chat — Ollama with Q6_K is 2.4x faster.
Test with your actual data. Benchmark numbers vary by model family, language, and task type. Always validate on your specific use case before deploying.

Deploying AI for your business?

Inference, GPUs, and quantization choices look different in production. See where your business is on the readiness curve.

Take the AI Readiness Check

Share: Post Share

Quantization Methods Compared: GGUF, AWQ, GPTQ, EXL2, NVFP4

What is Quantization?

Why Inference is Memory-Bound

The Formats

GGUF (llama.cpp)

GPTQ

AWQ

EXL2 (ExLlama v2)

NVFP4 (Blackwell FP4)

Head-to-Head: Speed vs Quality

Quantization and Non-English Languages

Decision Matrix

Practical Advice

Deploying AI for your business?

Read next

4-Bit Quantization Decoded: INT4 QAT, MXFP4, and NVFP4

Kimi K2.6 Explained: a Trillion-Parameter Open Model

Qwen 3.5: 35B Knowledge at 4B Speed — Better Than GPT-5?

On This Page

Developer Corner