Infrastructure 8 min read

4-Bit Quantization Decoded: INT4 QAT, MXFP4, and NVFP4

ai.rs Jul 3, 2026

quantization fp4 nvfp4 mxfp4 int4 blackwell

Open models now ship in 4-bit by default — it's how a 400B-parameter MoE fits on a workstation at all. But "4-bit" hides three different things you keep seeing in model cards: native INT4 (QAT), MXFP4, and NVFP4. They are not the same, and the difference shows up in both accuracy and which hardware runs them fast. Here's the decoder. (For the wider menu — GGUF, AWQ, GPTQ, EXL2 — see Quantization Methods Compared.)

Why 4-bit at all

A weight stored in FP16 takes 16 bits; in 4-bit it takes 4 — 4× smaller. That's often the difference between a model fitting in your VRAM or not (the VRAM math). And on NVIDIA's Blackwell GPUs (RTX 5090, GB10 / HP ZGX, B200), the tensor cores run 4-bit floats at 2× the throughput of FP8 — so 4-bit is faster, not just smaller.

The risk is accuracy: squeezing a 16-bit number into 4 bits loses information. The formats below are all different answers to "how do we keep the accuracy while spending only 4 bits?"

INT4 — and why QAT matters

INT4 stores each weight as a 4-bit integer (16 possible values). How you get there matters:

Post-training quantization (PTQ) — methods like GPTQ and AWQ take a finished FP16 model and round its weights to INT4 afterward. Fast, no retraining, slight accuracy loss. Great for quantizing any model yourself.
Quantization-aware training (QAT) — the model is trained with the 4-bit rounding in the loop, so it learns weights that survive quantization. The result is a native INT4 checkpoint with near-FP accuracy. Kimi K2.6 (and Kimi-K2-Thinking before it) ship exactly this — it's why their 4-bit release isn't a lossy afterthought.

Rule of thumb: if the author shipped a native QAT INT4 build, use it — cheapest high-quality option. If not, PTQ (AWQ/GPTQ) is your DIY route.

FP4 — 4-bit floating point (E2M1)

Integers space their 16 values evenly; floating-point spaces them to handle both large and small magnitudes — which is what neural-net weights actually look like. FP4 packs a tiny float into 4 bits as E2M1: 1 sign bit, 2 exponent bits, 1 mantissa bit.

Four bits can't span a layer's full numeric range alone, so FP4 formats attach a scale to a small block of weights. Two standards do this differently:

MXFP4 (OCP Microscaling)

Groups weights into blocks of 32, each sharing one E8M0 scale (an 8-bit power-of-two). An open OCP standard with low overhead — gpt-oss ships in MXFP4.

NVFP4 (NVIDIA Blackwell)

Groups weights into blocks of 16, each sharing an FP8 (E4M3) scale, plus a per-tensor FP32 global scale. Smaller blocks and a higher-precision scale mean each block fits its data better — so NVFP4 lands lower perplexity / higher accuracy than MXFP4 on the same model, at the cost of ~2× the scale overhead. It's Blackwell-native.

Format	Element	Block	Scale	Notable
INT4 (PTQ)	4-bit int	per-group	per-group	GPTQ / AWQ, DIY
INT4 (QAT)	4-bit int	trained-in	—	native, near-FP accuracy (Kimi)
MXFP4	E2M1	32	E8M0 (pow-2)	open standard, gpt-oss
NVFP4	E2M1	16	FP8 E4M3	most accurate FP4, Blackwell

Which should you use

Self-hosting on Blackwell (5090 / GB10 / B200): prefer NVFP4 — best accuracy at 4-bit, hardware-accelerated.
Portability / open tooling: MXFP4 is the open standard and widely supported.
The model ships native INT4 (QAT): just use it — highest quality-per-bit, zero work.
Quantizing an arbitrary FP16 model yourself: AWQ / GPTQ INT4 (PTQ) is the pragmatic route.

Bottom line

All of them pack weights into ~4 bits for 4× memory savings and (on Blackwell) 2× FP8 speed.
INT4 QAT = trained-in, near-FP accuracy when the author ships it.
FP4 (E2M1) uses a per-block scale; MXFP4 = 32-block / power-of-2 scale (open), NVFP4 = 16-block / FP8 scale (more accurate, Blackwell).
The right pick depends on your hardware and whether the author already did the work.

Deploying AI for your business?

Inference, GPUs, and quantization choices look different in production. See where your business is on the readiness curve.

Take the AI Readiness Check

Share: Post Share

4-Bit Quantization Decoded: INT4 QAT, MXFP4, and NVFP4

Why 4-bit at all

INT4 — and why QAT matters

FP4 — 4-bit floating point (E2M1)

MXFP4 (OCP Microscaling)

NVFP4 (NVIDIA Blackwell)

Which should you use

Bottom line

Deploying AI for your business?

Read next

Quantization Methods Compared: GGUF, AWQ, GPTQ, EXL2, NVFP4

Apricot Jam: Fable 5 vs Sonnet 5 — Which AI Makes the Better Retro Game?

Qwen-AgentWorld: the Open Language World Model for AI Agents

On This Page

Developer Corner

Why 4-bit at all

INT4 — and why QAT matters

FP4 — 4-bit floating point (E2M1)

MXFP4 (OCP Microscaling)

NVFP4 (NVIDIA Blackwell)

Which should you use

Bottom line

Related reading

Deploying AI for your business?

Read next

Quantization Methods Compared: GGUF, AWQ, GPTQ, EXL2, NVFP4

Apricot Jam: Fable 5 vs Sonnet 5 — Which AI Makes the Better Retro Game?

Qwen-AgentWorld: the Open Language World Model for AI Agents

On This Page

Developer Corner