Infrastructure 8 min read

4-Bit Quantization Decoded: INT4 QAT, MXFP4, and NVFP4

ai.rs Jul 3, 2026
4-Bit Quantization Decoded: INT4 QAT, MXFP4, and NVFP4 illustration

Open models now ship in 4-bit by default — it's how a 400B-parameter MoE fits on a workstation at all. But "4-bit" hides three different things you keep seeing in model cards: native INT4 (QAT), MXFP4, and NVFP4. They are not the same, and the difference shows up in both accuracy and which hardware runs them fast. Here's the decoder. (For the wider menu — GGUF, AWQ, GPTQ, EXL2 — see Quantization Methods Compared.)

Why 4-bit at all

A weight stored in FP16 takes 16 bits; in 4-bit it takes 4 — 4× smaller. That's often the difference between a model fitting in your VRAM or not (the VRAM math). And on NVIDIA's Blackwell GPUs (RTX 5090, GB10 / HP ZGX, B200), the tensor cores run 4-bit floats at 2× the throughput of FP8 — so 4-bit is faster, not just smaller.

The risk is accuracy: squeezing a 16-bit number into 4 bits loses information. The formats below are all different answers to "how do we keep the accuracy while spending only 4 bits?"

INT4 — and why QAT matters

INT4 stores each weight as a 4-bit integer (16 possible values). How you get there matters:

  • Post-training quantization (PTQ) — methods like GPTQ and AWQ take a finished FP16 model and round its weights to INT4 afterward. Fast, no retraining, slight accuracy loss. Great for quantizing any model yourself.
  • Quantization-aware training (QAT) — the model is trained with the 4-bit rounding in the loop, so it learns weights that survive quantization. The result is a native INT4 checkpoint with near-FP accuracy. Kimi K2.6 (and Kimi-K2-Thinking before it) ship exactly this — it's why their 4-bit release isn't a lossy afterthought.

Rule of thumb: if the author shipped a native QAT INT4 build, use it — cheapest high-quality option. If not, PTQ (AWQ/GPTQ) is your DIY route.

FP4 — 4-bit floating point (E2M1)

Integers space their 16 values evenly; floating-point spaces them to handle both large and small magnitudes — which is what neural-net weights actually look like. FP4 packs a tiny float into 4 bits as E2M1: 1 sign bit, 2 exponent bits, 1 mantissa bit.

Four bits can't span a layer's full numeric range alone, so FP4 formats attach a scale to a small block of weights. Two standards do this differently:

MXFP4 (OCP Microscaling)

Groups weights into blocks of 32, each sharing one E8M0 scale (an 8-bit power-of-two). An open OCP standard with low overhead — gpt-oss ships in MXFP4.

NVFP4 (NVIDIA Blackwell)

Groups weights into blocks of 16, each sharing an FP8 (E4M3) scale, plus a per-tensor FP32 global scale. Smaller blocks and a higher-precision scale mean each block fits its data better — so NVFP4 lands lower perplexity / higher accuracy than MXFP4 on the same model, at the cost of ~2× the scale overhead. It's Blackwell-native.

Format Element Block Scale Notable
INT4 (PTQ) 4-bit int per-group per-group GPTQ / AWQ, DIY
INT4 (QAT) 4-bit int trained-in native, near-FP accuracy (Kimi)
MXFP4 E2M1 32 E8M0 (pow-2) open standard, gpt-oss
NVFP4 E2M1 16 FP8 E4M3 most accurate FP4, Blackwell

Which should you use

  • Self-hosting on Blackwell (5090 / GB10 / B200): prefer NVFP4 — best accuracy at 4-bit, hardware-accelerated.
  • Portability / open tooling: MXFP4 is the open standard and widely supported.
  • The model ships native INT4 (QAT): just use it — highest quality-per-bit, zero work.
  • Quantizing an arbitrary FP16 model yourself: AWQ / GPTQ INT4 (PTQ) is the pragmatic route.

Bottom line

  • All of them pack weights into ~4 bits for 4× memory savings and (on Blackwell) 2× FP8 speed.
  • INT4 QAT = trained-in, near-FP accuracy when the author ships it.
  • FP4 (E2M1) uses a per-block scale; MXFP4 = 32-block / power-of-2 scale (open), NVFP4 = 16-block / FP8 scale (more accurate, Blackwell).
  • The right pick depends on your hardware and whether the author already did the work.

Deploying AI for your business?

Inference, GPUs, and quantization choices look different in production. See where your business is on the readiness curve.

Take the AI Readiness Check
Share: Post Share

Read next