Infrastructure 10 min read

Will This LLM Fit My GPU? VRAM Requirements for Every Model Size

ai.rs Mar 9, 2026

The Question Every Developer Asks

You found a model on Hugging Face. It looks promising. But before you spend 30 minutes downloading it and another 10 watching it crash with an out-of-memory error, you need to answer one question: will it fit on my GPU?

This isn't as simple as "8B parameters = X GB." VRAM usage depends on the data type, quantization format, context length, KV cache overhead, and whether you're running one user or twenty. Let's break it all down.

The VRAM Formula

Total GPU memory for inference has three components:

Total VRAM = Model Weights + KV Cache + Overhead

Component 1: Model Weights

This is the big one. Model weights are the learned parameters stored in files on disk, loaded entirely into VRAM for inference.

Data Type Bytes per Parameter 8B Model 27B Model 70B Model
FP32 4 32 GB 108 GB 280 GB
FP16 / BF16 2 16 GB 54 GB 140 GB
Q8_0 (8-bit) ~1.1 8.5 GB 29 GB 75 GB
Q6_K (6-bit) ~0.8 6.7 GB 21 GB 54 GB
Q4_K_M (4-bit) ~0.55 4.7 GB 15 GB 40 GB
Q2_K (2-bit) ~0.31 2.6 GB 8.5 GB 22 GB

The formula is straightforward:

Weight Memory = num_parameters x bytes_per_parameter

For quantized formats like GGUF, the bytes per parameter varies by layer — attention layers might use higher precision than feed-forward layers. The numbers above are averages across the full model.

MoE models are different. A model like Llama 4 Scout has 109B total parameters but only 17B active per token. You still need VRAM for all 109B parameters — every expert must be in memory even though only a subset fires per token. MoE models are memory-heavy but compute-light.

Component 2: KV Cache

The KV (Key-Value) cache stores attention states for every token in the context window. It grows linearly with sequence length and can consume significant VRAM for long contexts.

KV Cache = 2 x num_layers x num_kv_heads x head_dim x seq_length x dtype_bytes

Where:

  • 2 — one for keys, one for values
  • num_layers — number of transformer layers (e.g., 32 for Qwen3-8B)
  • num_kv_heads — number of key-value heads (often fewer than attention heads due to GQA)
  • head_dim — hidden_size / num_attention_heads (e.g., 4096 / 32 = 128)
  • seq_length — your actual context length in tokens
  • dtype_bytes — 2 for FP16/BF16, 1 for FP8

Here's what KV cache looks like for Qwen3-8B at different context lengths:

Context Length FP16 KV Cache FP8 KV Cache
2K tokens 256 MB 128 MB
8K tokens 1.0 GB 512 MB
32K tokens 4.1 GB 2.0 GB
128K tokens 16.4 GB 8.2 GB

At 32K context, the KV cache alone eats 4 GB — half of what the quantized weights use. This is why "my model fits in VRAM" and "my model fits in VRAM with the context length I need" are very different statements.

Multi-user multiplier: Each concurrent user needs their own KV cache. 8 users at 8K context = 8 GB of KV cache in FP16. This is why vLLM's paged attention matters at scale — it avoids pre-allocating the full context for every user.

Component 3: Overhead

Operating system, CUDA runtime, framework buffers, and activation memory during forward passes. Rule of thumb:

Component Typical Size
CUDA runtime + driver 300-500 MB
Framework buffers (Ollama/vLLM) 200-500 MB
Activation memory 100-300 MB
Total overhead ~0.5-1.5 GB

For quick estimates, add 1 GB overhead. For production capacity planning, add 1.5 GB.

The One-Command Check: hf-mem

Instead of doing math by hand, use hf-mem — a CLI tool that reads Safetensors metadata directly from Hugging Face without downloading the model. It uses HTTP range requests to fetch just the header bytes, so it works instantly even for 100 GB+ models.

Install and Run

# No install needed — run directly with uvx
uvx hf-mem --model-id Qwen/Qwen3-8B

This outputs a breakdown by component: parameter count per dtype, total bytes, and a formatted table showing exactly how much memory the weights require.

With KV Cache Estimation

Add --experimental to include KV cache calculations:

uvx hf-mem --model-id Qwen/Qwen3-8B --experimental

You can customize the estimate for your specific use case:

# 32K context, 4 concurrent users, FP8 cache
uvx hf-mem --model-id Qwen/Qwen3-8B \
  --experimental \
  --max-model-len 32768 \
  --batch-size 4 \
  --kv-cache-dtype fp8

GGUF Quantized Models

For quantized models (which is what most people actually deploy), specify the GGUF file:

# Check a specific quantization
uvx hf-mem --model-id bartowski/Qwen3-8B-GGUF \
  --gguf-file Qwen3-8B-Q6_K.gguf \
  --experimental

JSON Output for Scripts

Get machine-readable output for automation:

uvx hf-mem --model-id Qwen/Qwen3-8B --experimental --json-output

This returns a JSON object with param_count, bytes_count, cache_size, and all component-level detail — useful for building your own capacity planning scripts.

How It Works Under the Hood

hf-mem doesn't download model files. It exploits the Safetensors format which stores tensor metadata (shapes, dtypes) in a header at the beginning of each file. An HTTP range request (bytes=0-100000) fetches just this header — typically under 100 KB even for models with thousands of tensors.

From the header, it extracts every tensor's shape and dtype, multiplies shape dimensions to get parameter count, then multiplies by bytes-per-dtype to get memory. For KV cache, it reads the model's config.json to get layer count, head count, and head dimension.

The whole process takes 1-3 seconds regardless of model size.

Here's what actually fits, with realistic context lengths and 1 GB overhead budget:

8 GB GPUs (RTX 4060, RTX 3070)

Model Quant Weights KV (4K ctx) Total Fits?
Qwen3-8B Q4_K_M 4.7 GB 0.5 GB 6.2 GB Yes
Qwen3-8B Q6_K 6.7 GB 0.5 GB 8.2 GB Tight
Llama 3.1 8B Q4_K_M 4.9 GB 0.5 GB 6.4 GB Yes
Gemma 3 12B Q4_K_M 7.2 GB 0.6 GB 8.8 GB No

Sweet spot: 8B models at Q4_K_M with 4K context. Going to Q6_K is possible but leaves no room for longer contexts.

12 GB GPUs (RTX 4070, RTX 3060 12GB)

Model Quant Weights KV (8K ctx) Total Fits?
Qwen3-8B Q6_K 6.7 GB 1.0 GB 8.7 GB Yes
Qwen3-8B Q8_0 8.5 GB 1.0 GB 10.5 GB Yes
Gemma 3 12B Q6_K 9.2 GB 1.2 GB 11.4 GB Tight
Qwen3-14B Q4_K_M 8.2 GB 0.8 GB 10.0 GB Yes

Sweet spot: 8B at Q6_K or Q8_0 with 8K context. Can squeeze in 12-14B at Q4_K_M.

16 GB GPUs (RTX 4080, RTX 5060 Ti)

Model Quant Weights KV (8K ctx) Total Fits?
Qwen3-14B Q6_K 11.2 GB 0.8 GB 13.0 GB Yes
Gemma 3 27B Q4_K_M 15.2 GB 1.6 GB 17.8 GB No
Qwen3-8B Q6_K 6.7 GB 4.1 GB 11.8 GB Yes (32K ctx)

Sweet spot: 14B at Q6_K with 8K context. Or 8B at high quality with very long context.

24 GB GPUs (RTX 4090, RTX 5090, A5000)

Model Quant Weights KV (8K ctx) Total Fits?
Qwen3.5-27B Q6_K 21 GB 1.6 GB 23.6 GB Tight
Gemma 3 27B Q6_K 20 GB 1.6 GB 22.6 GB Yes
Llama 3.1 70B Q4_K_M 40 GB No
Qwen3-8B Q8_0 8.5 GB 16.4 GB 25.9 GB No (128K)

Sweet spot: 27B at Q6_K with 8K context. Note that even an 8B model can bust 24 GB if you crank context to 128K.

32 GB GPUs (RTX 5090)

Model Quant Weights KV (8K ctx) Total Fits?
Qwen3.5-27B Q8_0 29 GB 1.6 GB 31.6 GB Tight
Llama 4 Scout Q6_K 29 GB 1.2 GB 31.2 GB Tight
Qwen3.5-27B Q6_K 21 GB 6.4 GB 28.4 GB Yes (32K)

Sweet spot: 27B at Q8_0 for maximum quality, or Q6_K with extended context.

Common Mistakes

1. Ignoring KV Cache

"The model is 6 GB and my GPU has 8 GB, it'll fit." Probably — at 2K context. At 32K context, add another 4 GB for KV cache. Always factor in your actual context length.

2. Confusing Total vs Active Parameters (MoE)

Llama 4 Scout: 109B total, 17B active. Mixtral 8x7B: 47B total, 13B active. You need VRAM for total parameters, not active. MoE models seem efficient in compute but are memory-hungry.

3. Forgetting Multi-User Overhead

One user at 8K context needs 1 GB KV cache. Eight users need 8 GB. If you're deploying for concurrent access, multiply KV cache by your expected concurrency — or use vLLM's PagedAttention which allocates dynamically.

4. Using Reported Size Instead of Measuring

Model cards sometimes report FP16 size when quantized versions are available. Or they report weight-only size without KV cache. Use hf-mem to get the actual number from the actual files.

The Decision Process

1. Pick your model (size + architecture)
2. Pick your quantization (Q6_K is the sweet spot for most)
3. Calculate: weights + KV cache (at your context length) + 1 GB overhead
4. Compare against your GPU VRAM
5. If it doesn't fit: try smaller quant, shorter context, or smaller model

Or skip the math entirely:

uvx hf-mem --model-id <your-model> --experimental --max-model-len <your-context>

The 30 seconds spent checking saves 30 minutes of downloading and debugging OOM errors.

What About CPU Offloading?

If a model doesn't quite fit, some frameworks (llama.cpp, Ollama) can offload layers to system RAM. This works but kills performance — CPU memory bandwidth is 10-20x slower than GPU VRAM. A model that runs at 150 tok/s fully on GPU might drop to 15 tok/s with partial offloading.

Use offloading for experimentation, not production. If you need to offload more than 10-20% of layers, you need a bigger GPU or a smaller model.

Practical Workflow

Here's the workflow we use when evaluating models:

# 1. Check if it fits
uvx hf-mem --model-id Qwen/Qwen3-8B --experimental --max-model-len 8192

# 2. Check the quantized version you'll actually deploy
uvx hf-mem --model-id bartowski/Qwen3-8B-GGUF \
  --gguf-file Qwen3-8B-Q6_K.gguf --experimental

# 3. If it fits, download and test
ollama pull qwen3:8b-q6_K

# 4. Verify actual VRAM usage
nvidia-smi

The key insight: check before you download. GPU memory is a hard constraint — there's no swap file, no graceful degradation. Either the model fits or it crashes. A 3-second check with hf-mem tells you the answer before committing to a multi-gigabyte download.

For comparing which models give you the best quality within your VRAM budget, see our open model comparison and quantization benchmarks for quality-vs-size tradeoffs at each quantization level.

Share: Post Share

Related Articles