Research 15 min read

The GPU Memory Wall: Why Inference Hardware Matters

ai.rs Feb 26, 2026

The Counterintuitive Truth

GPUs are marketed on compute power — teraflops, CUDA cores, tensor operations per second. But LLM inference doesn't use compute power. It uses memory bandwidth.

Here's the fundamental problem:

RTX 5090 can compute:     103 TB/s of operations
RTX 5090 VRAM delivers:     1.8 TB/s of data
Gap:                        57x — cores idle 98% of the time

During autoregressive token generation, the GPU reads the entire model from VRAM for every single token. An 8B model at 6-bit quantization = 6.7 GB per token. At 1.8 TB/s bandwidth, that's 3.7 ms per token, giving a theoretical maximum of ~270 tokens/second.

No amount of additional compute helps. The bottleneck is the straw, not the reservoir.

Proving the Memory Wall

We ran an experiment with two models on the same RTX 5090:

Model Parameters VRAM Achieved tok/s Theoretical max
Qwen3-8B (Q6_K) 8.2B 6.7 GB 161 ~270
SmolLM2-135M (IQ4_XS) 135M 96 MB 1,110 ~18,750

The tiny model (135M) should be 70x faster based on its 70x smaller memory footprint. Instead, it's only 7x faster. Where does the performance go?

The Three Walls

Detailed profiling revealed the actual bottleneck structure:

Wall 1 — CPU round-trip:     854 μs  (95% of time)  ← REAL BOTTLENECK
Wall 2 — Kernel launches:    725 μs  (fixed with CUDA graphs: 1.8x speedup)
Wall 3 — VRAM bandwidth:      47 μs  (L2 cache would fix: 47→7 μs)
Wall 4 — GPU compute:          ~1 μs  (negligible)

Wall 1: The CPU Round-Trip

For every token generated, the process is:

  1. GPU finishes computing logits
  2. Transfer logits to CPU via PCIe
  3. CPU runs sampling (argmax/top-p/top-k)
  4. Transfer selected token back to GPU
  5. GPU embeds token and starts next forward pass

This CPU↔GPU round-trip takes 854 microseconds — regardless of model size. It's a fixed overhead that dominates inference for small models.

Wall 2: Kernel Launch Overhead

Each forward pass through the model launches hundreds of GPU kernels. Each launch has ~1-5 μs of overhead, and for a tiny 135M model, this adds up to 725 μs.

CUDA graphs solve this by recording the execution pattern once and replaying it. This improved our SmolLM2 throughput by 1.81x:

Configuration tok/s Improvement
Without CUDA graphs 615 Baseline
With CUDA graphs 1,110 1.81x

For the larger Qwen3-8B, CUDA graphs help less (1.17x) because the model is memory-bound, not kernel-launch-bound.

Wall 3: VRAM Bandwidth

For the large model, VRAM bandwidth IS the bottleneck:

  • Qwen3-8B streams 6.7 GB per token → 3.7 ms per token → ~270 tok/s max
  • Actual: 161 tok/s (60% efficiency — typical for real GPU workloads)

For the tiny model, VRAM bandwidth would allow 18,750 tok/s, but Walls 1 and 2 limit us to 1,110.

Wall 4: Compute

GPU compute is effectively free at these model sizes. The matrix multiplications take ~1 μs per token — negligible.

The L2 Cache Hypothesis

The RTX 5090 has a 96 MB L2 cache between the GPU cores and VRAM. If a model fits entirely in L2, it could theoretically avoid VRAM reads entirely:

VRAM bandwidth:  1.8 TB/s → 47 μs per SmolLM2 forward pass
L2 bandwidth:    ~12 TB/s → 7 μs per forward pass
Speedup:         6.7x

But this 6.7x only applies to the memory portion. With CPU overhead at 854 μs, the L2 advantage becomes:

With VRAM:  854 + 47 = 901 μs → 1,110 tok/s
With L2:    854 +  7 = 861 μs → 1,162 tok/s
Speedup:    4%

The CPU round-trip dominates so completely that L2 residency barely matters in practice.

What Prefill Reveals

Prefill (processing the input prompt) tells a different story:

Mode SmolLM2 tok/s Parallelism
Prefill 512 tokens 57,789 512x
Prefill 16 tokens 3,938 16x
Generation (1 token) 1,110 1x

During prefill, 512 tokens are processed simultaneously — the GPU achieves 57,789 tok/s. This proves the hardware IS capable of massive throughput. The limitation is the autoregressive nature of generation: each token depends on the previous one, preventing parallelism.

Why Purpose-Built ASICs Win

Inference-specific chips solve the memory wall architecturally:

Groq LPU

  • 230 MB on-chip SRAM (no external memory)
  • 80 TB/s internal bandwidth (44x GPU VRAM)
  • Eliminates CPU round-trip — sampling happens on-die
  • Result: 300+ tok/s on Llama 3 70B

Cerebras WSE-3

  • 44 GB on-chip SRAM
  • 21 PB/s on-chip bandwidth (11,600x GPU VRAM)
  • Entire model lives on-chip
  • Result: Thousands of tok/s

Taalas HC1

  • Model weights encoded directly in silicon (3-bit custom)
  • 17,000 tok/s on Llama 3.1 8B
  • 105x faster than our RTX 5090
  • No memory access at all — weights ARE the hardware

What This Means for GPU Deployments

1. Quantization is the Primary Lever

Since inference is memory-bound, reducing model size directly improves speed:

Quantization Speed improvement Why
BF16 → Q6_K ~2x Half the data to stream
BF16 → Q4_K_M ~2.5x Even less data
BF16 → Q2 ~3x Diminishing returns (quality drops)

Quantization doesn't sacrifice compute — it reduces the real bottleneck (memory reads).

2. VRAM Amount < VRAM Bandwidth

When choosing a GPU for inference, bandwidth matters more than capacity:

GPU VRAM Bandwidth Expected 8B Q6_K tok/s
RTX 4090 24 GB 1.0 TB/s ~90
RTX 5090 32 GB 1.8 TB/s ~160
A100 80 GB 2.0 TB/s ~180
H100 80 GB 3.35 TB/s ~300

The H100 has 1.9x the bandwidth of the RTX 5090, which translates directly to ~1.9x the inference speed.

3. Batching is the Only Way to Use Compute

The GPU's compute power only helps with concurrent requests. With 8 concurrent users, the GPU can process 8 tokens simultaneously, filling more of its compute capacity:

Concurrent users GPU utilization Aggregate tok/s
1 ~2% 161
4 ~8% ~400
8 ~15% ~600
32 ~50% ~1,500

This is why vLLM with continuous batching matters for production — it's the only way to actually use the GPU you paid for.

The Future: Where This Is Heading

  1. HBM4 (2026) — 6+ TB/s bandwidth on consumer GPUs could double inference speed
  2. On-chip model caching — Larger L2/L3 caches could eventually fit quantized 1B models
  3. Speculative decoding — Use small draft models to generate candidate tokens in parallel, but requires vocabulary-aligned model pairs
  4. Inference ASICs — Dedicated chips that eliminate the CPU round-trip entirely
  5. Hybrid architectures — GPU + inference ASIC combos that handle training and serving optimally

The memory wall isn't going away, but the wall is moving. Every generation of hardware pushes the boundary, and creative software solutions (quantization, batching, speculative decoding) continue to extract more from existing hardware.

Key Takeaway

When planning an LLM deployment, think in terms of memory bandwidth, not compute:

  • Your GPU cores are 98% idle during inference
  • Quantization is the single most impactful optimization
  • Batching (vLLM) is the only way to utilize compute
  • Purpose-built ASICs are 10-100x faster because they solve the architecture problem
  • For most businesses, a well-quantized model on a good GPU is more than sufficient
Share: Post Share

Related Articles