Research 15 min read

The GPU Memory Wall: Why Inference Hardware Matters

ai.rs Feb 26, 2026

gpu memory-wall hardware inference benchmarks

The Counterintuitive Truth

GPUs are marketed on compute power — teraflops, CUDA cores, tensor operations per second. But LLM inference doesn't use compute power. It uses memory bandwidth.

Here's the fundamental problem:

RTX 5090 can compute:     103 TB/s of operations
RTX 5090 VRAM delivers:     1.8 TB/s of data
Gap:                        57x — cores idle 98% of the time

During autoregressive token generation, the GPU reads the entire model from VRAM for every single token. An 8B model at 6-bit quantization = 6.7 GB per token. At 1.8 TB/s bandwidth, that's 3.7 ms per token, giving a theoretical maximum of ~270 tokens/second.

No amount of additional compute helps. The bottleneck is the straw, not the reservoir.

Proving the Memory Wall

We ran an experiment with two models on the same RTX 5090:

Model	Parameters	VRAM	Achieved tok/s	Theoretical max
Qwen3-8B (Q6_K)	8.2B	6.7 GB	161	~270
SmolLM2-135M (IQ4_XS)	135M	96 MB	1,110	~18,750

The tiny model (135M) should be 70x faster based on its 70x smaller memory footprint. Instead, it's only 7x faster. Where does the performance go?

The Three Walls

Detailed profiling revealed the actual bottleneck structure:

Wall 1 — CPU round-trip:     854 μs  (95% of time)  ← REAL BOTTLENECK
Wall 2 — Kernel launches:    725 μs  (fixed with CUDA graphs: 1.8x speedup)
Wall 3 — VRAM bandwidth:      47 μs  (L2 cache would fix: 47→7 μs)
Wall 4 — GPU compute:          ~1 μs  (negligible)

Wall 1: The CPU Round-Trip

For every token generated, the process is:

GPU finishes computing logits
Transfer logits to CPU via PCIe
CPU runs sampling (argmax/top-p/top-k)
Transfer selected token back to GPU
GPU embeds token and starts next forward pass

This CPU↔GPU round-trip takes 854 microseconds — regardless of model size. It's a fixed overhead that dominates inference for small models.

Wall 2: Kernel Launch Overhead

Each forward pass through the model launches hundreds of GPU kernels. Each launch has ~1-5 μs of overhead, and for a tiny 135M model, this adds up to 725 μs.

CUDA graphs solve this by recording the execution pattern once and replaying it. This improved our SmolLM2 throughput by 1.81x:

Configuration	tok/s	Improvement
Without CUDA graphs	615	Baseline
With CUDA graphs	1,110	1.81x

For the larger Qwen3-8B, CUDA graphs help less (1.17x) because the model is memory-bound, not kernel-launch-bound.

Wall 3: VRAM Bandwidth

For the large model, VRAM bandwidth IS the bottleneck:

Qwen3-8B streams 6.7 GB per token → 3.7 ms per token → ~270 tok/s max
Actual: 161 tok/s (60% efficiency — typical for real GPU workloads)

For the tiny model, VRAM bandwidth would allow 18,750 tok/s, but Walls 1 and 2 limit us to 1,110.

Wall 4: Compute

GPU compute is effectively free at these model sizes. The matrix multiplications take ~1 μs per token — negligible.

The L2 Cache Hypothesis

The RTX 5090 has a 96 MB L2 cache between the GPU cores and VRAM. If a model fits entirely in L2, it could theoretically avoid VRAM reads entirely:

VRAM bandwidth:  1.8 TB/s → 47 μs per SmolLM2 forward pass
L2 bandwidth:    ~12 TB/s → 7 μs per forward pass
Speedup:         6.7x

But this 6.7x only applies to the memory portion. With CPU overhead at 854 μs, the L2 advantage becomes:

With VRAM:  854 + 47 = 901 μs → 1,110 tok/s
With L2:    854 +  7 = 861 μs → 1,162 tok/s
Speedup:    4%

The CPU round-trip dominates so completely that L2 residency barely matters in practice.

What Prefill Reveals

Prefill (processing the input prompt) tells a different story:

Mode	SmolLM2 tok/s	Parallelism
Prefill 512 tokens	57,789	512x
Prefill 16 tokens	3,938	16x
Generation (1 token)	1,110	1x

During prefill, 512 tokens are processed simultaneously — the GPU achieves 57,789 tok/s. This proves the hardware IS capable of massive throughput. The limitation is the autoregressive nature of generation: each token depends on the previous one, preventing parallelism.

Why Purpose-Built ASICs Win

Inference-specific chips solve the memory wall architecturally:

Groq LPU

230 MB on-chip SRAM (no external memory)
80 TB/s internal bandwidth (44x GPU VRAM)
Eliminates CPU round-trip — sampling happens on-die
Result: 300+ tok/s on Llama 3 70B

Cerebras WSE-3

44 GB on-chip SRAM
21 PB/s on-chip bandwidth (11,600x GPU VRAM)
Entire model lives on-chip
Result: Thousands of tok/s

Taalas HC1

Model weights encoded directly in silicon (3-bit custom)
17,000 tok/s on Llama 3.1 8B
105x faster than our RTX 5090
No memory access at all — weights ARE the hardware

What This Means for GPU Deployments

1. Quantization is the Primary Lever

Since inference is memory-bound, reducing model size directly improves speed:

Quantization	Speed improvement	Why
BF16 → Q6_K	~2x	Half the data to stream
BF16 → Q4_K_M	~2.5x	Even less data
BF16 → Q2	~3x	Diminishing returns (quality drops)

Quantization doesn't sacrifice compute — it reduces the real bottleneck (memory reads).

2. VRAM Amount < VRAM Bandwidth

When choosing a GPU for inference, bandwidth matters more than capacity:

GPU	VRAM	Bandwidth	Expected 8B Q6_K tok/s
RTX 4090	24 GB	1.0 TB/s	~90
RTX 5090	32 GB	1.8 TB/s	~160
A100	80 GB	2.0 TB/s	~180
H100	80 GB	3.35 TB/s	~300

The H100 has 1.9x the bandwidth of the RTX 5090, which translates directly to ~1.9x the inference speed.

3. Batching is the Only Way to Use Compute

The GPU's compute power only helps with concurrent requests. With 8 concurrent users, the GPU can process 8 tokens simultaneously, filling more of its compute capacity:

Concurrent users	GPU utilization	Aggregate tok/s
1	~2%	161
4	~8%	~400
8	~15%	~600
32	~50%	~1,500

This is why vLLM with continuous batching matters for production — it's the only way to actually use the GPU you paid for.

The Future: Where This Is Heading

HBM4 (2026) — 6+ TB/s bandwidth on consumer GPUs could double inference speed
On-chip model caching — Larger L2/L3 caches could eventually fit quantized 1B models
Speculative decoding — Use small draft models to generate candidate tokens in parallel, but requires vocabulary-aligned model pairs
Inference ASICs — Dedicated chips that eliminate the CPU round-trip entirely
Hybrid architectures — GPU + inference ASIC combos that handle training and serving optimally

The memory wall isn't going away, but the wall is moving. Every generation of hardware pushes the boundary, and creative software solutions (quantization, batching, speculative decoding) continue to extract more from existing hardware.

Key Takeaway

When planning an LLM deployment, think in terms of memory bandwidth, not compute:

Your GPU cores are 98% idle during inference
Quantization is the single most impactful optimization
Batching (vLLM) is the only way to utilize compute
Purpose-built ASICs are 10-100x faster because they solve the architecture problem
For most businesses, a well-quantized model on a good GPU is more than sufficient

Wondering if this fits your business?

The benchmarks and architectures we cover here power real production AI assistants. See where your business stands in 2 minutes.

Take the AI Readiness Check

Share: Post Share