The Counterintuitive Truth
GPUs are marketed on compute power — teraflops, CUDA cores, tensor operations per second. But LLM inference doesn't use compute power. It uses memory bandwidth.
Here's the fundamental problem:
RTX 5090 can compute: 103 TB/s of operations
RTX 5090 VRAM delivers: 1.8 TB/s of data
Gap: 57x — cores idle 98% of the time
During autoregressive token generation, the GPU reads the entire model from VRAM for every single token. An 8B model at 6-bit quantization = 6.7 GB per token. At 1.8 TB/s bandwidth, that's 3.7 ms per token, giving a theoretical maximum of ~270 tokens/second.
No amount of additional compute helps. The bottleneck is the straw, not the reservoir.
Proving the Memory Wall
We ran an experiment with two models on the same RTX 5090:
| Model | Parameters | VRAM | Achieved tok/s | Theoretical max |
|---|---|---|---|---|
| Qwen3-8B (Q6_K) | 8.2B | 6.7 GB | 161 | ~270 |
| SmolLM2-135M (IQ4_XS) | 135M | 96 MB | 1,110 | ~18,750 |
The tiny model (135M) should be 70x faster based on its 70x smaller memory footprint. Instead, it's only 7x faster. Where does the performance go?
The Three Walls
Detailed profiling revealed the actual bottleneck structure:
Wall 1 — CPU round-trip: 854 μs (95% of time) ← REAL BOTTLENECK
Wall 2 — Kernel launches: 725 μs (fixed with CUDA graphs: 1.8x speedup)
Wall 3 — VRAM bandwidth: 47 μs (L2 cache would fix: 47→7 μs)
Wall 4 — GPU compute: ~1 μs (negligible)
Wall 1: The CPU Round-Trip
For every token generated, the process is:
- GPU finishes computing logits
- Transfer logits to CPU via PCIe
- CPU runs sampling (argmax/top-p/top-k)
- Transfer selected token back to GPU
- GPU embeds token and starts next forward pass
This CPU↔GPU round-trip takes 854 microseconds — regardless of model size. It's a fixed overhead that dominates inference for small models.
Wall 2: Kernel Launch Overhead
Each forward pass through the model launches hundreds of GPU kernels. Each launch has ~1-5 μs of overhead, and for a tiny 135M model, this adds up to 725 μs.
CUDA graphs solve this by recording the execution pattern once and replaying it. This improved our SmolLM2 throughput by 1.81x:
| Configuration | tok/s | Improvement |
|---|---|---|
| Without CUDA graphs | 615 | Baseline |
| With CUDA graphs | 1,110 | 1.81x |
For the larger Qwen3-8B, CUDA graphs help less (1.17x) because the model is memory-bound, not kernel-launch-bound.
Wall 3: VRAM Bandwidth
For the large model, VRAM bandwidth IS the bottleneck:
- Qwen3-8B streams 6.7 GB per token → 3.7 ms per token → ~270 tok/s max
- Actual: 161 tok/s (60% efficiency — typical for real GPU workloads)
For the tiny model, VRAM bandwidth would allow 18,750 tok/s, but Walls 1 and 2 limit us to 1,110.
Wall 4: Compute
GPU compute is effectively free at these model sizes. The matrix multiplications take ~1 μs per token — negligible.
The L2 Cache Hypothesis
The RTX 5090 has a 96 MB L2 cache between the GPU cores and VRAM. If a model fits entirely in L2, it could theoretically avoid VRAM reads entirely:
VRAM bandwidth: 1.8 TB/s → 47 μs per SmolLM2 forward pass
L2 bandwidth: ~12 TB/s → 7 μs per forward pass
Speedup: 6.7x
But this 6.7x only applies to the memory portion. With CPU overhead at 854 μs, the L2 advantage becomes:
With VRAM: 854 + 47 = 901 μs → 1,110 tok/s
With L2: 854 + 7 = 861 μs → 1,162 tok/s
Speedup: 4%
The CPU round-trip dominates so completely that L2 residency barely matters in practice.
What Prefill Reveals
Prefill (processing the input prompt) tells a different story:
| Mode | SmolLM2 tok/s | Parallelism |
|---|---|---|
| Prefill 512 tokens | 57,789 | 512x |
| Prefill 16 tokens | 3,938 | 16x |
| Generation (1 token) | 1,110 | 1x |
During prefill, 512 tokens are processed simultaneously — the GPU achieves 57,789 tok/s. This proves the hardware IS capable of massive throughput. The limitation is the autoregressive nature of generation: each token depends on the previous one, preventing parallelism.
Why Purpose-Built ASICs Win
Inference-specific chips solve the memory wall architecturally:
Groq LPU
- 230 MB on-chip SRAM (no external memory)
- 80 TB/s internal bandwidth (44x GPU VRAM)
- Eliminates CPU round-trip — sampling happens on-die
- Result: 300+ tok/s on Llama 3 70B
Cerebras WSE-3
- 44 GB on-chip SRAM
- 21 PB/s on-chip bandwidth (11,600x GPU VRAM)
- Entire model lives on-chip
- Result: Thousands of tok/s
Taalas HC1
- Model weights encoded directly in silicon (3-bit custom)
- 17,000 tok/s on Llama 3.1 8B
- 105x faster than our RTX 5090
- No memory access at all — weights ARE the hardware
What This Means for GPU Deployments
1. Quantization is the Primary Lever
Since inference is memory-bound, reducing model size directly improves speed:
| Quantization | Speed improvement | Why |
|---|---|---|
| BF16 → Q6_K | ~2x | Half the data to stream |
| BF16 → Q4_K_M | ~2.5x | Even less data |
| BF16 → Q2 | ~3x | Diminishing returns (quality drops) |
Quantization doesn't sacrifice compute — it reduces the real bottleneck (memory reads).
2. VRAM Amount < VRAM Bandwidth
When choosing a GPU for inference, bandwidth matters more than capacity:
| GPU | VRAM | Bandwidth | Expected 8B Q6_K tok/s |
|---|---|---|---|
| RTX 4090 | 24 GB | 1.0 TB/s | ~90 |
| RTX 5090 | 32 GB | 1.8 TB/s | ~160 |
| A100 | 80 GB | 2.0 TB/s | ~180 |
| H100 | 80 GB | 3.35 TB/s | ~300 |
The H100 has 1.9x the bandwidth of the RTX 5090, which translates directly to ~1.9x the inference speed.
3. Batching is the Only Way to Use Compute
The GPU's compute power only helps with concurrent requests. With 8 concurrent users, the GPU can process 8 tokens simultaneously, filling more of its compute capacity:
| Concurrent users | GPU utilization | Aggregate tok/s |
|---|---|---|
| 1 | ~2% | 161 |
| 4 | ~8% | ~400 |
| 8 | ~15% | ~600 |
| 32 | ~50% | ~1,500 |
This is why vLLM with continuous batching matters for production — it's the only way to actually use the GPU you paid for.
The Future: Where This Is Heading
- HBM4 (2026) — 6+ TB/s bandwidth on consumer GPUs could double inference speed
- On-chip model caching — Larger L2/L3 caches could eventually fit quantized 1B models
- Speculative decoding — Use small draft models to generate candidate tokens in parallel, but requires vocabulary-aligned model pairs
- Inference ASICs — Dedicated chips that eliminate the CPU round-trip entirely
- Hybrid architectures — GPU + inference ASIC combos that handle training and serving optimally
The memory wall isn't going away, but the wall is moving. Every generation of hardware pushes the boundary, and creative software solutions (quantization, batching, speculative decoding) continue to extract more from existing hardware.
Key Takeaway
When planning an LLM deployment, think in terms of memory bandwidth, not compute:
- Your GPU cores are 98% idle during inference
- Quantization is the single most impactful optimization
- Batching (vLLM) is the only way to utilize compute
- Purpose-built ASICs are 10-100x faster because they solve the architecture problem
- For most businesses, a well-quantized model on a good GPU is more than sufficient