Infrastructure 12 min read

vLLM vs Ollama: When Do Advanced Serving Frameworks Win?

ai.rs Jan 22, 2026

The Question Every Deployer Asks

You've fine-tuned your model, built your RAG pipeline, and you're ready to serve users. But which inference engine should you use?

Ollama is the simple choice — install, import model, serve. It runs GGUF quantized models and "just works."

vLLM is the production choice — continuous batching, PagedAttention, and optimized GPU scheduling. But it's complex to set up and has a steeper learning curve.

We ran extensive benchmarks on identical hardware to find where each wins.

Test Setup

Component Specification
GPU RTX 5090 (32 GB VRAM, Blackwell)
Model Qwen3-8B
Ollama format GGUF Q6_K (6.7 GB)
vLLM format NVFP4 (6.4 GB)
Prompt 256 tokens input, 256 tokens output
Decoding Greedy (temperature=0)

Both models produce near-identical output quality at these quantization levels.

Single-User Performance

Metric Ollama Q6_K vLLM NVFP4
Token generation speed 161 tok/s 68 tok/s
Time to first token (TTFT) 132 ms 18 ms
Total time (256 tokens) 1.6s 3.8s
VRAM usage 6.7 GB 6.4 GB

For a single user, Ollama is 2.4x faster in raw token generation. The model streams in under 2 seconds compared to vLLM's nearly 4 seconds.

vLLM has faster TTFT (18ms vs 132ms) due to its optimized prefill pipeline, but this advantage is irrelevant when total generation time is 2x longer.

Why Is vLLM Slower Single-User?

vLLM is a production server, not a speed demon. Every request goes through:

  1. Scheduler — Allocates KV cache blocks, manages priority queues
  2. Chunked prefill — Processes input in chunks for better batching
  3. CUDA graph capture — Compiles execution graphs for consistent performance
  4. KV cache management — PagedAttention allocates and frees memory blocks

This overhead is ~15-20ms per token step. At 68 tok/s, the per-token overhead accounts for most of the time. Ollama's simpler architecture (direct llama.cpp → GPU) skips all of this.

Multi-User Performance: Where vLLM Wins

Everything changes with concurrent users:

Concurrent Users Ollama (tok/s) vLLM (tok/s) vLLM Advantage
1 161 68 Ollama 2.4x
2 165 145 Close
4 168 265 vLLM 1.6x
8 173 332 vLLM 1.9x

Time to First Token Under Load

This is where the gap becomes dramatic:

Concurrent Users Ollama TTFT vLLM TTFT Difference
1 132 ms 18 ms 7x
4 3,200 ms 22 ms 145x
8 7,012 ms 26 ms 270x

With 8 concurrent users, Ollama makes users wait 7 seconds before seeing the first token. vLLM serves everyone in 26 milliseconds.

Why the Dramatic Difference?

Ollama processes requests sequentially. User 8 waits for users 1-7 to complete before generation starts. This creates a queue that grows linearly with users.

vLLM uses continuous batching — it processes all requests simultaneously on the GPU, sharing compute across users. The GPU utilization goes up, latency stays flat.

The Decision Framework

┌─────────────────────────────────────┐
│ How many concurrent users?          │
├───────────────┬─────────────────────┤
│ 1-2 users     │ → Use Ollama        │
│               │   Simpler, faster   │
├───────────────┼─────────────────────┤
│ 3-7 users     │ → Either works      │
│               │   vLLM if TTFT      │
│               │   matters           │
├───────────────┼─────────────────────┤
│ 8+ users      │ → Use vLLM          │
│               │   Mandatory for     │
│               │   acceptable UX     │
└───────────────┴─────────────────────┘

Setup Complexity Comparison

Ollama Setup (5 minutes)

# Install
pacman -S ollama-cuda  # or curl install script

# Create Modelfile
echo 'FROM ./model.gguf' > Modelfile
ollama create mymodel -f Modelfile

# Serve
ollama serve

Done. Your model is available at localhost:11434.

vLLM Setup (1-2 hours)

# Requires specific PyTorch version
pip install vllm  # May downgrade your torch

# Launch with NVFP4 quantization
vllm serve model-path \
    --quantization nvfp4 \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.85

Plus: PyTorch version conflicts, CUDA compatibility issues, model format conversion, and debugging startup failures.

Multi-LoRA: vLLM's Hidden Advantage

If you need to serve multiple specialized models, vLLM has a decisive advantage:

Scenario Ollama vLLM
5 different models 5 × 6.7 GB = 33.5 GB 1 base + 5 LoRAs = 5.7 GB
Adapter swap time Load new GGUF (seconds) Hot-swap LoRA (milliseconds)
Max models on 32GB 4-5 20+

vLLM shares one base model across all LoRA adapters, swapping them per-request with near-zero overhead. Ollama must load an entirely new model file for each adapter.

Quality Comparison

With greedy decoding (temperature=0), both engines produce near-identical output. The differences come from quantization format, not the engine:

Benchmark Q6_K (Ollama) NVFP4 (vLLM) Recovery
MMLU (general) ~98% of FP16 97.5% of FP16 Both excellent
GSM8K (math) ~98% 99.4% Both excellent
Hard reasoning ~98% ~80% Q6_K wins

For most business applications, both are indistinguishable. Q6_K has an edge on hard reasoning tasks and low-resource language accuracy.

Our Recommendation

Start with Ollama. It's simpler, faster for single users, and you can be up and running in 5 minutes.

Switch to vLLM when:

  1. You have 4+ concurrent users regularly
  2. TTFT under load is unacceptable (> 2 seconds)
  3. You need to serve multiple LoRA adapters efficiently
  4. You're scaling to production with load balancing

The migration path is straightforward: export your model in the right format, write a new config, and swap the endpoint URL. Your frontend doesn't need to change.

Share: Post Share

Related Articles