The Question Every Deployer Asks
You've fine-tuned your model, built your RAG pipeline, and you're ready to serve users. But which inference engine should you use?
Ollama is the simple choice — install, import model, serve. It runs GGUF quantized models and "just works."
vLLM is the production choice — continuous batching, PagedAttention, and optimized GPU scheduling. But it's complex to set up and has a steeper learning curve.
We ran extensive benchmarks on identical hardware to find where each wins.
Test Setup
| Component | Specification |
|---|---|
| GPU | RTX 5090 (32 GB VRAM, Blackwell) |
| Model | Qwen3-8B |
| Ollama format | GGUF Q6_K (6.7 GB) |
| vLLM format | NVFP4 (6.4 GB) |
| Prompt | 256 tokens input, 256 tokens output |
| Decoding | Greedy (temperature=0) |
Both models produce near-identical output quality at these quantization levels.
Single-User Performance
| Metric | Ollama Q6_K | vLLM NVFP4 |
|---|---|---|
| Token generation speed | 161 tok/s | 68 tok/s |
| Time to first token (TTFT) | 132 ms | 18 ms |
| Total time (256 tokens) | 1.6s | 3.8s |
| VRAM usage | 6.7 GB | 6.4 GB |
For a single user, Ollama is 2.4x faster in raw token generation. The model streams in under 2 seconds compared to vLLM's nearly 4 seconds.
vLLM has faster TTFT (18ms vs 132ms) due to its optimized prefill pipeline, but this advantage is irrelevant when total generation time is 2x longer.
Why Is vLLM Slower Single-User?
vLLM is a production server, not a speed demon. Every request goes through:
- Scheduler — Allocates KV cache blocks, manages priority queues
- Chunked prefill — Processes input in chunks for better batching
- CUDA graph capture — Compiles execution graphs for consistent performance
- KV cache management — PagedAttention allocates and frees memory blocks
This overhead is ~15-20ms per token step. At 68 tok/s, the per-token overhead accounts for most of the time. Ollama's simpler architecture (direct llama.cpp → GPU) skips all of this.
Multi-User Performance: Where vLLM Wins
Everything changes with concurrent users:
| Concurrent Users | Ollama (tok/s) | vLLM (tok/s) | vLLM Advantage |
|---|---|---|---|
| 1 | 161 | 68 | Ollama 2.4x |
| 2 | 165 | 145 | Close |
| 4 | 168 | 265 | vLLM 1.6x |
| 8 | 173 | 332 | vLLM 1.9x |
Time to First Token Under Load
This is where the gap becomes dramatic:
| Concurrent Users | Ollama TTFT | vLLM TTFT | Difference |
|---|---|---|---|
| 1 | 132 ms | 18 ms | 7x |
| 4 | 3,200 ms | 22 ms | 145x |
| 8 | 7,012 ms | 26 ms | 270x |
With 8 concurrent users, Ollama makes users wait 7 seconds before seeing the first token. vLLM serves everyone in 26 milliseconds.
Why the Dramatic Difference?
Ollama processes requests sequentially. User 8 waits for users 1-7 to complete before generation starts. This creates a queue that grows linearly with users.
vLLM uses continuous batching — it processes all requests simultaneously on the GPU, sharing compute across users. The GPU utilization goes up, latency stays flat.
The Decision Framework
┌─────────────────────────────────────┐
│ How many concurrent users? │
├───────────────┬─────────────────────┤
│ 1-2 users │ → Use Ollama │
│ │ Simpler, faster │
├───────────────┼─────────────────────┤
│ 3-7 users │ → Either works │
│ │ vLLM if TTFT │
│ │ matters │
├───────────────┼─────────────────────┤
│ 8+ users │ → Use vLLM │
│ │ Mandatory for │
│ │ acceptable UX │
└───────────────┴─────────────────────┘
Setup Complexity Comparison
Ollama Setup (5 minutes)
# Install
pacman -S ollama-cuda # or curl install script
# Create Modelfile
echo 'FROM ./model.gguf' > Modelfile
ollama create mymodel -f Modelfile
# Serve
ollama serve
Done. Your model is available at localhost:11434.
vLLM Setup (1-2 hours)
# Requires specific PyTorch version
pip install vllm # May downgrade your torch
# Launch with NVFP4 quantization
vllm serve model-path \
--quantization nvfp4 \
--max-model-len 4096 \
--gpu-memory-utilization 0.85
Plus: PyTorch version conflicts, CUDA compatibility issues, model format conversion, and debugging startup failures.
Multi-LoRA: vLLM's Hidden Advantage
If you need to serve multiple specialized models, vLLM has a decisive advantage:
| Scenario | Ollama | vLLM |
|---|---|---|
| 5 different models | 5 × 6.7 GB = 33.5 GB | 1 base + 5 LoRAs = 5.7 GB |
| Adapter swap time | Load new GGUF (seconds) | Hot-swap LoRA (milliseconds) |
| Max models on 32GB | 4-5 | 20+ |
vLLM shares one base model across all LoRA adapters, swapping them per-request with near-zero overhead. Ollama must load an entirely new model file for each adapter.
Quality Comparison
With greedy decoding (temperature=0), both engines produce near-identical output. The differences come from quantization format, not the engine:
| Benchmark | Q6_K (Ollama) | NVFP4 (vLLM) | Recovery |
|---|---|---|---|
| MMLU (general) | ~98% of FP16 | 97.5% of FP16 | Both excellent |
| GSM8K (math) | ~98% | 99.4% | Both excellent |
| Hard reasoning | ~98% | ~80% | Q6_K wins |
For most business applications, both are indistinguishable. Q6_K has an edge on hard reasoning tasks and low-resource language accuracy.
Our Recommendation
Start with Ollama. It's simpler, faster for single users, and you can be up and running in 5 minutes.
Switch to vLLM when:
- You have 4+ concurrent users regularly
- TTFT under load is unacceptable (> 2 seconds)
- You need to serve multiple LoRA adapters efficiently
- You're scaling to production with load balancing
The migration path is straightforward: export your model in the right format, write a new config, and swap the endpoint URL. Your frontend doesn't need to change.