What is an LLM?
A Large Language Model (LLM) is a neural network trained on vast amounts of text data that can understand and generate human-like language. Models like GPT-4, Claude, Llama, and Qwen have billions of parameters — the learned weights that encode knowledge about language, reasoning, and the world.
But here's what matters for business: an LLM isn't just a chatbot. It's a reasoning engine that can be specialized to understand your products, your customers, and your domain.
Why Deploy Your Own LLM?
Using a third-party API (OpenAI, Anthropic) is the easiest path, but it comes with trade-offs:
| Approach | Cost per 1M tokens | Latency | Data Privacy | Customization |
|---|---|---|---|---|
| API (GPT-4o) | $2.50-$10 | 200-800ms TTFT | Data leaves your server | Limited |
| Self-hosted (8B model) | ~$0.05 (compute) | 18-130ms TTFT | Stays on your hardware | Full control |
For a business handling thousands of customer interactions daily, self-hosting can reduce costs by 50-200x while keeping sensitive product data and customer conversations on your own infrastructure.
The Architecture
A deployed LLM system has three layers:
User Browser → Your Web Server (PHP/Node/Python)
↓
Inference Engine (Ollama / vLLM)
↓
Fine-tuned Model + RAG Pipeline
1. The Model — A pre-trained LLM (e.g., Qwen3-8B) fine-tuned on your domain data using LoRA adapters. This gives it expertise in your products without retraining the full model.
2. RAG (Retrieval-Augmented Generation) — Before the model generates a response, your system searches a product database and injects relevant information into the prompt. This ensures the model always has accurate, up-to-date data.
3. The Inference Engine — Software that loads the model onto your GPU and serves requests. Ollama is the simplest option for single-user deployments; vLLM handles concurrent users efficiently.
Hardware Requirements
The GPU is the critical component. Model size determines minimum VRAM:
| Model Size | Quantization | VRAM Needed | Speed (single user) |
|---|---|---|---|
| 7-8B params | Q6_K (6-bit) | 6-8 GB | 150-200 tok/s |
| 7-8B params | Q4_K_M (4-bit) | 4-6 GB | 180-220 tok/s |
| 13B params | Q6_K | 11-14 GB | 80-120 tok/s |
| 70B params | Q4_K_M | 40-48 GB | 15-25 tok/s |
For most business applications, an 8B parameter model with 6-bit quantization hits the sweet spot — fast enough for real-time chat (150+ tokens/second) and small enough to run on consumer GPUs like the RTX 4090 or RTX 5090.
Deployment Steps
Step 1: Choose Your Model
For domain-specific business use, start with a strong multilingual base model:
- Qwen3-8B — Excellent multilingual support, strong reasoning
- Llama 3.1 8B — Great English performance, large ecosystem
- Mistral 7B — Good balance of speed and quality
Step 2: Fine-tune with LoRA
LoRA (Low-Rank Adaptation) lets you train a small adapter (~130 MB) instead of the full model (~16 GB). You need:
- 5,000-25,000 training samples covering product Q&A, recommendations, and edge cases
- A GPU with 24-32 GB VRAM for training
- 3-6 hours of training time
Step 3: Set Up RAG
Build a retrieval pipeline that searches your product database on every query:
- Index products with weighted fields (name 3x, category 2x, description 1x)
- Inject the top 3-5 matching products into the prompt context
- This guarantees accurate prices, specs, and availability
Step 4: Deploy the Inference Engine
For a single-user chatbot on a website:
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Import your fine-tuned model
ollama create mymodel -f Modelfile
# Serve on localhost
ollama serve
Your web backend (PHP, Python, Node.js) makes HTTP requests to localhost:11434/api/chat and streams responses to the browser.
Step 5: Build the Chat Interface
The frontend uses Server-Sent Events (SSE) to stream tokens as they're generated:
const response = await fetch('/api/chat', {
method: 'POST',
body: JSON.stringify({ message: userInput })
});
const reader = response.body.getReader();
// Read tokens as they arrive...
This creates the familiar "typing" effect users expect from modern AI chat.
Performance You Can Expect
On a modern GPU with an 8B model:
| Metric | Typical Value |
|---|---|
| Time to first token | 0.1-0.3 seconds |
| Short response (50 tokens) | < 0.5 seconds |
| Typical response (150 tokens) | 0.8-1.2 seconds |
| Long response (400 tokens) | 2-3 seconds |
| RAG lookup time | < 1 ms |
Users experience near-instant responses — faster than most humans can read.
What's Next
Deploying an LLM is just the beginning. In follow-up articles, we'll cover:
- Fine-tuning strategies for domain expertise
- RAG optimization with BM25 vs. embedding search
- Quantization methods to balance speed and quality
- Security hardening against prompt injection attacks
The key takeaway: you don't need a data science team to deploy AI on your website. A single developer with a good GPU can build, fine-tune, and deploy a domain-specific AI assistant in a week.
Want to skip the infrastructure? See our managed AI service — we handle deployment, you focus on your business.