Fundamentals 8 min read

What is an LLM and How to Deploy It on Your Website

ai.rs Dec 25, 2025

llm deployment fundamentals self-hosting

What is an LLM?

A Large Language Model (LLM) is a neural network trained on vast amounts of text data that can understand and generate human-like language. Models like GPT-4, Claude, Llama, and Qwen have billions of parameters — the learned weights that encode knowledge about language, reasoning, and the world.

But here's what matters for business: an LLM isn't just a chatbot. It's a reasoning engine that can be specialized to understand your products, your customers, and your domain.

Why Deploy Your Own LLM?

Using a third-party API (OpenAI, Anthropic) is the easiest path, but it comes with trade-offs:

Approach	Cost per 1M tokens	Latency	Data Privacy	Customization
API (GPT-4o)	$2.50-$10	200-800ms TTFT	Data leaves your server	Limited
Self-hosted (8B model)	~$0.05 (compute)	18-130ms TTFT	Stays on your hardware	Full control

For a business handling thousands of customer interactions daily, self-hosting can reduce costs by 50-200x while keeping sensitive product data and customer conversations on your own infrastructure.

The Architecture

A deployed LLM system has three layers:

User Browser  →  Your Web Server (PHP/Node/Python)
                      ↓
                 Inference Engine (Ollama / vLLM)
                      ↓
                 Fine-tuned Model + RAG Pipeline

1. The Model — A pre-trained LLM (e.g., Qwen3-8B) fine-tuned on your domain data using LoRA adapters. This gives it expertise in your products without retraining the full model.

2. RAG (Retrieval-Augmented Generation) — Before the model generates a response, your system searches a product database and injects relevant information into the prompt. This ensures the model always has accurate, up-to-date data.

3. The Inference Engine — Software that loads the model onto your GPU and serves requests. Ollama is the simplest option for single-user deployments; vLLM handles concurrent users efficiently.

Hardware Requirements

The GPU is the critical component. Model size determines minimum VRAM:

Model Size	Quantization	VRAM Needed	Speed (single user)
7-8B params	Q6_K (6-bit)	6-8 GB	150-200 tok/s
7-8B params	Q4_K_M (4-bit)	4-6 GB	180-220 tok/s
13B params	Q6_K	11-14 GB	80-120 tok/s
70B params	Q4_K_M	40-48 GB	15-25 tok/s

For most business applications, an 8B parameter model with 6-bit quantization hits the sweet spot — fast enough for real-time chat (150+ tokens/second) and small enough to run on consumer GPUs like the RTX 4090 or RTX 5090.

Deployment Steps

Step 1: Choose Your Model

For domain-specific business use, start with a strong multilingual base model:

Qwen3-8B — Excellent multilingual support, strong reasoning
Llama 3.1 8B — Great English performance, large ecosystem
Mistral 7B — Good balance of speed and quality

Step 2: Fine-tune with LoRA

LoRA (Low-Rank Adaptation) lets you train a small adapter (~130 MB) instead of the full model (~16 GB). You need:

5,000-25,000 training samples covering product Q&A, recommendations, and edge cases
A GPU with 24-32 GB VRAM for training
3-6 hours of training time

Step 3: Set Up RAG

Build a retrieval pipeline that searches your product database on every query:

Index products with weighted fields (name 3x, category 2x, description 1x)
Inject the top 3-5 matching products into the prompt context
This guarantees accurate prices, specs, and availability

Step 4: Deploy the Inference Engine

For a single-user chatbot on a website:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Import your fine-tuned model
ollama create mymodel -f Modelfile

# Serve on localhost
ollama serve

Your web backend (PHP, Python, Node.js) makes HTTP requests to localhost:11434/api/chat and streams responses to the browser.

Step 5: Build the Chat Interface

The frontend uses Server-Sent Events (SSE) to stream tokens as they're generated:

const response = await fetch('/api/chat', {
    method: 'POST',
    body: JSON.stringify({ message: userInput })
});
const reader = response.body.getReader();
// Read tokens as they arrive...

This creates the familiar "typing" effect users expect from modern AI chat.

Performance You Can Expect

On a modern GPU with an 8B model:

Metric	Typical Value
Time to first token	0.1-0.3 seconds
Short response (50 tokens)	< 0.5 seconds
Typical response (150 tokens)	0.8-1.2 seconds
Long response (400 tokens)	2-3 seconds
RAG lookup time	< 1 ms

Users experience near-instant responses — faster than most humans can read.

What's Next

Deploying an LLM is just the beginning. In follow-up articles, we'll cover:

Fine-tuning strategies for domain expertise
RAG optimization with BM25 vs. embedding search
Quantization methods to balance speed and quality
Security hardening against prompt injection attacks

The key takeaway: you don't need a data science team to deploy AI on your website. A single developer with a good GPU can build, fine-tune, and deploy a domain-specific AI assistant in a week.

Want to skip the infrastructure? See our managed AI service — we handle deployment, you focus on your business.

Ready to put this into practice?

Understanding the fundamentals is one thing — building something for your business is another. See where you stand.

Take the AI Readiness Check

Share: Post Share