Fundamentals 8 min read

What is an LLM and How to Deploy It on Your Website

ai.rs Dec 25, 2025

What is an LLM?

A Large Language Model (LLM) is a neural network trained on vast amounts of text data that can understand and generate human-like language. Models like GPT-4, Claude, Llama, and Qwen have billions of parameters — the learned weights that encode knowledge about language, reasoning, and the world.

But here's what matters for business: an LLM isn't just a chatbot. It's a reasoning engine that can be specialized to understand your products, your customers, and your domain.

Why Deploy Your Own LLM?

Using a third-party API (OpenAI, Anthropic) is the easiest path, but it comes with trade-offs:

Approach Cost per 1M tokens Latency Data Privacy Customization
API (GPT-4o) $2.50-$10 200-800ms TTFT Data leaves your server Limited
Self-hosted (8B model) ~$0.05 (compute) 18-130ms TTFT Stays on your hardware Full control

For a business handling thousands of customer interactions daily, self-hosting can reduce costs by 50-200x while keeping sensitive product data and customer conversations on your own infrastructure.

The Architecture

A deployed LLM system has three layers:

User Browser  →  Your Web Server (PHP/Node/Python)
                      ↓
                 Inference Engine (Ollama / vLLM)
                      ↓
                 Fine-tuned Model + RAG Pipeline

1. The Model — A pre-trained LLM (e.g., Qwen3-8B) fine-tuned on your domain data using LoRA adapters. This gives it expertise in your products without retraining the full model.

2. RAG (Retrieval-Augmented Generation) — Before the model generates a response, your system searches a product database and injects relevant information into the prompt. This ensures the model always has accurate, up-to-date data.

3. The Inference Engine — Software that loads the model onto your GPU and serves requests. Ollama is the simplest option for single-user deployments; vLLM handles concurrent users efficiently.

Hardware Requirements

The GPU is the critical component. Model size determines minimum VRAM:

Model Size Quantization VRAM Needed Speed (single user)
7-8B params Q6_K (6-bit) 6-8 GB 150-200 tok/s
7-8B params Q4_K_M (4-bit) 4-6 GB 180-220 tok/s
13B params Q6_K 11-14 GB 80-120 tok/s
70B params Q4_K_M 40-48 GB 15-25 tok/s

For most business applications, an 8B parameter model with 6-bit quantization hits the sweet spot — fast enough for real-time chat (150+ tokens/second) and small enough to run on consumer GPUs like the RTX 4090 or RTX 5090.

Deployment Steps

Step 1: Choose Your Model

For domain-specific business use, start with a strong multilingual base model:

  • Qwen3-8B — Excellent multilingual support, strong reasoning
  • Llama 3.1 8B — Great English performance, large ecosystem
  • Mistral 7B — Good balance of speed and quality

Step 2: Fine-tune with LoRA

LoRA (Low-Rank Adaptation) lets you train a small adapter (~130 MB) instead of the full model (~16 GB). You need:

  • 5,000-25,000 training samples covering product Q&A, recommendations, and edge cases
  • A GPU with 24-32 GB VRAM for training
  • 3-6 hours of training time

Step 3: Set Up RAG

Build a retrieval pipeline that searches your product database on every query:

  • Index products with weighted fields (name 3x, category 2x, description 1x)
  • Inject the top 3-5 matching products into the prompt context
  • This guarantees accurate prices, specs, and availability

Step 4: Deploy the Inference Engine

For a single-user chatbot on a website:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Import your fine-tuned model
ollama create mymodel -f Modelfile

# Serve on localhost
ollama serve

Your web backend (PHP, Python, Node.js) makes HTTP requests to localhost:11434/api/chat and streams responses to the browser.

Step 5: Build the Chat Interface

The frontend uses Server-Sent Events (SSE) to stream tokens as they're generated:

const response = await fetch('/api/chat', {
    method: 'POST',
    body: JSON.stringify({ message: userInput })
});
const reader = response.body.getReader();
// Read tokens as they arrive...

This creates the familiar "typing" effect users expect from modern AI chat.

Performance You Can Expect

On a modern GPU with an 8B model:

Metric Typical Value
Time to first token 0.1-0.3 seconds
Short response (50 tokens) < 0.5 seconds
Typical response (150 tokens) 0.8-1.2 seconds
Long response (400 tokens) 2-3 seconds
RAG lookup time < 1 ms

Users experience near-instant responses — faster than most humans can read.

What's Next

Deploying an LLM is just the beginning. In follow-up articles, we'll cover:

  • Fine-tuning strategies for domain expertise
  • RAG optimization with BM25 vs. embedding search
  • Quantization methods to balance speed and quality
  • Security hardening against prompt injection attacks

The key takeaway: you don't need a data science team to deploy AI on your website. A single developer with a good GPU can build, fine-tune, and deploy a domain-specific AI assistant in a week.

Want to skip the infrastructure? See our managed AI service — we handle deployment, you focus on your business.

Share: Post Share

Related Articles