Fundamentals 8 min read

What is an LLM and How to Deploy It on Your Website

ai.rs Dec 25, 2025
What is an LLM and How to Deploy It on Your Website illustration

What is an LLM?

A Large Language Model (LLM) is a neural network trained on vast amounts of text data that can understand and generate human-like language. Models like GPT-4, Claude, Llama, and Qwen have billions of parameters — the learned weights that encode knowledge about language, reasoning, and the world.

But here's what matters for business: an LLM isn't just a chatbot. It's a reasoning engine that can be specialized to understand your products, your customers, and your domain.

Why Deploy Your Own LLM?

Using a third-party API (OpenAI, Anthropic) is the easiest path, but it comes with trade-offs:

Approach Cost per 1M tokens Latency Data Privacy Customization
API (GPT-4o) $2.50-$10 200-800ms TTFT Data leaves your server Limited
Self-hosted (8B model) ~$0.05 (compute) 18-130ms TTFT Stays on your hardware Full control

For a business handling thousands of customer interactions daily, self-hosting can reduce costs by 50-200x while keeping sensitive product data and customer conversations on your own infrastructure.

The Architecture

A deployed LLM system has three layers:

User Browser  →  Your Web Server (PHP/Node/Python)
                      ↓
                 Inference Engine (Ollama / vLLM)
                      ↓
                 Fine-tuned Model + RAG Pipeline

1. The Model — A pre-trained LLM (e.g., Qwen3-8B) fine-tuned on your domain data using LoRA adapters. This gives it expertise in your products without retraining the full model.

2. RAG (Retrieval-Augmented Generation) — Before the model generates a response, your system searches a product database and injects relevant information into the prompt. This ensures the model always has accurate, up-to-date data.

3. The Inference Engine — Software that loads the model onto your GPU and serves requests. Ollama is the simplest option for single-user deployments; vLLM handles concurrent users efficiently.

Hardware Requirements

The GPU is the critical component. Model size determines minimum VRAM:

Model Size Quantization VRAM Needed Speed (single user)
7-8B params Q6_K (6-bit) 6-8 GB 150-200 tok/s
7-8B params Q4_K_M (4-bit) 4-6 GB 180-220 tok/s
13B params Q6_K 11-14 GB 80-120 tok/s
70B params Q4_K_M 40-48 GB 15-25 tok/s

For most business applications, an 8B parameter model with 6-bit quantization hits the sweet spot — fast enough for real-time chat (150+ tokens/second) and small enough to run on consumer GPUs like the RTX 4090 or RTX 5090.

Deployment Steps

Step 1: Choose Your Model

For domain-specific business use, start with a strong multilingual base model:

  • Qwen3-8B — Excellent multilingual support, strong reasoning
  • Llama 3.1 8B — Great English performance, large ecosystem
  • Mistral 7B — Good balance of speed and quality

Step 2: Fine-tune with LoRA

LoRA (Low-Rank Adaptation) lets you train a small adapter (~130 MB) instead of the full model (~16 GB). You need:

  • 5,000-25,000 training samples covering product Q&A, recommendations, and edge cases
  • A GPU with 24-32 GB VRAM for training
  • 3-6 hours of training time

Step 3: Set Up RAG

Build a retrieval pipeline that searches your product database on every query:

  • Index products with weighted fields (name 3x, category 2x, description 1x)
  • Inject the top 3-5 matching products into the prompt context
  • This guarantees accurate prices, specs, and availability

Step 4: Deploy the Inference Engine

For a single-user chatbot on a website:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Import your fine-tuned model
ollama create mymodel -f Modelfile

# Serve on localhost
ollama serve

Your web backend (PHP, Python, Node.js) makes HTTP requests to localhost:11434/api/chat and streams responses to the browser.

Step 5: Build the Chat Interface

The frontend uses Server-Sent Events (SSE) to stream tokens as they're generated:

const response = await fetch('/api/chat', {
    method: 'POST',
    body: JSON.stringify({ message: userInput })
});
const reader = response.body.getReader();
// Read tokens as they arrive...

This creates the familiar "typing" effect users expect from modern AI chat.

Performance You Can Expect

On a modern GPU with an 8B model:

Metric Typical Value
Time to first token 0.1-0.3 seconds
Short response (50 tokens) < 0.5 seconds
Typical response (150 tokens) 0.8-1.2 seconds
Long response (400 tokens) 2-3 seconds
RAG lookup time < 1 ms

Users experience near-instant responses — faster than most humans can read.

What's Next

Deploying an LLM is just the beginning. In follow-up articles, we'll cover:

  • Fine-tuning strategies for domain expertise
  • RAG optimization with BM25 vs. embedding search
  • Quantization methods to balance speed and quality
  • Security hardening against prompt injection attacks

The key takeaway: you don't need a data science team to deploy AI on your website. A single developer with a good GPU can build, fine-tune, and deploy a domain-specific AI assistant in a week.

Want to skip the infrastructure? See our managed AI service — we handle deployment, you focus on your business.

Ready to put this into practice?

Understanding the fundamentals is one thing — building something for your business is another. See where you stand.

Take the AI Readiness Check
Share: Post Share

Related Articles