The Knowledge Problem
Every LLM has a fundamental limitation: it only knows what was in its training data. Ask it about your product catalog, and it will either hallucinate an answer or admit it doesn't know.
Fine-tuning helps — the model can learn patterns about your products, your brand voice, and how to handle edge cases. But fine-tuning has a memorization ceiling. A typical LoRA adapter with 174 million trainable parameters can reliably memorize 500-1,000 specific products. Beyond that, details start bleeding together.
| Catalog Size | Fine-tuning Alone | With RAG |
|---|---|---|
| 500 products | Good accuracy | Perfect accuracy |
| 2,000 products | Declining accuracy | Perfect accuracy |
| 10,000 products | Pattern-only | Perfect accuracy |
| 100,000+ products | Same as 10K | Perfect accuracy |
RAG eliminates this ceiling entirely.
How RAG Works
Retrieval-Augmented Generation is a two-step process:
- Retrieve — When a user asks a question, search your database for relevant information
- Generate — Inject the retrieved information into the prompt, then let the LLM generate a response
User: "What red wines do you have under $30?"
↓
[RETRIEVE] Search product database → 5 matching wines
↓
[AUGMENT] Add products to prompt context
↓
Prompt: "Given these products: [Wine A $25, Wine B $28, ...]
Answer the customer's question: What red wines under $30?"
↓
[GENERATE] LLM creates natural response with accurate data
The model never needs to "remember" your catalog. It receives the relevant data fresh with every query.
The Search Engine: BM25 vs. Embeddings
The retrieval step needs a search engine. Two main approaches:
BM25 (Keyword Search)
Classic term-frequency matching. If the user says "red wine", it finds products containing those exact words.
Pros: Fast (< 1ms), no GPU needed, no model to maintain, deterministic Cons: Misses synonyms, can't understand intent
Embedding Search (Semantic)
Converts text to vectors and finds similar meanings. "Something fruity for summer" matches light wines even without exact keyword overlap.
Pros: Understands meaning, handles vague queries Cons: Requires an embedding model, slower, less predictable
The Surprising Result
In production testing with structured product data, BM25 consistently outperformed embedding search:
| Metric | BM25 | Embeddings |
|---|---|---|
| Exact product matches | 92% | 78% |
| Price accuracy | 100% | 100% |
| Speed | < 1ms | 15-50ms |
| Infrastructure | None | Embedding model required |
Why? Structured product catalogs have consistent naming conventions. When someone asks about "Cabernet Sauvignon", the product is literally called "Cabernet Sauvignon" in the database. Keyword matching finds it instantly.
Embeddings shine with unstructured data (documents, articles, support tickets) where concepts matter more than exact terms.
Building an Effective RAG Pipeline
Step 1: Multi-field Weighted Indexing
Don't just search product descriptions. Weight fields by importance:
name: 3x weight (most important)
brand: 2x weight
category: 2x weight
style: 2x weight
taste: 1x weight
description: 1x weight
This ensures "Merlot" matches the wine category before matching a description that mentions merlot in passing.
Step 2: Category Alias Mapping
Users don't always use your exact category names:
"white wine" → search: white wine category
"something sweet" → search: dessert wines, sweet cocktails
"bitter" → search: IPAs, amaro, bitters
"for cooking" → search: cooking wines, olive oils
Map natural language to your taxonomy before searching.
Step 3: Rich Context Injection
Don't just pass product names and prices. Include enough context for the model to give helpful answers:
{
"name": "Château Margaux 2018",
"price": "$89.99",
"category": "Red Wine",
"style": "Full-bodied Bordeaux",
"taste": "Dark fruit, tobacco, elegant tannins",
"food_pairing": "Grilled lamb, aged cheese",
"description": "A structured yet approachable vintage..."
}
This gives the model 60-90 tokens of context per product — enough to make informed recommendations.
Step 4: Context Window Management
Most 8B models have 8K-32K token context windows. Budget your context carefully:
| Component | Tokens |
|---|---|
| System prompt | 200-400 |
| RAG results (5 products) | 300-450 |
| Conversation history | 500-2,000 |
| Response space | 200-500 |
| Total | 1,200-3,350 |
Plenty of room for accurate retrieval without hitting context limits.
RAG Solves the Update Problem
Perhaps RAG's biggest advantage: zero-retraining updates.
- New product added? Update the database. RAG finds it immediately.
- Price changed? Update the database. The next query returns the new price.
- Product discontinued? Remove from database. It's gone instantly.
Fine-tuning requires retraining (5 hours) to learn new information. RAG reflects changes in milliseconds.
The Combined Architecture
The optimal production setup uses both fine-tuning and RAG:
| Component | Responsibility |
|---|---|
| Fine-tuning | Brand voice, conversation style, safety guardrails, domain reasoning |
| RAG | Exact product data, prices, availability, specifications |
Fine-tuning teaches the model how to think and talk about your domain. RAG gives it the specific facts to think and talk about.
Together, you get an AI assistant that sounds like an expert and never gets the details wrong.
Getting Started with RAG
- Export your product catalog to JSON (name, price, category, description)
- Build a search index using BM25 (a few lines of code in most languages)
- Write a retrieval function that searches the index and returns top 3-5 matches
- Inject results into the prompt before the user's question
- Test and iterate — add category aliases and field weights as needed
The entire RAG pipeline can be built in a day. No ML expertise required. No additional GPU needed. Just a search index and a well-structured prompt.
See RAG in a real business context. How it works — live pricing, instant product updates, no retraining.