Fundamentals 10 min read

What is RAG and Why Your AI Needs It

ai.rs Jan 15, 2026

The Knowledge Problem

Every LLM has a fundamental limitation: it only knows what was in its training data. Ask it about your product catalog, and it will either hallucinate an answer or admit it doesn't know.

Fine-tuning helps — the model can learn patterns about your products, your brand voice, and how to handle edge cases. But fine-tuning has a memorization ceiling. A typical LoRA adapter with 174 million trainable parameters can reliably memorize 500-1,000 specific products. Beyond that, details start bleeding together.

Catalog Size Fine-tuning Alone With RAG
500 products Good accuracy Perfect accuracy
2,000 products Declining accuracy Perfect accuracy
10,000 products Pattern-only Perfect accuracy
100,000+ products Same as 10K Perfect accuracy

RAG eliminates this ceiling entirely.

How RAG Works

Retrieval-Augmented Generation is a two-step process:

  1. Retrieve — When a user asks a question, search your database for relevant information
  2. Generate — Inject the retrieved information into the prompt, then let the LLM generate a response
User: "What red wines do you have under $30?"
    ↓
[RETRIEVE] Search product database → 5 matching wines
    ↓
[AUGMENT] Add products to prompt context
    ↓
Prompt: "Given these products: [Wine A $25, Wine B $28, ...]
         Answer the customer's question: What red wines under $30?"
    ↓
[GENERATE] LLM creates natural response with accurate data

The model never needs to "remember" your catalog. It receives the relevant data fresh with every query.

The Search Engine: BM25 vs. Embeddings

The retrieval step needs a search engine. Two main approaches:

Classic term-frequency matching. If the user says "red wine", it finds products containing those exact words.

Pros: Fast (< 1ms), no GPU needed, no model to maintain, deterministic Cons: Misses synonyms, can't understand intent

Embedding Search (Semantic)

Converts text to vectors and finds similar meanings. "Something fruity for summer" matches light wines even without exact keyword overlap.

Pros: Understands meaning, handles vague queries Cons: Requires an embedding model, slower, less predictable

The Surprising Result

In production testing with structured product data, BM25 consistently outperformed embedding search:

Metric BM25 Embeddings
Exact product matches 92% 78%
Price accuracy 100% 100%
Speed < 1ms 15-50ms
Infrastructure None Embedding model required

Why? Structured product catalogs have consistent naming conventions. When someone asks about "Cabernet Sauvignon", the product is literally called "Cabernet Sauvignon" in the database. Keyword matching finds it instantly.

Embeddings shine with unstructured data (documents, articles, support tickets) where concepts matter more than exact terms.

Building an Effective RAG Pipeline

Step 1: Multi-field Weighted Indexing

Don't just search product descriptions. Weight fields by importance:

name:     3x weight (most important)
brand:    2x weight  
category: 2x weight
style:    2x weight
taste:    1x weight
description: 1x weight

This ensures "Merlot" matches the wine category before matching a description that mentions merlot in passing.

Step 2: Category Alias Mapping

Users don't always use your exact category names:

"white wine"    → search: white wine category
"something sweet" → search: dessert wines, sweet cocktails
"bitter"        → search: IPAs, amaro, bitters
"for cooking"   → search: cooking wines, olive oils

Map natural language to your taxonomy before searching.

Step 3: Rich Context Injection

Don't just pass product names and prices. Include enough context for the model to give helpful answers:

{
  "name": "Château Margaux 2018",
  "price": "$89.99",
  "category": "Red Wine",
  "style": "Full-bodied Bordeaux",
  "taste": "Dark fruit, tobacco, elegant tannins",
  "food_pairing": "Grilled lamb, aged cheese",
  "description": "A structured yet approachable vintage..."
}

This gives the model 60-90 tokens of context per product — enough to make informed recommendations.

Step 4: Context Window Management

Most 8B models have 8K-32K token context windows. Budget your context carefully:

Component Tokens
System prompt 200-400
RAG results (5 products) 300-450
Conversation history 500-2,000
Response space 200-500
Total 1,200-3,350

Plenty of room for accurate retrieval without hitting context limits.

RAG Solves the Update Problem

Perhaps RAG's biggest advantage: zero-retraining updates.

  • New product added? Update the database. RAG finds it immediately.
  • Price changed? Update the database. The next query returns the new price.
  • Product discontinued? Remove from database. It's gone instantly.

Fine-tuning requires retraining (5 hours) to learn new information. RAG reflects changes in milliseconds.

The Combined Architecture

The optimal production setup uses both fine-tuning and RAG:

Component Responsibility
Fine-tuning Brand voice, conversation style, safety guardrails, domain reasoning
RAG Exact product data, prices, availability, specifications

Fine-tuning teaches the model how to think and talk about your domain. RAG gives it the specific facts to think and talk about.

Together, you get an AI assistant that sounds like an expert and never gets the details wrong.

Getting Started with RAG

  1. Export your product catalog to JSON (name, price, category, description)
  2. Build a search index using BM25 (a few lines of code in most languages)
  3. Write a retrieval function that searches the index and returns top 3-5 matches
  4. Inject results into the prompt before the user's question
  5. Test and iterate — add category aliases and field weights as needed

The entire RAG pipeline can be built in a day. No ML expertise required. No additional GPU needed. Just a search index and a well-structured prompt.

See RAG in a real business context. How it works — live pricing, instant product updates, no retraining.

Share: Post Share

Related Articles