The Semantic Search Hype
Every AI tutorial tells you to use embedding search. Convert your data to vectors, store them in a vector database, and enjoy "semantic understanding" that keyword search can't match.
But when we deployed both approaches against a real product catalog, the results told a different story.
The Experiment
We tested both search approaches on a production catalog of 2,000+ products across multiple categories:
BM25 Setup:
- Multi-field weighted index (name 3x, brand 2x, category 2x, description 1x)
- Category alias mapping (natural language → taxonomy)
- No ML model required
Embedding Setup:
- Sentence transformer model (all-MiniLM-L6-v2)
- Product descriptions encoded to 384-dimensional vectors
- Cosine similarity search
Test queries: 200 real customer questions collected from user testing.
The Results
| Metric | BM25 | Embeddings | Winner |
|---|---|---|---|
| Exact product match (top 3) | 92% | 78% | BM25 |
| Category accuracy | 95% | 88% | BM25 |
| Price in results | 100% | 100% | Tie |
| Handles vague queries | 71% | 85% | Embeddings |
| Speed (per query) | < 1ms | 15-50ms | BM25 |
| Infrastructure needed | None | Embedding model | BM25 |
BM25 won on the metrics that matter most for e-commerce: finding the right product and getting the category right.
Why BM25 Wins for Structured Data
1. Products Have Consistent Names
When a customer asks about "Merlot wine", the product is literally called "Merlot" in the database. There's no semantic gap to bridge.
Query: "Merlot wine"
BM25: Finds "Merlot" (exact match, 100% confidence)
Embed: Finds "Merlot" (0.89 similarity) + "Pinot Noir" (0.85) + "Cabernet" (0.82)
BM25 gives a decisive match. Embeddings blur the boundaries between similar products.
2. Field Weighting Adds Context
With BM25, you can weight fields differently:
"stainless steel cookware"
BM25 with weights:
- name (3x): "Stainless Steel Pan" → HIGH score
- category (2x): "Cookware" → HIGH score
- description (1x): mentions steel → LOW score
Result: Exact product match
Embeddings flatten everything into a single vector, losing this structural information.
3. Numbers and Codes Work Correctly
Query: "model X-500"
BM25: Finds product X-500 (exact term match)
Embed: Finds products with similar descriptions (X-500 is meaningless to the embedding model)
BM25 handles SKUs, model numbers, prices, and alphanumeric codes that embeddings treat as noise.
Where Embeddings Win
Embeddings have a clear advantage for vague, intent-based queries:
Query: "something refreshing for a hot day"
BM25: Matches on "refreshing" if it appears in descriptions
Embed: Understands the concept and finds light wines, sparkling water, citrus drinks
Query: "gift for someone who likes cooking"
BM25: Matches on "gift" and "cooking" separately
Embed: Understands the gifting + cooking intent, finds cookware gift sets
If your customers frequently use vague, conversational language, embeddings add value.
The Hybrid Approach
The best production systems use both:
1. BM25 search → top 10 results (fast, precise)
2. If BM25 results < 3 → fallback to embedding search
3. Re-rank combined results by relevance
But here's the key insight: when you have an LLM in the pipeline, the model itself acts as a re-ranker.
The LLM receives the BM25 results and uses its own understanding to select the most relevant products for the response. You get semantic understanding from the LLM without needing a separate embedding search.
Cost of Each Approach
| Component | BM25 | Embeddings |
|---|---|---|
| Runtime infrastructure | Zero | Embedding model (0.5-2 GB) |
| Per-query compute | < 1ms CPU | 15-50ms GPU |
| Index update | Instant | Re-encode modified products |
| Maintenance | None | Model version management |
| Dependencies | Standard library | torch, transformers, vector DB |
BM25 adds zero complexity to your stack. It runs in any language with a standard library. No GPU, no model downloads, no version conflicts.
Implementation: BM25 in 30 Lines
A production BM25 search for product catalogs:
import math
from collections import Counter
class BM25:
def __init__(self, documents, k1=1.5, b=0.75):
self.k1, self.b = k1, b
self.docs = documents
self.avgdl = sum(len(d.split()) for d in documents) / len(documents)
self.df = Counter()
for doc in documents:
for term in set(doc.lower().split()):
self.df[term] += 1
self.N = len(documents)
def score(self, query, doc_idx):
doc = self.docs[doc_idx].lower().split()
doc_len = len(doc)
tf = Counter(doc)
score = 0
for term in query.lower().split():
if term not in self.df:
continue
idf = math.log((self.N - self.df[term] + 0.5) / (self.df[term] + 0.5) + 1)
term_tf = tf.get(term, 0)
score += idf * (term_tf * (self.k1 + 1)) / (term_tf + self.k1 * (1 - self.b + self.b * doc_len / self.avgdl))
return score
def search(self, query, top_k=5):
scores = [(i, self.score(query, i)) for i in range(self.N)]
return sorted(scores, key=lambda x: -x[1])[:top_k]
That's it. No dependencies, no model downloads, no GPU.
When to Use What
| Scenario | Recommendation |
|---|---|
| Structured product catalog | BM25 |
| FAQ / documentation search | Embeddings |
| Multi-language product search | BM25 + aliases |
| Conversational discovery | Embeddings (or LLM re-ranking) |
| Real-time (< 5ms) | BM25 |
| Budget-constrained | BM25 |
| Unstructured knowledge base | Embeddings |
Our Recommendation
Start with BM25. It's simpler, faster, and more accurate for product search. Add multi-field weighting and category aliases for 90%+ accuracy.
Add embeddings later if analytics show customers frequently use vague, intent-based queries that BM25 can't handle.
Let the LLM do the heavy lifting. Your fine-tuned model already has semantic understanding. Feed it BM25 results and let it reason about which products best answer the customer's question.