A company deploys a RAG assistant based on product documentation and, after a week of pilot testing, discovers a problem: queries containing SKU codes, part symbols, or internal system names (e.g., "INTAKE-3B module") return semantically similar but incorrect documents. Vector search "understands the intent" but loses identifiers. Service customers receive descriptions of similar products instead of the specific technical sheet. This is one of the most common reasons why RAG works great in demos but underperforms in production.
Why semantics alone fails
#Semantic search operates on the principle of meaning geometry: the embedding model converts text into a vector and searches for documents with similar representations. The strength of this method is also its weakness: the model averages meaning, compresses information, and loses lexical precision.
Three specific query classes where semantics consistently fails:
Codes and identifiers. SKU-4521, P/N 7742-A, order number ZAM-2024-08811—these are character strings without semantic context. Embedding treats them as noise and assigns them vectors close to documents on general product topics, instead of returning the document with an exact match.
Industry acronyms and abbreviations. "IFRS 17", "KSeF", "CMMS"—the model may lack sufficient context for rare, especially Polish, specialized abbreviations. It returns documents about finance or maintenance systems, but not the specific standard.
Proper names and versions. "Comarch ERP XL 2024.2.1" vs. "Comarch ERP Optima"—for embeddings, these are close neighbors. For the user, these are different products. Confusing versions in responses generates real operational errors.
BM25 (Best Match 25) addresses these cases differently: it counts term frequency in the document, normalizes by document length, and promotes results where the queried tokens literally appear. For SKU-4521, BM25 returns the document containing this string in the first position. For a general question like "how to configure integration?", semantics performs better because the user describes intent, not an identifier.
How RRF fusion works
#Reciprocal Rank Fusion is an algorithm for merging rankings from independent search engines. The formula for the score of document d is simple:
RRF(d) = Σ 1 / (k + rank_i(d))
where k is a smoothing constant (default 60), and rank_i(d) is the position of the document in the i-th ranking. A document ranked 1st in both engines receives a score of 1/(60+1) + 1/(60+1) ≈ 0.033. A document ranked 1st in BM25 but 50th in semantics receives 1/61 + 1/110 ≈ 0.025. Fusion rewards consistency: if both engines agree a document is relevant, it ranks high. If only one indicates it, it still has a chance, but with a lower score.
The advantage of RRF over weighted score summation: there’s no need to normalize BM25 results (integers, corpus-dependent) and cosine similarity (values [-1, 1]). RRF operates solely on positions, so scales don’t matter. This is particularly important for dynamic corpora, where BM25 result ranges change as the document base grows.
After RRF fusion, it’s worth adding a reranking step: a cross-encoder model evaluates each candidate in the context of the entire query. Reranker improves precision for complex multi-part queries. More on reranking architecture in the article about RAG search quality.
Table: query type vs. better method
#| Query Type | Example | Better Method | Rationale |
|---|---|---|---|
| Identifier / code | "SKU-4521", "P/N 7742-A" | BM25 | Exact token match |
| Industry acronym | "KSeF 2026", "IFRS 17" | BM25 + hybrid | BM25 for token, semantics for context |
| Descriptive question | "how to file a complaint online" | Semantics | Intent more important than words |
| Mixed question | "return procedure for SKU-4521" | Hybrid RRF | Both signals needed |
| Proper name + version | "Comarch ERP XL 2024.2.1" | BM25 + hybrid | Version requires precision |
| Conceptual question | "difference between leasing and renting" | Semantics | No unique tokens |
| Multilingual query | PL code + EN description | Hybrid + BGE-M3 | Semantic language bridge |
Step-by-step configuration
#Implementing hybrid search in an existing RAG stack requires a few precise steps. We assume Qdrant as the vector database and Elasticsearch or Postgres with pg_trgm / tsvector as the BM25 engine.
Step 1: Dual-index the corpus. The same text goes into the vector database (BGE-M3 embeddings or another model) and the full-text index. Both indexes must use the same document identifiers so fusion can merge results by key.
Step 2: Execute both queries in parallel. For each user query, send it simultaneously to the semantic engine (top-k=50-100 candidates) and BM25 (top-k=50-100). Limit the number of candidates, as the reranker at a later stage has a limited token budget.
Step 3: RRF fusion. Merge the two candidate lists via RRF. Start with k=60 (default value). If queries in your system are primarily lexical (many codes and SKUs), lower k to 20-30 to strengthen the impact of a strong leader in one ranking. Experiment on a golden set of at least 100 queries with expected answers.
Step 4: Reranker (optional but recommended). Pass the top-20 from RRF to a cross-encoder model (e.g., bge-reranker-v2-m3 locally). The reranker evaluates each (query, fragment) pair and assigns a precise score. Latency increases by 100-300 ms, but precision for complex queries improves by 10-20 MAP points.
For optimal tech stack configuration, use the stack selection tool, which considers corpus scale, latency requirements, and hosting environment.
When hybrid doesn’t help
#Hybrid isn’t free. Two queries instead of one, fusion, optional reranker: total system latency increases by 80-200 ms (estimate depends on hardware and candidate size). In systems requiring end-to-end responses under 200 ms (e.g., live customer chat), this may be too much.
Don’t implement hybrid when:
- The corpus is homogeneous and descriptive (e.g., only conversational FAQ)—pure semantics suffices.
- All queries are conceptual, and the database contains no identifiers.
- You have fewer than 1,000 documents, and simple semantics achieves recall above 0.9 on the golden set.
- Latency is a hard requirement, and caching results isn’t possible.
Best practice is to measure recall@5 and recall@10 separately for BM25, semantics, and hybrid on a representative golden set before deciding. More on building golden sets and measuring quality in the article about RAG response quality evaluation.
Try it live
#FAQ
#Does BM25 require a separate database?
#Not always. If you’re already using PostgreSQL, you can run BM25 via built-in tsvector and ts_rank—no additional infrastructure needed. Elasticsearch or OpenSearch offer more tuning options (field boosting, custom tokenizers for product codes), but for small to medium corpora (up to hundreds of thousands of documents), Postgres BM25 is fully sufficient. Qdrant, from version 1.10, includes built-in sparse vector search (SPLADE), which functionally approaches BM25.
How to adjust BM25 to semantics proportions?
#With RRF, you can’t directly adjust "weight"—the algorithm operates on positions. However, you can regulate the number of candidates from each engine (top-k) or use weighted fusion with score normalization (linear interpolation), where the alpha parameter defines the share of semantics. Start with alpha=0.5 (equal share) and experiment on the golden set. Typical findings from deployments: with many codes and SKUs, alpha=0.3 (stronger BM25) improves results; for descriptive texts, alpha=0.7 works better for semantics.
Does hybrid work with multilingual queries?
#Yes, provided you use a multilingual embedding model (BGE-M3 supports over 100 languages) and a BM25 tokenizer that handles the morphology of the given language. For Polish, lemmatization before BM25 indexing is important (e.g., via Morfologik or Stemmers), because "zamówień", "zamówieniu", "zamówienie" are the same token after lemmatization. Without lemmatization, BM25 recall drops by 20-40% on Polish queries.
When is it worth adding a reranker after hybrid RRF?
#Reranker is particularly valuable for multi-part queries and questions requiring full context understanding (e.g., "what is the return procedure for goods purchased in a December 2024 promotion for B2B customers"). RRF merges ranking signals but doesn’t understand the query holistically. A cross-encoder evaluates each (query, fragment) pair independently and captures subtle matches. The semantic search architecture in a company details when it’s worth integrating a reranker into the pipeline.
How to assess if hybrid actually improved results?
#Build a golden set: at least 100 queries with expected documents (start with 50 and expand). Measure recall@5 (whether the correct document is in the top 5) and MRR (Mean Reciprocal Rank, how high the first relevant result appears). Compare three variants: BM25 solo, semantics solo, hybrid RRF. In systems with mixed technical vocabulary, hybrid improves recall@5 by 15-35% compared to semantics alone. If the improvement is less than 5%, first check chunking quality—often the issue lies there. A useful starting point is the website and system audit, which identifies bottlenecks before implementation.