Most teams stop at vector top-k and ask why the assistant makes mistakes every fifth time. The answer is almost always the same: the search returned chunks similar to the query, but not those that answer the question. Reranking solves precisely this problem, without rebuilding the entire pipeline.
Why embeddings alone aren't enough
#Embedding converts text into a vector, compressing information into a single number per dimension. The embedding model never sees the question and chunk together—it embeds them separately, and similarity is measured geometrically afterward. This is fast and scalable, but it comes at a cost.
A common production example: a company's knowledge base contains two documents. One describes the "complaint procedure for individual customers," the other the "complaint procedure for business customers." Both have very similar embeddings because most words are shared. The question "how to file a complaint as a company" is also close to both in vector space. ANN search returns both with similar scores. The generative model receives both chunks and has a chance to mix up the paths.
A cross-encoder reranker sees the whole picture differently: it receives both the question and each candidate as a pair, calculating a relevance score with mutual context in mind. The chunk about business customers gets a higher score. Relevance improves without changing the vector index.
This is also why hybrid search and reranking are a natural pair: the hybrid increases candidate coverage (BM25 + ANN), while the reranker boosts precision from that pool.
How cross-encoder works: mechanism under the hood
#A cross-encoder is a transformer model (typically a BERT variant or similar) trained on pairs (question, chunk) with relevance labels. During inference:
- The question and chunk are concatenated as a single input sequence with a special separator token.
- The model processes this pair through all attention layers simultaneously, so every question token can "see" every chunk token.
- The output is a single scalar: the relevance score for that specific pair.
- The pipeline calculates the score for each candidate from the top-k, sorts them in descending order, and returns the top-n to the generative model.
Cost: cross-encoder inference for each candidate separately. With top-20 candidates, that’s 20 forward passes of the reranking model instead of one. That’s why the reranker processes only candidates from the search, not the entire index.
| Mechanism | Input | Query-chunk interaction | Speed | Relevance |
|---|---|---|---|---|
| Bi-encoder (ANN) | each text separately | none (post-hoc cosine) | very fast | good |
| Cross-encoder (reranker) | question+chunk pair | full (cross-attention) | slower | high |
| Hybrid BM25+ANN+reranker | both | full on candidates | moderate | highest |
Bi-encoder scales to millions of documents. Cross-encoder doesn’t scale to millions because it would have to compare every question with every chunk. That’s why the production pattern is always: fast candidate search, then an expensive reranker on a small set.
When reranking makes the biggest difference
#Not every pipeline needs reranking from day one. It’s worth adding when:
The knowledge base contains documents with similar wording but different content. Procedures for different customer groups, regulations in multiple versions, instructions for different device models. Embedding treats them almost the same; reranker distinguishes them.
Questions are multi-faceted or ambiguous. "How to cancel a subscription if I was on vacation and missed the deadline" has many components. Bi-encoder simplifies it to one vector. Cross-encoder reads the whole thing.
The pipeline answers in English to Polish questions or vice versa. Multilingual models have worse embeddings for mixed languages. Cross-encoder, trained on multilingual pairs, handles contextual relevance better.
You have long chunks (over 512 tokens). Bi-encoder compresses a long chunk into one vector, losing details. Cross-encoder processes the entire context of the pair.
If your knowledge base is small (up to a few hundred documents), questions are short and precise, and answers are satisfactory, reranking may not provide a measurable improvement. Always verify results on a test set before deployment.
Reranking models: what to run locally
#The choice of reranking model affects quality, latency, and whether data leaves your infrastructure. Basic rule: if the knowledge base contains sensitive or personal data, the reranker must run locally. Sending a document chunk to an external reranking API means data leaves the corporate network.
We use locally run reranking models. Practical options in 2026:
| Model | Size | Languages | Latency (per pair) | Hosting |
|---|---|---|---|---|
| bge-reranker-v2-m3 | 568 MB | multilingual (PL, EN, DE...) | ~30 ms (CPU) | local |
| bge-reranker-large | 1.3 GB | multilingual | ~60 ms (CPU) | local |
| ms-marco-MiniLM-L-12 | 127 MB | EN | ~10 ms (CPU) | local / cloud |
| Cohere Rerank v3 | API | multilingual | ~80 ms (API) | cloud |
For a Polish knowledge base, bge-reranker-v2-m3 is the first choice: multilingual, fits on a standard CPU server, latency around a few dozen milliseconds per pair. With 20 candidates, that’s under 700 ms of additional time, which is acceptable in most enterprise assistant scenarios.
Self-hosting the reranking model also has a GDPR compliance dimension: the document chunk never leaves the infrastructure, even temporarily during relevance assessment. This is important for knowledge bases containing procedures, contracts, or customer data.
Building a pipeline with reranking
#A complete pipeline looks like this. Each step has one task and produces clearly defined output.
Step 1: Indexing. Each document is split into chunks (typically 256-512 tokens with a 64-token overlap). Each chunk is embedded with the BGE-M3 model and stored in a vector database (Qdrant) with metadata: category, date, department, document type.
Step 2: Hybrid search. For a user query, BM25 (Postgres FTS or Elasticsearch) and ANN (Qdrant) run in parallel. Results are merged via Reciprocal Rank Fusion (RRF) into a pool of 20-50 candidates. This step usually takes under 100 ms.
Step 3: Reranking. Each candidate is evaluated by the cross-encoder as a pair (query, chunk). Results are sorted in descending order. The generative model receives the top-3 to top-5 chunks.
Step 4: Generation with context. The generative model receives the question and selected chunks via an LLM router. Guardrails check the output before returning it to the user. If no chunk exceeds the relevance threshold, the system responds "I don’t know" and escalates to a human (see: human-handoff).
Critical detail: the reranking relevance threshold. If the highest score in the pool is below 0.3 (scale 0-1), the chunk is insufficiently related to the question. It’s better not to generate an answer than to generate one with low-relevance context. This logic directly limits hallucinations.
Reranking and latency: how to maintain response time
#Reranking adds time. Whether that time is acceptable depends on the context.
In an enterprise assistant with sub-3-second responses, reranking 20 candidates on CPU takes 600-800 ms. That’s 20-30% of the total response time. In most customer service and internal support scenarios, this is acceptable.
If the pipeline needs to be faster, a few optimizations:
- Reduce the candidate pool. Top-10 instead of top-20 means twice as fast reranking with negligible coverage loss.
- Cache reranking results for repeated questions (with 24h TTL). FAQ questions repeat in 30-50% of customer service systems.
- Use a smaller reranking model. MiniLM is 5x faster than bge-reranker-large with acceptable relevance loss for simpler knowledge bases.
- Filter by metadata before reranking. If the document category is known from context (e.g., the user is in the "products" section), limit the pool to that category at the ANN stage.
Throughput and latency are dimensions we measured in every deployment. Details on monitoring the pipeline are covered in the article monitoring AI agent quality.
Evaluation: how to check if reranking helped
#Don’t add reranking without a test set. Without measurement, you can’t be sure you’ve improved anything.
A minimal evaluation set consists of 50-100 pairs (question, expected chunk). You can build it from production logs (user questions + consultant feedback) or manually for the most important areas of the knowledge base.
Metrics to watch:
- MRR (Mean Reciprocal Rank) — how high the relevant chunk appears on the list. MRR@10 measures this on the first 10 results.
- NDCG (Normalized Discounted Cumulative Gain) — a weighted measure where higher = more relevant.
- Precision@k — how many of the top-k chunks are actually relevant.
- Response time p50/p95 — before and after reranking.
An A/B benchmark before/after reranking on the same test set gives a concrete answer. In typical projects with diverse knowledge bases, reranking increases MRR@5 by 15-30% compared to ANN alone. This translates to fewer escalations to humans and a higher containment rate.
Try it live
#The sandbox below runs a pipeline with reranking on your text. Paste a document chunk, ask a question. The system will search, evaluate candidates with a cross-encoder, and generate an answer. PII is masked before the model, zero retention.
FAQ
#How does reranking differ from regular vector search?
#Vector search calculates similarity between embeddings of the question and chunks separately, without mutual comparison. Reranking processes each pair (question, chunk) together through a cross-encoder model, which understands mutual context. Result: vector search is fast and scales well to millions of documents, while reranker is slower but much more accurately assesses which candidate actually answers the question.
Is reranking necessary in every RAG?
#No. With a small, consistent knowledge base (up to a few hundred well-structured documents) and precise questions, semantic search alone is often sufficient. Reranking is cost-effective when the knowledge base is large or heterogeneous, questions are multi-faceted, or you have documents with similar wording but different content. Before adding reranking, measure MRR and Precision@k on a test set without it, then compare results after adding it. Data decides, not intuition.
Does reranking send data to the cloud?
#Only if you use a reranking API (e.g., Cohere Rerank). With a local model (bge-reranker-v2-m3 via Ollama), all chunks and questions are processed on your own server. For knowledge bases containing sensitive or RODO-protected data, always choose local models. Company obligations when processing personal data in AI are covered in the article AI Act and RODO 2026.
How long does it take to implement reranking in an existing RAG?
#If the RAG pipeline is already running, adding reranking typically takes days, not weeks. The search step is extended with a cross-encoder model (downloaded and run locally), sorting logic based on the new score, and optionally a minimum relevance threshold. The bigger effort is the evaluation set and benchmark, which confirm that the change improved quality. Without them, you don’t know what you’ve actually gained.
Does reranking help with questions about numbers, codes, and proper names?
#Here, reranking helps, but hybrid search is more effective as a preceding step. BM25 precisely captures contract numbers, SKUs, or unique product names, which embeddings don’t handle well. The reranker then evaluates which candidate (from both BM25 and ANN) actually answers the question. Combining both mechanisms delivers the best results on diverse knowledge bases, as we describe in more detail in the article semantic search and embeddings in enterprise.