We regularly see the same scenario: a team selects an embedding model based on its MTEB ranking, deploys it to production, and only after weeks of user complaints discovers the assistant fails to find obvious documents. Public benchmarks show how a model performs on third-party English data. They say nothing about how it will handle your contracts, ISO procedures, or Polish service requests. The only reliable measurement is evaluation on a dataset that reflects your users' real queries.
Three metrics that actually matter
#Evaluating semantic search boils down to one question: for a given query, do the correct fragments appear high in the results list? Three complementary metrics measure this.
Recall@k answers what percentage of expected fragments appeared in the top k results. It’s the “did we find anything at all” metric. If you expect 3 relevant documents and 2 appear in the top 5, recall@5 is 0.67. Recall@k is critical for RAG because the generative model only sees what retrieval provides; a missing fragment means missing context and hallucination risk.
MRR (Mean Reciprocal Rank) measures how high the first relevant result appears. For a query where the relevant document ranks 1st, the contribution is 1; 2nd is 1/2; 4th is 1/4. MRR averages these reciprocals across all queries. High MRR means you often hit “on the first try,” letting you provide less context and reduce token costs.
nDCG (normalized Discounted Cumulative Gain) is the most demanding of the three. It accounts not only for whether a relevant result ranks high but also for relevance degree (you can label fragments as “highly relevant” or “partially relevant”) and penalizes weaker results above better ones. nDCG is worth calculating when you have graded relevance, not just binary “matches/doesn’t match.”
| Metric | What it measures | When to use | Typical threshold (PL corporate docs) |
|---|---|---|---|
| Recall@k | Coverage: how many relevant in top k | Always, as the first metric | recall@5: 70-80% |
| MRR | Position of first relevant result | When “first try” matters | 0.70-0.85 |
| nDCG@k | Ranking quality with relevance weight | When you have graded relevance | 0.75-0.88 |
| Precision@k | Percentage of relevant in top k | When false positives are costly | Depends on threshold |
The numbers in the “threshold” column are ranges observed in our projects with Polish corporate documents, not universal constants. Treat them as reference points for calibration, not as goals in themselves.
How to build a test set of question-fragment pairs
#Without a golden set, you have nothing to measure. A golden set is a list of pairs: realistic questions + the fragments that should appear in response. The build process looks like this.
Start by collecting 50-100 real questions. The best source is logs of actual assistant queries, support tickets, or interviews with the people who will use the system. Questions invented “at the desk” are too regular and don’t reflect user language—errors, abbreviations, and case forms.
For each question, mark relevant fragments. Here’s a decision point: binary ratings (matches/doesn’t match) are faster; graded ratings (highly relevant / partially relevant / irrelevant) provide richer signals for nDCG. For the first iteration, binary ratings suffice. Annotation should be done by someone familiar with the domain, not a language model, because the model is the subject of evaluation here.
| Step | Action | Pitfall to avoid |
|---|---|---|
| 1 | Collect 50-100 real questions from logs or interviews | Invented questions are too “clean” |
| 2 | Mark relevant fragments (binary or graded) | Annotation by LLM distorts measurement |
| 3 | Index the corpus with the same pipeline as production | Different chunking = incomparable results |
| 4 | Query with the golden set, calculate metrics | Testing only on “easy” questions |
| 5 | Repeat for 2-3 models in parallel | Changing multiple variables at once |
It’s crucial that the evaluation corpus is indexed exactly like the production one: same chunking strategy, same fragment size, same model. If you change chunking between test and production, the metrics become meaningless. We discuss vector database selection separately in the article on how to choose a vector database.
Offline vs. online evaluation pitfalls
#Metrics calculated on the golden set are offline evaluation: controlled, repeatable, cheap. But they have limitations you can’t ignore.
The first pitfall is overfitting to the golden set. If you repeatedly tune the pipeline on the same 80 questions, you eventually optimize for those specific questions, not real queries. Keep a portion of pairs as a “validation set” you don’t use for tuning. The second pitfall is data drift: a golden set built on a corpus from six months ago no longer reflects documents added since.
Online evaluation measures behavior on live traffic: result click-through rates, percentage of queries escalated to humans, thumbs-up/down ratings, percentage of answers marked as irrelevant. The signal is “noisier” than offline, but it tells the truth about user experience. Best practice is to combine both: offline as a quick gate for every model or chunking change, online as the final production arbiter. We apply the same offline/online split in AI agent evaluation, where synthetic tests verify regressions and production metrics confirm real value.
The third pitfall is specific to RAG: good retrieval is necessary but not sufficient for a good answer. You might have recall@5 at 85%, yet users remain dissatisfied because the generative model poorly summarizes the retrieved context. That’s why measure retrieval metrics (recall, MRR, nDCG) separately from answer quality metrics (faithfulness, relevance). Confusing these two layers is the most common diagnostic error we see.
Polish language specifics in measurement
#Polish inflection means embedding evaluation for Polish requires caution. A user asks about “faktura” (invoice), but the document contains “faktur,” “fakturze,” “fakturami.” A weak model for Polish treats these forms as distant points, so the relevant fragment drops in ranking despite perfect semantic fit. Intentionally include questions in case forms and with careless diacritics (“wez” instead of “weź”) in the golden set, because that’s how real users write.
The second factor is a mixed-language validation corpus. Polish corporate documents are full of English terms: compliance, vendor, deliverable, SLA. If your golden set contains only pure Polish sentences, you’ll overestimate model quality on data that’s actually bilingual. When selecting a model, consider one with a rich multilingual corpus, like BGE-M3; we discuss this further in the article on embeddings for the Polish language.
The third issue: when pure vector retrieval falls short on Polish industry terms and catalog numbers, measure how much hybrid search adds. Calculate recall@5 for vectors alone, then for vectors combined with BM25 on the same golden set. If the difference exceeds a few percentage points, hybrid pays off. All these comparisons only make sense when you change one variable at a time and measure on an unchanged test set.
FAQ
#How many question-fragment pairs are enough for reliable evaluation?
#For a first, rough assessment, 50-100 carefully selected pairs suffice. Such a set catches glaring differences between models and lets you discard obviously weak options. For stable, statistically reliable comparisons between similar models, aim for 200-300 pairs, because with a small sample, a few difficult questions can shift the result by several percentage points.
What’s the difference between recall@k and precision@k?
#Recall@k tells what percentage of all relevant fragments appeared in the top k, measuring coverage. Precision@k tells what percentage of results in the top k are actually relevant, measuring list “purity.” In RAG, recall is usually the priority because missing context hurts more than one unnecessary fragment, which the generative model will ignore anyway. Precision gains importance when false positives are costly or misleading.
Can I rely on the MTEB or BEIR ranking when choosing a model?
#Public benchmarks are useful as an initial filter to narrow the candidate list, but they don’t replace measurement on your data. MTEB and BEIR corpora are largely English and general-domain, so they poorly predict performance on Polish contracts or service requests. Treat the ranking as a starting point, and make the decision based on your own golden set.
How often should I repeat evaluation after deployment?
#Recalculate the offline golden set with every model change, chunking strategy change, or reranker change—it’s a quick regression gate. Independently, refresh the golden set with new production questions every quarter to keep up with data and user language drift. Monitor online metrics (ratings, escalations) continuously, just like in AI agent quality monitoring.
Does good recall@5 guarantee good assistant answers?
#No. High recall only means relevant fragments reach the context, which is necessary but not sufficient. The generative model might poorly summarize good context, omit a key detail, or add information outside the source. That’s why measure retrieval quality (recall, MRR, nDCG) separately from answer quality (faithfulness, relevance)—they’re two different layers with different failure causes.