How to Evaluate a RAG System: Retrieval Metrics, Faithfulne…

How to Evaluate a RAG System: Retrieval Metrics, Faithfulness, and Golden Set

We regularly see the same moment: a team deploys a RAG-based assistant, someone reports that “the bot hallucinates,” and the blame game begins. Is the model making things up, or did it simply not receive the right context? Without separate measurement of both layers, this discussion ends in guesswork. RAG evaluation isn’t a single number—it’s two sets of metrics tied to a shared test set, which together show exactly where the system breaks down.

Two Layers That Must Be Measured Separately

A RAG system has two components that can fail independently. Retrieval may not find the right fragment—then generation has no chance, because the model can’t see what it wasn’t given. Or retrieval provides the correct context, but the model still answers off-target, ignores it, or adds facts from its own memory. These are two distinct failures with two distinct repair plans.

That’s why we measure them separately. The quality of semantic search assesses whether the right fragments land high in the context. The quality of the answer assesses whether the model sticks to what it was given. Only combining both gives an end-to-end picture. A common mistake is looking only at the final answer score: when it drops, you don’t know whether retrieval or generation is at fault, so fixes become chaotic.

Layer	Diagnostic Question	Key Metrics	What to Fix When Poor
Retrieval	Did the right fragment make it into context?	recall@k, precision@k, MRR	chunking, embedding model, hybrid search, reranking
Generation	Does the model stick to the context?	faithfulness, relevance, attribution	prompt, guardrails, model selection
End-to-end	Did the user get a good answer?	correctness, satisfaction	depends on the layer below

Retrieval Metrics: Recall@k and Precision

We measure the retrieval layer with the same tools as pure retrieval. Recall@k tells you what percentage of expected fragments appeared in the top k results fed to the model. This is a coverage metric—if the right fragment isn’t in the context, even the best model can’t generate a correct answer, only hallucinate. That’s why recall@k is more critical for RAG than for a search engine, where the user browses the list themselves.

Precision@k, on the other hand, tells you what percentage of provided fragments were actually relevant. Low precision means we’re cluttering the context—the model gets noise, token costs rise, and the risk increases that it latches onto an irrelevant fragment. In practice, we balance both: a wider k raises recall but lowers precision. MRR adds context by showing how high the first relevant result lands, which matters when we want to feed the model less context.

Metric	What It Measures	Signal When Poor	Typical Threshold (PL corporate docs)
Recall@k	Coverage of relevant fragments in top k	Model lacks context, hallucinates	recall@5: 70-80%
Precision@k	Purity of context	Noise, costs, misattributions	0.50-0.70
MRR	Position of first relevant result	Need to provide more context	0.70-0.85

The numbers in the “threshold” column are ranges observed in our projects with Polish corporate documents—not universal constants. Treat them as calibration points. We expand on the methodology for measuring retrieval quality in our article on how to measure embedding quality. When recall is too low, the most common levers are changing the chunking strategy and adding hybrid search (BM25 plus vectors).

Answer Metrics: Faithfulness and Attribution

The second layer is the quality of the answer itself, assuming the context is already there. The key metric is faithfulness (grounding): whether every claim in the answer can be supported by a fragment from the provided context. A faithful answer doesn’t add facts outside the material—this kind of addition is the classic hallucination. We measure faithfulness by breaking the answer into individual statements and checking whether each finds support in the context.

The second metric is answer relevance: whether the answer actually addresses the question, not just being correct but off-topic. The third is source attribution: whether the system indicates which document a piece of information comes from and whether those citations are accurate. Attribution isn’t decoration—it’s a mechanism that lets users verify answers and gives us an auditable trail during evaluation.

Hallucinations in RAG have two distinct sources. First: retrieval didn’t deliver the answer, so the model “fills in” from memory—this is a retrieval-layer defect, caught by low recall. Second: the context was correct, but the model still strayed beyond it—this is a generation defect, caught by low faithfulness. Separate measurement lets you immediately know which layer to fix, instead of “improving the prompt” when the real issue is a missing chunk.

▶Evaluation Plan for Our RAG Systemsandbox · reasoning

Golden Set: Question-Context-Answer Pairs

Without a golden set, you have nothing to measure on either layer. For RAG, the golden set is richer than for pure retrieval because each entry is a triplet: a realistic question, a set of fragments constituting the correct context, and a reference answer (or a set of facts that a correct answer must contain). The question and marked context feed retrieval metrics; the question, context, and reference answer feed generation metrics.

Start with 50-100 real questions from assistant logs, support tickets, or interviews with future users. Questions invented “at the desk” are too neat and don’t reflect the language of users who write with abbreviations and errors. For each question, mark the relevant context fragments and write a reference answer—annotation should be done by someone familiar with the domain, not a language model, because the model is often the subject of evaluation here.

Triplet Element	Purpose	Pitfall to Avoid
Question	System input, real language	Overly “clean” invented questions
Context (fragments)	Retrieval metrics, faithfulness check	Annotation by LLM distorts measurement
Reference Answer	Relevance and correctness metrics	Single “correct” version for open-ended questions

Index the evaluation corpus with the exact same pipeline as production: the same chunking strategy, the same embedding model, the same fragment size. If you change anything between test and production, the metrics stop being comparable. Version your golden set and treat it like code—it grows with the system, and every reported production error should return to it as a new test case.

Offline vs. Online: Catching Hallucinations in Production

Offline evaluation on a golden set is repeatable and cheap, but it has a blind spot: it only measures what we anticipated. Real users ask questions outside the set, in forms we didn’t imagine. That’s why offline is for comparing variants (model A vs. B, chunking 256 vs. 512) and protecting against regression after each change, but it doesn’t replace production observation.

Online evaluation works on live traffic. The foundation is observability: we log the question, provided context, answer, and cited sources so every response can be replayed and audited. On this basis, we calculate faithfulness continuously—an automated judge (LLM-as-judge) checks whether the claims in the answer are supported by the logged context and flags unsupported cases as hallucination candidates. This is a signal, not a verdict: some flags require human review, and a sample goes to domain expert assessment.

Source attribution closes the loop. If every answer cites the fragments it relied on, verifying hallucinations reduces to checking whether the cited fragments actually contain what the model claims. A mismatch in citation is a strong hallucination signal. Production signals—low faithfulness, thumbs down, escalation to a human—are fed back into the golden set as new test cases, closing the offline-online cycle. We describe how this loop works for a full assistant in our article on monitoring AI agent quality.

FAQ

Is one end-to-end metric enough to evaluate RAG?

No. A single end-to-end score shows something is wrong but won’t tell you whether retrieval or generation failed. These are two different issues with different fixes, so you need to measure both layers separately and only then combine them into an end-to-end picture. Without this separation, fixes are guesswork.

How is faithfulness different from answer relevance?

Faithfulness checks whether the answer sticks to the provided context and doesn’t add facts outside it—this guards against hallucination. Relevance checks whether the answer actually addresses the question. An answer can be faithful to the context but irrelevant (correct but off-topic), so both metrics are needed in parallel.

How many pairs should a golden set have to start?

From our practice, a reasonable starting point is 50-100 question-context-answer pairs built from real queries. This is enough to catch clear differences between variants and regressions. Treat the set as living: every new production error should be added as a test case, so it grows over time.

Can I use an LLM to evaluate faithfulness?

Yes, LLM-as-judge is a practical way to scale faithfulness evaluation, especially online. But treat it as a signal, not a final verdict: calibrate the judge on a manually assessed sample and periodically audit its decisions. Don’t use the same type of model you’re evaluating to annotate the golden set (marking correct context), as this distorts the measurement.

How does source attribution help catch hallucinations?

When an answer cites the fragments it relied on, verification reduces to checking whether those fragments actually contain what the model claims. A mismatch between citation and fragment content is a strong hallucination signal. Attribution also provides an auditable trail—useful for evaluation and handling user complaints.