The team deployed RAG, the first demo looks great, management is satisfied. Then, after six weeks in production, someone changes the document structure in the knowledge base. Response quality drops. No one notices for two weeks because there are no regression tests. This is a scenario we see in one out of every three initial deployments.
RAG evaluation isn’t an academic add-on. It’s the only method that distinguishes “the system responds” from “the system responds correctly.” Below, I describe how to build it, which metrics to use, and how to maintain quality throughout the system’s lifecycle.
Why RAG evaluation differs from testing a regular API
#Testing a REST API is straightforward: you send a request and compare the response to a schema. RAG has three independent sources of errors that require separate metrics.
Retrieval error. The system retrieved fragments that don’t match the question. The generative model has good context, but it’s the wrong context. The response may sound convincing and be completely incorrect.
Faithfulness error. The system retrieved relevant fragments, but the model extracted non-existent information from them or added its own parametric knowledge. This is a classic hallucination even with correct retrieval.
Relevance error. The system retrieved the correct fragments, the model generated text derived from those fragments, but the response doesn’t address the user’s actual question. This happens when the question is ambiguous or when the knowledge base has thematic gaps.
Unit tests for APIs don’t catch any of these errors. You need separate metrics for each layer.
Golden set: the foundation of RAG evaluation
#A golden set is a controlled collection of input-expected-output pairs, built once, maintained continuously, and run automatically. Minimum requirements for production:
| Parameter | Minimum | Recommended | Note |
|---|---|---|---|
| Number of pairs | 50 | 150-200 | Fewer than 50 = too much variance in results |
| Thematic coverage | 80% of base topics | 100% | Gaps = blind spots in evaluation |
| Question difficulty | mix of simple/complex | 30% complex | Only simple questions = false confidence |
| Expected fragment | fragment identifier | fragment + minimum score | Allows measuring retrieval separately |
| Expected answer | key facts (list) | full reference answer | Full answer simplifies LLM-as-judge |
| Update frequency | with base changes | with every change | Outdated questions = false results |
The golden set is built from sources you know: questions from support tickets (most valuable, as they’re real), questions from contact forms, questions asked to consultants. Don’t create it by inventing questions at a desk — questions invented in a vacuum rarely match what users are actually looking for.
Formal out-of-domain questions (those the knowledge base shouldn’t answer) must also be included in the golden set. RAG evaluation without negative tests doesn’t measure whether the system correctly refuses to answer.
Three evaluation metrics: faithfulness, relevance, context precision
#RAG evaluation boils down to three numbers you need on your dashboard.
Faithfulness measures whether the response follows from the provided fragments. Scale of 0-1. A value of 1 means every claim in the response is supported by at least one fragment in the context. A value of 0 means the model responded using its own parametric knowledge, ignoring the context.
Required value for production: above 0.85. Below this threshold, the system regularly generates information outside the knowledge base.
Context relevance measures what percentage of retrieved fragments is substantively useful for answering the question. Low context relevance with high faithfulness is a classic symptom of overly broad retrieval: the model is faithful to the context, but the context is unnecessary. This requires fixing reranking or search parameters.
Answer relevance measures whether the response addresses the actual question. This metric is independent of faithfulness: the response may be faithful to the fragments but not answer what was asked. Particularly important for multi-aspect or ambiguous questions.
A fourth metric we add to every project: "I don’t know" rate — the percentage of queries where the system correctly refused to answer due to a lack of relevant fragments. Too low indicates that guardrails are too permissive and the system hallucinates instead of escalating.
Measurement methods: LLM-as-judge and human evaluation
#There are two practical ways to measure qualitative metrics: LLM-as-judge and domain expert evaluation. Each method has different applications.
LLM-as-judge is a second model (different from the one generating responses) that evaluates the pair (question, response, context) according to a defined rubric. The advantage is scalability: evaluating a thousand pairs takes minutes. The disadvantage is the need for calibration, as different models have different evaluation biases.
Calibrating LLM-as-judge: take 100 pairs from the golden set and evaluate them manually (domain expert), then evaluate the same set with LLM-as-judge. A Pearson correlation above 0.8 between human and model evaluation is the threshold at which you can trust automated evaluation on large volumes.
Domain expert evaluation is slow and costly but necessary as a reference point. Minimum recommendation: every two weeks, a random sample of 30-50 pairs from production is evaluated by an expert. This sample maintains LLM-as-judge calibration and catches degradation that automatic metrics don’t see.
End-user evaluation (thumbs up/down, short survey) is the simplest way to collect feedback, but it measures perception, not substance. Users give high ratings to stylistically confident responses, even when the content is incorrect. Use it as a supplement, not a replacement.
Regression tests: automatic golden set in CI/CD
#A golden set is only valuable if it’s run regularly and automatically. The regression cycle for RAG:
With every change to the knowledge base. Adding new documents, removing outdated ones, changing chunk structure — each of these actions can alter retrieval results for existing questions. The golden set run after a change catches regressions before they reach production.
With changes to model or reranker configuration. Updating the embeddings model, changing reranking parameters, adjusting the faithfulness threshold in guardrails — each of these settings affects metrics. Without regression testing, you won’t know the direction of the change.
Weekly on a production sample. Regardless of infrastructure changes, real user queries can reveal degradation not visible in the golden set (out-of-domain questions, new query types). An automatic sample of 50 production conversations evaluated weekly by LLM-as-judge supplements the golden set with real-world signals.
Combining these approaches with a versioned knowledge base process is described in the article RAG knowledge updates and versioning.
RAG evaluation isn’t just about technical quality. For systems processing personal data or operating in high-risk areas, the evaluation trail is part of the documentation required by the AI Act.
Minimum compliance requirements:
Evaluation logs must include: model version, knowledge base version, test run date, results per metric, and test pair identifiers. Test query content may contain PII taken from production logs, which requires pseudonymization before inclusion in the golden set.
For systems subject to DPIA (e.g., HR, finance, healthcare), evaluation must include bias testing: does the system treat the same questions differently based on demographic characteristics present in the context? Results of such tests are documented as part of the risk assessment.
TTL for evaluation logs: aggregated results (metrics without content) can be stored long-term as an audit trail. Logs with full evaluation pairs containing query content have a shorter TTL, aligned with RODO retention policies — typically 30-90 days.
Lack of documented evaluation for a high-risk system is a potential gap during a supervisory audit. Human-oversight must have proof that it was executed, not just declared.
Tools: RAGAS, custom pipeline, and what to choose when
#RAGAS (Retrieval Augmented Generation Assessment) is the most popular open-source framework for RAG evaluation. It implements faithfulness, answer relevance, context precision, and context recall as ready-to-use metrics. It requires an LLM model for evaluation (built-in LLM-as-judge) and can run locally via the OpenClaw router.
When RAGAS is the right choice: when you’re starting evaluation from scratch and want to quickly obtain a set of metrics without writing your own evaluation logic. RAGAS works well for the first 3-6 months of a project.
When to build a custom pipeline: when you have domain-specific requirements (e.g., legal compliance evaluation, technical fact-checking, formal tone in a specific language) that off-the-shelf rubrics don’t support. A custom pipeline also gives better control over token evaluation costs.
Important technical note: the evaluation model (LLM-as-judge) shouldn’t be the same model that generates responses. Different models have different biases, and self-evaluation inflates results. A good pattern is a smaller, faster model as the judge (lower token cost) with a larger generative model.
Evaluating drift: when the golden set is no longer enough
#A golden set built at deployment gradually becomes unrepresentative. After six months, user questions may significantly deviate from those in the golden set — new thematic areas, new customer types, changed organizational processes. This query distribution drift is a silent killer of quality: metrics on the golden set remain stable, but actual production quality declines.
Signs that the golden set needs updating:
- The escalation-to-human rate increases for three consecutive weeks without configuration changes.
- A thematic cluster appears in production that isn’t represented in the golden set (detectable by clustering embeddings of real queries).
- User feedback contains response categories that the golden set doesn’t test.
Update the golden set at least quarterly: review 200-300 production conversations, identify new question types, add them to the golden set, and run a benchmark. This isn’t a one-time project; it’s an ongoing process.
The article monitoring AI agent quality describes how to integrate golden set evaluation into a broader operational dashboard.
Try it live
#Describe your RAG system or planned deployment, and the model will indicate which metrics to start with, how to build a golden set for your scope, and what quality risks are critical in your case (playground: PII masked, zero retention):
FAQ
#How many pairs should a golden set have for a small RAG deployment?
#A minimum of 50 pairs allows detecting regressions larger than 10 percentage points in quality metrics. For a knowledge base of up to 200 documents and a narrow thematic scope, this is a sufficient starting point. With 200 pairs, results are statistically stable and allow detecting changes of 3-5 points. If the knowledge base has several distinct thematic areas, the golden set must have representative coverage of each, even if the total number of pairs is small. Details on preparing the knowledge base are described in the article how to prepare company data for AI.
What’s the difference between faithfulness and relevance in RAG evaluation?
#Faithfulness measures whether the response follows from the provided context fragments, regardless of whether those fragments were relevant. Relevance measures whether the response actually addresses the user’s question. A system can have high faithfulness (the model faithfully cites fragments) and low relevance (the fragments don’t match the question). Both metrics must be measured separately. The most common mistake is measuring only one and drawing conclusions about the entire system. If faithfulness is high but relevance is low, the problem lies in retrieval, not the generative model. More about search is described in the article semantic search and embeddings in the enterprise.
How often should the golden set be run in production?
#Run it with every change to the knowledge base or model configuration. Regardless of changes, run it weekly on a sample of 50 production conversations evaluated by LLM-as-judge. For systems handling over 1,000 queries daily, reduce the interval to 3-4 days. For systems in high-risk sectors (finance, healthcare, HR), every configuration change requires passing the golden set with full documentation of results as part of the AI Act audit trail. Automating runs in CI/CD eliminates the risk of skipping tests during rapid changes.
Is LLM-as-judge reliable without calibration?
#Without calibration against expert evaluation, LLM-as-judge produces results with low correlation to actual domain quality. Models define “relevance” differently for different domains. Calibration involves: manually evaluating 100 pairs, evaluating the same pairs with LLM-as-judge, and measuring correlation. Below a Pearson correlation of 0.75, the model isn’t a reliable judge for your domain. Calibration must be repeated after updates to the evaluation model. The cost of this work is one-time at system launch and worth every minute, as it allows trusting automated metrics.
How does RAG evaluation connect with AI Act requirements?
#For systems classified as high-risk under the AI Act (particularly in HR, finance, healthcare), regulations require documenting system effectiveness and explainability. Golden set evaluation with preserved result logs is direct evidence that the system was regularly controlled. Lack of this documentation makes demonstrating compliance difficult during a supervisory audit. For systems subject to DPIA, add bias tests and refusal-to-answer tests for out-of-domain questions to the evaluation process. Company obligations under the AI Act and RODO are described in the article AI Act and RODO 2026.