AI Agent Evaluation: Pre-Production Testing, Golden Sets, a…

A company built a customer service agent. The agent has been running for a week, and escalations to consultants are dropping. In the third week, it turns out the agent had been calling get_order_status with an incorrect order ID for three days, generating false “order in transit” responses for half of the queries. No one checked tool call accuracy before deployment because metrics were limited to latency and the number of handled conversations. This is a real-world pattern that emerges during the first agent deployment in Polish and European companies.

AI agent evaluation before production is a distinct discipline from ongoing monitoring. Below, I describe how to build it step by step.

How Agent Evaluation Differs from RAG Evaluation#

An agent isn’t just a language model. It consists of a model, a set of tools (tool-use), a decision loop, and often a RAG knowledge base. Each of these components can fail independently of the others. Guardrails may work correctly, yet the agent still calls the wrong tool with the right permissions.

RAG evaluation measures whether the model generates a response faithful to the source document. Agent evaluation measures three additional dimensions:

Tool selection accuracy – Did the agent call the right tool for the query?
Call parameter correctness – Did it pass the correct arguments to the tool?
Task success rate – Did the multi-step task end with the expected outcome?

These three dimensions can be checked before production—but only if you have a golden set with encoded expectations for each.

Golden Set: How to Build It and What to Avoid#

A golden set is a collection of pairs: (user query, expected agent behavior). Expected behavior can vary in granularity:

Response level – Expected text or fragment for semantic comparison
Tool level – Expected tool the agent should call
Parameter level – Expected call arguments (e.g., {"order_id": "{{order_id}}", "locale": "pl"})
Task level – Expected end state after a multi-step sequence

For a customer service agent covering order statuses and return policies, a minimal golden set is 200-300 examples. A rough breakdown: 40% covers typical queries (high frequency), 30% covers edge cases (policy exceptions, missing data), and 30% covers out-of-scope queries (agent should escalate or refuse).

Pitfalls when building a golden set:

Overrepresentation of documentation examples. If the golden set comes mainly from FAQs or manuals the model saw during fine-tuning or in RAG, results will be inflated compared to production. Supplement it with real anonymized requests from previous channels.

Missing tool error examples. The golden set must include scenarios where the tool returns an error (timeout, missing record, invalid format). Check if the agent handles them gracefully instead of hallucinating results.

Skipping multilingual examples. If the agent serves customers in multiple languages, each language needs separate coverage. Models behave differently for queries in minority languages.

Evaluation Metrics: Four Dimensions with PASS Thresholds#

The table below presents agent evaluation dimensions, recommended measurement methods, and the minimum threshold before production approval. Thresholds are a starting point—not a guarantee—and depend on process criticality and error risk in your industry.

Evaluation Dimension	Measurement Method	PASS Threshold	Notes
Response faithfulness	LLM-as-judge calibrated on 50 human pairs	≥ 85%	For high-risk systems, min. 92%
Tool selection accuracy	Exact match vs. golden set	≥ 90%	Count per call, not per conversation
Tool parameter correctness	Schema validation + exact/fuzzy match	≥ 88%	Fuzzy for text fields, exact for IDs and dates
Task success rate (multi-step)	Comparison of end state vs. expected	≥ 80%	Lower threshold due to cascading errors
"I don’t know"/escalation rate	Count of out-of-scope responses	10-35%	Too low = agent doesn’t escalate when it should
P95 latency on complex tasks	95th percentile of time from query to response	≤ 12 s	Include tool calls in measurement

Tool selection accuracy below 90% before production is a signal to revise the prompt system or few-shot examples—not to deploy and hope for improvement. Agents with incorrect tool-use trigger real consequences (data operations, reservations, payments) with no undo mechanism like a wrong text response.

LLM-as-Judge: When It Works and When It Fails#

LLM-as-judge is a method where a second model evaluates the quality of the first model’s responses. It speeds up large-scale evaluation (1,000+ pairs daily) but has limitations worth knowing before relying on it.

When it works well:

Evaluating faithfulness (whether the response is faithful to the source document) for RAG systems with clear facts
Flagging obvious hallucinations and completely off-topic responses
Comparing two agent versions (A/B) on the same question set

When it fails:

Evaluating tool parameter correctness (requires deterministic schema validation, not linguistic assessment)
Tasks requiring domain knowledge the judge model lacks (law, medicine, company-specific policies)
Detecting systematic errors from the same provider (if agent and judge use the same base model, the judge may not see the error)

Calibration is mandatory. Before using LLM-as-judge automatically, manually evaluate 50-100 pairs, compare with the judge’s results, and calculate Pearson correlation or Cohen’s kappa. A correlation below 0.75 with human evaluation disqualifies the judge for that dimension. The LLM-as-judge calibration pattern is described in the article on RAG quality evaluation.

Regression Tests: Model Changes Shouldn’t Be a Surprise#

Switching the base model (e.g., upgrading to a newer version) or changing the system prompt are the most common causes of unexpected regressions in production agents. Model providers don’t guarantee identical behavior between versions.

Regression tests involve running the golden set on the new version and comparing results with a baseline snapshot. Three steps to make it work in practice:

Freeze the baseline – After passing pre-production evaluation, save the results (faithfulness score, tool accuracy, task success rate) as a version artifact. This is your comparison point for every change.
Automate golden set execution – Integrate the regression test into the CI/CD pipeline or run it manually before every version promotion. For production agents, weekly execution on a core set (50-100 pairs) is the minimum.
Define degradation thresholds – Not every 1 percentage point drop requires blocking. Set thresholds: a faithfulness drop of more than 3 percentage points or tool accuracy below the PASS threshold blocks promotion. Drift between consecutive tests—not a one-time result—is the signal for an audit.

The quality drift detection pattern over time is described in the article on AI agent monitoring. The impact of prompt changes on quality is discussed in the article on prompt engineering for businesses.

Limits of Hallucinations in Tool Calls#

An agent can hallucinate not only in text responses but also in tool call parameters. Example: The agent calls create_refund(order_id="ORD-12345") for an order that doesn’t exist in the system, interpreting the ID from the conversation text rather than a real record.

Defense against this type of error requires validation on the tool side (not just the agent):

The tool returns a 404 error or error code when the record doesn’t exist
The agent has instructions in the system prompt: “If the tool returns an error, do not retry with different parameters. Escalate to a human.”
Complex structured-output with JSON Schema validation before passing to the tool

The article on AI assistant security audits covers the full range of pre-production security tests, including injection and excessive tool permission tests.

Try It Live#

Describe the agent you want to test before production, and the model will outline the golden set structure, metrics for your scope, and PASS thresholds tailored to process risk (playground: PII masked, zero retention):

▶Design AI Agent Evaluation Before Deploymentsandbox · reasoning

FAQ#

How many examples should a golden set have before the first deployment?#

For a narrow-scope agent (2-3 tools, one process), the minimum is 150-200 pairs. For multi-task agents (5+ tools, several processes), the optimal range is 400-600 pairs. Below 150 pairs, edge case coverage is too low for results to have predictive value. The set’s composition matters more than its size: 30% out-of-scope examples and 30% edge cases are necessary for the golden set to detect real issues, not just confirm happy path functionality.

Can LLM-as-judge be used without calibration on a human sample?#

No. Without calibration, you don’t know if the judge measures the same thing as a human. In projects relying on uncalibrated LLM-as-judge, scores were inflated by 8-15 percentage points compared to domain expert evaluations. Calibration requires 50-100 manually evaluated pairs and comparison with the judge’s results. If Pearson correlation is below 0.75, change the judge model or the measurement method for that dimension.

What to check if task success rate drops after a model change?#

First, isolate where in the multi-step sequence the error occurs: tool selection accuracy, parameter correctness, or tool result interpretation. If tool selection accuracy is stable but parameters worsened, the issue lies in how the new model extracts data from the conversation structure. The usual fix is adding few-shot examples to the prompt or stricter structured-output validation before calling. The article on why AI projects fail describes systemic causes of regression after configuration changes.

How to evaluate an agent when there’s no historical data?#

Without historical requests, build the golden set from three sources: (1) process documentation and product FAQs, (2) interviews with customer service consultants about typical and difficult queries, (3) synthetic data generated from the process description. Synthetic data requires verification by a domain expert before inclusion in the golden set, as LLM-generated examples may not match the real query distribution. The Agent Blueprint helps define tool scope and scenarios that must be covered.

How often should regression tests be run after deployment?#

Weekly for deployments with over 500 daily queries, biweekly for smaller ones. Mandatory before every change: base model, system prompt, tool set, and RAG knowledge base. Automatically running the golden set in CI/CD for every pull request changing the agent’s configuration is standard for production systems. The continuous monitoring pattern is described in the article on AI agent quality monitoring.

AI agent evaluation before production is a distinct discipline from ongoing monitoring. Below, I describe how to build it step by step.

How Agent Evaluation Differs from RAG Evaluation#

RAG evaluation measures whether the model generates a response faithful to the source document. Agent evaluation measures three additional dimensions:

Tool selection accuracy – Did the agent call the right tool for the query?
Call parameter correctness – Did it pass the correct arguments to the tool?
Task success rate – Did the multi-step task end with the expected outcome?

These three dimensions can be checked before production—but only if you have a golden set with encoded expectations for each.

Golden Set: How to Build It and What to Avoid#

A golden set is a collection of pairs: (user query, expected agent behavior). Expected behavior can vary in granularity:

Response level – Expected text or fragment for semantic comparison
Tool level – Expected tool the agent should call
Parameter level – Expected call arguments (e.g., {"order_id": "{{order_id}}", "locale": "pl"})
Task level – Expected end state after a multi-step sequence

Pitfalls when building a golden set:

Skipping multilingual examples. If the agent serves customers in multiple languages, each language needs separate coverage. Models behave differently for queries in minority languages.

Evaluation Metrics: Four Dimensions with PASS Thresholds#

Evaluation Dimension	Measurement Method	PASS Threshold	Notes
Response faithfulness	LLM-as-judge calibrated on 50 human pairs	≥ 85%	For high-risk systems, min. 92%
Tool selection accuracy	Exact match vs. golden set	≥ 90%	Count per call, not per conversation
Tool parameter correctness	Schema validation + exact/fuzzy match	≥ 88%	Fuzzy for text fields, exact for IDs and dates
Task success rate (multi-step)	Comparison of end state vs. expected	≥ 80%	Lower threshold due to cascading errors
"I don’t know"/escalation rate	Count of out-of-scope responses	10-35%	Too low = agent doesn’t escalate when it should
P95 latency on complex tasks	95th percentile of time from query to response	≤ 12 s	Include tool calls in measurement

LLM-as-Judge: When It Works and When It Fails#

When it works well:

Evaluating faithfulness (whether the response is faithful to the source document) for RAG systems with clear facts
Flagging obvious hallucinations and completely off-topic responses
Comparing two agent versions (A/B) on the same question set

When it fails:

Evaluating tool parameter correctness (requires deterministic schema validation, not linguistic assessment)
Tasks requiring domain knowledge the judge model lacks (law, medicine, company-specific policies)
Detecting systematic errors from the same provider (if agent and judge use the same base model, the judge may not see the error)

Regression Tests: Model Changes Shouldn’t Be a Surprise#

Regression tests involve running the golden set on the new version and comparing results with a baseline snapshot. Three steps to make it work in practice:

Freeze the baseline – After passing pre-production evaluation, save the results (faithfulness score, tool accuracy, task success rate) as a version artifact. This is your comparison point for every change.
Automate golden set execution – Integrate the regression test into the CI/CD pipeline or run it manually before every version promotion. For production agents, weekly execution on a core set (50-100 pairs) is the minimum.
Define degradation thresholds – Not every 1 percentage point drop requires blocking. Set thresholds: a faithfulness drop of more than 3 percentage points or tool accuracy below the PASS threshold blocks promotion. Drift between consecutive tests—not a one-time result—is the signal for an audit.

Limits of Hallucinations in Tool Calls#

Defense against this type of error requires validation on the tool side (not just the agent):

The tool returns a 404 error or error code when the record doesn’t exist
The agent has instructions in the system prompt: “If the tool returns an error, do not retry with different parameters. Escalate to a human.”
Complex structured-output with JSON Schema validation before passing to the tool

The article on AI assistant security audits covers the full range of pre-production security tests, including injection and excessive tool permission tests.

AI Agent Evaluation: Pre-Production Testing, Golden Sets, and Benchmarks

How Agent Evaluation Differs from RAG Evaluation#

Golden Set: How to Build It and What to Avoid#

Evaluation Metrics: Four Dimensions with PASS Thresholds#

LLM-as-Judge: When It Works and When It Fails#

Regression Tests: Model Changes Shouldn’t Be a Surprise#

Limits of Hallucinations in Tool Calls#

Try It Live#

FAQ#

How many examples should a golden set have before the first deployment?#

Can LLM-as-judge be used without calibration on a human sample?#

What to check if task success rate drops after a model change?#

How to evaluate an agent when there’s no historical data?#

How often should regression tests be run after deployment?#

AI Agent Evaluation: Pre-Production Testing, Golden Sets, and Benchmarks

How Agent Evaluation Differs from RAG Evaluation#

Golden Set: How to Build It and What to Avoid#

Evaluation Metrics: Four Dimensions with PASS Thresholds#

LLM-as-Judge: When It Works and When It Fails#

Regression Tests: Model Changes Shouldn’t Be a Surprise#

Limits of Hallucinations in Tool Calls#

Try It Live#

FAQ#

How many examples should a golden set have before the first deployment?#

Can LLM-as-judge be used without calibration on a human sample?#

What to check if task success rate drops after a model change?#

How to evaluate an agent when there’s no historical data?#

How often should regression tests be run after deployment?#