A company deploys an AI agent, and the first few weeks look great—the number of inquiries to consultants drops. Then a complaint arrives: the agent provided incorrect pricing information. Another follows: it suggested an unavailable appointment slot. No one noticed because no one measured anything beyond the fact that the agent “responds.” This is standard for most initial deployments in Poland and Central Europe.
Monitoring an AI agent isn’t an option for a later phase. It’s the foundation without which deployment is a gamble. Below, I describe how to build this foundation from scratch and which numbers actually tell the truth.
Four Layers of Monitoring: What to Measure and Why
#Effective AI agent monitoring consists of four layers, each answering different questions. All are necessary; none replace the others.
| Layer | Question | Example Metrics |
|---|---|---|
| Response Quality | Is the agent responding correctly? | RAG accuracy, hallucination rate, user rating |
| Operational Behavior | Is the agent running stably? | p50/p95 latency, errors, token cost, timeout rate |
| Business Outcomes | Is the agent delivering results? | containment rate, escalations, conversions, CSAT |
| Compliance & Security | Is the agent auditable? | TTL logs, decision trail, guardrails incidents |
The operational layer is the easiest technically (HTTP logs, Prometheus). The response quality layer is the hardest because it requires assessing whether the answer was substantively correct, not just “delivered.” Layers can’t be reduced to a single number. A company measuring only CSAT won’t see a rising hallucination rate until customers stop asking.
Response Quality KPIs: How to Measure What’s Not Visible in HTTP Logs
#Hallucinations are syntactically correct but factually incorrect responses. In RAG systems, their primary cause is low retrieval accuracy: the model receives the wrong fragment and builds its answer on it. That’s why the first quality KPI is retrieval precision—the percentage of queries for which the top-3 fragments from the vector database are substantively accurate.
Ways to measure retrieval precision:
- Sampling with human evaluation — Weekly, 50-100 random (query, RAG response) pairs are assessed by a domain expert. Accuracy above 85% is a good result for the initial scope.
- LLM-as-judge — A second model evaluates the (fragment, query) pair without generating a response. Effective for quickly scanning large volumes but requires calibration on a human sample.
- End-user rating — Thumbs up/down or a short post-resolution survey. Simplest but measures perception, not substance.
The second quality KPI is the escalation rate—the percentage of conversations the agent hands off to a human. A healthy range is 15–35% for the first narrow deployment. Below 10%, check if guardrails are working correctly—the agent may be responding where it should escalate. Above 50%, the knowledge base is too sparse to cover queries.
Business KPIs: Numbers the Board Understands
#The business layer translates technical quality into organizational results. Three numbers worth having on the dashboard from day one:
Containment rate is the percentage of cases closed by the agent without escalation to a human. For a narrow scope (e.g., FAQs and status updates), 50–70% is a realistic target after 8 weeks. A rising containment rate with stable or improving CSAT proves the system is maturing. A rising containment rate with falling CSAT signals the agent is closing cases it shouldn’t.
Cost per resolved case (cost-per-case) combines token costs (inference), infrastructure, and human oversight. Calculate it from LLM router telemetry: tokens consumed per query multiplied by the model rate plus a fraction of infrastructure costs. Compare it to the cost of handling the same case via a consultant. The difference is your real financial outcome, not an estimate.
Time to first response measured at p50 and p95 percentiles. A p50 median below 3 seconds is achievable for most RAG + LLM architectures. P95 above 15 seconds signals a bottleneck (embedding timeout, overloaded llm-router, no cache). Customers accept 3-4 seconds; they don’t accept 20.
Observability Architecture: What to Collect and Where
#Good AI agent monitoring requires three components: real-time data collection, storage with appropriate TTL, and an aggregating view.
Build data collection into the LLM router. Each call logs: timestamp, model, token count (input/output), latency, guardrails result (pass/block/escalate), session ID (anonymized), RAG sources (fragment identifiers, not content). This log is the foundation for all subsequent metrics.
Storage must have TTL aligned with RODO policy—typically 30–90 days for operational logs, longer only for aggregated metrics without PII. Conversation content is separate and has its own shorter TTL (see RODO and AI Act).
The aggregation layer is Prometheus + Grafana or a custom analytics script, depending on scale. For organizations with up to 10,000 queries per month, a simple spreadsheet dashboard with data from the router API is sufficient to start. Above that volume, Prometheus with anomaly alerts is standard.
Alerts: When Monitoring Must Wake a Human
#A passive dashboard isn’t enough. You need alerts that react before problems reach customers. Four alerts that should be active from day one in production:
- Escalation rate above threshold — If escalations exceed, e.g., 60% in an hour, something’s wrong with the knowledge base or model. Immediate notification.
- P95 latency above 10 s — Signals an infrastructure bottleneck, not a quality issue. Requires stack investigation, not data review.
- Guardrails block rate tripled in 1 hour — Indicates a prompt attack attempt or input data anomaly. Requires guardrails log review.
- Token costs increased by 50% at constant volume — Signals model behavior change (longer responses, more retrieval fragments) or configuration error.
Alerts shouldn’t all go to the same person. Latency is for DevOps, guardrails block rate for security, escalation rate for the product owner. Human-oversight must be designed into the architecture from the start, not bolted on later.
Drift Assessment: When the Agent “Diverges” from Reality
#Drift is the phenomenon where an agent worked correctly in week one but after three months, its responses degrade—without any code changes. Causes:
- Knowledge base drift — Documents become outdated (new prices, changed procedures), but the vector database wasn’t reindexed.
- Query distribution drift — Customers start asking about new things the knowledge base doesn’t cover.
- Model drift — The provider updated the model without notice, changing behavior.
Drift detection requires regular regression testing. Maintain a set of 50–100 questions with expected answers (golden set) and run it weekly or with every knowledge base or model version change. A retrieval precision drop of more than 5 percentage points signals the need for an audit. The article on RAG and fine-tuning explains when quality degradation requires model retraining versus just updating the knowledge base.
Compliance and Audit Trail: AI Act and RODO Requirements
#An AI agent serving customers is a system on the AI Act radar. Not every agent is high-risk, but every conversational system must meet transparency requirements: the customer must know they’re talking to AI. Logging this information is part of the audit trail.
The audit trail for an AI agent includes at minimum:
- Session ID (anonymized or pseudonymized)
- Model version and guardrails configuration used in the session
- Result of each guardrails check (pass/block/escalate) with timestamp
- RAG source identifiers (not content, only document reference)
- Handoff events to humans with context
This log lets you answer an inspector’s question: “Why did the agent provide this answer on March 15?” without reconstructing the entire system state. For sectors subject to DPIA (finance, healthcare, HR), requirements are stricter—details in the article on AI Act and RODO 2026.
Try It Live
#Describe your current or planned agent system, and the model will indicate which KPIs to implement first and which alerts are critical for your scope (playground: PII masked, zero retention):
FAQ
#Which AI agent KPIs should be reported to the board?
#Three numbers every board understands: containment rate (percentage of cases handled without human intervention), CSAT after AI handling compared to human channels, and cost per resolved case. Technical metrics like p95 latency or retrieval precision belong with the product owner or engineers, not in board reports. The board needs the trend of these three numbers monthly, not daily granularity.
How often should an AI agent’s response quality be audited?
#The minimum rhythm is biweekly for deployments under 1,000 queries daily and weekly above that threshold. Key is maintaining a consistent golden set of questions with expected answers and running it automatically with every knowledge base or model version change. A one-time deployment audit without regular regression tests doesn’t protect against drift. Work patterns with knowledge bases are described in the article on company GPT.
What does a high escalation rate in an AI agent mean?
#An escalation rate above 40–50% for a narrow scope usually indicates a knowledge base that’s too small: the agent can’t find sufficiently accurate answers and cautiously hands off cases to humans. This behavior is desirable from a safety perspective but operationally costly. The fix involves expanding and improving document quality in the knowledge base, not lowering the escalation threshold. Assessing knowledge base gaps is easier with the automation finder tool.
How does AI agent monitoring relate to AI Act requirements?
#The AI Act requires transparency (disclosing that the interlocutor is AI), explainability of decisions, and full human-oversight for high-risk systems. Monitoring is the tool to meet these requirements: the audit trail lets you reconstruct why the agent behaved a certain way, and alerts on guardrails and escalations document that human oversight is active. Lack of monitoring isn’t just a quality risk—it’s a gap in documentation required by regulators. Details on company obligations in 2026 are in the article AI Act and RODO.
How much does building AI agent monitoring cost?
#It depends on scale and what you already have. With an existing stack (Prometheus, Grafana), implementation cost is mainly engineering time. For a greenfield deployment at small volume, basic monitoring (JSON logs, golden set test, simple dashboard) can be built as part of the pilot project without a separate budget line. Costs rise with high-availability, complex structured output pipelines, and LLM-as-judge audits at scale. A realistic cost estimate for your scope is generated by the ROI calculator.