Agentic RAG: an agent that plans searches autonomously

A legal client asks an assistant: “What risks do we face when terminating a contract with a German subcontractor, considering our framework agreement, the current annex, and case law from the past year?” Classic RAG performs one search, returns a few fragments, and generates an answer. It might hit the mark—or it might not. Agentic RAG approaches this differently: it breaks the question into multiple steps, checking after each whether it has enough information.

What changes in the architecture

In the classic RAG approach, the flow is linear and fixed: one vector query, k fragments into context, one generation. This works well when the question is precise and the answer lies in one place in the knowledge base.

Agentic RAG replaces this fixed flow with a decision loop. The agent receives the question, plans an initial query, evaluates the returned fragments, reformulates the query if the context is incomplete or contradictory, retrieves another batch of data, and repeats until a confidence threshold is met. Only then does it proceed to generation.

Key structural differences resulting from this change:

Multiple queries. The agent may send 3-6 separate queries in one response cycle. Each can use a different formulation, a different search strategy (vector, hybrid BM25 plus vectors, metadata), or target a different document collection.

Sufficiency evaluation. After each iteration, the model assesses whether the gathered fragments are sufficient to answer the question. This isn’t a simple threshold rule: the model checks consistency, coverage, and whether any critical thread is missing.

Query reformulation. If the first query returns fragments that are too broad, the agent narrows it down. If the context appears contradictory, the agent searches for a resolving document. This is reranking and selection done actively, not passively.

Confidence boundary. If the agent still hasn’t gathered sufficient context after the iteration limit, it doesn’t speculate. It escalates to a human with the message: “The question is outside the knowledge base coverage or requires expert verification.”

Classic RAG vs. agentic RAG

Dimension	Classic RAG	Agentic RAG
Queries per response	1	2-6 (depends on complexity)
Latency	300-800 ms	2-8 s
LLM cost	1 generation call	2-6 planning and generation calls
Best for	Simple, factual questions	Complex, multi-threaded, ambiguous
Hallucination risk	Medium (no context = no generation)	Lower (iteration reduces context gaps)
Evaluation difficulty	Moderate	High (multiple search paths)
Human-gate	Optional	Required for low confidence

Latency and cost ranges are based on our observations from projects with Polish legal and financial documents—no guarantees. Actual values depend on document length, model, and iteration count.

Where agentic RAG actually helps

The agent-based approach makes sense when the question has at least one of these characteristics.

Multi-threading. The question combines information from different sources or time periods. Classic RAG returns a fragment from one place; the agent connects several.

Ambiguity. The question is underspecified and requires clarification through context. The agent can ask about the case background, retrieve a general fragment, then drill into details.

Conflicting information in the base. When documents in the collection contain data from different dates or inconsistent procedures, classic RAG might return an older version. The agent iterates and compares fragments to select the current one.

Auditability of the search path. In compliance applications (law, finance, medicine), the value lies not only in the answer but also in the path to it: which documents were checked, in what order, and why they were sufficient. Agentic RAG naturally produces this log.

For simple dictionary questions, single-step product lookups, or FAQ navigation, classic RAG is cheaper and sufficient. There’s no point in deploying an agent-based architecture where a fixed flow works.

Cost and evaluation challenges

Agentic RAG isn’t free. Three main costs to consider before implementation:

More LLM calls. Each search iteration requires at least one additional model call to assess sufficiency and reformulate the query. With cloud models, this directly impacts token costs. Our inference calculator helps estimate whether self-hosting is cost-effective for a given volume.

Higher latency. Two seconds is an acceptable response time for an internal B2B assistant. For customer service chat with sub-second expectations, agentic RAG requires buffering, semantic caching, or a hybrid mode: agent for complex questions, classic RAG for simple ones.

Harder evaluation. In classic RAG, you have one set of context fragments and one answer to evaluate. In agentic RAG, the search path varies. The golden set must cover not only the final answer but also the quality of reformulations and sufficiency decisions. At Cashcrown, we build separate golden sets for this mode with path annotation, which is slower but necessary for reliable measurement. Methodology details are in the article on evaluating RAG systems.

▶Design an agentic RAG loop for legal documentssandbox · reasoning

Guardrails and human-gate: where the agent must stop

An unbounded agent loop is an operational risk. Three guardrails we consider mandatory.

Hard iteration limit. The agent can’t search indefinitely. We set an upper bound (usually 5-7 iterations) and handle exceeding it as escalation, not failure. The log includes the question, all iterations, and the reason for escalation.

Guardrails against hallucinations. Before generating the answer, we check whether every claim the model intends to include is supported by the gathered fragments. Unsupported claims are removed or flagged. The same safeguard described in the article on multi-step agents is applied here to the search layer.

Human-gate for low confidence. If the sufficiency assessment after all iterations is below the acceptance threshold, the agent doesn’t generate a speculative answer. Instead, it returns: “I don’t have sufficient documents to answer confidently. Escalating to an expert.” This handoff must include context: which questions were asked, what was found, and what’s missing.

Observability of the loop is necessary for these guardrails to work. We log every iteration: vector query, returned fragments, sufficiency assessment, and the decision to continue or stop. Without this log, there’s no way to debug escalation cases or calibrate the confidence threshold.

Implementation: from pilot to production

A typical agentic RAG pilot for one process in a B2B company takes 4-6 weeks. The first two weeks involve working with data and designing the loop (chunking strategy, sufficiency threshold, iteration limit, log schema). The third and fourth weeks are shadow mode: the agent processes questions in parallel with the team’s answers, and discrepancies go to analysis.

Before production launch, we build a golden set specific to agentic RAG: for each question, we annotate the expected search path and sufficiency decision, not just the final answer. This is slower than a golden set for classic RAG but the only method that allows evaluating planning quality, not just the outcome. How to build this golden set in the context of multi-agent systems is described separately.

Pilot cost depends on the knowledge base complexity and annotation requirements. The ROI of agent-based deployment is estimated using the inference calculator, which accounts for the cost of additional LLM iterations.

FAQ

How does agentic RAG differ from classic RAG?

Classic RAG executes one fixed cycle: one vector query, k fragments into context, one generation. Agentic RAG replaces this flow with a loop: the agent plans the query, evaluates whether results are sufficient, reformulates, and iterates until a confidence threshold or step limit is reached. The difference is architectural, not just in the number of queries: classic RAG cannot decide whether the context is sufficient on its own.

When doesn’t agentic RAG make sense?

For simple, single-step factual questions, product lookups, FAQ navigation, and anywhere users expect sub-second responses. Agentic RAG pays off for complex, multi-threaded questions where missing context leads to incorrect answers. For most internal B2B assistants, a reasonable approach is a hybrid mode: classic RAG for simple questions, agentic for complex ones, with a router deciding the path.

How to assess context sufficiency in the agent loop?

There’s no universal method. Two practical approaches work: an evaluative prompt asking the model to indicate which question threads are covered in the gathered fragments, and a threshold consistency score based on cosine similarity between the question and aggregated fragments. At Cashcrown, we calibrate this threshold on a golden set per project, as the optimal level varies between legal, technical, and commercial knowledge bases.

Does agentic RAG solve the hallucination problem?

Partially. Iterative searching reduces the risk that the model answers without context, as the agent gathers more and better-matched fragments. But it doesn’t eliminate hallucinations entirely: the model can still go beyond the gathered context when generating the answer. A guardrail checking the coverage of each claim in the context is still necessary, as is the option to escalate to a human for low confidence. The architecture of this layer is described in the article on enterprise GPT on a knowledge base.

How to log the search path for audit purposes?

Each iteration generates a record: original question, reformulated vector query, returned fragments (with document ID and position), sufficiency assessment, and the decision to continue or stop. This log goes to a separate storage with RODO-compliant retention. In applications covered by the AI Act (finance, law, HR), the log is a mandatory part of the audit trail, not an optional debugging add-on.