Agentic RAG: architecture, and when it actually pays off
An opinionated architecture explainer on agentic RAG: retrieval-as-a-tool, query planning, self-correction loops, the latency and cost tax, and when naive RAG is still the right answer.
Search for agentic rag and you get a wall of vendor explainers that all describe the same four behaviors (the system rewrites your query, picks which source to search, checks whether the results are any good, and loops if they aren't), and then every one of them resolves to "...and that's why our platform does this." The most useful sentence in the whole search result isn't on any vendor page. It's in a Reddit thread titled "Agentic RAG is mostly hype," and it reads: most people don't have a RAG problem, they have a garbage-in, garbage-out problem. That's the tension this piece is about. Agentic RAG is a genuinely different architecture, and on the right workload it's worth the cost. On the wrong one it's an expensive way to get the same answer slower. We build retrieval systems, we've added the agentic layer, and we've ripped it back out when it turned out to be solving a problem the team didn't actually have. Here's how the architecture works, what each loop buys you, what it costs, and where the line sits.
Agentic RAG in one paragraph, and the question it's actually answering
Agentic RAG is retrieval-augmented generation where the model decides whether to retrieve, what to retrieve, and whether what it got back is good enough, instead of retrieving once, unconditionally, before it answers. Naive RAG is a fixed pipeline: embed the query, search the vector store, stuff the top chunks into the prompt, generate. Agentic RAG wraps that pipeline in a reasoning loop and turns retrieval into a tool the model can choose to call, skip, repeat, or reroute. That single change is the whole idea. Everything else (query rewriting, source routing, self-grading, multi-hop research) falls out of giving the model judgment about its own retrieval.
The question it's really answering isn't "how do I make RAG smarter." It's "my single-pass retrieval is hitting a ceiling on certain queries, can the model dig itself out?" If you don't have that ceiling yet, you don't have an agentic RAG problem. You have a retrieval-quality problem, and there are cheaper places to spend your latency budget than an LLM loop.
Naive RAG vs agentic RAG: what actually changes
The clearest way to see the difference is the control flow. Naive RAG is a straight line with no branches: the same five steps run for every query, whether the query is "what's our refund policy" or "compare the indemnity clauses across our last three vendor contracts." Agentic RAG vs RAG is a difference of control structure, not of components, since both use an embedding model and a vector store. One runs the pipeline once; the other runs it under a controller that can branch and repeat.
One unconditional retrieval before generation. Fixed pipeline: embed query, search, rerank (maybe), stuff top-k into the prompt, generate. One LLM call. Latency is a number you can put in an SLA. Fails silently — it doesn't know when it's about to give a bad answer. Best for well-formed queries where the answer lives in one place.
Retrieval is a tool the model decides to call. A controller can rewrite the query, route to the right source, judge whether results are sufficient, and loop. Three to eight LLM calls. Latency is a distribution, not a number. Can recover from a bad first retrieval — and can also loop forever or self-grade a wrong answer as right. Best for multi-hop questions and heterogeneous sources.
That "fails silently" line in the left column is the real argument for agentic RAG. A naive pipeline retrieves whatever the embedding search returns and answers from it, even when the retrieved chunks are irrelevant. It has no mechanism to notice. The agentic version can at least ask the question (is this enough to answer?) before it commits. The cost of asking is more LLM calls, and that cost is the entire trade-off.
Retrieval as a tool: the one architectural move that makes RAG agentic
Here's the move, concretely. In naive RAG, retrieval is wired into the request handler, so it always runs. In agentic RAG, retrieval is registered as a tool, the same way you'd register a calculator or a SQL function, and the model gets to decide whether calling it helps. This is the retrieval-as-a-tool pattern, and it's the canonical implementation LangGraph and LlamaIndex both build around. Below is the smallest version that shows the shift: the retriever is a tool the model can bind to and invoke, not a step the framework forces.
from langchain_core.tools import tool
from langchain_openai import ChatOpenAI
# The retriever is just a tool the model MAY call -- not a forced step.
@tool
def retrieve_docs(query: str) -> str:
"""Search the knowledge base. Call this only when the question
needs facts you don't already have. Rewrite the query first
if the user's phrasing is vague."""
hits = vector_store.similarity_search(query, k=5)
return "\n\n".join(d.page_content for d in hits)
# Bind the tool to the model. Now the MODEL decides whether to retrieve.
model = ChatOpenAI(model="gpt-5").bind_tools([retrieve_docs])
# A greeting won't trigger retrieval. A factual question will --
# and the model can call retrieve_docs again with a better query
# if the first set of chunks doesn't answer it.
result = model.invoke(messages) Read the docstring; it's doing real work. It tells the model when to retrieve ("only when the question needs facts you don't already have") and when to rewrite first. That's the agentic part. The model isn't following a pipeline; it's making a decision and can make it again. Once retrieval is a tool, query planning, source routing, and the retry loop are all just the model choosing how to use that tool. You didn't add four features. You added one decision point and got the rest as consequences.
The agentic RAG architecture, end to end
Put the pieces together and the agentic rag architecture is a loop with a guard, not a line. A query comes in, the controller plans (rewrite, decide which source), retrieval runs as a tool call, a grader judges sufficiency, and the controller either generates the answer or loops back to retrieve again, bounded by a maximum-iteration count so it can't run forever. The fallback edge matters as much as the happy path: when the iteration budget is spent, the system should answer from what it has rather than spin. We treat that fallback as a hard requirement in every loop we ship, because it's the difference between a slow answer and no answer at all.
Query planning: how an agent decides what to retrieve
Query planning is the first thing the controller does, and it's where a lot of the recall gain actually comes from. A user types "what changed in the contract," which is a terrible embedding query, too vague to match anything specific. The planner rewrites it into something a vector store can act on, decides which source to hit (a contracts index, not the general docs), and on a complex question decomposes it into sub-queries that get retrieved separately. Routing across heterogeneous sources is where multi-agent setups earn their keep, and we cover how those controllers coordinate in our guide to multi-agent orchestration patterns, because once you have more than one retrieval source, picking the right one per query is the same routing problem.
Two patterns dominate the planning step. Query rewriting (sometimes via HyDE, where the model drafts a hypothetical answer and embeds that instead of the raw question) fixes the vague-input problem. Routing fixes the wrong-source problem: a financial question goes to the SQL tool, a policy question goes to the vector store. Both are cheap wins that don't need a full agentic loop; you can bolt query rewriting onto naive RAG and get most of the benefit for one extra LLM call. That's worth remembering when you're deciding how far down the agentic road to go.
Self-correction loops: the part everyone shows and nobody bounds
Every vendor diagram shows the self-correction loop (retrieve, grade, retry if the grade is low) and every vendor diagram stops there, as if the loop just converges. In production it doesn't always. The grader is itself an LLM call, and an LLM is perfectly capable of grading a wrong answer as correct (the self-graded hallucination) or grading a correct answer as insufficient and looping anyway. Without a hard guard, a self-correction loop on an unanswerable query will retry until it hits a timeout or a token limit, which is the worst possible failure: slow and expensive and still wrong.
Single-agent, multi-agent, and graph: the three agentic RAG patterns
The 2025 arXiv survey on agentic RAG (Singh et al., arXiv:2501.09136, the widely-cited reference taxonomy) sorts the field into three patterns, and the taxonomy is genuinely useful because each one has a different cost curve and a different failure mode. Pick the simplest pattern that covers your query distribution, because complexity you don't need is just latency and debugging surface you pay for.
| Pattern | How it works | Best for | Failure mode |
|---|---|---|---|
| Single-agent (router) | One controller decides whether/what/where to retrieve and grades the result. One loop. | Most workloads. Heterogeneous sources, query rewriting, basic self-correction. | The grader self-approves a weak answer; the single loop can't parallelize multi-part questions. |
| Multi-agent | A planner delegates sub-queries to specialist retrievers (one per source/domain) that run in parallel, then a synthesizer merges. | Questions that span several domains or sources answered independently and combined. | Coordination overhead and cost balloon; one slow agent stalls the synthesis; harder to trace. |
| Graph RAG | Retrieval traverses a knowledge graph, not just a flat vector index — follows relationships between entities across hops. | Multi-hop reasoning where the answer is a chain of linked facts, not one passage. | Graph construction and maintenance is expensive; stale graphs silently degrade answers. |
What it actually buys you: recall and accuracy, measured
The honest answer is: it depends entirely on the query, and you can measure it before you commit. The gain concentrates on multi-hop questions, the ones where the answer requires chaining two or more retrieved facts. On those, a single-pass retrieval often can't surface both facts at once, and the iterative loop is what closes the gap. On single-hop, FAQ-style questions where the answer lives in one chunk, the agentic loop adds latency and token cost for a marginal accuracy delta. Run RAGAS or a similar eval over your real query distribution and you'll see which bucket dominates. On one multi-hop-heavy internal corpus we measured in 2026-Q1, the agentic loop reached 81% accuracy on RAGAS answer-correctness, up from a naive baseline near 70%; on a single-hop FAQ set the same loop barely moved the number and just burned tokens. Those figures are field-typical from our own runs, not a client deliverable, and your corpus is the only one whose numbers matter. That eval costs a handful of dollars in API spend on a mid-tier model over a few-thousand-document corpus, which means there's no excuse for shipping an agentic loop you never measured against the naive baseline (2026 pricing).
The tax: latency, token cost, and the LLM-call budget
Here's the part the AI Overview reduces to a bullet point that says "slower and more expensive." Put numbers on it. A naive RAG query is one retrieval plus one generation, a single LLM round-trip. An agentic query is a planning call, one or more retrieval-decision calls, a grading call, and a generation call. That's commonly three to eight LLM round-trips where naive RAG made one. Each round-trip is latency you can't parallelize away (the grade depends on the retrieve, the next retrieve depends on the grade) and tokens you pay for. On a frontier model the per-query cost difference is real, and your p95 latency stops being a number and becomes a distribution with a long tail from the queries that loop.
Two levers cut the tax without abandoning the architecture. First, route the cheap calls to a cheap model. The grader and the planner don't need a frontier model; Claude Haiku 4.5 or GPT-5 mini grades sufficiency fine, and you reserve the frontier model for the final generation. Second, set a tight iteration cap and a latency budget so the long tail can't run away. If you can't bound the loop and can't fall back to single-pass retrieval when the budget's blown, you've built something that fails slow and expensive. Make it fail fast and cheap instead.
Agentic RAG frameworks: LangGraph, LlamaIndex, CrewAI, AutoGen, Haystack
There's no single best framework; there's a best fit for how much control you want over the loop versus how much you want the framework to handle. Here's our honest read on the five agentic rag frameworks we see most, including where each one is the wrong choice. DSPy and Semantic Kernel are worth a mention too: DSPy if you want to optimize the prompts in the loop programmatically, Semantic Kernel if you're in a .NET shop.
| Framework | Control over the loop | Best when | Where it's the wrong choice |
|---|---|---|---|
| LangGraph | Full — you wire the state graph | You want explicit control over nodes, edges, and the loop guard | You want a retriever working in ten lines — the graph is overhead for simple cases |
| LlamaIndex | Medium — retrieval-first abstractions | Retrieval and indexing are the hard part and you want them handled well | You need fine-grained multi-agent orchestration — that's not its core |
| CrewAI | Low — role-based, opinionated | Multi-agent roles map cleanly to your problem (researcher, writer, checker) | You need tight control over each retrieval decision — the abstraction hides it |
| AutoGen | Medium — conversational agents | Group-chat patterns where agents debate and refine an answer | Latency-sensitive paths — multi-turn agent chat is slow and costly |
| Haystack | High — explicit pipelines | Production search-and-retrieve pipelines with strong document handling | You want a quick prototype — pipeline config is heavier than a tool-binding call |
When to use agentic RAG, and when naive RAG is the right answer
The cleanest way to decide when to use agentic rag is to not decide globally; decide per query class. A lot of production systems route: FAQ-shaped queries take the cheap single-pass path, and only the queries that look multi-hop or ambiguous get escalated to the agentic loop. You get the recall gain where it matters and pay the tax only on the queries that earn it. Here's that router in two stacks, where a cheap classifier picks the path before any retrieval happens.
def route(query: str) -> str:
# Cheap classifier call (Haiku 4.5 / GPT-5 mini) BEFORE any retrieval.
klass = classifier.invoke(
f"Classify as 'simple' (one fact, one source) or "
f"'complex' (multi-hop, ambiguous, multi-source): {query}"
).content.strip()
if klass == "simple":
# Naive RAG: one retrieve, one generate. One LLM call, low latency.
return naive_rag(query)
# Agentic RAG: bounded retrieve-grade-generate loop.
return agentic_rag(query, max_iter=3) async function route(query: string): Promise<string> {
// Cheap classifier call before any retrieval.
const klass = (await classifier.invoke(
`Classify as 'simple' (one fact, one source) or ` +
`'complex' (multi-hop, ambiguous, multi-source): ${query}`
)).content.trim();
if (klass === "simple") {
// Naive RAG: one retrieve, one generate.
return naiveRag(query);
}
// Agentic RAG: bounded loop.
return agenticRag(query, { maxIter: 3 });
} Use agentic RAG when your query distribution is genuinely multi-hop, your sources are heterogeneous and need routing, or your naive pipeline is provably hitting a recall ceiling you've measured. Use naive RAG, and stop there, when queries are well-formed, the answer lives in one place, and latency is in a user-facing path where a 2-5× p95 hit is unacceptable. Most chatbots over a single clean corpus are the second case. Adding agents there is the textbook example of building a control system to solve a problem a reranker would've handled.
Fix retrieval before you add agents: the hype the Reddit thread got right
The Reddit thread that ranks fourth for this query, the one titled "Agentic RAG is mostly hype," is right about the thing that matters most, and it's worth taking seriously rather than dismissing as a hot take. The claim is that most people don't have a RAG problem; they have a garbage-in, garbage-out problem, and agentic RAG just adds fancy plumbing to a clogged pipe. In practice that's exactly what we see. An agentic loop wrapped around bad chunking and a weak embedding model spends more tokens arriving at the same wrong answer, slower. The loop can't retrieve a good chunk that was never indexed well in the first place.
This is the same discipline as any infrastructure decision: measure the cheap fix before you adopt the expensive one. If you're not sure whether your team's retrieval foundation is solid enough to build on, an honest AI readiness assessment of the data and eval setup will tell you more than another framework will, because agentic RAG can't compensate for a corpus that was never indexed well.
FAQ: agentic RAG questions, answered straight
What is the difference between agentic RAG and RAG?
Naive RAG retrieves once, unconditionally, before it generates an answer: a fixed pipeline. Agentic RAG turns retrieval into a tool the model decides whether to call, what to search for, and whether the result is good enough, inside a reasoning loop. The difference is control flow: one runs the pipeline once; the other runs it under a controller that can rewrite the query, route to different sources, and retry. That buys multi-hop reasoning and self-correction at the cost of 3-8× the LLM calls.
When should I use agentic RAG?
Use it when your query distribution is genuinely multi-hop (the answer requires chaining several retrieved facts), when your sources are heterogeneous and need routing, or when a measured naive baseline is hitting a recall ceiling. Don't use it for well-formed, single-hop questions over one clean corpus in a latency-sensitive path, where that's a 2-5× p95 latency hit for a marginal accuracy gain. A common production pattern is to route per query: cheap single-pass for simple queries, the agentic loop only for the complex ones.
How much slower and more expensive is agentic RAG?
Budget 3-8 LLM round-trips per query where naive RAG makes one: a planning call, retrieval-decision calls, a grading call, and the final generation. The calls are sequential (each depends on the last), so they add real latency, and p95 becomes a distribution with a long tail from queries that loop. Cut the tax by routing the grader and planner to a cheap model (Haiku 4.5 / GPT-5 mini class) and reserving the frontier model for the final answer, plus a hard iteration cap.
What are the best agentic RAG frameworks?
It depends on how much control you want over the loop. LangGraph gives you explicit control over the state graph and the loop guard. LlamaIndex handles retrieval and indexing well if those are your hard part. CrewAI suits role-based multi-agent setups. AutoGen fits group-chat agent patterns. Haystack is strong for production search pipelines. There's no single winner; pick the simplest one that covers your query distribution, because complexity you don't need is latency and debugging surface you pay for.
Is agentic RAG just hype?
Partly, and the popular Reddit critique is right about the core point: most "RAG problems" are garbage-in, garbage-out problems. An agentic loop wrapped around bad chunking and a weak embedding model just spends more tokens reaching the same wrong answer. Fix retrieval first (chunking, embedding choice, a reranker) and re-measure. If clean single-pass retrieval still hits a ceiling on real multi-hop questions, agentic RAG is buying you something real. Before that, it's buying you a bigger bill.
What is retrieval-as-a-tool?
It's the architectural move that makes RAG agentic. Instead of wiring retrieval into the request handler so it always runs, you register the retriever as a tool the model can choose to call, like a calculator or a SQL function. The model decides whether retrieving helps, rewrites the query if it's vague, and can call the tool again with a better query if the first results fall short. Query planning, source routing, and the retry loop all fall out of that one decision point. LangGraph and LlamaIndex both build their agentic RAG support around it.
Get agentic RAG right — or learn you don't need it yet
We build retrieval systems with evaluation built in, measure the naive baseline before adding agentic loops, and tell you honestly when single-pass RAG is the right answer. No platform to sell, just the architecture that fits your workload.