RAG and Agentic System Design: Architect AI Systems Correctly

Published

11 Jun 2026

Author

Khusbu Basnet

Table of Contents

Most retrieval-augmented generation systems pass their first two demos and fail on the third type of query. The lookup-style questions work. Then someone asks a comparative question that spans three documents, the retrieval surfaces irrelevant chunks, the model hallucinates a confident answer, and the credibility the system spent six weeks earning evaporates in a single meeting. The engineering team is rarely surprised. The chunking strategy was chosen on a Wednesday afternoon, the embedding model was the framework default, and nobody is monitoring hallucination rate against a benchmark because no benchmark exists.

The same pattern plays out in agentic systems. A multi-step agent works in scripted scenarios, then loops on an ambiguous instruction, makes a tool call it shouldn't, and writes data the team didn't expect it to touch. The result is a refund issued twice, an email sent to the wrong list, or a record updated in a way that takes the data team a week to unwind. The architecture had no boundary on what tools the agent could reach, no checkpoint where a human had to confirm, no audit trail beyond the underlying log line.

RAG architecture and agentic system design is the discipline of making those choices deliberately, upfront, before the system ships. The choices are technical: chunking strategy, embedding model, vector store, retrieval algorithm, re-ranking, agent decision boundaries, tool-use constraints, and human override points. Most teams treat them as defaults rather than as design decisions with downstream consequences measured in weeks of rework. This article walks through the choices that matter, the failure modes most teams hit, and the patterns that hold under production load.

Why The Wrong Architecture Costs More Than The Wrong Model

The dominant narrative around AI failure points at the model. Models are the visible layer, so they take the blame. The architecture underneath is what actually determines whether the system holds.

A system with strong architecture and an average model can be upgraded to a better model in an afternoon — model swap is a configuration change. A system with weak architecture and the best model on the market degrades the moment the use case widens beyond the original scope. The retrieval pipeline that returns adequate context for the first 500 documents fails when the corpus reaches 50,000. The agent that handled three tools cleanly produces unpredictable behaviour when a fourth is added. The chunking strategy that worked on FAQ-style content collapses on long-form policy documents.

The cost compounds across three axes. Rework cost — replacing a flat chunking strategy with hierarchical chunking and re-indexing a corpus is a multi-week exercise that should have been a sprint-one decision. Operational cost — a retrieval pipeline that hits the model with five irrelevant chunks per query pays the token bill for those chunks every time, and at production volume that bill compounds quickly. And credibility cost — leadership conversations after a visible AI failure are slower, more sceptical, and harder to fund. EB Pearls treats RAG and agentic system design as a P03 architecture decision precisely because the choices are difficult to reverse later and routine to make well upfront.

What RAG and Agentic System Design Actually Is

Built to Last™ 2.0 places RAG and agentic system design in the Right Architecture pillar, alongside the Architecture Session and the Three-Horizon Test. RAG architecture is the design of how a system retrieves relevant context from a knowledge base and presents it to a model. Agentic system design is the design of how a model takes actions — calling tools, navigating workflows, making decisions — under defined constraints. Many production systems combine both. The architecture decisions split into four areas.

The retrieval pipeline

The retrieval pipeline is the part of a RAG system that turns a user query into a set of context chunks passed to the model. Four decisions inside it determine whether the system holds: chunking strategy, embedding selection, vector search configuration, and re-ranking.

Chunking strategy is the most consequential decision and the most commonly defaulted. The naive approach — split documents into fixed-length chunks of, say, 500 tokens — works on the simplest content and fails everywhere else. A clause that explains a definition is separated from the clause that uses it. A table's header is in one chunk and its rows in another. The model gets fragments and reconstructs meaning, which is exactly the load the retrieval pipeline was supposed to remove. Hierarchical chunking — chunks at the section, paragraph, and sentence level, linked so retrieval can fetch the parent context when a child matches — preserves cross-paragraph relationships. Semantic chunking, which uses embedding similarity to find natural breakpoints, performs better on long-form documents.

Embedding selection determines what the system treats as similar. A general-purpose embedding model trained on web text may underperform on legal, medical, or technical corpora where domain terminology carries specific meaning. The decision criteria are concrete: retrieval quality on a representative test set, latency at production query volume, cost per million embeddings, and whether the model can be self-hosted if data sovereignty requires it. Benchmarking two or three candidates against a labelled evaluation set is a one-day exercise that prevents months of low retrieval quality.

Vector search configuration covers the index type, similarity metric, and recall-precision trade-off. An approximate nearest neighbour index returns top results quickly but can miss relevant chunks at the edges of the embedding space. The right configuration depends on corpus size, query volume, and how much latency the use case can absorb — a tuning exercise informed by measurement, not a binary choice.

Re-ranking is the step most retrieval pipelines skip. After the vector search returns a candidate set — typically 20 to 50 chunks — a re-ranker scores them against the query using a more expensive model, and the top three to five are passed to the generation model. Re-ranking corrects the precision problem that vector search alone produces, particularly on comparative or multi-part queries. Adding a re-ranker to a struggling RAG system is the single highest-leverage change available; it routinely lifts answer quality more than swapping the foundation model.

Agent design and decision boundaries

Agentic systems extend the RAG pattern with action. The model plans, calls tools, observes results, and continues. The architecture choices are about constraint. Three matter most: tool-use scope ( OWASP Top 10 for Large Language Model Applications ), planning loop structure, and termination conditions.

Tool-use scope is the inventory of actions the agent can take. A narrow scope — three tools, all read-only — is easy to reason about and hard to break. A wide scope — fifteen tools, several with write access — is powerful and a security and reliability surface area in equal measure. The design choice is to define the smallest tool set the use case requires, separate reads from writes, and route writes through a confirmation step or a constrained API rather than direct database access.

Planning loop structure determines how the agent decides what to do next. ReAct-style loops, where the model alternates reasoning and acting, work for short chains. Plan-then-execute structures, where the model produces a plan first and the system executes the steps with checkpoints, work better for longer chains. Long loops accumulate error — small misinterpretations compound into wrong actions by step five.

Termination conditions prevent agents looping forever. A hard step limit, a confidence threshold, and a defined error path together prevent the failure mode where an agent rephrases the same wrong sub-goal twenty times and burns through a budget. Production agents need all three.

Human override and audit

The two decisions that separate prototype-grade from production-grade systems are where humans override and what gets logged. Human override is not a single feature; it is a set of choices about which decisions the system makes autonomously, which it surfaces for confirmation, and which it escalates. A common pattern: read operations autonomous, low-impact writes autonomous with logging, high-impact writes require explicit confirmation, and edge-case detection routes to a queue. The pattern only works when the categories are defined at design time, not when the on-call engineer is paged at 2am.

Audit is the trail the system leaves behind every action — query, retrieval result, tool calls, model output, override if any, final action. It is invisible until something goes wrong, at which point it is the difference between a five-minute investigation and a five-day reconstruction. Under the EU AI Act, ISO 42001, and the NIST AI Risk Management Framework, audit trails are not optional for higher-risk use cases; the architecture has to support them from the start.

Failure modes even when the component is present

A team can make all four sets of decisions and still ship a fragile system. Three failure modes recur. The test set used to measure retrieval quality is too narrow — it covers the queries the team thought to write, not the queries users actually ask. The agent's tool descriptions are inconsistent or ambiguous, so the model picks the wrong tool under load. The human override is technically present but operationally absent — the queue is never staffed, so escalations stall and the team learns about the failures from customers. The component being present is necessary; the component being operational is the harder problem.

How to Implement RAG and Agentic Design Without Stalling Delivery

Implementing the discipline does not require pausing the build. It requires sequencing the right decisions before code locks. A realistic four-to-six-week implementation runs in parallel with the early sprints of an AI engagement.

Week one is corpus and use-case characterisation. The team catalogues the document types in scope — fixed-format records, unstructured policy documents, transcripts, structured data — and the query types the system must handle. The query types matter more than the corpus; they tell you whether you need re-ranking, whether single-document retrieval is enough, and whether the agent needs tool use at all. Many systems that started as agents could have been simpler retrieval systems if this step had been run. The output is a written characterisation that goes into the Locked Scope Document™.

Week two is the evaluation set. A team that cannot agree on what "working" means cannot build a system that works. The evaluation set is a labelled collection of queries with the expected retrieval results and, where applicable, the expected agent actions. It does not need to be large — 50 to 200 well-chosen queries cover most use cases — but it needs to span the query types catalogued in week one. The set runs against every architecture candidate, and against every release once the system ships. Without it, system quality is a matter of demo enthusiasm.

Weeks three and four are the architecture experiments. The team runs at least two chunking strategies against the evaluation set, two embedding models, and — if relevant — two agent loop structures. The experiments are scoped to days, not sprints. The output is a written architecture decision for each area, captured as an Architecture Decision Record. These become the artefact the next engineer reads when they join in month six and ask why the system was built this way.

Weeks five and six are productionisation. The retrieval pipeline goes behind an API. The agent's tool inventory is locked. The audit trail is wired through. The evaluation set runs in continuous integration so any change that drops retrieval quality is caught before deployment. The system is now ready to enter the Production Readiness Review™.

Three obstacles recur. Skipping the evaluation set because demo answers look fine — do that once and you ship a system whose quality you cannot measure or defend. Over-scoping the agent tool inventory for flexibility; every tool added is surface area, so start narrow and expand based on observed needs. Treating human override as a feature added at the end; override designed in week one is structural, override designed in week six is bolt-on. Implementation depends on three other Built to Last components: the AI Evaluation Framework supplies the test cases, Prompt Version Control treats the orchestration prompts as code, and the Production Readiness Review confirms the system is ready to go live.

A RAG System Redesigned for the Queries It Actually Got

An Australian enterprise we worked with had built an internal knowledge RAG over its policy and procedure library. The system demoed well to senior leadership. Six weeks after rollout, usage had dropped sharply. The system handled lookup-style queries — "what is the leave entitlement for staff in category X" — accurately. It failed on comparative and conditional queries — "what changes when an employee transfers between subsidiaries" — because the answers spanned several documents whose interactions sat across paragraph boundaries that the chunking strategy didn't preserve.

The redesign was scoped to four weeks. The team replaced fixed-length chunking with hierarchical chunking that preserved section, paragraph, and sentence relationships. A re-ranker was added behind the vector search. The evaluation set was rebuilt against the actual queries that had failed, drawn from the system's logs. Two embedding models were benchmarked; the domain-tuned one outperformed the default on policy-specific terminology. The agent layer was deliberately not introduced; the use case did not need actions, only better retrieval. Three months after relaunch, query volume had returned to the original target and the comparative-query category — the one the original system had failed on — was the second-most-used path through the system. The lesson was not that hierarchical chunking is always right; it was that the chunking strategy is a design decision, and treating it as a default cost six months.

When This Component Is Critical, and When You Can Defer It

The discipline is critical the moment the system handles a query type whose failure is visible to users — wrong answers, hallucinated content, comparative questions surfacing incorrect comparisons. It is critical for any agentic system that can write to a system of record, send external communications, or affect a customer-facing process. Under the EU AI Act, ISO 42001, and NIST AI Risk Management Framework for higher-risk use cases, auditability and override architecture is a regulatory expectation rather than a design choice.

It can be deferred — though not skipped — for narrow internal proofs-of-concept where the user base is small and the use case will be retired before production scale. Even there, the evaluation set is worth building from the start. A proof-of-concept that succeeds is a system that will be asked to handle more queries, and a team that measures late finds itself blind exactly when the system is most valuable to keep aligned. The discipline is also less load-bearing where AI is a supporting feature rather than the core function.

What to Do Next

If you have a RAG or agentic system in flight without an evaluation set, an explicit chunking and embedding strategy, defined tool-use scope, or a documented human-override pattern, the implementation sequence above is the structured way to close the gap. For a wider view of how AI architecture sits inside delivery, see how we deliver agentic AI. The next BTL component most teams need alongside this one is the AI Evaluation Framework — the test infrastructure that keeps retrieval quality honest as the corpus and the model evolve.

Frequently Asked Questions

How do we retrieve relevant context reliably?

Treat retrieval as a pipeline, not a single step. The minimum production pattern is a chunking strategy chosen against the actual corpus, an embedding model benchmarked against a labelled evaluation set, vector search tuned for the recall-precision trade-off the use case needs, and a re-ranker on the candidate set. Re-ranking in particular is the highest-leverage addition for systems struggling on comparative or multi-part queries; it routinely lifts answer quality more than swapping the foundation model.

What embedding model should we use?

The one that performs best on a labelled evaluation set drawn from your actual query distribution. General-purpose embedding models trained on web text tend to underperform on domain-specific corpora where terminology carries specific meaning. Benchmark two or three candidates against retrieval quality, query latency at production volume, cost per million embeddings, and data sovereignty constraints. The framework default is almost never the right answer for production.

How do we constrain agent behaviour?

Three controls, designed together. Scope the tool inventory to the smallest set the use case needs, separating reads from writes. Structure the planning loop with a hard step limit, a confidence threshold, and a defined error path. Route high-impact writes through an explicit confirmation step rather than direct execution. Wide tool scopes and loose loops are the failure pattern that produces unexpected actions at production scale.

Where do humans override?

At the boundaries you define before the system ships. A workable default: reads autonomous, low-impact writes autonomous with audit logging, high-impact writes require explicit human confirmation, edge-case detection routes to a staffed queue. The override has to be operational, not just present — a queue that is never staffed is the same architecture as no override at all. The EU AI Act and NIST AI Risk Management Framework set baseline expectations for higher-risk use cases.

When does an agentic architecture earn its complexity?

When the workflow genuinely requires sequenced actions that depend on intermediate observations. A use case that retrieves information and presents an answer is a RAG system, not an agent — adding the agent layer adds failure surface for no benefit. The agent earns its complexity when the system must call tools, observe results, and adapt its next action accordingly.

How do we measure whether retrieval is good enough?

Against a labelled evaluation set, every release. The set covers the query types you expect — lookup, comparative, conditional, multi-document — with the retrieval results that should appear. Retrieval quality is then a measurable metric, not a matter of demo opinion. The set runs in continuous integration so any change that degrades retrieval is caught before deployment. Without it, you cannot improve the system because you cannot tell whether you have.

What's the realistic timeline to redesign a struggling RAG system?

Four to six weeks for most use cases. Week one is corpus and query characterisation. Week two builds the evaluation set against real user queries, including the ones that failed. Weeks three and four run architecture experiments — chunking, embeddings, re-ranking — against the evaluation set. Weeks five and six wire the new pipeline through to production with the evaluation set running in continuous integration. The work runs in parallel with continued operation of the existing system.

Khusbu Basnet

Khusbu ensures top-quality project delivery while fostering growth. Her dedication to excellence drives her to be a best-in-class Project Manager.