Most retrieval-augmented generation systems pass their first two demos and fail on the third type of query. The lookup-style questions work. Then someone asks a comparative question that spans three documents, the retrieval surfaces irrelevant chunks, the model hallucinates a confident answer, and the credibility the system spent six weeks earning evaporates in a single meeting. The engineering team is rarely surprised. The chunking strategy was chosen on a Wednesday afternoon, the embedding model was the framework default, and nobody is monitoring hallucination rate against a benchmark because no benchmark exists.
The same pattern plays out in agentic systems. A multi-step agent works in scripted scenarios, then loops on an ambiguous instruction, makes a tool call it shouldn't, and writes data the team didn't expect it to touch. The result is a refund issued twice, an email sent to the wrong list, or a record updated in a way that takes the data team a week to unwind. The architecture had no boundary on what tools the agent could reach, no checkpoint where a human had to confirm, no audit trail beyond the underlying log line.
RAG architecture and agentic system design is the discipline of making those choices deliberately, upfront, before the system ships. The choices are technical: chunking strategy, embedding model, vector store, retrieval algorithm, re-ranking, agent decision boundaries, tool-use constraints, and human override points. Most teams treat them as defaults rather than as design decisions with downstream consequences measured in weeks of rework. This article walks through the choices that matter, the failure modes most teams hit, and the patterns that hold under production load.
Why The Wrong Architecture Costs More Than The Wrong Model
The dominant narrative around AI failure points at the model. Models are the visible layer, so they take the blame. The architecture underneath is what actually determines whether the system holds.
A system with strong architecture and an average model can be upgraded to a better model in an afternoon — model swap is a configuration change. A system with weak architecture and the best model on the market degrades the moment the use case widens beyond the original scope. The retrieval pipeline that returns adequate context for the first 500 documents fails when the corpus reaches 50,000. The agent that handled three tools cleanly produces unpredictable behaviour when a fourth is added. The chunking strategy that worked on FAQ-style content collapses on long-form policy documents.
The cost compounds across three axes. Rework cost — replacing a flat chunking strategy with hierarchical chunking and re-indexing a corpus is a multi-week exercise that should have been a sprint-one decision. Operational cost — a retrieval pipeline that hits the model with five irrelevant chunks per query pays the token bill for those chunks every time, and at production volume that bill compounds quickly. And credibility cost — leadership conversations after a visible AI failure are slower, more sceptical, and harder to fund. EB Pearls treats RAG and agentic system design as a P03 architecture decision precisely because the choices are difficult to reverse later and routine to make well upfront.
What RAG and Agentic System Design Actually Is
Built to Last™ 2.0 places RAG and agentic system design in the Right Architecture pillar, alongside the Architecture Session and the Three-Horizon Test. RAG architecture is the design of how a system retrieves relevant context from a knowledge base and presents it to a model. Agentic system design is the design of how a model takes actions — calling tools, navigating workflows, making decisions — under defined constraints. Many production systems combine both. The architecture decisions split into four areas.
The retrieval pipeline
The retrieval pipeline is the part of a RAG system that turns a user query into a set of context chunks passed to the model. Four decisions inside it determine whether the system holds: chunking strategy, embedding selection, vector search configuration, and re-ranking.
Chunking strategy is the most consequential decision and the most commonly defaulted. The naive approach — split documents into fixed-length chunks of, say, 500 tokens — works on the simplest content and fails everywhere else. A clause that explains a definition is separated from the clause that uses it. A table's header is in one chunk and its rows in another. The model gets fragments and reconstructs meaning, which is exactly the load the retrieval pipeline was supposed to remove. Hierarchical chunking — chunks at the section, paragraph, and sentence level, linked so retrieval can fetch the parent context when a child matches — preserves cross-paragraph relationships. Semantic chunking, which uses embedding similarity to find natural breakpoints, performs better on long-form documents.
Embedding selection determines what the system treats as similar. A general-purpose embedding model trained on web text may underperform on legal, medical, or technical corpora where domain terminology carries specific meaning. The decision criteria are concrete: retrieval quality on a representative test set, latency at production query volume, cost per million embeddings, and whether the model can be self-hosted if data sovereignty requires it. Benchmarking two or three candidates against a labelled evaluation set is a one-day exercise that prevents months of low retrieval quality.
Vector search configuration covers the index type, similarity metric, and recall-precision trade-off. An approximate nearest neighbour index returns top results quickly but can miss relevant chunks at the edges of the embedding space. The right configuration depends on corpus size, query volume, and how much latency the use case can absorb — a tuning exercise informed by measurement, not a binary choice.
Re-ranking is the step most retrieval pipelines skip. After the vector search returns a candidate set — typically 20 to 50 chunks — a re-ranker scores them against the query using a more expensive model, and the top three to five are passed to the generation model. Re-ranking corrects the precision problem that vector search alone produces, particularly on comparative or multi-part queries. Adding a re-ranker to a struggling RAG system is the single highest-leverage change available; it routinely lifts answer quality more than swapping the foundation model.
Agent design and decision boundaries
Agentic systems extend the RAG pattern with action. The model plans, calls tools, observes results, and continues. The architecture choices are about constraint. Three matter most: tool-use scope ( OWASP Top 10 for Large Language Model Applications ), planning loop structure, and termination conditions.
Tool-use scope is the inventory of actions the agent can take. A narrow scope — three tools, all read-only — is easy to reason about and hard to break. A wide scope — fifteen tools, several with write access — is powerful and a security and reliability surface area in equal measure. The design choice is to define the smallest tool set the use case requires, separate reads from writes, and route writes through a confirmation step or a constrained API rather than direct database access.
Planning loop structure determines how the agent decides what to do next. ReAct-style loops, where the model alternates reasoning and acting, work for short chains. Plan-then-execute structures, where the model produces a plan first and the system executes the steps with checkpoints, work better for longer chains. Long loops accumulate error — small misinterpretations compound into wrong actions by step five.
Termination conditions prevent agents looping forever. A hard step limit, a confidence threshold, and a defined error path together prevent the failure mode where an agent rephrases the same wrong sub-goal twenty times and burns through a budget. Production agents need all three.
Human override and audit
The two decisions that separate prototype-grade from production-grade systems are where humans override and what gets logged. Human override is not a single feature; it is a set of choices about which decisions the system makes autonomously, which it surfaces for confirmation, and which it escalates. A common pattern: read operations autonomous, low-impact writes autonomous with logging, high-impact writes require explicit confirmation, and edge-case detection routes to a queue. The pattern only works when the categories are defined at design time, not when the on-call engineer is paged at 2am.
Audit is the trail the system leaves behind every action — query, retrieval result, tool calls, model output, override if any, final action. It is invisible until something goes wrong, at which point it is the difference between a five-minute investigation and a five-day reconstruction. Under the EU AI Act, ISO 42001, and the NIST AI Risk Management Framework, audit trails are not optional for higher-risk use cases; the architecture has to support them from the start.
Failure modes even when the component is present
A team can make all four sets of decisions and still ship a fragile system. Three failure modes recur. The test set used to measure retrieval quality is too narrow — it covers the queries the team thought to write, not the queries users actually ask. The agent's tool descriptions are inconsistent or ambiguous, so the model picks the wrong tool under load. The human override is technically present but operationally absent — the queue is never staffed, so escalations stall and the team learns about the failures from customers. The component being present is necessary; the component being operational is the harder problem.
How to Implement RAG and Agentic Design Without Stalling Delivery
Implementing the discipline does not require pausing the build. It requires sequencing the right decisions before code locks. A realistic four-to-six-week implementation runs in parallel with the early sprints of an AI engagement.
Week one is corpus and use-case characterisation. The team catalogues the document types in scope — fixed-format records, unstructured policy documents, transcripts, structured data — and the query types the system must handle. The query types matter more than the corpus; they tell you whether you need re-ranking, whether single-document retrieval is enough, and whether the agent needs tool use at all. Many systems that started as agents could have been simpler retrieval systems if this step had been run. The output is a written characterisation that goes into the Locked Scope Document™.
Week two is the evaluation set. A team that cannot agree on what "working" means cannot build a system that works. The evaluation set is a labelled collection of queries with the expected retrieval results and, where applicable, the expected agent actions. It does not need to be large — 50 to 200 well-chosen queries cover most use cases — but it needs to span the query types catalogued in week one. The set runs against every architecture candidate, and against every release once the system ships. Without it, system quality is a matter of demo enthusiasm.
Weeks three and four are the architecture experiments. The team runs at least two chunking strategies against the evaluation set, two embedding models, and — if relevant — two agent loop structures. The experiments are scoped to days, not sprints. The output is a written architecture decision for each area, captured as an Architecture Decision Record. These become the artefact the next engineer reads when they join in month six and ask why the system was built this way.
Weeks five and six are productionisation. The retrieval pipeline goes behind an API. The agent's tool inventory is locked. The audit trail is wired through. The evaluation set runs in continuous integration so any change that drops retrieval quality is caught before deployment. The system is now ready to enter the Production Readiness Review™.
Three obstacles recur. Skipping the evaluation set because demo answers look fine — do that once and you ship a system whose quality you cannot measure or defend. Over-scoping the agent tool inventory for flexibility; every tool added is surface area, so start narrow and expand based on observed needs. Treating human override as a feature added at the end; override designed in week one is structural, override designed in week six is bolt-on. Implementation depends on three other Built to Last components: the AI Evaluation Framework supplies the test cases, Prompt Version Control treats the orchestration prompts as code, and the Production Readiness Review confirms the system is ready to go live.
A RAG System Redesigned for the Queries It Actually Got
An Australian enterprise we worked with had built an internal knowledge RAG over its policy and procedure library. The system demoed well to senior leadership. Six weeks after rollout, usage had dropped sharply. The system handled lookup-style queries — "what is the leave entitlement for staff in category X" — accurately. It failed on comparative and conditional queries — "what changes when an employee transfers between subsidiaries" — because the answers spanned several documents whose interactions sat across paragraph boundaries that the chunking strategy didn't preserve.
The redesign was scoped to four weeks. The team replaced fixed-length chunking with hierarchical chunking that preserved section, paragraph, and sentence relationships. A re-ranker was added behind the vector search. The evaluation set was rebuilt against the actual queries that had failed, drawn from the system's logs. Two embedding models were benchmarked; the domain-tuned one outperformed the default on policy-specific terminology. The agent layer was deliberately not introduced; the use case did not need actions, only better retrieval. Three months after relaunch, query volume had returned to the original target and the comparative-query category — the one the original system had failed on — was the second-most-used path through the system. The lesson was not that hierarchical chunking is always right; it was that the chunking strategy is a design decision, and treating it as a default cost six months.
When This Component Is Critical, and When You Can Defer It
The discipline is critical the moment the system handles a query type whose failure is visible to users — wrong answers, hallucinated content, comparative questions surfacing incorrect comparisons. It is critical for any agentic system that can write to a system of record, send external communications, or affect a customer-facing process. Under the EU AI Act, ISO 42001, and NIST AI Risk Management Framework for higher-risk use cases, auditability and override architecture is a regulatory expectation rather than a design choice.
It can be deferred — though not skipped — for narrow internal proofs-of-concept where the user base is small and the use case will be retired before production scale. Even there, the evaluation set is worth building from the start. A proof-of-concept that succeeds is a system that will be asked to handle more queries, and a team that measures late finds itself blind exactly when the system is most valuable to keep aligned. The discipline is also less load-bearing where AI is a supporting feature rather than the core function.
What to Do Next
If you have a RAG or agentic system in flight without an evaluation set, an explicit chunking and embedding strategy, defined tool-use scope, or a documented human-override pattern, the implementation sequence above is the structured way to close the gap. For a wider view of how AI architecture sits inside delivery, see how we deliver agentic AI. The next BTL component most teams need alongside this one is the AI Evaluation Framework — the test infrastructure that keeps retrieval quality honest as the corpus and the model evolve.
Frequently Asked Questions
How do we retrieve relevant context reliably?
What embedding model should we use?
How do we constrain agent behaviour?
Where do humans override?
When does an agentic architecture earn its complexity?
How do we measure whether retrieval is good enough?
What's the realistic timeline to redesign a struggling RAG system?
Khusbu ensures top-quality project delivery while fostering growth. Her dedication to excellence drives her to be a best-in-class Project Manager.
Read more Articles by this Author