How to Vet an AI Agency: The Questions That Separate Real From Theatre

Published

17 Jun 2026

Author

Roshan Manandhar

How to Vet an AI Agency: The Questions That Separate Real From Theatre

7:45

Table of Contents

You are about to hand someone between $30K and $200K to build an AI system. The agency's website says they do AI. Their case studies mention AI. Their team page lists AI engineers. None of that tells you whether they have actually shipped AI into production — or whether they built a demo, put it in a slide deck, and called it a case study.

The difference between an agency that has delivered production AI and one that has delivered impressive prototypes is not visible on a website. It is visible in the answers to about twelve questions. This article gives you those questions, explains what good and bad answers sound like, and tells you what to do with the information.

Why We Wrote This

We are an AI agency. We have a direct commercial interest in you choosing the right one — ideally us, but honestly, a competent competitor is better for the industry than another failed project blamed on "AI not being ready." Every failed AI engagement makes the next founder more sceptical and the market harder for everyone.

We wrote this because we have seen what happens when the wrong agency gets a production AI brief. The founder burns six months and a significant budget, concludes AI does not work for their use case, and either abandons the initiative or starts over with trust debt that the next agency has to repay. Most of that was avoidable.

Why the Standard Vetting Process Fails for AI

The way most founders vet a software agency — check the portfolio, read the reviews, talk to a reference, compare quotes — breaks down for AI work. Here is why.

Portfolios lie by omission

An agency can show you a chatbot they built. They cannot easily show you whether that chatbot is still running in production, what its accuracy rate is, or whether the client replaced it six months later. A demo and a production system look identical in a case study.

Reviews lag reality.

AI capability is changing quarter by quarter. An agency that was genuinely leading-edge eighteen months ago may be running the same playbook while the technology has moved. Clutch reviews from 2024 tell you about capability in 2023. A five-star review for a traditional software project tells you nothing about AI capability.

Quotes hide scope.

AI project quotes routinely cover only the build — the application layer and the prompt engineering. Production hardening, monitoring, ongoing model management, and the iteration cycles that AI systems require are often left out. The cheapest quote is frequently the most expensive project.

The standard vetting process does not account for these gaps. You need a different set of questions.

The Twelve Questions

These are arranged in three groups: questions about what they have built, questions about how they build, and questions about what happens after.

Group 1: What Have You Actually Shipped

Question 1: Show me an AI system you built that is still running in production today.

Not a demo. Not a prototype. Not a proof of concept that got positive feedback. A system that real users interact with, that has been running for at least three months, and that the client still pays to operate.

What a good answer sounds like: a specific system, a specific client (or anonymised with enough detail to be credible), specific metrics — accuracy rate, daily usage, uptime. The agency can describe what went wrong post-launch and how they fixed it.

What a bad answer sounds like: "We've built several AI solutions for clients across industries." Vague, plural, no specifics. If they cannot name one system that is running right now, they have not shipped production AI.

Question 2: What was the hardest production issue you encountered in an AI system, and how did you resolve it?

This question is a filter. Anyone who has shipped AI into production has a war story. Accuracy that dropped when real data looked different from test data. Latency that was acceptable in staging and unacceptable under load. A hallucination pattern that appeared only with a specific type of input.

If the answer is polished and generic — "We encountered some accuracy challenges and optimised our prompts" — they are describing a demo, not production. Real production issues are specific, ugly, and instructive.

Question 3: What AI system did you recommend a client *not* build?

An agency that says yes to every AI brief is either desperate for work or does not understand the technology well enough to know when it is the wrong answer. The most valuable thing an AI agency can do in the first conversation is tell you that AI is not the right solution for your problem. If they have never done that, they are selling technology, not solving problems.

Group 2: How Do You Build?

Question 4: Walk me through your process from brief to production. Where does AI-specific work happen?

You are looking for evidence that the agency has a structured process that accounts for the ways AI projects differ from traditional software: data assessment, accuracy benchmarking, prompt engineering as an iterative discipline, production hardening for AI-specific failure modes, and monitoring designed for model behaviour.

At EB Pearls, this structure is codified in the Built to Last™ 2.0 framework. P01 (The Right Problem) includes AI Validation — a structured assessment of whether AI is the right solution before a model is selected. P02 (The Right Infrastructure) includes AI-specific production readiness: drift detection, cost monitoring, accuracy tracking. P03 (The Right Architecture) includes RAG and agentic system design with documented architecture decisions.

What a good answer sounds like: a defined process with named stages, where AI-specific activities are explicit — not bolted onto a generic software delivery process.

What a bad answer sounds like: "We follow agile methodology" with no AI-specific activities mentioned. AI projects run on sprints like everything else, but the *activities within those sprints* are different.

Question 5: How do you define and measure accuracy for an AI system?

This is where theatre collapses. An agency doing real AI work can describe how they establish accuracy benchmarks before building, how they measure accuracy in production, and what they do when accuracy drops below the threshold.

What a good answer includes: a defined evaluation framework, agreement on what "good enough" means before development starts, continuous monitoring against that benchmark post-launch, and a process for prompt tuning or model adjustment when accuracy drifts.

What a bad answer includes: "We test thoroughly before launch." Testing is not measuring. Testing tells you the system works today. Measuring tells you when it stops working tomorrow.

Question 6: How do you handle hallucinations in production?

Every LLM hallucinates. The question is not whether the agency can prevent it — they cannot — but whether they have a systematic approach to detecting it, reducing it, and handling it gracefully when it occurs.

Look for: output validation, retrieval quality scoring in RAG systems, confidence thresholds below which the system escalates to a human or declines to answer, monitoring that tracks hallucination rates over time, and a defined response when rates increase.

Avoid: "Our prompt engineering minimises hallucinations." Prompt engineering reduces hallucination. It does not eliminate it. An agency that claims otherwise has not dealt with production-scale input diversity.

Question 7: What does your monitoring cover for AI systems specifically?

Standard application monitoring (uptime, error rates, response times) is necessary but insufficient for AI. You need monitoring that covers accuracy, hallucination rates, cost per call, latency distribution (not just average), drift detection, and usage patterns that indicate the system is being used in unexpected ways.

At EB Pearls, this is the Observability and Monitoring Framework within P02, extended for AI with accuracy tracking, cost alerting, and drift detection.

What a good answer includes: specific metrics they track for AI systems beyond standard application monitoring.

What a bad answer includes: "We use Datadog" or "We set up CloudWatch." Those are tools, not answers. The question is what they monitor, not what dashboard they display it on.

Group 3: What Happens After?

Question 8: What does post-launch support look like for AI specifically?

AI systems require ongoing attention that traditional software does not. Models update, prompts need tuning, accuracy drifts, cost patterns change, new edge cases surface. Post-launch support for AI is not just bug fixes — it is active maintenance of a system whose behaviour changes over time even if the code does not.

What a good answer includes: a defined support period (at EB Pearls, this is the 90-day Post-Launch Accountability Period), specific AI maintenance activities (prompt tuning cycles, accuracy reviews, cost optimisation), and clarity on what is included versus what requires additional investment.

What a bad answer includes: "We offer ongoing support packages." That is a billing arrangement, not a capability description.

Question 9: What happens if the foundation model provider changes pricing or deprecates the model we are using?

This is a stress test for architectural thinking. The agency's answer tells you whether they build systems that are resilient to vendor changes or systems that are tightly coupled to one provider's API.

What a good answer includes: abstraction layers that allow model swapping, evaluation frameworks that can benchmark a new model against the existing one, and a documented process for migration.

What a bad answer includes: "We'll cross that bridge when we come to it." You will come to it. OpenAI, Anthropic, Google, and every other provider adjust pricing and deprecate models. The question is whether the system is built to absorb the change.

Question 10: Who owns the prompts, the evaluation data, and the fine-tuned models?

Intellectual property in AI is more complex than in traditional software. The code is yours — that is standard. But what about the prompts that were iteratively refined over months? The evaluation datasets that define accuracy? The fine-tuned model weights if you went down that path? The vector embeddings of your proprietary data?

If the agency has not thought about this, it will become a problem at contract end.

Question 11: What does your handover package include for AI systems?

For traditional software, a handover package includes code, documentation, and deployment guides. For AI, it should also include: prompt libraries with version history, evaluation frameworks and benchmark data, monitoring configuration, model configuration and fallback logic, data pipeline documentation, and a runbook for common operational scenarios (accuracy drop, cost spike, model deprecation).

At EB Pearls, this is the Structured Handover Package in P06 (The Right Team), extended for AI-specific artefacts.

Question 12: Can I talk to a client whose AI system you built at least six months ago?

Not a recent launch. Not a project in progress. A system that has been running long enough to reveal whether it works at production scale over time. Six months is the minimum useful reference window for AI — it takes that long for drift, edge cases, and operational realities to surface.

If the agency cannot produce a six-month reference, they may have AI capability but not production AI track record. That is a meaningful distinction.

The Red Flags

Beyond the twelve questions, watch for these patterns during the vetting process.

The demo loop

The agency wants to show you what AI can do rather than discuss what it should do for your specific problem. Impressive demos are easy to build. They prove capability, not judgment.

The model-first pitch

The conversation starts with technology — "We use GPT-4o" or "We've built a custom RAG pipeline" — rather than with your business problem. The model is a tool. Leading with it is like an architecture firm leading with their CAD software.

The everything-is-AI agency

Six months ago they were a web development shop. Now every project on their site has "AI-powered" in the title. AI capability at the agency level requires investment in infrastructure, tooling, evaluation frameworks, and team expertise. It does not materialise because someone completed an online course.

No scoping before quoting

If an agency gives you a price without a structured discovery or scoping process, they are guessing. AI projects have more unknowns than traditional software — data quality, accuracy requirements, integration complexity — and guessing about these produces quotes that are either padded beyond reason or underquoted into project failure.

What to Do With This Information

Score each agency on the twelve questions. You do not need a perfect score — you need a pattern. An agency that answers eight questions well and four poorly probably has real capability with gaps. An agency that answers four well and eight poorly is presenting capability they do not yet have.

Weight Group 1 (what they have shipped) most heavily. Groups 2 and 3 can be developed; Group 1 cannot be faked. An agency with production experience and an imperfect process will learn. An agency with a perfect process and no production experience will discover their process is wrong the first time something breaks.

Then check the contract. Ensure it covers IP ownership for AI-specific artefacts, defines accuracy benchmarks and what happens when they are missed, includes a post-launch period with AI-specific maintenance, and specifies what the handover package contains.

Frequently Asked Questions

How many AI agencies should I evaluate?

Three is the practical minimum. Fewer than three and you lack a baseline for comparison. More than five and the evaluation overhead exceeds the value of marginal information. Focus your depth on the top three, not your breadth on seven.

Should I require an AI agency to build a proof of concept before committing to a full project?

Yes — but define what the POC must prove. A POC that demonstrates the AI can work on curated data proves almost nothing. A POC that tests accuracy on your actual data, under realistic conditions, against a defined benchmark is genuinely useful. Expect to pay for a real POC. Free POCs use curated data and prove capability, not fit.

What is a reasonable timeline for an AI project?

It depends on scope, but most production AI systems require three to six months from discovery to launch. Anyone promising production-grade AI in four weeks is either building a very narrow feature or underestimating production hardening. The iteration cycles alone — prompt tuning, accuracy benchmarking, edge case handling — take weeks.

How do I know if an agency is genuinely strong at AI or just reselling API access?

Ask them to describe a technical architecture decision they made on an AI project and why. Anyone can call an API. The decisions around how to chunk documents, which embedding model to use, how to handle multi-turn context, where to place guardrails, and how to design fallback logic — these require genuine expertise. If the answer sounds like API documentation, that is what they are working from.

What certifications or standards should I look for?

ISO 27001 (information security) and ISO 9001 (quality management) are meaningful baseline certifications for any agency handling your data. For AI specifically, ask whether they align their work to NIST AI Risk Management Framework, ISO 42001 (AI management systems), or EU AI Act classifications. These frameworks are relatively new and formal certification is rare, but awareness of them signals maturity.

Should the agency's team include data scientists or just software engineers?

For most production AI projects using foundation models, software engineers with AI systems experience are more valuable than traditional data scientists. You need people who understand prompt engineering, retrieval pipelines, model orchestration, and production operations — not people who train models from scratch. If your project requires custom model training or fine-tuning, data science capability matters. For API-based and RAG-based systems, production engineering capability matters more.

Roshan Manandhar Solution Architect

Roshan drives digital transformation at EB Pearls, leveraging AI, blockchain, and emerging tech to enhance efficiency, productivity, and innovation.

Not Sure Where AI Actually Fits in Your Business?

Most companies bolt AI onto the wrong problem. We find the use case that moves a real metric — then build it so it works in production, not just in a demo. No hype. No science projects. One call, and you'll leave with a shortlist of what's worth building.

Book Your AI Strategy Call