You are about to hand someone between $30K and $200K to build an AI system. The agency's website says they do AI. Their case studies mention AI. Their team page lists AI engineers. None of that tells you whether they have actually shipped AI into production — or whether they built a demo, put it in a slide deck, and called it a case study.
The difference between an agency that has delivered production AI and one that has delivered impressive prototypes is not visible on a website. It is visible in the answers to about twelve questions. This article gives you those questions, explains what good and bad answers sound like, and tells you what to do with the information.
Why We Wrote This
We wrote this because we have seen what happens when the wrong agency gets a production AI brief. The founder burns six months and a significant budget, concludes AI does not work for their use case, and either abandons the initiative or starts over with trust debt that the next agency has to repay. Most of that was avoidable.
Why the Standard Vetting Process Fails for AI
The way most founders vet a software agency — check the portfolio, read the reviews, talk to a reference, compare quotes — breaks down for AI work. Here is why.
Portfolios lie by omission
An agency can show you a chatbot they built. They cannot easily show you whether that chatbot is still running in production, what its accuracy rate is, or whether the client replaced it six months later. A demo and a production system look identical in a case study.
Reviews lag reality.
AI capability is changing quarter by quarter. An agency that was genuinely leading-edge eighteen months ago may be running the same playbook while the technology has moved. Clutch reviews from 2024 tell you about capability in 2023. A five-star review for a traditional software project tells you nothing about AI capability.
Quotes hide scope.
AI project quotes routinely cover only the build — the application layer and the prompt engineering. Production hardening, monitoring, ongoing model management, and the iteration cycles that AI systems require are often left out. The cheapest quote is frequently the most expensive project.
The standard vetting process does not account for these gaps. You need a different set of questions.
The Twelve Questions
These are arranged in three groups: questions about what they have built, questions about how they build, and questions about what happens after.
Group 1: What Have You Actually Shipped
Question 1: Show me an AI system you built that is still running in production today.
Not a demo. Not a prototype. Not a proof of concept that got positive feedback. A system that real users interact with, that has been running for at least three months, and that the client still pays to operate.
What a good answer sounds like: a specific system, a specific client (or anonymised with enough detail to be credible), specific metrics — accuracy rate, daily usage, uptime. The agency can describe what went wrong post-launch and how they fixed it.
What a bad answer sounds like: "We've built several AI solutions for clients across industries." Vague, plural, no specifics. If they cannot name one system that is running right now, they have not shipped production AI.
Question 2: What was the hardest production issue you encountered in an AI system, and how did you resolve it?
This question is a filter. Anyone who has shipped AI into production has a war story. Accuracy that dropped when real data looked different from test data. Latency that was acceptable in staging and unacceptable under load. A hallucination pattern that appeared only with a specific type of input.
If the answer is polished and generic — "We encountered some accuracy challenges and optimised our prompts" — they are describing a demo, not production. Real production issues are specific, ugly, and instructive.
Question 3: What AI system did you recommend a client *not* build?
An agency that says yes to every AI brief is either desperate for work or does not understand the technology well enough to know when it is the wrong answer. The most valuable thing an AI agency can do in the first conversation is tell you that AI is not the right solution for your problem. If they have never done that, they are selling technology, not solving problems.
Group 2: How Do You Build?
Question 4: Walk me through your process from brief to production. Where does AI-specific work happen?
You are looking for evidence that the agency has a structured process that accounts for the ways AI projects differ from traditional software: data assessment, accuracy benchmarking, prompt engineering as an iterative discipline, production hardening for AI-specific failure modes, and monitoring designed for model behaviour.
At EB Pearls, this structure is codified in the Built to Last™ 2.0 framework. P01 (The Right Problem) includes AI Validation — a structured assessment of whether AI is the right solution before a model is selected. P02 (The Right Infrastructure) includes AI-specific production readiness: drift detection, cost monitoring, accuracy tracking. P03 (The Right Architecture) includes RAG and agentic system design with documented architecture decisions.
What a good answer sounds like: a defined process with named stages, where AI-specific activities are explicit — not bolted onto a generic software delivery process.
What a bad answer sounds like: "We follow agile methodology" with no AI-specific activities mentioned. AI projects run on sprints like everything else, but the *activities within those sprints* are different.
Question 5: How do you define and measure accuracy for an AI system?
This is where theatre collapses. An agency doing real AI work can describe how they establish accuracy benchmarks before building, how they measure accuracy in production, and what they do when accuracy drops below the threshold.
What a good answer includes: a defined evaluation framework, agreement on what "good enough" means before development starts, continuous monitoring against that benchmark post-launch, and a process for prompt tuning or model adjustment when accuracy drifts.
What a bad answer includes: "We test thoroughly before launch." Testing is not measuring. Testing tells you the system works today. Measuring tells you when it stops working tomorrow.
Question 6: How do you handle hallucinations in production?
Every LLM hallucinates. The question is not whether the agency can prevent it — they cannot — but whether they have a systematic approach to detecting it, reducing it, and handling it gracefully when it occurs.
Look for: output validation, retrieval quality scoring in RAG systems, confidence thresholds below which the system escalates to a human or declines to answer, monitoring that tracks hallucination rates over time, and a defined response when rates increase.
Avoid: "Our prompt engineering minimises hallucinations." Prompt engineering reduces hallucination. It does not eliminate it. An agency that claims otherwise has not dealt with production-scale input diversity.
Question 7: What does your monitoring cover for AI systems specifically?
Standard application monitoring (uptime, error rates, response times) is necessary but insufficient for AI. You need monitoring that covers accuracy, hallucination rates, cost per call, latency distribution (not just average), drift detection, and usage patterns that indicate the system is being used in unexpected ways.
At EB Pearls, this is the Observability and Monitoring Framework within P02, extended for AI with accuracy tracking, cost alerting, and drift detection.
What a good answer includes: specific metrics they track for AI systems beyond standard application monitoring.
What a bad answer includes: "We use Datadog" or "We set up CloudWatch." Those are tools, not answers. The question is what they monitor, not what dashboard they display it on.
Group 3: What Happens After?
Question 8: What does post-launch support look like for AI specifically?
AI systems require ongoing attention that traditional software does not. Models update, prompts need tuning, accuracy drifts, cost patterns change, new edge cases surface. Post-launch support for AI is not just bug fixes — it is active maintenance of a system whose behaviour changes over time even if the code does not.
What a good answer includes: a defined support period (at EB Pearls, this is the 90-day Post-Launch Accountability Period), specific AI maintenance activities (prompt tuning cycles, accuracy reviews, cost optimisation), and clarity on what is included versus what requires additional investment.
What a bad answer includes: "We offer ongoing support packages." That is a billing arrangement, not a capability description.
Question 9: What happens if the foundation model provider changes pricing or deprecates the model we are using?
This is a stress test for architectural thinking. The agency's answer tells you whether they build systems that are resilient to vendor changes or systems that are tightly coupled to one provider's API.
What a good answer includes: abstraction layers that allow model swapping, evaluation frameworks that can benchmark a new model against the existing one, and a documented process for migration.
What a bad answer includes: "We'll cross that bridge when we come to it." You will come to it. OpenAI, Anthropic, Google, and every other provider adjust pricing and deprecate models. The question is whether the system is built to absorb the change.
Question 10: Who owns the prompts, the evaluation data, and the fine-tuned models?
Intellectual property in AI is more complex than in traditional software. The code is yours — that is standard. But what about the prompts that were iteratively refined over months? The evaluation datasets that define accuracy? The fine-tuned model weights if you went down that path? The vector embeddings of your proprietary data?
If the agency has not thought about this, it will become a problem at contract end.
Question 11: What does your handover package include for AI systems?
For traditional software, a handover package includes code, documentation, and deployment guides. For AI, it should also include: prompt libraries with version history, evaluation frameworks and benchmark data, monitoring configuration, model configuration and fallback logic, data pipeline documentation, and a runbook for common operational scenarios (accuracy drop, cost spike, model deprecation).
At EB Pearls, this is the Structured Handover Package in P06 (The Right Team), extended for AI-specific artefacts.
Question 12: Can I talk to a client whose AI system you built at least six months ago?
Not a recent launch. Not a project in progress. A system that has been running long enough to reveal whether it works at production scale over time. Six months is the minimum useful reference window for AI — it takes that long for drift, edge cases, and operational realities to surface.
If the agency cannot produce a six-month reference, they may have AI capability but not production AI track record. That is a meaningful distinction.
The Red Flags
Beyond the twelve questions, watch for these patterns during the vetting process.
The demo loop
The model-first pitch
The everything-is-AI agency
No scoping before quoting
What to Do With This Information
Score each agency on the twelve questions. You do not need a perfect score — you need a pattern. An agency that answers eight questions well and four poorly probably has real capability with gaps. An agency that answers four well and eight poorly is presenting capability they do not yet have.
Weight Group 1 (what they have shipped) most heavily. Groups 2 and 3 can be developed; Group 1 cannot be faked. An agency with production experience and an imperfect process will learn. An agency with a perfect process and no production experience will discover their process is wrong the first time something breaks.
Then check the contract. Ensure it covers IP ownership for AI-specific artefacts, defines accuracy benchmarks and what happens when they are missed, includes a post-launch period with AI-specific maintenance, and specifies what the handover package contains.
Frequently Asked Questions
How many AI agencies should I evaluate?
Should I require an AI agency to build a proof of concept before committing to a full project?
What is a reasonable timeline for an AI project?
How do I know if an agency is genuinely strong at AI or just reselling API access?
What certifications or standards should I look for?
Should the agency's team include data scientists or just software engineers?
Roshan drives digital transformation at EB Pearls, leveraging AI, blockchain, and emerging tech to enhance efficiency, productivity, and innovation.
Read more Articles by this Author