A foundation model provider released an update. The changelog looked routine — performance improvements, expanded context window, minor behavioural refinements. The engineering team deployed it to production the same afternoon. No evaluation suite. No staged rollout. No comparison against the previous version on production query types. The system kept running. Latency was stable. No errors surfaced in the logs. Two weeks later, a product manager flagged that a critical query type — the one responsible for the highest-value user interactions — had degraded. Accuracy on that query class had dropped 6% since the update. Not catastrophically wrong. Just consistently worse, in ways that eroded user trust one interaction at a time.
An untested model update is a production incident waiting for traffic. The model did not crash. It did something worse — it got subtly wrong, in a way that no infrastructure monitor would catch and no error log would surface. The team spent two weeks serving degraded responses to their most important query type because the deployment process had no evaluation gate between "new model available" and "new model in production."
This is not a tooling problem. It is a process problem. Teams that treat model updates like software patches — deploy, monitor for errors, move on — will keep breaking production in ways that take weeks to notice and months to remediate. Model retraining requires a protocol: a defined cadence, testing gates, approval checkpoints, and specific procedures for when the foundation model underneath your application changes without warning.
At EB Pearls, model retraining is governed by protocol, not instinct. With 360+ AI-native developers and 900+ projects delivered across 1,400+ businesses, we have watched teams learn the hard way that the moment you skip evaluation is the moment production starts degrading. Built to Last™ delivery treats every model change — whether a full retrain, a fine-tuning update, or a foundation model version bump — as a controlled release that must pass defined gates before reaching users.
Why Model Updates Break Production
Software updates are deterministic. You change the code, you test the code, you deploy the code. If the tests pass, the new version behaves as specified. Model updates are probabilistic. A retrained model that passes aggregate metrics can still fail on specific query types, edge cases, or user segments that the evaluation set does not adequately represent. The failure mode is not a crash. It is a distribution shift in output quality that manifests differently depending on what you ask the model to do.
Three categories of model change create production risk, and each requires a different response protocol.
Full retraining replaces the model weights entirely. New training data, new optimisation cycles, new learned representations. Even when the architecture stays the same, the model's behaviour can shift in unpredictable ways — better on some inputs, worse on others, subtly different in tone or formatting in ways that downstream systems depend on.
Fine-tuning updates modify a subset of the model's weights using new data. The risk surface is smaller than full retraining but not negligible. Fine-tuning on biased or unrepresentative data can degrade performance on the exact query types the update was meant to improve.
Foundation model updates are the most dangerous because you control none of the variables. The provider changes the underlying model — new weights, new training data, potentially new architecture — and your application inherits those changes. Your prompts, your retrieval pipelines, your output parsers — all were calibrated against the previous version. The new version may handle them differently in ways that are invisible without systematic evaluation.
Google's ML best practices documentation identifies untested model updates as a primary failure mode in production ML systems — not because teams lack the tools to test, but because they lack the process discipline to require testing before every deployment. The tooling exists. The protocol often does not.
What a Model Retraining Protocol Actually Looks Like
A retraining protocol is not a checklist taped to a wall. It is an enforceable process with defined triggers, evaluation gates, approval authorities, and rollback procedures. Every model change — regardless of size — passes through the same pipeline. The gates get stricter as the change gets larger, but no change skips the pipeline entirely.
Retraining Triggers
A protocol must define when retraining happens. Ad hoc retraining — "the model seems off, let's retrain" — is how teams waste compute on models that do not need updating and miss degradation on models that do.
Drift-triggered retraining fires when monitoring detects that input distributions or output quality have shifted beyond defined thresholds. This is the most efficient trigger because it retrains only when the data says the model needs it. It requires a functioning drift detection and monitoring system that tracks feature distributions, prediction distributions, and accuracy on labelled samples.
Scheduled retraining operates on a fixed cadence — weekly, monthly, or quarterly — regardless of detected drift. This suits domains where data changes continuously and the cost of monitoring is high relative to the cost of retraining. The risk is retraining when unnecessary or missing degradation between cycles.
Event-triggered retraining fires when a known external change occurs: a new data source is integrated, a business rule changes, a foundation model provider releases an update, or a significant shift in user behaviour is observed. These triggers require human judgement to identify the event but automated processes to execute the response.
The agentic AI delivery process at EB Pearls defines retraining triggers during the Discovery Workshop, calibrated to the specific domain and data velocity of each project. A fraud detection model operating on daily transaction data has different retraining triggers than a content recommendation system operating on weekly engagement patterns.
Evaluation Gates
Every model change must pass evaluation before reaching production. The evaluation suite is not a single accuracy number — it is a structured set of tests that cover different dimensions of model behaviour.
Aggregate metrics measure overall performance: accuracy, precision, recall, F1, or domain-specific metrics like BLEU, ROUGE, or mean reciprocal rank. The new model must meet or exceed the current production model on these metrics. This is the minimum gate — necessary but not sufficient.
Segment-level evaluation breaks performance down by query type, user segment, input category, or any other dimension that matters to the business. A model that improves aggregate accuracy by 2% while degrading accuracy on the highest-value query type by 6% is a net negative. Segment-level evaluation catches regressions that aggregate metrics hide.
Regression tests are specific input-output pairs where the expected behaviour is well-defined. These are the model equivalent of unit tests. If the current production model handles a critical query correctly and the candidate model does not, the candidate fails — regardless of what aggregate metrics show.
Behavioural consistency checks compare the candidate model's outputs against the production model's outputs on a representative sample. Large divergences in output format, tone, length, or structure indicate that downstream systems may break even if accuracy metrics are stable.
Approval Process
Evaluation produces data. Approval requires a human decision. The protocol must define who has authority to approve model deployments and what information they need to make that decision.
For routine updates where the candidate model passes all evaluation gates with clear margins, approval can be delegated to the ML engineering lead. For updates where the candidate model shows mixed results — improvements on some segments, regressions on others — approval should escalate to include product stakeholders who can assess the business impact of the trade-offs. For foundation model updates, where the change surface is large and partially unknown, approval should include a cross-functional review that encompasses engineering, product, and operations.
The approval artefact — the evaluation report, the comparison data, the decision rationale — must be stored alongside the model version. When someone asks in six months why the model changed, the answer should be in the version history, not in someone's memory.
Rollback Procedures
Every deployment must have a defined rollback path. If the new model passes evaluation but degrades in production on patterns the evaluation suite did not cover, the team needs to revert to the previous version within minutes, not hours.
This requires model versioning infrastructure that maintains at least two deployable versions at all times: the current production model and the previous production model. DevOps practices at EB Pearls enforce immutable model artefacts — every deployed model version is stored with its weights, configuration, evaluation results, and deployment metadata, so rollback is a configuration change, not a rebuild.
How to Handle Foundation Model Updates
Foundation model updates deserve their own section because they are fundamentally different from updates you control. When OpenAI ships a new GPT version, when Anthropic updates Claude, when Google releases a new Gemini variant — your application inherits changes to the reasoning engine underneath it. Your prompts were engineered against the previous version's behaviour. Your output parsers expect the previous version's formatting. Your evaluation benchmarks reflect the previous version's capabilities.
Pin your model version. Every production deployment should specify an exact model version, not a floating alias. If your API calls target "gpt-4" rather than a specific snapshot, you are opting into uncontrolled updates. Pin the version, then upgrade deliberately.
Maintain a shadow evaluation pipeline. When a new foundation model version is available, deploy it in a shadow environment that mirrors production traffic but does not serve responses to users. Run your full evaluation suite against the shadow deployment. Compare results against the pinned production version. This is where the 6% accuracy drop gets caught — in a test cycle, not in production.
Evaluate prompt compatibility. Foundation model updates can change how models interpret prompts. A prompt that produced structured JSON reliably on the previous version may produce slightly different formatting on the new version. Test every production prompt template against the candidate version and verify that output parsing, tool calling, and downstream integrations still function correctly.
Stage the rollout. Even after evaluation passes, deploy the new foundation model version to a canary population first — a small percentage of production traffic. Monitor canary metrics for at least 48 hours before expanding to full production. This catches interaction effects between the new model version and real user behaviour that shadow evaluation cannot fully replicate.
Research from Microsoft on foundation model deployment emphasises that evaluation suites must be refreshed when the foundation model changes, not just when the application changes. A benchmark that adequately tested the previous model version may have blind spots for the new version's different failure modes.
The Update That Slipped Through
A mid-sized SaaS platform had built an AI-powered customer support system on top of a foundation model. The system handled tier-one support queries — account questions, billing inquiries, feature explanations — and escalated complex issues to human agents. Performance had been strong for four months. Resolution rates were high, escalation rates were low, and customer satisfaction scores on AI-handled queries were within acceptable ranges.
The foundation model provider released a minor version update. The engineering team reviewed the changelog — improved reasoning, better instruction following, reduced hallucination rates. Every line item sounded like an improvement. The team updated the model version in their API configuration and deployed to production. No shadow evaluation. No segment-level testing. No canary period.
The system continued to function. Aggregate metrics looked stable for the first week. But buried in the data, a pattern was forming. The new model version handled billing queries differently — not incorrectly in most cases, but with subtle changes in phrasing and structure that confused the downstream classification system responsible for routing escalations. Queries that should have been escalated were being resolved with technically accurate but practically insufficient responses. Customers were getting answers that addressed the letter of their question but missed the intent.
The degradation went unnoticed for two weeks. When the support team finally identified the pattern, they had accumulated a backlog of under-served customers and a measurable dip in satisfaction scores for the billing query segment. A gated evaluation protocol — shadow deployment, segment-level testing, canary rollout — would have caught the regression in a single test cycle. The evaluation suite included billing query benchmarks. The new model would have shown the divergence. The team would have investigated before deploying.
When Each Protocol Element Matters Most
Pin model versions from day one if your application depends on specific model behaviours — output formatting, tool calling patterns, reasoning chains, or any behaviour that downstream systems parse or depend on. Floating model versions in production is an uncontrolled experiment on your users.
Implement segment-level evaluation if your model serves multiple query types, user segments, or use cases. Aggregate metrics hide segment-level regressions. The higher the variance in your query distribution, the more essential segment-level testing becomes.
Require cross-functional approval if model changes affect user-facing behaviour. Engineers evaluate technical performance. Product managers evaluate business impact. Neither perspective alone is sufficient for changes that affect how users experience the product.
Build canary deployment infrastructure if your evaluation suite cannot fully replicate production conditions. Shadow evaluation tests the model against representative inputs. Canary deployment tests the model against real users at controlled scale. Both are necessary for high-stakes applications. The concept-to-launch delivery process at EB Pearls includes canary infrastructure as a standard deployment component for AI systems, not an optional upgrade.
Where to Start
Audit your current model deployment process. Ask three questions. First: if you needed to roll back to the previous model version right now, how long would it take? If the answer is more than fifteen minutes, you do not have adequate versioning. Second: when was the last time you evaluated a model update on segment-level metrics before deploying? If the answer is never, your evaluation suite has a critical gap. Third: who approved the last model change, and where is the approval documented? If the answer involves uncertainty, your approval process is informal at best.
When you are ready to build a retraining protocol that catches regressions before they reach production, talk to our team. We design model update pipelines with evaluation gates, approval workflows, and rollback procedures — because the conversation about why the model broke is always more expensive than the protocol that prevents it.
Frequently Asked Questions
How often should we retrain our AI model?
Retraining frequency depends on data velocity and drift patterns, not arbitrary schedules. Models operating on rapidly changing data — fraud detection, real-time pricing, trending content — may need weekly or even daily retraining. Models in stable domains — document classification, medical imaging — may perform well for months between retraining cycles. The correct answer is to retrain when drift detection indicates the model needs it, supplemented by a maximum interval that ensures no model goes longer than a defined period without evaluation.
What evaluation metrics should we use before deploying a retrained model?
Use a layered evaluation approach. Start with aggregate metrics relevant to your domain — accuracy, F1, BLEU, or custom business metrics. Then evaluate at the segment level, breaking performance down by query type, user segment, and input category. Include regression tests on critical input-output pairs. Finally, compare output distributions between the candidate model and the current production model to catch behavioural shifts that accuracy metrics miss. No single metric is sufficient.
Who should approve model deployments in production?
Approval authority should scale with the risk of the change. Routine retraining updates where the candidate clearly outperforms the current model across all segments can be approved by the ML engineering lead. Updates with mixed results — some segments improve, others regress — require product stakeholder input to assess business trade-offs. Foundation model version changes should involve cross-functional review including engineering, product, and operations, because the change surface is large and partially unpredictable.
How do we handle foundation model updates we cannot control?
Pin your production deployments to specific model versions rather than floating aliases. When a new version is available, deploy it in a shadow environment and run your full evaluation suite against it. Test prompt compatibility explicitly — foundation model updates can change how prompts are interpreted. Stage the rollout through a canary deployment before full production. Treat every foundation model update as an external dependency change that requires the same rigour as a major version upgrade in any other critical dependency.
What should a model rollback procedure look like?
A rollback procedure requires three elements: versioned model artefacts stored in an immutable registry, a deployment configuration that can switch between model versions without rebuilding infrastructure, and pre-defined rollback triggers that specify when to revert. The previous production model should always be deployable within minutes. Rollback triggers should include accuracy drops below defined thresholds on production monitoring, anomalous output distributions detected by drift monitoring, and critical regression test failures identified post-deployment.
What is a canary deployment for AI models?
A canary deployment routes a small percentage of production traffic — typically 5 to 10 percent — to the candidate model while the majority continues to be served by the current production model. Canary metrics are monitored for a defined period, usually 48 to 72 hours, comparing the candidate's real-world performance against the production baseline. If the canary metrics meet or exceed the baseline, traffic is gradually shifted to the candidate. If metrics degrade, the canary is terminated and all traffic reverts to the production model. This catches issues that offline evaluation cannot replicate.
How do we version AI models effectively?
Model versioning requires tracking more than weights. Each version should be stored as an immutable artefact that includes model weights, training configuration, training data reference, evaluation results, deployment metadata, and the approval decision. Use a model registry — MLflow, Weights & Biases, or a cloud-native equivalent — that enforces immutability and provides lineage tracking. Every production deployment should reference a specific version ID, and the registry should maintain the full history of which version was deployed when, by whom, and based on what evaluation data.
Discover custom app development and AI trends with Nikesh Maharjan, EB Pearls' Senior Engineering Manager. Learn how we build innovative solutions.
Read more Articles by this Author