A production database failed on a Tuesday morning. The engineering team opened the disaster recovery runbook — a document created eighteen months earlier during the original infrastructure build. The runbook referenced a secondary database server that had been decommissioned six months ago. It described a failover process for an architecture that no longer existed. The backup restoration steps assumed a tool the team had replaced during a platform migration. The documented recovery time was two hours. The actual recovery took fourteen.
Fourteen hours of downtime. Not because the team lacked a disaster recovery plan — they had one. Not because the backups were missing — they existed. Because the plan had never been tested against the infrastructure it was supposed to recover. Every assumption in the document was correct at the time of writing and wrong at the time of execution. The gap between documented recovery and actual recovery was twelve hours of improvisation under pressure.
This is not an edge case. It is the default state of disaster recovery across most organisations. At EB Pearls, Disaster Recovery Design and Testing™ is embedded into the operational lifecycle of every system we build. Across 900+ projects delivered for over 1,400 businesses, we have seen this pattern repeatedly: teams invest in creating DR plans but never invest in proving they work. Our approach treats DR testing as a recurring operational practice, not a one-time documentation exercise, because a disaster recovery plan that has never been tested is a document, not a plan.
This article covers how to design disaster recovery for real-world execution and how to test it before you need it.
Why Untested DR Plans Fail When You Need Them Most
Disaster recovery plans decay. They decay because infrastructure changes, because team members leave, because dependencies are added and removed, because the architecture evolves and the documentation does not. The plan written during the original build reflects the system as it was, not the system as it is.
This decay is invisible until execution. A DR plan sitting in a wiki looks exactly the same whether its instructions are current or eighteen months stale. There is no automated check that verifies the documented failover target still exists, that the backup restoration process still works with the current database version, or that the person named as the recovery lead still works at the organisation.
The consequences of untested DR plans compound in specific, predictable ways. Backup files exist but have never been restored — and when restoration is attempted during an actual incident, the backup format is incompatible with the current database version. Failover infrastructure is provisioned but has never received production traffic — and when failover is triggered, the secondary environment lacks configuration changes deployed to production over the past year. Recovery procedures reference manual steps — and the engineer performing those steps has never executed them before and is doing so at 3 AM under incident pressure.
The project delivery framework at EB Pearls treats disaster recovery validation as an ongoing operational obligation. DR plans are not signed off at project delivery and forgotten. They are tested, updated, and retested on a defined schedule because the system they describe is constantly changing.
Designing Disaster Recovery That Can Actually Execute
Effective disaster recovery design starts with the question that most DR plans skip: how will this actually be executed during an incident? Not what the architecture diagram looks like, but what commands will be run, in what order, by whom, using what credentials, against what infrastructure.
Define Recovery Objectives Before Architecture
Every DR design begins with two numbers: Recovery Time Objective (RTO) and Recovery Point Objective (RPO). RTO defines the maximum acceptable downtime. RPO defines the maximum acceptable data loss. These are business decisions, not technical ones — they should be set by stakeholders who understand the cost of downtime per hour and the cost of data loss per transaction.
The architecture follows from these numbers. A four-hour RTO allows for cold standby restoration from backups. A fifteen-minute RTO requires warm standby infrastructure that can receive traffic with minimal promotion time. A near-zero RTO requires active-active multi-region deployment with automated failover. Each tier carries different infrastructure costs, operational complexity, and testing requirements.
Design for the Recovery Operator
DR plans fail when they assume expert knowledge that the recovery operator does not have. The person executing recovery at 3 AM during a production outage may not be the person who designed the system. They may be a junior engineer on the weekend on-call rotation. They may be unfamiliar with the specific database engine, the cloud provider's console, or the deployment tooling.
Effective DR procedures are written for the least experienced person who might need to execute them. Every step is explicit. Every command is copy-pasteable. Every decision point includes the criteria for choosing each path. Environment-specific values — hostnames, connection strings, credentials locations — are parameterised and documented in a single, maintained location rather than scattered across the procedure.
Automate Recovery Where Possible
Manual recovery steps are the highest-risk component of any DR plan. Each manual step is an opportunity for human error under pressure. Automation reduces this risk by encoding the recovery procedure into executable scripts that have been tested and validated.
Infrastructure-as-code tools — Terraform, Pulumi, CloudFormation — allow recovery infrastructure to be provisioned from version-controlled definitions rather than manually recreated from documentation. Database restoration scripts that accept parameters for target environment, backup source, and validation checks can be executed consistently regardless of who runs them. Automated health verification after restoration confirms that the recovered system is functional before traffic is redirected.
The goal is not full automation of every recovery scenario — some incidents require human judgement. The goal is automating the mechanical steps so that human attention is reserved for the decisions that require it.
Building a DR Testing Programme
DR testing is not a single event. It is a programme with defined scope, frequency, and escalation. The testing programme should exercise increasingly realistic scenarios over time, building team confidence and identifying gaps progressively.
Tabletop Exercises
The simplest form of DR testing requires no infrastructure changes. A tabletop exercise gathers the team responsible for recovery, presents a failure scenario, and walks through the documented procedure step by step. The team does not execute any commands — they read the procedure and identify gaps.
Tabletop exercises consistently reveal problems that are invisible in the documentation. Steps that reference decommissioned infrastructure. Procedures that assume access credentials that have been rotated. Contact lists with people who have changed roles. Dependencies on services that have been replaced. These exercises should run quarterly at minimum and after any significant infrastructure change.
Backup Verification Testing
Backups that have never been restored are not backups — they are assumptions. Backup verification testing restores a backup to an isolated environment and confirms that the restored data is complete, consistent, and usable.
This testing should verify several dimensions. Data completeness: does the restored database contain all expected tables, schemas, and records? Data consistency: are referential integrity constraints satisfied? Application compatibility: can the current application version connect to and operate against the restored database? Performance: does the restored database perform within acceptable parameters, or does it require index rebuilds or statistics updates before it can serve production traffic?
Backup verification should run on a defined schedule — monthly at minimum for critical systems. The results should be recorded and any failures should trigger immediate investigation, because a backup that fails verification today will fail restoration during an incident tomorrow.
Controlled Failover Drills
Controlled failover drills execute the actual recovery procedure against real infrastructure, but under controlled conditions rather than during an incident. The team triggers failover to the secondary environment, verifies that the secondary is serving traffic correctly, operates on the secondary for a defined period, and then fails back to the primary.
These drills test components that tabletop exercises and backup verification cannot reach. Network routing changes. DNS propagation timing. Application configuration differences between primary and secondary environments. Session handling during failover. Data synchronisation lag between regions. Load balancer health check behaviour during the transition.
Controlled failover drills should run at least twice per year for critical systems. The drill should be scheduled during business hours with the full incident response team available — not because failure is expected, but because the drill's purpose is to identify problems when the team has time to investigate and resolve them.
Multi-Region Failover Testing
For systems with multi-region architectures, failover testing must verify that the recovery region can handle production load, that data replication lag is within RPO, and that the application behaves correctly when operating from the recovery region.
Multi-region testing introduces complexities that single-region failover does not. Latency changes when users connect to a geographically distant region. Data replication lag means the recovery region may be slightly behind the primary. Third-party API integrations may behave differently from a different source region. CDN and caching layers need reconfiguration or invalidation. These factors must be tested, not assumed. As Google's Site Reliability Engineering handbook documents, even well-resourced engineering organisations discover unexpected failure modes during planned DR exercises that would have caused extended outages during unplanned incidents.
The DR Test That Revealed the Twelve-Hour Gap
The composite scenario described in the opening is instructive because of what it reveals about DR plan decay. The plan was created during the original infrastructure build. At that time, the secondary database server existed, the backup tool was installed, and the recovery procedure was accurate. Over eighteen months, three changes occurred independently: the secondary server was decommissioned during a cost optimisation exercise, the backup tool was replaced during a platform migration, and the original infrastructure lead left the organisation.
No single change broke the DR plan. Each change was reasonable in isolation and was executed competently within its own scope. But no change triggered a review of the DR documentation. The result was a plan that referenced infrastructure, tools, and people that no longer existed — and this was invisible until the moment of execution.
Had the team run a tabletop exercise after any of those three changes, the discrepancies would have been identified immediately. Had they run a backup verification test, they would have discovered the tool incompatibility weeks or months before it mattered. Had they run a controlled failover drill, they would have found that the failover target did not exist. The fourteen-hour recovery was not caused by a catastrophic failure — it was caused by the accumulated drift between a static document and a dynamic system.
The gap between the documented two hours and the actual fourteen hours represents exactly the kind of risk that business continuity planning must account for: not the failure itself, but the failure of the recovery.
When to Invest in DR Testing and How Much
DR testing is non-negotiable if your system handles financial transactions, personal data, or operations where downtime carries regulatory, contractual, or significant reputational consequences. Any system with an RTO measured in minutes rather than hours needs validated, practised recovery procedures — not just documented ones.
A lighter approach is acceptable if your system is stateless or easily reproducible, with data that can be reconstructed from source systems. A static marketing website backed by a version-controlled repository has a fundamentally different DR profile than a transactional database with no upstream source of truth.
Scale your testing to your risk profile. Systems with aggressive RTO and RPO targets need quarterly controlled failover drills and monthly backup verification. Systems with relaxed recovery targets can operate with semi-annual tabletop exercises and quarterly backup verification. But every system — regardless of criticality — needs some form of DR validation. The trends shaping infrastructure decisions in 2025 increasingly favour automated, continuous DR validation over periodic manual exercises, and the tooling to support this approach is maturing rapidly.
Where to Start
Pick one system — the one that would hurt most if it went down tomorrow. Pull up its disaster recovery documentation. Read it with the question: could a competent engineer who has never seen this system execute this procedure at 3 AM and recover within the documented RTO? If the answer is not a confident yes, that is your starting point.
Restore the most recent backup to an isolated environment. Verify the data is complete and the application can run against it. Time the process. Compare that time to the documented RTO. The gap between those two numbers is the gap between your DR plan and reality.
When you are ready to build disaster recovery that works under pressure — tested, automated, and maintained — talk to our DevOps team. We design recovery architectures that have been proven before they are needed, because the worst time to discover your DR plan does not work is during the disaster.
Frequently Asked Questions
How often should we test our disaster recovery plan?
What is the difference between RTO and RPO?
How do we verify that our backups are actually restorable?
What is multi-region failover and when do we need it?
What should a disaster recovery runbook include?
How do we test DR without affecting production?
What is the cost of not testing disaster recovery?
Tiffany brings creativity, adapts quickly to new tools, and leads atomic design principles to enhance UI/UX efficiency.
Read more Articles by this Author