The plan was simple: migrate 14 applications to the cloud over a single weekend. Friday evening, the team kicked off the cutover. By Saturday afternoon, eight applications were running in the new environment. By Sunday evening, three of the remaining six had issues — a payment processing service couldn't reach its database, an internal reporting tool was throwing authentication errors, and a customer-facing portal was returning intermittent 502s. The rollback plan existed on paper, but it hadn't accounted for shared database dependencies between the applications that had already migrated and the ones that hadn't. Rolling back one application meant breaking another that was already live in the new environment.
Monday morning was chaotic. The operations team was triaging three broken applications while simultaneously fielding calls from internal stakeholders and customers. The applications that had migrated cleanly were working, but the ones that failed were caught between two environments — too far into the migration to roll back cleanly, not far enough to push through. It took until Wednesday to stabilise everything. The weekend migration had turned into a five-day incident.
This is the failure mode of big-bang migrations. The problem isn't that migrations are inherently risky. The problem is that moving everything at once means every dependency, every configuration difference, and every environmental assumption gets tested simultaneously with no isolation between failures. At EB Pearls™, we've seen this pattern repeatedly across the 900+ projects we've delivered — and the organisations that migrate successfully are the ones that sequence by risk, run in parallel, and prove before they cut.
Why Big-Bang Migrations Fail
The appeal of a big-bang cloud migration architecture is understandable. One cutover window. One weekend of disruption. One clean break from legacy to modern. It sounds efficient. In practice, it creates a blast radius that encompasses every application, every integration, and every user simultaneously.
The core issue is dependency density. Enterprise workloads don't exist in isolation. Application A writes to a database that Application B reads from. Application C authenticates through a service that Application D hosts. Application E generates events that Applications F, G, and H consume. When you migrate all of them at once, you're not running one migration — you're running dozens of interconnected migrations where the failure of any single component can cascade across the entire portfolio.
Big-bang migrations also compress your learning window to zero. In a sequenced migration, you learn from migrating the first workload and apply those lessons to the second. You discover that DNS propagation takes longer than expected, that a particular firewall rule needs adjusting, that the new environment handles connection pooling differently. Each migration makes the next one safer. A big-bang approach eliminates that feedback loop entirely. Every lesson arrives at the same time as every problem.
The rollback problem is equally severe. A rollback plan for a single application is straightforward — revert DNS, restore the database, restart services. A rollback plan for 14 interdependent applications is a migration plan in reverse, with the added complexity that some applications are now generating data in the new environment that needs to be reconciled with the old. The weekend migration composite above failed not because the team lacked a rollback plan, but because the rollback plan assumed each application could be rolled back independently. Shared dependencies made that assumption false.
Sequencing Workloads by Risk and Dependency
The foundation of a safe cloud migration architecture is sequencing — deciding which workloads move first, which move last, and why. This isn't a technical decision alone. It's a risk management decision that accounts for business criticality, dependency complexity, and your team's capacity to absorb problems.
Map Dependencies Before You Sequence
Before you decide what moves first, you need to know what connects to what. Dependency mapping should capture four layers: network dependencies (what talks to what over which ports), data dependencies (what reads from and writes to which databases and queues), authentication dependencies (what relies on which identity providers and token services), and operational dependencies (what monitoring, logging, and alerting systems each workload uses).
This mapping exercise consistently reveals surprises. A service that the team considers standalone turns out to depend on a shared certificate authority. A database that supposedly serves one application is actually queried by three others through an undocumented API. These discoveries are far better made during planning than during a live cutover.
The Risk-Based Sequencing Framework
Once dependencies are mapped, sequence workloads into migration waves using three criteria.
Wave 1: Low risk, low dependency. Start with workloads that have minimal integration points, are not customer-facing, and where downtime is tolerable. Internal tools, development environments, batch processing jobs. These are your learning migrations. You'll discover the friction points — DNS propagation delays, firewall misconfigurations, unexpected latency differences — in a context where the consequences of discovery are low.
Wave 2: Medium risk, moderate dependency. Move to workloads that have some integration points but where those integrations can operate across environments temporarily. Back-office systems, internal APIs, reporting platforms. These migrations test your ability to run services across the legacy and new environments simultaneously.
Wave 3: High risk, high dependency. Customer-facing applications, payment processing, real-time data systems. These move last, after you've validated the environment with lower-risk workloads and established that cross-environment connectivity works. By the time you reach Wave 3, your team has migrated multiple workloads, resolved environmental issues, and built confidence in the rollback procedures.
Each wave should include a defined stabilisation period — typically one to two weeks — before the next wave begins. This gives the team time to monitor the migrated workloads, catch latent issues, and apply lessons learned to the next wave's planning and delivery.
The Parallel Run: Prove Before You Cut
A parallel run is the practice of running a workload in both the legacy and new environments simultaneously, comparing outputs to validate that the new environment produces correct results before cutting over. It's the single most effective risk reduction technique in workload migration.
How Parallel Runs Work
The mechanics vary by workload type, but the principle is consistent. For data processing workloads, you feed the same input data to both environments and compare the outputs. For API services, you route a percentage of traffic to the new environment (shadow traffic or canary routing) and compare response payloads and latency. For batch jobs, you run the job in both environments and diff the results.
The parallel run answers the question that no amount of pre-migration testing can fully address: does this workload behave identically in the new environment under real production conditions? Testing environments approximate production. Parallel runs use production itself as the test.
What to Compare
Focus comparisons on three dimensions. Functional correctness — does the new environment produce the same outputs for the same inputs? Performance — does the new environment meet the same latency, throughput, and resource utilisation benchmarks? Error behaviour — does the new environment handle edge cases and failure scenarios the same way?
Discrepancies in any dimension need investigation before cutover. A parallel run that reveals a three-per-cent difference in calculated values isn't a minor issue — it's a data integrity problem that will compound daily after cutover.
Duration and Exit Criteria
Parallel runs should cover at least one full business cycle. For most applications, that means a minimum of one week — enough to capture daily batch processes, weekly reports, and typical traffic patterns. For workloads with monthly cycles (billing, payroll, financial reporting), a full month of parallel running is warranted.
Define exit criteria before the parallel run begins. These should be specific and measurable: zero functional discrepancies for five consecutive business days, P95 latency within ten per cent of legacy, zero unhandled exceptions. Don't start with vague goals like "everything looks good." Define what "good" means numerically, and hold to it.
The Cutover Plan
The cutover is the moment you redirect production traffic from the legacy environment to the new one. A well-sequenced migration with successful parallel runs makes the cutover itself low-drama — which is exactly the goal.
Pre-Cutover Checklist
Before cutting over any workload, confirm: the parallel run met all exit criteria, the rollback procedure has been tested (not just documented — actually tested), the operations team knows the escalation path, monitoring and alerting are configured in the new environment, and stakeholders have been notified of the cutover window.
The Cutover Window
Schedule cutovers during low-traffic periods, but don't mistake low traffic for zero risk. Even during off-peak hours, real users are generating real transactions. The cutover should be designed to be reversible for a defined period — typically 24 to 72 hours — during which the legacy environment remains available as a fallback.
DNS-based cutovers using low TTL values allow quick rollback by redirecting traffic back to the legacy environment. Load balancer routing provides even faster switching. The specific mechanism depends on the workload, but the principle is the same: maintain the ability to revert quickly until you're confident the new environment is stable under full production load.
Post-Cutover Monitoring
The first 48 hours after cutover are the highest-risk period. Assign dedicated monitoring to the migrated workload. Watch error rates, latency percentiles, database connection counts, queue depths, and any workload-specific health metrics. Establish clear thresholds for automatic rollback — for example, if error rates exceed two per cent for more than fifteen minutes, trigger the rollback procedure without waiting for a decision meeting.
This kind of structured cutover discipline is part of how sound software development practice manages complexity — not by eliminating risk, but by making risk visible and controllable.
Managing the In-Between State
Every sequenced migration creates a period where some workloads are in the new environment and some remain in the legacy environment. This in-between state is unavoidable and needs to be managed deliberately, not treated as a temporary inconvenience.
Cross-Environment Connectivity
Workloads in the new environment still need to communicate with workloads in the legacy environment. This requires network connectivity between environments — VPN tunnels, direct connections, or API gateways — that is often underestimated in migration planning. Latency between environments will be higher than within either environment. Services that depend on low-latency communication need to migrate together or have their communication patterns adapted.
Data Consistency
The most complex aspect of the in-between state is data. When a workload migrates but its database doesn't (or vice versa), you need a data synchronisation strategy. Options include real-time replication, event-driven synchronisation, or dual-write patterns. Each has trade-offs in complexity, latency, and consistency guarantees. AWS's migration best practices documentation provides detailed guidance on data synchronisation patterns during migration.
Operational Complexity
During the in-between state, your operations team is managing two environments. Monitoring, logging, alerting, and incident response need to work across both. This doubles the operational surface area and requires clear documentation of which workloads are where. A migration tracker — updated in real time — is essential. Without it, incident responders waste critical minutes determining which environment a failing workload is running in.
The Weekend Migration That Became a Five-Day Incident
The composite in the opening illustrates what happens when migration architecture is absent. The team had migrated workloads but hadn't sequenced them by dependency. They had a rollback plan but hadn't tested it against shared database dependencies. They had a cutover window but no parallel run to validate workload behaviour before the cut.
The three applications that failed shared a pattern: each depended on a service or database that was in a different migration state. The payment processing service had migrated, but its database hadn't — and the cross-environment latency introduced transaction timeouts that didn't exist when both were co-located. The reporting tool had migrated, but the authentication service it used was still in the legacy environment, and the temporary authentication bridge had a configuration error. The customer portal was caught in a half-migrated state when the team paused migration to investigate the first two failures.
Every one of these issues was discoverable. Dependency mapping would have identified the shared database. A parallel run would have exposed the latency-induced timeouts. Testing the rollback procedure would have revealed the shared dependency problem. The team wasn't unskilled — they simply lacked a migration architecture that made risk visible before it materialised. This is a recurring theme across application delivery at scale: the practices that prevent incidents are planning practices, not heroic recovery practices.
Research from Google's SRE team consistently emphasises that the reliability of any system change — including migrations — is proportional to the quality of the rollback strategy, not the confidence of the rollforward plan. You don't need to be certain the migration will succeed. You need to be certain that if it doesn't, you can recover without data loss or extended downtime.
What to Do Next
If you're planning a cloud migration, start with dependency mapping. Document every integration point for every workload in scope. Then sequence those workloads into waves — low risk first, high risk last. Define parallel run criteria for each wave. Test your rollback procedure before you need it.
If you're mid-migration and things feel fragile, pause. The most dangerous migration is the one that continues forward because the team feels committed to the timeline. A migration that takes twelve weeks and works is better than one that takes four weeks and creates a five-day incident.
If you need help building a migration architecture that sequences risk, validates through parallel runs, and maintains rollback capability throughout — EB Pearls' DevOps team can help you plan and execute a migration that moves your workloads without moving your risk. We've supported migrations across the emerging technology landscape and can bring that experience to your migration planning.
Frequently Asked Questions
What is cloud migration architecture?
What order should we migrate workloads in?
What is a parallel run in migration?
How do we handle rollback if something goes wrong?
How long does a cloud migration typically take?
What is the biggest risk during cloud migration?
Discover custom app development and AI trends with Nikesh Maharjan, EB Pearls' Senior Engineering Manager. Learn how we build innovative solutions.
Read more Articles by this Author