Migration Architecture: Move Workloads Without Moving Risk

Migration Architecture: Move Workloads Without Moving Risk
Published

19 Jun 2026

Author
Nikesh Maharjan

Nikesh Maharjan

Migration Architecture: Move Workloads Without Moving Risk
5:56
Table of Contents

The plan was simple: migrate 14 applications to the cloud over a single weekend. Friday evening, the team kicked off the cutover. By Saturday afternoon, eight applications were running in the new environment. By Sunday evening, three of the remaining six had issues — a payment processing service couldn't reach its database, an internal reporting tool was throwing authentication errors, and a customer-facing portal was returning intermittent 502s. The rollback plan existed on paper, but it hadn't accounted for shared database dependencies between the applications that had already migrated and the ones that hadn't. Rolling back one application meant breaking another that was already live in the new environment.

Monday morning was chaotic. The operations team was triaging three broken applications while simultaneously fielding calls from internal stakeholders and customers. The applications that had migrated cleanly were working, but the ones that failed were caught between two environments — too far into the migration to roll back cleanly, not far enough to push through. It took until Wednesday to stabilise everything. The weekend migration had turned into a five-day incident.

This is the failure mode of big-bang migrations. The problem isn't that migrations are inherently risky. The problem is that moving everything at once means every dependency, every configuration difference, and every environmental assumption gets tested simultaneously with no isolation between failures. At EB Pearls™, we've seen this pattern repeatedly across the 900+ projects we've delivered — and the organisations that migrate successfully are the ones that sequence by risk, run in parallel, and prove before they cut.

Why Big-Bang Migrations Fail

The appeal of a big-bang cloud migration architecture is understandable. One cutover window. One weekend of disruption. One clean break from legacy to modern. It sounds efficient. In practice, it creates a blast radius that encompasses every application, every integration, and every user simultaneously.

The core issue is dependency density. Enterprise workloads don't exist in isolation. Application A writes to a database that Application B reads from. Application C authenticates through a service that Application D hosts. Application E generates events that Applications F, G, and H consume. When you migrate all of them at once, you're not running one migration — you're running dozens of interconnected migrations where the failure of any single component can cascade across the entire portfolio.

Big-bang migrations also compress your learning window to zero. In a sequenced migration, you learn from migrating the first workload and apply those lessons to the second. You discover that DNS propagation takes longer than expected, that a particular firewall rule needs adjusting, that the new environment handles connection pooling differently. Each migration makes the next one safer. A big-bang approach eliminates that feedback loop entirely. Every lesson arrives at the same time as every problem.

The rollback problem is equally severe. A rollback plan for a single application is straightforward — revert DNS, restore the database, restart services. A rollback plan for 14 interdependent applications is a migration plan in reverse, with the added complexity that some applications are now generating data in the new environment that needs to be reconciled with the old. The weekend migration composite above failed not because the team lacked a rollback plan, but because the rollback plan assumed each application could be rolled back independently. Shared dependencies made that assumption false.

Sequencing Workloads by Risk and Dependency

The foundation of a safe cloud migration architecture is sequencing — deciding which workloads move first, which move last, and why. This isn't a technical decision alone. It's a risk management decision that accounts for business criticality, dependency complexity, and your team's capacity to absorb problems.

Map Dependencies Before You Sequence

Before you decide what moves first, you need to know what connects to what. Dependency mapping should capture four layers: network dependencies (what talks to what over which ports), data dependencies (what reads from and writes to which databases and queues), authentication dependencies (what relies on which identity providers and token services), and operational dependencies (what monitoring, logging, and alerting systems each workload uses).

This mapping exercise consistently reveals surprises. A service that the team considers standalone turns out to depend on a shared certificate authority. A database that supposedly serves one application is actually queried by three others through an undocumented API. These discoveries are far better made during planning than during a live cutover.

The Risk-Based Sequencing Framework

Once dependencies are mapped, sequence workloads into migration waves using three criteria.

Wave 1: Low risk, low dependency. Start with workloads that have minimal integration points, are not customer-facing, and where downtime is tolerable. Internal tools, development environments, batch processing jobs. These are your learning migrations. You'll discover the friction points — DNS propagation delays, firewall misconfigurations, unexpected latency differences — in a context where the consequences of discovery are low.

Wave 2: Medium risk, moderate dependency. Move to workloads that have some integration points but where those integrations can operate across environments temporarily. Back-office systems, internal APIs, reporting platforms. These migrations test your ability to run services across the legacy and new environments simultaneously.

Wave 3: High risk, high dependency. Customer-facing applications, payment processing, real-time data systems. These move last, after you've validated the environment with lower-risk workloads and established that cross-environment connectivity works. By the time you reach Wave 3, your team has migrated multiple workloads, resolved environmental issues, and built confidence in the rollback procedures.

Each wave should include a defined stabilisation period — typically one to two weeks — before the next wave begins. This gives the team time to monitor the migrated workloads, catch latent issues, and apply lessons learned to the next wave's planning and delivery.

The Parallel Run: Prove Before You Cut

A parallel run is the practice of running a workload in both the legacy and new environments simultaneously, comparing outputs to validate that the new environment produces correct results before cutting over. It's the single most effective risk reduction technique in workload migration.

How Parallel Runs Work

The mechanics vary by workload type, but the principle is consistent. For data processing workloads, you feed the same input data to both environments and compare the outputs. For API services, you route a percentage of traffic to the new environment (shadow traffic or canary routing) and compare response payloads and latency. For batch jobs, you run the job in both environments and diff the results.

The parallel run answers the question that no amount of pre-migration testing can fully address: does this workload behave identically in the new environment under real production conditions? Testing environments approximate production. Parallel runs use production itself as the test.

What to Compare

Focus comparisons on three dimensions. Functional correctness — does the new environment produce the same outputs for the same inputs? Performance — does the new environment meet the same latency, throughput, and resource utilisation benchmarks? Error behaviour — does the new environment handle edge cases and failure scenarios the same way?

Discrepancies in any dimension need investigation before cutover. A parallel run that reveals a three-per-cent difference in calculated values isn't a minor issue — it's a data integrity problem that will compound daily after cutover.

Duration and Exit Criteria

Parallel runs should cover at least one full business cycle. For most applications, that means a minimum of one week — enough to capture daily batch processes, weekly reports, and typical traffic patterns. For workloads with monthly cycles (billing, payroll, financial reporting), a full month of parallel running is warranted.

Define exit criteria before the parallel run begins. These should be specific and measurable: zero functional discrepancies for five consecutive business days, P95 latency within ten per cent of legacy, zero unhandled exceptions. Don't start with vague goals like "everything looks good." Define what "good" means numerically, and hold to it.

The Cutover Plan

The cutover is the moment you redirect production traffic from the legacy environment to the new one. A well-sequenced migration with successful parallel runs makes the cutover itself low-drama — which is exactly the goal.

Pre-Cutover Checklist

Before cutting over any workload, confirm: the parallel run met all exit criteria, the rollback procedure has been tested (not just documented — actually tested), the operations team knows the escalation path, monitoring and alerting are configured in the new environment, and stakeholders have been notified of the cutover window.

The Cutover Window

Schedule cutovers during low-traffic periods, but don't mistake low traffic for zero risk. Even during off-peak hours, real users are generating real transactions. The cutover should be designed to be reversible for a defined period — typically 24 to 72 hours — during which the legacy environment remains available as a fallback.

DNS-based cutovers using low TTL values allow quick rollback by redirecting traffic back to the legacy environment. Load balancer routing provides even faster switching. The specific mechanism depends on the workload, but the principle is the same: maintain the ability to revert quickly until you're confident the new environment is stable under full production load.

Post-Cutover Monitoring

The first 48 hours after cutover are the highest-risk period. Assign dedicated monitoring to the migrated workload. Watch error rates, latency percentiles, database connection counts, queue depths, and any workload-specific health metrics. Establish clear thresholds for automatic rollback — for example, if error rates exceed two per cent for more than fifteen minutes, trigger the rollback procedure without waiting for a decision meeting.

This kind of structured cutover discipline is part of how sound software development practice manages complexity — not by eliminating risk, but by making risk visible and controllable.

Managing the In-Between State

Every sequenced migration creates a period where some workloads are in the new environment and some remain in the legacy environment. This in-between state is unavoidable and needs to be managed deliberately, not treated as a temporary inconvenience.

Cross-Environment Connectivity

Workloads in the new environment still need to communicate with workloads in the legacy environment. This requires network connectivity between environments — VPN tunnels, direct connections, or API gateways — that is often underestimated in migration planning. Latency between environments will be higher than within either environment. Services that depend on low-latency communication need to migrate together or have their communication patterns adapted.

Data Consistency

The most complex aspect of the in-between state is data. When a workload migrates but its database doesn't (or vice versa), you need a data synchronisation strategy. Options include real-time replication, event-driven synchronisation, or dual-write patterns. Each has trade-offs in complexity, latency, and consistency guarantees. AWS's migration best practices documentation provides detailed guidance on data synchronisation patterns during migration.

Operational Complexity

During the in-between state, your operations team is managing two environments. Monitoring, logging, alerting, and incident response need to work across both. This doubles the operational surface area and requires clear documentation of which workloads are where. A migration tracker — updated in real time — is essential. Without it, incident responders waste critical minutes determining which environment a failing workload is running in.

The Weekend Migration That Became a Five-Day Incident

The composite in the opening illustrates what happens when migration architecture is absent. The team had migrated workloads but hadn't sequenced them by dependency. They had a rollback plan but hadn't tested it against shared database dependencies. They had a cutover window but no parallel run to validate workload behaviour before the cut.

The three applications that failed shared a pattern: each depended on a service or database that was in a different migration state. The payment processing service had migrated, but its database hadn't — and the cross-environment latency introduced transaction timeouts that didn't exist when both were co-located. The reporting tool had migrated, but the authentication service it used was still in the legacy environment, and the temporary authentication bridge had a configuration error. The customer portal was caught in a half-migrated state when the team paused migration to investigate the first two failures.

Every one of these issues was discoverable. Dependency mapping would have identified the shared database. A parallel run would have exposed the latency-induced timeouts. Testing the rollback procedure would have revealed the shared dependency problem. The team wasn't unskilled — they simply lacked a migration architecture that made risk visible before it materialised. This is a recurring theme across application delivery at scale: the practices that prevent incidents are planning practices, not heroic recovery practices.

Research from Google's SRE team consistently emphasises that the reliability of any system change — including migrations — is proportional to the quality of the rollback strategy, not the confidence of the rollforward plan. You don't need to be certain the migration will succeed. You need to be certain that if it doesn't, you can recover without data loss or extended downtime.

What to Do Next

If you're planning a cloud migration, start with dependency mapping. Document every integration point for every workload in scope. Then sequence those workloads into waves — low risk first, high risk last. Define parallel run criteria for each wave. Test your rollback procedure before you need it.

If you're mid-migration and things feel fragile, pause. The most dangerous migration is the one that continues forward because the team feels committed to the timeline. A migration that takes twelve weeks and works is better than one that takes four weeks and creates a five-day incident.

If you need help building a migration architecture that sequences risk, validates through parallel runs, and maintains rollback capability throughout — EB Pearls' DevOps team can help you plan and execute a migration that moves your workloads without moving your risk. We've supported migrations across the emerging technology landscape and can bring that experience to your migration planning.

Frequently Asked Questions

What is cloud migration architecture?

Cloud migration architecture is the structured approach to planning, sequencing, and executing the movement of workloads from one environment to another — typically from on-premises infrastructure to cloud platforms. It encompasses dependency mapping, risk-based sequencing, parallel run validation, cutover planning, and rollback strategies. The goal is to reduce migration risk by making each step verifiable and reversible rather than relying on a single high-stakes cutover event.

What order should we migrate workloads in?

Sequence workloads by risk and dependency. Start with low-risk, low-dependency workloads — internal tools, development environments, batch jobs — to build migration experience and validate the target environment. Move to medium-risk workloads with moderate integration points next. Migrate high-risk, customer-facing, and high-dependency workloads last, after your team has resolved environmental issues and validated procedures through earlier waves.

What is a parallel run in migration?

A parallel run involves operating a workload simultaneously in both the legacy and new environments, comparing outputs to verify that the new environment produces correct results under real production conditions. It answers the question that pre-migration testing cannot fully address: does this workload behave identically in production? Parallel runs should cover at least one full business cycle before cutover is approved.

How do we handle rollback if something goes wrong?

Design rollback procedures for each workload before migration begins, and test them — not just document them. Use DNS-based or load-balancer-based routing to enable quick traffic redirection back to the legacy environment. Maintain the legacy environment as a functional fallback for a defined period (typically 24 to 72 hours) after cutover. Critically, ensure rollback plans account for shared dependencies between workloads that may be in different migration states.

How long does a cloud migration typically take?

Duration depends on the number of workloads, their complexity, and their interdependencies. A sequenced migration with proper parallel runs typically takes longer than a big-bang approach in calendar time — but far less time in incident recovery time. A typical migration of 10 to 20 workloads, sequenced into three to four waves with stabilisation periods between them, takes eight to sixteen weeks. Rushing the timeline is the single most common cause of migration failures.

What is the biggest risk during cloud migration?

The biggest risk is shared dependencies between workloads that are in different migration states. When some workloads have migrated and others haven't, cross-environment communication introduces latency, connectivity, and data consistency challenges that don't exist when everything runs in one environment. Thorough dependency mapping before migration begins is the primary mitigation for this risk.

Worried About Picking the Wrong Dev Partner Again?

Blown budgets. Missed deadlines. A codebase you can't move. We've rebuilt from all of it. You'll attend every sprint demo, own the code from day one, and never wonder what you're paying for. Bring your hard questions — those are the ones we want.