Phased Migration Protocol: Move Workloads in the Right Order

Phased Migration Protocol: Move Workloads in the Right Order
Published

19 Jun 2026

Author
Roshan Manandhar

Roshan Manandhar

Phased Migration Protocol: Move Workloads in the Right Order
6:35
Table of Contents

A migration team started with the payment system because the CFO wanted it moved first. The logic was straightforward: the payment system was the most expensive workload on legacy infrastructure, so moving it first would deliver the fastest cost savings. The team spent three weeks preparing the cutover. On migration night, a configuration error in the new environment's load balancer caused the payment service to drop transactions intermittently. For four hours, the most critical system in the organisation was unreliable. Customer-facing transactions failed. Internal reconciliation broke. The incident response team scrambled to roll back, but the rollback procedure had been designed for a clean failure, not a partial one — the system was processing some transactions successfully in the new environment while dropping others.

The four-hour outage cost more than a year of the infrastructure savings the early migration was supposed to deliver. And the root cause wasn't technical incompetence. The team was skilled. The configuration error was a single misaligned health check parameter that would have been caught in a parallel run. The real failure was sequencing: they started with the highest-risk, highest-consequence workload and gave themselves zero room for the kind of mistake that every migration produces at least once.

At EB Pearls™, we've delivered over 900 projects across infrastructure, application delivery, and cloud migration. The phased migration protocol we walk through below is the approach we use — and the approach we've seen work — when the goal is a migration where mistakes happen early, on low-consequence workloads, and every step is proven before the next one begins.

This walkthrough covers the full protocol: classifying workloads by risk, building a sequenced migration order, running parallel environments to prove correctness, testing rollback at every stage, and executing cutovers that are deliberately anticlimactic.

Step 1 — Inventory and Classify Every Workload

Before you sequence anything, you need a complete picture of what you're migrating. The inventory isn't a list of server names. It's a classification of every workload by three dimensions: business criticality, dependency complexity, and migration difficulty.

Business criticality answers the question: what happens if this workload is unavailable for four hours? For a payment processing system, the answer involves lost revenue, failed transactions, and regulatory exposure. For an internal wiki, the answer is mild inconvenience. Classify each workload as critical (revenue or compliance impact within minutes), important (operational impact within hours), or standard (impact measured in days).

Dependency complexity maps what each workload connects to. A workload that reads from two databases, writes to a message queue, authenticates through a shared identity provider, and exposes an API consumed by four other services has high dependency complexity. A standalone batch job that reads a file from storage and writes results to another location has low complexity. Document every integration point — network, data, authentication, and operational monitoring — for every workload in scope.

Migration difficulty accounts for the technical factors that make a specific workload harder to move. Stateful services are harder than stateless ones. Workloads with real-time latency requirements are harder than batch processes. Systems with undocumented configurations are harder than those managed through infrastructure as code.

The output of this step is a workload register: every workload listed with its criticality rating, dependency map, and difficulty assessment. This register becomes the input for sequencing.

Step 2 — Sequence into Migration Waves by Risk

The sequencing principle is simple: migrate low-risk workloads first and high-risk workloads last. The reason is equally simple — every migration surfaces unexpected issues, and you want those issues to appear on workloads where the consequences are manageable, not catastrophic.

Wave 1: Low Risk, Low Dependency

Start with workloads classified as standard criticality, low dependency complexity, and low migration difficulty. Internal tools, development and staging environments, documentation systems, batch processing jobs that run during off-hours. These are your learning migrations.

Wave 1 exists to teach your team how the target environment actually behaves — not how the documentation says it behaves. You'll discover that DNS propagation takes longer than expected. You'll find that a firewall rule that worked in testing doesn't work in production. You'll learn that the new environment handles connection pooling differently. Every one of these discoveries is a lesson that makes subsequent waves safer.

Plan for two to four workloads in Wave 1. Allow a one-to-two-week stabilisation period after the wave completes before beginning Wave 2.

Wave 2: Medium Risk, Moderate Dependency

Move to workloads with important criticality and moderate dependency complexity. Back-office systems, internal APIs, reporting platforms, non-customer-facing data processing. These workloads have real integration points that need to operate across environments during the migration period.

Wave 2 tests your cross-environment architecture. Workloads in the new environment need to communicate with workloads still in the legacy environment. Latency between environments will be higher. Authentication tokens need to work across both. Data synchronisation patterns need to function under real load. These are the problems you need to solve before migrating anything customer-facing.

Plan for three to six workloads in Wave 2, potentially split into sub-waves if dependency chains require it. Allow a two-week stabilisation period before Wave 3.

Wave 3: High Risk, High Dependency

Customer-facing applications, payment processing, real-time transaction systems, and any workload where downtime has immediate revenue or regulatory consequences. These move last.

By the time you reach Wave 3, your team has migrated multiple workloads across two waves. The target environment has been validated under real conditions. Cross-environment connectivity has been proven. Rollback procedures have been tested — ideally exercised for real during an earlier wave. The team knows the environment's quirks. They've built the muscle memory for migration operations.

This is exactly why the composite scenario in the opening went wrong. The team started at Wave 3 difficulty with Wave 1 experience. The phased migration protocol exists to ensure you never face high-consequence migration decisions before you've earned the operational knowledge to make them well.

Step 3 — Establish Parallel Run Criteria for Each Wave

A parallel run is the practice of operating a workload simultaneously in both environments, comparing outputs to verify the new environment produces correct results before you commit to the cutover. It's the single most effective technique for catching problems before they affect users.

Define What You're Comparing

For each workload, specify what a successful parallel run looks like across three dimensions.

Functional correctness. Does the new environment produce the same outputs for the same inputs? For an API service, this means response payloads match. For a data processing job, this means output datasets are identical. For a reporting system, this means reports contain the same figures.

Performance equivalence. Does the new environment meet latency and throughput benchmarks? Define specific thresholds: P95 response time within ten per cent of legacy, throughput within five per cent, CPU and memory utilisation within acceptable bounds.

Error behaviour parity. Does the new environment handle edge cases and failures the same way? Feed it malformed inputs, simulate upstream failures, test timeout scenarios. The goal is confirming that error handling in the new environment matches or improves on the legacy environment.

Define Duration and Exit Criteria

Parallel runs should cover at least one full business cycle — a minimum of one week for most workloads, a full month for workloads with monthly processing cycles such as billing or financial reporting. Define exit criteria before the parallel run begins: zero functional discrepancies for five consecutive business days, performance within defined thresholds for the same period, no unhandled exceptions.

These exit criteria must be specific, measurable, and agreed by the migration team and the workload owner before the parallel run starts. "It looks fine" is not an exit criterion. "Zero discrepancies across 50,000 compared transactions over five business days" is.

The project delivery framework that governs how we plan work at EB Pearls applies the same principle: define success criteria before execution, not after.

Step 4 — Test Rollback Before You Need It

Every workload in every wave needs a tested rollback procedure. Not a documented procedure — a tested one. The distinction matters because rollback plans that exist only on paper routinely fail when executed under pressure.

What Rollback Testing Looks Like

For each workload, execute the full rollback sequence in a non-production environment that mirrors production topology. Migrate the workload forward to the new environment. Verify it's running. Then execute the rollback: redirect traffic back to legacy, confirm data consistency, verify that dependent services reconnect correctly.

Pay particular attention to data state during rollback. If the workload processed transactions in the new environment before the rollback was triggered, what happens to those transactions? Are they replicated back to the legacy database? Are they lost? Are they duplicated? The payment system outage in the opening composite was a partial failure — some transactions succeeded in the new environment while others failed. A rollback in that state needs to account for data that exists in the new environment but not the old one.

Rollback for Shared Dependencies

The most dangerous rollback scenarios involve workloads that share dependencies across migration states. If Workload A has migrated and depends on Database X, and Workload B is still in legacy and also depends on Database X, rolling back Workload A might break Workload B if the database schema or connection configuration changed during migration. Map these shared dependency rollback paths explicitly and test them. AWS's migration prescriptive guidance details several patterns for managing database dependencies across environments during phased migrations.

Automate Where Possible

Rollback procedures that require a human to execute fifteen manual steps under pressure at two in the morning will fail. Automate the rollback sequence: DNS reversion, load balancer switching, database failover, service restart. The trigger can be manual — someone decides to roll back — but the execution should be scripted and tested.

Step 5 — Execute the Cutover for Each Wave

With parallel runs validated and rollback tested, the cutover itself should be the least dramatic part of the migration. That's the goal — a cutover so well-prepared that it's procedural rather than heroic.

Pre-Cutover Verification

Before cutting over any workload, confirm five things. The parallel run met all exit criteria. The rollback procedure has been tested within the last week (environments drift; a rollback tested a month ago may not work today). The operations team knows the escalation path and has the rollback runbook accessible. Monitoring and alerting are configured and verified in the new environment. Stakeholders have been notified of the cutover window and the expected duration.

Execute the Cut

Schedule cutovers during low-traffic periods appropriate to the workload. Use DNS-based routing with low TTL values or load balancer weighted routing to redirect traffic gradually. A common pattern is 10-25-50-100 per cent traffic shifting over a period of hours, monitoring error rates and latency at each stage. If any stage breaches defined thresholds, halt the traffic shift and investigate before proceeding.

For workloads that cannot tolerate split-traffic routing — typically those with strong consistency requirements — a hard cutover with immediate rollback capability is appropriate. Cut DNS, monitor for a defined observation period (typically 30 to 60 minutes under full production load), and confirm stability before decommissioning the rollback path.

Post-Cutover Observation

The first 48 hours after cutover are the highest-risk window. Assign dedicated monitoring to the migrated workload. Define automatic rollback triggers: if error rates exceed a defined threshold for more than a defined duration, execute the rollback without waiting for a consensus meeting. The threshold and duration should be agreed before the cutover — not debated during an incident.

After the observation period, conduct a brief post-cutover review. What worked? What was unexpected? What should change for the next wave? These lessons feed directly into the next wave's preparation, which is the compounding advantage of a phased approach over a big-bang migration where all lessons arrive simultaneously with all problems.

Step 6 — Manage the Cross-Environment Period

Between the first Wave 1 cutover and the final Wave 3 decommission of legacy, your organisation operates across two environments. This period can last weeks or months. Managing it deliberately is essential.

Network and Latency

Workloads in the new environment communicate with workloads still in legacy through VPN tunnels, direct connections, or API gateways. Latency between environments will be higher than within either environment. Services that depend on sub-millisecond communication latency need to migrate together within the same wave, or their communication patterns need adaptation.

Data Synchronisation

When a workload migrates but its database consumers haven't, or when a database migrates but its application consumers are split across environments, you need a data synchronisation strategy. Real-time replication, change data capture, event-driven sync, or dual-write patterns each have trade-offs in complexity and consistency guarantees. Choose the simplest pattern that meets the workload's consistency requirements. Over-engineering the sync layer creates its own failure modes.

Operational Visibility

During the cross-environment period, your operations team manages two environments with different monitoring stacks, different logging pipelines, and different alerting configurations. A migration status tracker — updated in real time — is essential. When an incident occurs, the first question is which environment the affected workload is running in. If that question takes more than thirty seconds to answer, your tracker isn't working. This kind of operational discipline is what separates well-delivered projects from ones that accumulate confusion over time.

Step 7 — Decommission Legacy and Close the Migration

A migration isn't complete when the last workload cuts over. It's complete when the legacy environment is decommissioned and the organisation is running entirely in the new environment with no residual dependencies.

Confirm No Residual Traffic

After the final wave's observation period, verify that no traffic is reaching the legacy environment. Check DNS records, load balancer logs, database connection logs, and network flow data. Shadow dependencies — services that were missed in the inventory or that connect to the legacy environment through undocumented paths — frequently surface at this stage. Resolve every one before decommissioning.

Archive and Decommission

Archive legacy environment configurations, data backups, and runbooks. Then decommission the infrastructure. The decommission should be staged: stop services first, monitor for any unexpected breakage for one to two weeks, then tear down the infrastructure. Keeping legacy infrastructure running "just in case" indefinitely is a cost and security liability. Set a hard decommission date and hold to it.

Conduct a Migration Retrospective

Document what worked, what failed, what was learned. Capture the lessons in a format that future migration teams — or future projects — can use. Every migration generates institutional knowledge. Capturing it is the difference between an organisation that gets better at migrations and one that repeats the same mistakes. Understanding the broader technology trends shaping infrastructure decisions ensures those lessons remain relevant as the platform landscape evolves.

The Payment System That Should Have Moved Last

The composite scenario illustrates the most common phased migration protocol failure: letting business urgency override risk sequencing. The CFO's request to migrate the payment system first was reasonable from a cost perspective. But cost optimisation and risk management are different objectives, and in a migration context, risk management must lead.

Had the team followed the protocol, the payment system would have been a Wave 3 workload — high criticality, high dependency complexity, high migration difficulty. The configuration error that caused the four-hour outage — a misaligned health check parameter on the load balancer — would almost certainly have been discovered during Wave 1 or Wave 2 migrations, where the same load balancer configuration was in play but the workload consequences were orders of magnitude lower. The team would have fixed the configuration, updated the migration runbook, and migrated the payment system weeks later with the issue already resolved.

The Google Cloud Architecture Framework describes this principle as "failing cheap" — engineering your process so that failures occur where their cost is lowest. A phased migration protocol is the practical implementation of that principle.

What to Do Next

Start with the workload inventory. Document every workload in scope, classify by criticality and dependency complexity, and build the wave sequence. Don't skip the inventory because you think you already know what's connected to what — every migration team that skips this step discovers dependencies during cutover that they could have mapped in planning.

If you're mid-migration and the sequence feels wrong — if the next workload in the plan is high-risk and your team hasn't built confidence from lower-risk migrations — pause and re-sequence. A migration that takes sixteen weeks and succeeds is vastly preferable to one that takes eight weeks and produces a four-hour outage on your most critical system.

If you need help building a phased migration protocol that sequences workloads by risk, validates each step through parallel runs, and maintains tested rollback capability throughout, EB Pearls' DevOps team can help you plan and execute. Across 900+ projects and 1400+ businesses, we've supported migrations at every scale — and the protocol described here is the one we use.

Frequently Asked Questions

What is a phased migration protocol?

A phased migration protocol is a structured approach to moving workloads from one environment to another in sequenced waves, ordered by risk and dependency. Rather than migrating everything at once, workloads are classified, sequenced from low risk to high risk, validated through parallel runs, and cut over individually with tested rollback procedures at every stage. The protocol ensures that migration mistakes happen on low-consequence workloads before high-consequence ones are moved.

What order should we migrate workloads in?

Sequence by risk, not by business urgency or cost savings. Start with low-criticality, low-dependency workloads — internal tools, development environments, batch jobs. Move to medium-criticality workloads with moderate integration points next. Migrate customer-facing, revenue-critical, and highly interconnected workloads last, after your team has resolved environmental issues and built operational confidence through earlier waves.

How do we test each migration step?

Through parallel runs — operating the workload simultaneously in both the legacy and new environments and comparing outputs across functional correctness, performance, and error behaviour. Define specific, measurable exit criteria before each parallel run begins. A typical minimum is one full business cycle (one week) with zero discrepancies before cutover is approved.

What if we need to roll back a migration?

Design and test rollback procedures for every workload before migration begins. Use DNS-based or load-balancer-based routing for quick traffic redirection. Maintain the legacy environment as a functional fallback for 24 to 72 hours after each cutover. Critically, test rollback against shared dependencies — workloads in different migration states sharing databases or services create the most dangerous rollback scenarios.

How long does a phased migration take?

A phased migration takes longer in calendar time than a big-bang approach but produces dramatically less incident recovery time. A typical migration of 10 to 20 workloads, sequenced into three to four waves with stabilisation periods between them, takes ten to twenty weeks. The stabilisation periods between waves — typically one to two weeks — are where lessons from each wave are applied to the next.

What is the biggest mistake in cloud migration sequencing?

Starting with the highest-risk workload. Business stakeholders often want the most expensive or most visible system migrated first to demonstrate ROI. But the first migration is where your team has the least operational experience with the target environment. Configuration errors, environmental differences, and procedural gaps are most likely to surface during early migrations. Those discoveries should happen on workloads where a four-hour outage is inconvenient, not catastrophic.

Like What You Just Read? It's How We Run Every Project.

Discovery workshops, sprint demos, production reviews — this isn't thought leadership. It's our operating system. If you want to see how it works with your product on the table, let's talk.