Minimising Downtime During a Cloud Migration | Knowledge Base

A cloud migration is surgery on the heart of your IT. You're moving systems your organisation runs on daily to a different environment. The question isn't whether there's risk of downtime, but how you keep that risk manageable.

Downtime costs money. Gartner estimates the cost at $137 per minute for smaller companies up to $9,000 per minute for multinationals. In the Eurozone, that averages out to €4,600 per minute. Even for an SME with a webshop generating €50,000 per month, four hours offline already means a few thousand euros in missed orders, plus the indirect damage: customers who drop off, search engines penalising a slow site, employees twiddling their thumbs.

The good news: with the right approach you can execute a migration with minimal to zero downtime. This article shows you how.

Why migrations fail

Most migration failures fall into three categories.

Incomplete inventory. You don't know exactly what's running, what dependencies exist, and which systems communicate with each other. An application seems standalone, but turns out to make an API call to an internal service you haven't migrated yet. Result: the application starts up in the cloud and fails immediately.

Big bang cutover. Moving everything at once on a Friday evening and hoping it works Monday morning. This is the riskiest strategy there is. If something goes wrong (and something always goes wrong), you have no working environment to fall back on.

No rollback plan. If the new environment doesn't work as expected, you need to be able to go back within minutes. Without a tested rollback plan you're stuck: the old environment is already shut down, the new one doesn't work, and your team is improvising under pressure.

The four strategies

There are four proven methods to minimise downtime. Each method suits a different type of workload.

1. Blue-green deployment. You build the complete new environment (green) alongside the existing one (blue). Both run simultaneously. Once green is tested and ready, you switch traffic via DNS or a load balancer. The cutover takes seconds. If green doesn't function, you switch back to blue. Downside: you temporarily pay double infrastructure costs, and databases must stay synchronised during the parallel period.

2. Canary deployment. You send a small percentage of your traffic (5-10%) to the new environment while the rest goes via the old one. Monitor the new environment and if everything looks good, you gradually increase: 25%, 50%, 75%, 100%. When problems arise, you dial the percentage back to zero. This is safer than blue-green because you test with fewer users, but it requires a load balancer that supports weighted routing.

3. Rolling migration. You migrate components one by one. First the static website, then the API, then the database, then the background processes. Each component is tested separately before you proceed. This spreads the risk, but takes longer and requires that old and new components can temporarily communicate with each other.

4. Database-first. The database is almost always the trickiest part of a migration. With this approach you migrate the database first, with replication between old and new. The application keeps running on the old servers but reads and writes to the new database. Once the database is stable, you move the application. This isolates the most difficult part of the migration.

4 strategies for minimal downtime

Choose based on risk tolerance and workload

Blue-Green

Two complete environments

Cutover: sec 2x cost

Canary

5% → 25% → 50% → 100%

Low risk Days

Rolling

Component by component

Flexible Complex

Database-first

Move data, then application

Isolated Replication

The database: the hardest part

In almost every migration, the database is where things get tricky. Copying files is relatively simple. Moving a running database without data loss is not.

The standard approach works in three steps. First a full copy (initial sync) of the database to the new environment. That can take hours depending on size, but the application keeps running on the old database. Then you enable replication: every change on the old database is automatically applied to the new one. MySQL, PostgreSQL and most modern databases support this natively. Finally, once replication runs stable and the lag is under a few seconds, you do the cutover: you point the application to the new database.

The cutover itself typically takes 30 seconds to 2 minutes. During that period you briefly write to neither database. After that, everything points to the new location.

Two pitfalls here. One: test replication lag under production load, not just at a quiet moment. A database processing 100 queries per second may build up lag at 1,000 queries per second that complicates the cutover. Two: schema changes. If the new environment runs a slightly different database version, test whether all queries work identically. A difference in how a JOIN is processed can cause subtle data issues that only surface days later.

DNS: the underestimated factor

DNS records have a TTL (Time To Live) that determines how long they're cached. The default TTL at many providers is 3600 seconds (1 hour) or higher. If you do a cutover and change DNS, it can take up to an hour before all users are sent to the new server. In the meantime, part of the traffic goes to the old server and part to the new one.

The solution is simple but often forgotten: lower the DNS TTL 24 to 48 hours before the planned migration to a low value, for example 60 or 120 seconds. After the cutover you wait a few minutes and then virtually all traffic points to the new environment. After a stable period you raise the TTL again.

Cloudflare sets a minimum of 120 seconds. With traditional DNS providers you can often go lower (30 seconds), but not every resolver respects extremely low TTL values. Count on a transition period of 2 to 5 minutes after the DNS change.

The migration playbook

A migration without a playbook is a migration that goes wrong. The playbook describes exactly what needs to happen, in what order, by whom, and what the fallback plan is at each step.

A basic structure:

Week -2: Preparation. Inventory all systems, dependencies and data stores. Lower DNS TTL. Set up monitoring on both environments. Document current performance as a baseline (response times, error rates, throughput). Schedule the cutover at a quiet moment (not Friday afternoon, not before a peak period).

Week -1: Dry run. Execute the full migration on a test environment that mirrors production. Test the cutover, test the rollback plan, measure time investment. Document everything that didn't go as expected.

D-day: Execution. Follow the playbook step by step. Each step has a go/no-go point: if the check fails, you stop and execute the rollback plan. Communicate clearly to stakeholders (beforehand: "we're migrating tonight", afterwards: "everything runs on the new environment").

Week +1: Stabilisation. Monitor intensively. Compare performance with the baseline. Keep the old environment active for at least a week as fallback option. Only remove the old environment once everything is stable.

Migration playbook: from preparation to cutover

At least 3 weeks lead time for a safe migration

Week -2: Preparation

Inventory, lower DNS TTL, set up monitoring, measure baseline

Week -1: Dry run

Full test migration, practice rollback, time measurement, close documentation gaps

D-day: Cutover

Step-by-step execution, go/no-go checks per step, communicate to stakeholders

Week +1: Stabilisation

Intensive monitoring, performance vs. baseline, keep old environment as fallback

The cutover night checklist

On the night of the cutover you want no surprises. This list prevents the most common mistakes:

Beforehand: DNS TTL already lowered (48 hours ago)? Database replication running and lag is <5 seconds? Monitoring active on both environments? Rollback plan tested in the dry run? All involved know their role and are reachable? Communication sent to users?

During: follow the playbook, no improvising. Check after each step whether the control passes. On a fail: stop, rollback, analyse, schedule a new moment. Don't push through.

Afterwards: compare error rates, response times and throughput with the baseline. Monitor actively for at least 24 hours. Keep the old environment available until you're certain.

The core

Zero downtime during a cloud migration isn't a marketing promise, it's an engineering discipline. The core is simple: build the new environment alongside the old one, test thoroughly, switch over gradually, and keep a fallback plan ready. The database is the hardest part. DNS TTL is most often forgotten. And a dry run is not optional.

Plan at least three weeks lead time. Not because the technical migration takes that long, but because preparation and testing make the difference between a smooth transition and a nightmare.

Need help with your migration? Our experts guide you from planning to cutover. Get in touch for a no-obligation conversation.