How Minecraft Realms moved its databases from AWS to Azure
In our recent blog post we announced the completed Minecraft Realms' migration from AWS to Azure. In this post, we deep dive into how the Realms team moved all its database components to Azure. We hope that you learn from our successful migration, battle scars and tips so you are able to unlock Azure's database offerings for your organization. It's easy!
The Minecraft Realms service maintains a sizeable amount of business-critical data including game world state information, subscriptions metadata and telemetry. With a total of three pre-production and two production environments and up to 3 databases (DBs) in each environment, our mission was to reliably migrate over 1TB of data distributed over 13 DBs – the largest storing over 450GB data - from AWS Aurora to Azure Database for MySql . This business critical operation was complicated further by ~6k service requests per second (RPS) happening during the migration.
A combination of precise tooling on Azure – Azure Database Migration Service (DMS)- and expert support from the Azure FastTrack team were critical to making this move a success. Our tool of choice was the Azure Database Migration Service, which simplifies the process of data, schema, and object migration from disparate sources to Azure. The customized guidance we received from the Azure FastTrack team – a team of Azure engineering experts – accelerated, unblocked, and streamlined our deployment to Azure and saved us significant costs.
Preparing for the Migration
The Azure DMS documentation captures all pre-requisites and configuration steps depending upon your existing DB technology. Since Realms we were migrating MySql data from AWS to Azure we took the following steps:
- Replication logging was enabled in AWS Aurora.
- Provisioned Azure Database for MySql servers in the target Azure subscription. Though Azure DMS is capable of migrating both data and schema, we chose to deploy our schemas to the target MySql servers using our existing deployment tooling.
- Finally, a set of Azure DMS instances was created in the target Azure subscription. Each Azure DMS instance can concurrently migrate up to two DB servers, so we created a total of 7 instances for our 13 DBs.
Since Azure DMS is free for the first 6 months of use*, using the tool didn't have any impact on the Realms team budget and was a significant financial bonus.
*See Azure Database Migration Service pricing details.
Migrating the Databases
With all preparatory steps completed, we were ready to start the actual migration process. To minimize enable robust testing and minimize risk, we adopted the following strategy:
First, migrate the pre-production environments in the order of least active to most active environments.
Next, hook-up the migrated pre-production databases in Azure to the AWS front-end and let it bake for ~1 week with replication enabled – to ensure data consistency between the AWS Aurora pre-production DB and the corresponding Azure MySql DB target.
Note: As of today, the front end is also completely on Azure.
After the validation, sanity checks and load testing, switch to Azure MySql as the primary store.
The above strategy was followed for the 3 pre-production environments and then the 2 production environments.
Establishing connectivity over virtual private networks: To maintain backend security, our MySql servers are only reachable by our machines and are not visible to the public internet. We needed network connectivity between our Azure DMS instances and both the source AWS Aurora DB and target Azure MySql DB. While establishing connectivity within the Azure network was trivial using a simple VNet rule on the target server, connecting to the Aurora source servers was significantly more complicated. With quick guidance from the Azure FastTrack team we created a Site-to-Site VPN between the Azure VNet hosting our DMS instances and the AWS Virtual Private Cloud (VPC) hosting our Aurora DBs, this allowed us to achieve connectivity between the DMS instances and both the source and the target DB servers.
After creating the S2S VPN connections we created Private Link entries for each DB server, allowing us access to our MySql server from within the Azure VNet and enabling the routing from the AWS frontend server, over the S2S VPN, to the Azure MySql DB.
Figure 1: Replication setup. The Dotted black lines indicate the replication datapath from AWS to Azure.
Highlights from the replication and migration: The replication process was smooth and Azure DMS gave actionable error messages each time we ran into an issue. The migration gave us an opportunity to maintain DB hygiene as we cleaned up some abandoned tables hiding in our schemas on the AWS side. Once the schemas matched, the actual replication was rock solid and fast, thanks to Azure DMS.
We let the replication sit and soak for ~1 week to gain confidence in the replication and its ability to keep up with our service demand. After the initial data load, which took up to 2 days for our largest DB table (over a billion rows!), we never saw a replication delay over a few seconds, despite the 6k RPS to the service. After seeing replication keep up over a full weekend (the busiest time for gaming), we were comfortable switching to Azure MySql as our primary data store.
Each week, we migrated a new environment from AWS Aurora to Azure MySql. We took a small downtime (~20 mins for most environments), during which we completed the migration in Azure DMS. When we brought the environment back online after deployment, our DBs were hosted in Azure! We ran our pre-production environments in this configuration for several weeks, before moving our production environments.
Figure 2: Architecture post-DB migration. Here we see the front-end servers directly reading/writing data to/from the Azure MySql DBs.
VNet SKU Matters: While establishing the S2S VPN connections between the service VPCs in AWS and the VNets containing our Azure MySql DBs, we learnt the importance of the VNet gateway SKU. Though the default VNet Gateway SKU (VpnGw1) was sufficient for our pre-production environment, the same SKU could not keep up with our production traffic. Each Gateway SKU offers different performance and throughput and we recommend researching and load-testing the chosen SKU prior to migration.
DB Version Matters: While moving our final production environment, we observed blocking performance issues with our target Azure MySql DB server, which was unable to handle the production traffic load. Since we had been running on MySql 5.6 for years, we incorrectly assumed an equivalence of version implementation between AWS Aurora MySql 5.6 and Azure MySql 5.6. With the help of the Azure MySql team, we addressed the performance bottlenecks by upgrading to Azure MySql 5.7, realizing that there are significant performance differences between MySql 5.6 and 5.7, some of which turned out to be necessary for our service.
Over a couple of weeks, we upgraded the target MySql server for our final production environment, cleared the DB, restarted our replication via Azure DMS and ran several load tests. We also added the ability to put our service in read-only mode. We chose a day for the final migration and flipped the switch. With everything we'd learned, and our intense preparation, we were able to migrate the final production environment smoothly in less than an hour. We had successfully migrated a terabyte of data spread over thirteen databases from AWS to Azure!
Battle-scars and tips
- Assuming DB version equivalence between AWS Aurora MySql 5.6 and Az MySql DB 5.6 caused us some avoidable downtime. We learnt, the hard way, that v5.7 has significant performance upgrades and enhancements over v5.6 and strongly encourage anyone running on MySql 5.6 to investigate switching to 5.7. The performance gains allowed us to reduce our CoGS dramatically, with our Azure MySql servers running at ~10-15% utilization, compared to the overloaded state we started from.
- Load testing prior to our final production migration would have identified the performance issues that we eventually faced with the S2S VPN gateway SKU chosen. We learnt that SKU choice matters and recommend load testing simulated production workload before migration. After our first failed migration, we had to spend significant time rationalizing data between the AWS and Azure hosted MySql servers, since we were allowing writes to happen as soon as the service came back up. We recommend putting your service in read-only mode if possible, prior to flipping the migration switch.
- The seamless migrations of our pre-production environments lulled us into a false sense of confidence, so we didn't invest as much as we should have in our rollback plan. Nonetheless, we highly recommend having a sound rollback plan in case things go awry.
Read about the team's whole migration journey in parts 1 and 3 of this blog series.