How Minecraft Realms moved compute+storage from AWS to Azure
In a previous post we announced the successful migration of Minecraft Realms from AWS to Azure and dove into the Realms database migration. Here, we conclude this series with a final post on compute and storage migration. We hope that your organizations and teams learn from our experience and can tap into the Azure's storage and compute services faster and more effectively.
The essence of Minecraft Realms' rich gameplay is captured in the beautifully constructed virtual worlds, where players come together for private online gaming. Realms persists ~20+ Petabytes worth of these gameplay critical Minecraft virtual worlds.
Less visible, but equally crucial to gameplay, is the fleet of compute clusters running the Realms service code, which handles the Realms service requests from Minecraft clients. These Front-End Servers are responsible for quickly handling client requests, securely accessing user data, and managing cloud-hosted gameplay servers – providing exactly what is needed for our players to have seamless and fast joining to these gameplay servers.
Leveraging best-in-class Azure tools such AzCopy, and services such as Azure Blob Storage and Azure Virtual Machine Scale Sets, we were able to successfully migrate Realms' Storage and Compute workloads from AWS to Azure. Now on Azure, we maintain a significantly smaller code footprint, have simpler deployments and a more flexible architecture than before – saving precious developer time. Furthermore, lower latency through Azure's global footprint has improved gameplay – allowing our gamers to enjoy Realms even more.
The Realms service code handles all Realms requests from Minecraft clients by performing crucial tasks such as recording metadata updates, processing realms purchases, and provisioning/deprovisioning gameplay servers. The entire Realms service code was initially hosted on AWS EC2 instances over 60+ VMs running a gamut of core services. We migrated these Realms front-end servers from AWS EC2 to Azure VMs by the following steps:
- Customize the standard Azure linux VM image with necessary dependencies
- Create several Azure Virtual Machine Scale Sets (VMSS), associated with the customized images
- Deploy Realms service code to the VMSS instances at instance creation/warmup time
To ensure a seamless update for our players, we adopted a blue-green deployment strategy (Figure 2), where a release was deployed simultaneously to both Azure and AWS, and user traffic was switched between them using DNS updates. This approach allowed us to fully create and test each environment in Azure before pointing test and customer traffic to Azure. Since both AWS and Azure deployments were up and running at the same time, any issues encountered by user traffic could quickly be reverted and fixed. This approach is highly recommended as it allows quick and safe rollback, minimizes downtime, and enables robust validation. For details on how the AWS front-end servers connected to the Azure DBs, please see our post on the Realms DB migration.
Tips from the compute migration experience:
- To load-balance for multi-region services, Azure Front Door is recommended, while for single region services Application Gateway may be used. Azure offers various load balancing options listed here.
- To protect public IP addresses and ports on VM instances, Network Security Groups may be used.
- To build custom VM images, check out Azure Image builder
Player world data is critical to each gaming session on Minecraft Realms. The world data is loaded each session from a persistent store and used to initialize the game server runtime. Prior to migration, the world data was hosted in AWS S3 buckets - storing Petabytes of data spread across millions of files.
Due to the low-latency, high-throughput, and high-frequency access required for a good player experience, a hybrid model using an AWS S3 store with Azure Compute was not a viable option. Specifically, because (a) Copying the data across clouds is much slower than copying data from within the same datacenter. Such a hybrid cloud model would increase the startup time for each game session. And (b) Over time, the egress cost of the data out of AWS S3 would really start to add up. We thus needed a holistic migration of the world data to Azure such that compute and storage could run in conjunction with optimal performance.
We planned the migration with minimal impact to gameplay – to move all the world data from AWS S3 buckets to Azure Blob Storage. Bulk copying the data from AWS S3 to Azure Blob Storage would have involved downtime to active players, since a player would not be able to play their Realm while their data was being moved. We adopted the following strategy to avoid downtime:
"The first time an existing active Realms world is accessed, read the existing world data from S3 and then write all updates to Azure Blob Storage. On subsequent accesses, use Azure blob exclusively."
As a result, all active Realms were migrated automatically during runtime. Players would experience an imperceptible (order of milliseconds) delay only when starting their Realm on Azure for the first time. This left only the inactive world data (~300TB) - pertaining to users that have not been active for over a year – to be bulk copied later.
Leveraging AzCopy, we were able to bulk copy the pending inactive users world data from the source AWS S3 and into the target Azure Blob Storage using simple scripts. The inactive user world data was completely transferred to Azure Blob Storage accounts within the span of ~1 week. Finally, after verification, the databases were updated to point to the Azure Blob Storage world data and the S3 buckets were cleared. We were now fully on Azure!
Learnings from the storage migration:
- Prepare for increased cost during the migration period. Keep in mind that data duplication across cloud providers can effectively double the storage cost for the span of the migration. Furthermore, ingress and egress charges on each cloud provider add up to the charges.
- 300+TB data can take a long time to copy, and likewise for deletion.
- Azcopy worked without a hitch, even when copying a ridiculous number of files.
Final Tips on Azure migration:
- Use ARM templates for creating and deploying Azure resources/infrastructure such as Load Balancers, VM Scale Sets, NSGs, etc.
- Do try to set up at least some level of load testing in Azure before pushing the big button to switch over completely. Don't rely on successful migrations of pre-production environments, or on Azure's assurances around performance. Further, always have a rollback plan, preferably which avoids data loss.
You will almost certainly need code changes to move.
Through the journey covered in this blog series, we've successfully migrated the entire Realms service – the Front-end servers (Compute), virtual world data (Storage), all the databases and abstracted the gameplay server management through Azure PlayFab. Our move to Azure has not only helped the efficiency of Realms' core business, but also enhanced the experience for Realms gamers. Piggybacking on the robust Azure toolset, we've reduced service maintenance cost – through a smaller code footprint, simpler deployments, and better abstractions. At the same time, we've improved gameplay by leveraging Azure's global footprint.
Unlock the power of Azure for your organization today!
Read about the team's whole migration journey in parts 1 and 2 of this blog series.