PlayFab’s PlayStream now powered by Azure
PlayFab PlayStream makes sophisticated, high-end automation of live game operations available to all game developers. PlayStream provides a unified event pipeline, allowing you to use events to enhance gameplay while the game is being played. It is a highly available production service, supporting live traffic of thousands of games. Popular games such as Rainbow 6 Siege, Doom Eternal, and Minecraft are actively using PlayStream to route and process in-game events.
PlayFab game services automatically insert events into PlayStream as things happen in the game. Events such as 'a player logged in' or 'a player made a purchase' can be routed through PlayStream. In addition to built-in events, you can also add custom events via dedicated APIs. Leveraging PlayStream's packed capabilities such as the event pipeline, event visualizer, real time segmentation system, and event archiver - you can act upon these game events to modify player experience and analyze the game.
Figure 1: PlayStream captures all your game's events as they occur during gaming
Since PlayFab's acquisition in 2018, we've been on a journey - moving our infrastructure from AWS to Azure. For the first couple of years, we focused on delivering new services and improving existing services for our end users. Just over a year ago, we decided to start moving PlayStream to Azure. Now completely on Azure, PlayStream scales faster, runs cheaper and performs better for our customers (see section Modernizing PlayStream's architecture on Azure)
In this post, we'll walk you through PlayStream's migration journey to Azure, how PlayStream works under-the-hood and how we were able to modernize leveraging Azure's cloud-native services. If you're considering cloud infrastructure in the development of your game, looking to migrate or modernize critical production services to the Cloud, or just looking to learn more about how PlayStream works, this article is for you.
PlayStream Challenges and Opportunities
To fully appreciate PlayStream's modernization journey on Azure and why it runs and scales better than before, let's take a quick look at the original design on AWS. PlayStream was originally built on AWS using the following major components:
- Kinesis – Data Stream into which events are ingested and read from for processing.
- EC2 – To run the feature processors on Windows compute instances (in an Auto Scaling Group).
- DynamoDB – Used to store the kinesis shards processing state checkpoints.
- SNS/SQS – Used to queue messages for further processing.
Figure 2: PlayStream's original architecture on AWS, built using Kinesis, EC2 VMs, SNS/SQS (not shown), and DynamoDB
Figure 2 shows the original design where data from clients was ingested into Kinesis from PlayFab servers. The head dispatcher running on EC2 VMs funneled the data to delegate kinesis streams which fed into feature processors.
While Azure's rich toolset offered ready tools for lift-and-shift, we used the migration opportunity to look at the pain-points of our design on AWS and modernize it to a cloud-native architecture. This allowed us to leverage the true power of the cloud with Azure's fully managed services and address several challenges such as:
- Parallelism – Parallelism in Kinesis can be increased by adding shards (1MB/s ingress, 2MB/s egress). The cost grows linearly with the number of allocated shards. In scenarios requiring high parallelism but low ingress (e.g., long IO operations) - Kinesis forces you to spin-up more shards, resulting in redundant cost. We wanted to save on this expense.
- Availability –It can take several minutes to reboot a single AWS EC2 VM (with additional overheads using Auto Scaling Groups). We wanted to recover dead and deadlocked compute units in seconds.
- Long build times – We used Jenkins on a hosted VM for our CI/CD automation, building a deployable Amazon Machine Image. The process took at least 1 hr. We wanted to limit this to just a few minutes to enable rapid patching, rollbacks, and deployments.
- Heavy Resource Utilization – On AWS, we used a mix of compute viz. a combination of CPU heavy, IO heavy, and memory heavy instance types. Since every feature processor was running isolated, the redundant memory of the CPU heavy processors wasn't being utilized by the other processors which needed it. We wanted to improve global resource utilization.
Modernizing PlayStream's architecture on Azure
While planning the modernization effort, we had a few prerequisites – (1) The right set of services had to be found on Azure to support the PlayStream workloads reliably and securely. (2) All SLAs had to be maintained not only post-migration but also during migration, including PlayStream's best effort in processing each event exactly once, and (3) The end-result had to be better for us and our customers, else the move wouldn't make sense.
Azure provided a clear line of sight to directly address our pain-points and meet all our prerequisites. Figure 3 shows our current architecture running 100% on Azure. We replaced our existing AWS EC2 compute layer with Azure Kubernetes Service, Kinesis with Event Hubs, Dynamo DB with Cosmos DB, and moved our CI/CD pipelines to Azure DevOps. Incoming game events are funneled in real-time through EventHubs and a head-dispatcher running on Azure Kubernetes Services (AKS). Events are subsequently dispatched to delegate EventHubs which serve as feeders into specific PlayStream feature processors such as the Actions Processor and the Event Archiver, running at scale on AKS clusters. The data funneling provides high availability and fault tolerance though processing isolation.
Figure 3: This illustration shows the current architecture of PlayStream on Azure, powered by Azure Kubernetes Service (AKS), EventHubs, and CosmosDB. The incoming events are dispatched and delegated to feature processors running on AKS Clusters. Cosmos DB handles the state checkpoints while EventHubs handles the event streams.
We noted clear technical and cost advantages of adopting these Azure services over our original design on AWS:
- Azure Kubernetes Service was the clear choice for our new compute layer. Using shared k8s cluster to run all our PlayStream workloads resulted immediately in better resource pooling providing a 2x improvement in efficiency. Horizontal/vertical pods autoscaling along with cluster auto-scale gives us better elasticity than before. Pod restarts are rapid and unnoticeable in terms of processing latency – making the experience smoother for our end-users. We also use pods liveness checks to detect deadlocks and self-heal, eliminating manual intervention required before.
- Azure Event Hubs was chosen to replace kinesis, offering a few clear absolute advantages. While Kinesis forces you to pay-per-shard regardless of how much data you ingress/egress, Event Hubs lets you partition your data as much as you need to get sufficient parallelism, while you only pay for Throughput Units (TUs), which can auto scale with your data. You only pay for your current allocated TUs, regardless the number of partitions. Event Hubs also offers the flexibility of tiering based on workload type viz. Basic, Standard, and Dedicated hubs. We use Dedicated Hubs for high-demand, mission critical production loads and Standard hubs for the rest.
- Azure DevOps now hosts all our CI/CD pipelines, reducing deployments times to AKS from over an hour to only a few minutes. Our production deployment, including end-to-end tests, is automatically triggered on cadence, eliminating manual intervention that was required in our previous design. We no longer spend time patching/maintaining build agents; this is all taken care of by ADO. The K8s Cluster and other resources are created through Terraform and helm charts.
Code migration - Migrating to AKS and building on top of Linux required all our source code to be ported from .NET framework to .NET core. Please refer to our .NET migration guide for instructions.
Abstracting Data Dependencies – PlayStream depends on several data layers. Migrating these required carefully mapping inter-layer dependencies, establishing a migration order, and making the application layer agnostic to the underlying data source. To achieve this, we abstracted the data layer, removing AWS specifics from it and adding implementations for other data sources (e.g: replacing DynamoDB with Azure CosmosDB). The abstracted data layer was responsible for: (a) Correctly initializing the connection to the data source, passing in any secret/identity based on the execution context, and (b) CRUD operations implementation to the specific data source. The abstraction greatly simplified both the live and offline data migration, ensuring that the reads/writes were from/to the correct DB.
Abstracting Shared Infrastructure – Many of our shared infrastructure (like configuration, logging, metrics, billing meters) were windows specific. We abstracted common infrastructure code and added platform specific implementations. To minimize risk, we rigorously tested our common infrastructure, and data access layers before migrating our traffic to Azure. The common infrastructure was re-used in moving several different components including the Main Server (API Front Door).
Traffic Migration – PlayStream is constantly servicing live traffic from thousands of games. Maintaining SLAs during migration without impacting the games was key. We followed a metric driven approach to achieve this, setting up the same metrics in Azure for all the feature processors as in AWS. We closely tracked SLA metrics such as (1) the number of events processed per second, ensuring that it did not exhibit spikes outside our adaptive thresholds and (2) event processing latency, ensuring this was within the sub-second range. Such a setup enabled us to make sure that processors operate at the same (or better) quality as traffic switches over.
To gradually migrate traffic from AWS to Azure, we stamped events with where they should be processed. Events were stamped according to config rules that we could dynamically patch with updates without deploying the entire service. In the head dispatcher, we marked the target delegate processors to handle each event, guaranteeing that the same event would never be processed on both clouds. Based on the game Id and the target feature processor, we were able to finely control where the event would be processed i.e., on Azure or on AWS. Figure 4 shows a snapshot of the architecture mid-migration where some game events are being processed either 100% in Azure (see Processor 1) or fully in AWS (see Processor 3), or selectively in Azure or AWS based on the game Id (see Processor 2). Traffic routing was also setup so that we could immediately roll back to AWS for any processor, in case of any unexpected behavior or delays. By and by, we migrated all the traffic to Azure, and wound down our AWS backend!
Figure 4: Traffic migration by tagging events with where they should be processed
Figure 5 recaps a condensed view of the architecture before, during and after the migration.
Figure 5: Architecture evolution: before, during and after migration
Now on Azure, PlayStream works better for us and for our customers. Switching from AWS EC2 VMs to Azure Kubernetes Service significantly improved efficiency by better resource utilization (2x improvement in efficiency). At the same time, switching from Kinesis to Event Hubs now let's us maintain a high degree of parallelism while paying only for the throughput that is used. Thanks to replica sets on AKS, we can now scale up/down in a matter of seconds, reacting faster to customers' traffic pattern changes. We can now rollout new features and fixes to customers much faster – down from over an hour per deployment to under 5 minutes- using docker container deployments in Azure DevOps. Finally, using AKS Pods Liveness checks, we've made our service self-healing and cut downtime originating from unhealthy machines or bad application states to zero.
Azure not only powers PlayFab PlayStream, but also offers you a rich ecosystem of services to scale-up your business smoothly. Launch your next game title, hassle-free, with PlayFab and scale your business swiftly with Azure today!
Azure For Gaming - https://azure.microsoft.com/solutions/gaming
PlayFab Technical Guide - https://azure.microsoft.com/services/playfab/
Azure for AWS Professionals - https://docs.microsoft.com/azure/architecture/aws-professional/