GDC 2024: How Minecraft Puts Players at the Center of the Universe of Big Data

How do you represent the views of millions of Minecraft players around the world? With data, of course!

March 21, 2024
GDC Minecraft Hero image

In case you've been living under a rock for the past 15 years or so, Minecraft is one of the original open world sandbox video games. It doesn't have a predetermined story; there’s no beginning, no end. Players are enabled to do whatever they want in the game, and we see all sorts of behaviors: engaging in creative game mode, survival game modes, playing with their friends, engaging in battles with mobs, being super creative — you name it. It's a really, really big ecosystem.

Minecraft allows players to engage across a number of platforms, be it a mobile phone or a tablet, console or PC. We also have titles available on XCloud. So from a data standpoint, you can probably imagine that when we're operating with 22 different endpoints, it makes it a pretty serious task to normalize data that's coming from all of them.

Minecraft is also a very global game. We see engagement from all corners of the world; At one point we saw activity happening in Antarctica, and we were wondering what could possibly be going on in Antarctica? Is there a research station over there that's just happily playing Minecraft?

And last but not least, but very importantly, Minecraft is the best-selling game of all time, with over 300,000,000 copies sold. This was a milestone that was announced last year, and we're super proud to be able to cater to such a large audience — but you can imagine the task we have when it comes to representing at least 300,000,000 players at scale, using data.

Data: The Intersection of Art and Science

I want to talk a little bit about the intersection of art and science, and how Minecraft is really both. It's a product that’s heavy on art; my analytics and data science teams work very closely with people from creative backgrounds who make decisions on a day-to-day basis about very abstract questions. But there's also a science aspect to making decisions in video games, and really trying to understand which players need updates from the game team, for example, or things like connectivity, which can hugely impact the quality of the play experience. That's the science part, and we try hard to bring in people that can live in that intersection.

I'm going to rewind a bit and tell you a little bit of a story around the journey that we’ve embarked on over the past four or five years, trying to level up our data ecosystem to be able to represent our players at scale.

Back in 2019, we were capturing data from our various endpoints, and it was just too much data for our systems to be able to process. I’ve watched many teams have this happen over the years — in video games analytics, we sometimes have a tendency to capture too much data, to the point where our analysts, data scientists and engineers don't know what to do with it. We were also living in a world where the data was very fragmented: there are multiple game teams at the studio, and they can potentially make decisions around the services that support those games, and also there can be multiple client implementations of telemetry.  Other teams produce data too. We see commerce, business, marketing teams, and others produce data that is not always consumed.

Minecraft GDC Inline image

The focus at the time really wasn't on harmonizing, but rather on supporting small efforts across the landscape — this was, in fact, creating a ton of overhead. In order for our data scientists and analysts to actually create a snapshot that was meaningful — that looked at the player lifecycle, for example — they were spending a ton of time writing SQL or HiveQL code to be able to merge and wrangle data. This was all very painful for our data scientists, and morale wasn't super high.

Here's one example: understanding friends features in the game. The studio wanted to know what the average number of friends was for Minecraft players, and one of our data scientists went into many systems and tried to answer this question, coming up with a very low number of average friends per player.  The data was right, but somewhat counterintuitive. 

After looking at this data, we actually felt like the ecosystem was very lonely. Just Alex and Steve. Were we looking at the right metric? Or even the right problem?

Fast forward a few years later, and we’ve built up incredible capabilities. It turns out that what we were seeing at that point in time was really just the average number of friends on a friends list, so the number we were seeing wasn’t really telling us whether the ecosystem was social, just that players were not using the friends list feature.

Six Degrees of Kevin Bacon

As some of you know, there was a famous study at one point that talked about social networks, basically illustrating that people in society are connected to Kevin Bacon by up to six degrees of separation. So, we ran a similar exercise, but this one was called “six degrees of Alex and Steve”. We were trying to understand how social the ecosystem really is, moving away from simple averages and descriptive stats. 

This study was driven through our modern data ecosystem and follows the relationships of players, how they’re interacting with others in the ecosystem (by a number of degrees) and can be represented visually through a network graph.

Minecraft GDC Inline image

The results were mind blowing.  We saw that a random sample of 40 players resulted in connections to 330 players, and that group of 330 players connected to an additional 12,000 players in Minecraft. So if we were to follow those 12,000 players and then the next few degrees of separation, we’d eventually cover the entire globe with Minecraft connections.

We learned that the ecosystem is highly social, and that social butterflies in the ecosystem are important to make the game more fun and engaging for others – just like in real life.  Lastly, we ended up encountering a lot of information about how player networks form – coalescing around country, platforms, but a lot of times forming around interests more than anything.

The LAN Parties of Today

Back in the day, we used to have the old LAN parties, right? Some of you probably remember doing this on weekends — actually bringing your PCs together with other folks, hooking up to an Ethernet port and spending hours playing games with each other. And even then, there were one or two people that got the group together, securing the space, passing around flyers, etc. And the same applies in Minecraft: what we see is social butterflies — players that are really good at connecting players with each other.

We see players on social media who are trying to gain a following and create an ecosystem by sharing Minecraft Realms links with each other.  Many times we see that social butterflies aren’t necessarily the ones that you’d normally see driving millions of views on YouTube, but they are super important to their own networks of friends and connections. The old LAN party has become the new personal servers party, and we see that this is thriving on social media and intersecting with personal servers in Minecraft Realms.

We would have never known this had we not looked a little bit differently at the question of whether players are interacting with each other in the game. Were we still only looking at averages, we’d still be under the assumption that Minecraft is a bit of a lonely ecosystem, where the reality is that there’s actually a lot of activity happening.

Modernizing the System

So how do we enable this sort of information, and what enables a more nuanced view of players? I'm going to fast-forward to 2024, where we have a data ecosystem that is much more player-centric. All the decisions that we make around how to capture data and what data sources to ingest are really around the fundamental question of: “can we make the player experience better, more fun, and more engaging?”

Minecraft GDC Inline image

We also have flexible tech. We’re 100% cloud-based at this point, but our processing of data is detached from our data storage, and that actually becomes very powerful.

We virtualize the data ecosystem so that it becomes somewhat agnostic to the underlying technology.  Lastly, our focus as a team is on insights, predictions, and most recently doubling down on Artificial Intelligence and Machine Learning.

For our technical stack, we use Azure technology. Azure Blob and Azure Data Lake are our primary storage technologies.  We also have a great partnership with Azure Databricks — so there are a couple of options of how to virtualize a data ecosystem. We're in the process of migrating to the Azure Databricks Unity catalog (that’s a bit of a work-in-progress), and last but not least, what's really important is the consumption layer.

A Virtual Source of Truth

At a very high level, you can think about this idea as a single place where we have information around product, social media, player support, game, health, marketing, etc. It’s a place where not all the data is actually stored in that place, but the data can be consumed through a single endpoint. It's not enough to collect data — we must harmonize it and store it and create this virtual source of truth.

While technology is important, it's critical to build subject matter expertise, so that analysts, data scientists, and power users understand how to tap into all of the data that the studio owns. We hire people that are typically subject matter experts in the domains where the demand for data is: product, commerce, social support, game help, and marketing, for instance.

The Power of Storytelling with Data Insights

Around 2011, I remember watching a National Geographic documentary that created a composite portrait of “the most typical person in the world”. I thought that was really interesting: Based on statistics and averages of the world population and its growth, we could actually get to a composite of what the average person looked like back in 2011.

Likewise, the sort of standard video documentary that National Geographic uses to tell stories is a great example of data storytelling.  I have always really enjoyed how they find examples of average or expected behavior in nature, and truly document the science using media and storytelling.

Minecraft GDC Inline image

We try to do something very similar with Minecraft, which is that we try to understand the life of the “average” Minecraft player — though that's probably the wrong way of describing it, so we actually call it the life of the “most likely” player. This doesn't mean that every single Minecraft player goes through the same process or experiences; obviously on the micro level, experiences are very unpredictable, but on average and on aggregate, the world is relatively stable.

One of the things that we try to look into is a continuum: What are some of the things that players are more or less likely to experience? In our case, we broke out the life cycle of a player into four different stages, and most games probably see a combination of some of these different stages. Looking at our life cycle from a horizontal standpoint is actually really powerful, because that way we can manage the player experience and better understand what players are more likely to encounter as they become more sophisticated in the game.

For example, we’ll start with onboarding, where we ask ourselves some of the big questions: What makes a player onboard successfully? What are some of the features that should be presented early in the lifecycle? What are some of the things that they're likely to experience next? Are they going to start interacting with other players? Are they going to start pushing the boundaries of their hardware to such an extent that they're more likely to encounter crashes? This is a very important stage, because it will really dictate whether we’re able to build a lifetime relationship with our player or not.

As players become a little bit more sophisticated, what are some of the things that tell us that the player is becoming more sophisticated? What are some of the experiences that we should be thinking about from a pathing standpoint? How do we build a lifetime relationship with these players? What happens after they master the game, or after they build a social network? They've experienced a lot of the different components that the Minecraft ecosystem has to offer, but how do we continue to drive an everlasting relationship through new experiences, even outside of the main Minecraft game and into parallel experiences including Minecraft Dungeons and Minecraft Legends?

So this is one way of leveraging the data insights ecosystem by really helping us understand, from a product standpoint, what players are experiencing. and how we can optimize the game to eventually get to a point where we can create a lifetime relationship with our players.

Speaking at Scale

Another big thing for us is figuring out how we can leverage social media to listen to feedback at scale, through different languages and on different platforms, and how we can apply some of that information to product launches.

A few years ago, we established a partnership with a group of students at Purdue University, and this group helped us start to build capabilities in this space. We use very advanced technologies (including Azure Cognitive Services) to help us either translate or understand the sentiment of social media conversations, and create an understanding of whether our players are enjoying new experiences or can provide constructive feedback.

Minecraft GDC Inline image

Last year, the Minecraft Legends team released a gameplay trailer on YouTube and the analytics teams were able to gather a lot of feedback and information from different interactions happening on social media.  We use Natural Language techniques, as well as LLMs, to be able to understand what players are telling us and represent feedback at scale.

Feedback is everywhere, including app stores, and our website on Minecraft.net, and even in our support channels. As an example, players are able to use a form on help.minecraft.net, and tell us if they're experiencing a problem or open a technical support ticket. We can leverage machine learning to categorize tickets and create visualizations that help us prioritize areas that could impact the player experience.

It's also very helpful for us to use just traditional natural language processing and topic modeling to create clusters of different problems. We can come up with seven or eight different problems that we're seeing coming from the help website, and work directly with some of our services teams to get more insight.

Leveraging Machine Learning for Recommendations and Prediction

Minecraft Marketplace is a fantastic way for players to expand their experience: it offers worlds, skins, textures, and add-ons. Some of it is for sale, some of it is for free.

The data scientists at Minecraft try to enable discoverability of content by using advanced machine learning.  What we do is similar to what a TV subscription service will do when recommending different movies to you when you're browsing through your favorite TV app, including churn prediction.   

Our data scientists are working hand-in-hand with some of our game producers to create the right treatments and content curation.  Data scientists are able to create algorithms on-the-fly inside the data ecosystem, and push those into the game. 

Minecraft GDC Inline image

Miguel Gonzalez-Fierro, a Principal Data Scientist at Minecraft, and in collaboration with several data scientists at Microsoft, recently transferred ownership of the Recommenders Github repository to the Linux foundation.  This repository contains a collection of dozens of open-sourced recommendation algorithms. 

If you have a data scientist on your team, you are likely to have somebody that’s interested in recommendations, and they’ve likely stumbled into these repositories for learning. This is one of the most important recommendation repository in the world; we're super proud of it, and continue to invest in it.

The Road Ahead

The right data strategy in games is one that looks forward, harmonizes, and iterates to improve.

One of the most forward-looking decisions we have made from a data strategy standpoint was the decoupling of data technologies so that the ecosystem could remain future proof.  In the world today, we store data outside of the technologies that we use to process data, and that creates some flexibility.  But technology is only one component in a holistic data strategy for games – people and talent are the most important ingredient.

Creating harmony between people and data technology requires a good understanding of how data scientists and analysts consume data.  Building consumption end points that are practical and easy to use.

Lastly, the world of games moves extremely fast – and it is important to maintain a flexible and nimble approach when it comes to data. Many teams and studios around the world approach their data technology efforts with very heavy and rigid dev ops processes – our approach over the years has shifted towards one of rapid iteration and failing fast.  Learn fast.