Diablo IV debugs Linux core dumps from Visual Studio
Blizzard is using Visual Studio 2019 to debug Linux core dumps on WSL. The following blog post is written by Bill Randolph, a Senior Software Engineer at Blizzard working on the development of Diablo IV. Thanks for your partnership, Bill!
On Diablo IV we develop all our code on Windows and compile for multiple platforms. This includes our servers, which run on Linux. (The code includes conditional compilation and custom platform-specific code where necessary). There are multiple reasons for this workflow. For one, our team's core competency is on Windows. Even our server programmers are most familiar with Windows development, and we appreciate the ability for all the programmers on our team to use a common toolset and knowledge base.
The other, and most important reason that we develop on Windows is the functionality and robust toolset provided by Visual Studio. There is nothing quite comparable in the Linux world, even if we were to develop natively in Linux.
However, this presents us with some challenges when a deployed server crashes and we want to debug the resulting core dump. There is the option to remote login to the VM (or more specifically the container) that crashed and run gdb to diagnose the crash there. But there are numerous disadvantages to this. For one, we don't deploy source with our binaries, so source is not available in a gdb session on a VM or container.
Another hurdle is gdb itself: unless you use gdb on a very regular basis, you don't retain a level of proficiency with it that makes it convenient for our use. Putting it simply, our developers would much rather use familiar tools to debug. Since only 2 or 3 of our developers have much proficiency with gdb, they become the de-facto resource for diagnosing production crashes, and that's not optimal.
We have always wanted a more intuitive approach for debugging our Linux cores. That's why we are so excited to be able to utilize the new Visual Studio feature that lets us do just that in the familiar environment of Visual Studio! It really is not an exaggeration to say that this is a dream come true.
Our debugging workflow
The Visual Studio Linux core debug workflow is enabled only if you install WSL or add a Linux connection to the Connection Manager. All our server developers install WSL, using the distribution we deploy on. We run a script I wrote that also installs all the development tools and support libraries needed to build our server within WSL.
(As a brief side topic, I want to emphasize that we have found WSL to be the best available Linux environment for developers to test their changes in a Linux build. It's incredibly convenient to hop into WSL, cd into the shared code directory, and build right from there. This is a much better solution than running a VM, or even a container. If you are building with CMake, then you can also leverage Visual Studio's native support for WSL.)
Let me provide a little background about our build. We develop our code on Windows and have a Windows version of our servers that can run under Windows. This is useful for normal feature development. However, we deploy our servers on Linux, which requires a build generated on Linux itself. The Linux build is generated on a build farm that uses a build system on a Linux box to build the server, and its container that gets deployed. The Linux executable is only deployed in a container and the developers normally don't have access to it.
When a server crashes in our infrastructure an automated process notifies us and the core file is archived to a network share. To debug a core in either Linux, or using Visual Studio, you must have the executable that was running; it also helps to debug with the exact shared libraries used on the deployed container. We use another script to obtain these files. First, we copy the core to our local machine then run the script and point it to the core. The script downloads the Docker container that was built with that version, extracts the server binary from it, along with certain shared runtime libraries for gdb's use. (This avoids gdb compatibility problems you may encounter if your WSL version does not exactly match the deployed Linux version.) The script writes to ~/.gdbinit to set up the shared libraries as system libraries for the debug session.
Then we switch over to Visual Studio, where the fun begins. We load the solution to build our Windows version of our servers. Then we open the new debug dialog under Debug -> Other Debug Targets -> Debug Linux Core Dump with Native Only. We enable the checkbox that says "Debug on WSL" and fill in the (WSL-specific!) path to both the core file and the server binary. After that, we hit Debug & watch the show!
Visual Studio invokes gdb in our WSL behind the scenes. After some disk activity, up pops a call stack for the crash with the instruction pointer on the relevant line of code. It's a brave new world!
So next comes the task of identifying the crash. We have a crash handler that intercepts the crash to perform some housekeeping, so the actual crash will be down the call stack in a single-threaded server. However, some of our servers are multi-threaded, and the crash could have originated from any of those threads. Our crash handler logs the source of the crash's file and line number, so examining those variables gives us our first lead; we will look for the call stack that was executing that code.
In the old days of a few weeks ago, we would use gdb to get a backtrace of all threads and peruse the resulting list to see which thread had the most-likely call stack that would have crashed. For example, if a thread is just sleeping, it is most likely not the crashed thread. We would look for a stack that had some more content than a few frames capping with a "sleep" and examine the code to see if a problem is evident, or go into gdb itself to examine the process state.
However, Visual Studio gives us considerably more powerful options than that. For a multi-threaded core you can open the Threads window in your debug session and poke around in each thread to see what the stack looks like. This is pretty similar to the gdb approach, and if there are 50 threads it can be very tedious. Fortunately, there is a feature that makes this much easier: Parallel Stacks.
I confess most of us did not know about Parallel Stacks until Erika Sweet and her team told us about it. Invoking Debug -> Windows -> Parallel Stacks (only available during your debug session) opens a new window that shows the call stack of every thread in your process. It's a fascinating 30,000-foot view of your entire process space. You can double-click any stack frame in any thread, and Visual Studio will jump to that frame in both source and the call stack window. This is a huge time-saver for us.
Once we can see the code near the crash, we can inspect variables using mouse-hover, QuickWatch or any of the other plethora of tools in Visual Studio. It's true that in a Release build, many variables are optimized out, but at the same time, many are not! We can hone in on a problem much faster using Visual Studio's interface than we ever could have using just gdb.
Our team is very excited about the ability to debug Linux cores from our production environment in Visual Studio! It is a game changer for us, as it allows many more developers to actively diagnose problems "in the wild", and it makes the powerful toolset of Visual Studio debugging available to all of us. Once our initial setup is complete, it only takes a minute or so to be in a debugging session in Visual Studio. This feature will make finding problems in our code much faster and more efficient! Thanks to Erika and her team for working with us on this!