Spatial sound design

If you've ever played Marco Polo, or had someone call your phone to help you locate it, you are already familiar with the importance of spatial sound. We use sound cues in our daily lives to locate objects, get someone's attention, or get a better understanding of our environment. The more closely your app's sound behaves like it does in the real world, the more convincing and engaging your holograms will be. Spatial sound does four key things for holographic development:

  1. Grounding: Just like real objects, you want to be able to hear holograms even when you can't see them, and you want to be able to locate them anywhere around you. Just as holograms need to be grounded visually to blend with your real world, they also need to be grounded audibly. Spatial sound seamlessly blends your real world audio environment with the holographic audio environment.
  2. User attention: People are used to having their attention drawn by sound - we instinctually look toward an object that we hear around us. When you want to direct your user's gaze to a particular place, rather than using an arrow to point them visually, placing a sound in that location is a very natural and fast way to guide them.
  3. Immersion: When objects move or collide, we usually hear those interactions between materials. So when your objects don't make the same sound they would in the real world, a level of immersion is lost - like watching a scary movie with the volume all the way down. Spatialized sound make up the "feel" of a place beyond what we can see.
  4. Interaction design: In most traditional interactive experiences, interaction sounds like UI sound effects are played in standard mono or stereo. But because everything in HoloLens exists in 3D space - including the UI - these objects benefit from spatialized sounds. When we press a button in the real world, the sound we hear comes from that button. By spatializing interaction sounds, we again provide a more natural and realistic user experience.

A few best practices when using spatial sound:

  1. Real sounds work better than synthesized or unnatural sounds. The more familiar your user is with a type of sound, the more real it will feel, and the more easily they will be able to locate it in their environment. A human voice, for example, is a very common type of sound, and your users will locate it just as quickly as a real person in the room talking to them.
  2. Expectation trumps simulation. If you are used to a sound coming from a particular direction, your attention will be guided in that direction regardless of spatial cues. For example, most of the time that we hear birds, they are above us. Playing the sound of a bird will most likely cause your user to look up, even if you place the sound below them. This is usually confusing, and it is recommended that you work with expectations like these rather than going against them for a more natural experience.
  3. Most sounds should be spatialized. As mentioned above, everything in HoloLens exists in 3D space - your sounds should as well. Even music can sometimes benefit from spatialization, particularly when it's tied to a menu or some other UI.
  4. Avoid invisible emitters. Because we've been conditioned to look at sounds that we hear around us, it can be an unnatural and even unnerving experience to locate a sound that has no visual presence. Sounds in the real world don't come from empty space, so be sure that if an audio emitter is placed within the user's immediate environment that it can also be seen.
  5. Avoid spatial masking. Spatial sound relies on very subtle acoustic cues that can be overpowered by other sounds. If you do have stereo music or ambient sounds, make sure they are low enough in the mix to give room for the details of your spatialized sounds that will allow your users to locate them easily, and keep them sounding realistic and natural.

Some general concepts to keep in mind when using spatial sound:

Spatial sound is a simulation. The most frequent use of spatial sound is making a sound seem as though it is emanating from a real or virtual object in the augmented world. Thus, spatialized sounds may make the most sense coming from such objects.

Note that the perceived accuracy of spatial sound means that a sound shouldn't necessarily emit from the center of an object, as the difference will be noticeable depending on the size of the object and distance from the user. With small objects, the center point of the object is usually sufficient. For larger objects, you may want a sound emitter or multiple emitters at the location on the object that is supposed to be producing the sound.

Normalize all sounds. Distance attenuation happens quickly within the first meter from the device user, as it does in the real world. All audio files should be normalized, and most sounds should be played at unity gain. The spatial audio engine will handle the attenuation necessary for a sound to "feel" like it's at a certain distance (we call this "distance cues"), and applying any attenuation on top of that could reduce the effect. Outside of simulating a real object, the initial distance decay of spatial sound sounds will likely be more than enough for a proper mix of your audio. If you feel like you need to be attenuating sounds, it's likely that the source is too close to the user, and that should be adjusted rather than the volume of the audio file or the emitter.

Spatial sound emitter movement. Because spatial sound is tied to the movement of the user's head, no sound emitter movement is needed for an accurate positional effect - the user's own head movement (even very slight) will provide the necessary cues for a sound's position.

If sound emitter motion is desired (i.e. a bird in flight), left/right movement is most effective for spatial sound, and should be incorporated into emitter motion whenever appropriate. For instance, if a sound moves from in front to behind the user, moving by on one side of the user will produce the best effect. Elevation changes are less obvious, so emitters should be close to eye level unless a simulated object is meant to be above or below the user.

Distance attenuation/dynamic compression. It can be tempting to reduce (or even nullify) the amount of distance attenuation for important sounds. However, distance attenuation is important for positionality, and the Min Gain should be kept to a low value (below -20). This is mostly because the "Min Gain" property for distance attenuation only applies to the direct path - all reflections will decay naturally regardless of this setting. This means that with no attenuation, your sound will become less reflective rather than more reflective, throwing the positional simulation off and making everything sound very close to the user's head. It is best to keep distance attenuation as natural as possible and add dynamic compression when needed for important sounds.

Sounds should be spatialized. On the HoloLens, as in the real world, experiences exist in 3D space. Sound emitting objects, including user interface elements, should be locatable using sound as well as sight. If a portion of an experience occurs outside of the user's view, for example music or a voice-over during a scene transition, there can be benefits to spatializing this audio as well. Using spatial sound on these objects provides a natural way for users to identify where their attention should be focused.

Object discovery and user interfaces. When using audio cues to direct the user's attention beyond their current view, the sound should be audible in the mix. For sounds and music that are associated with an element of the user interface (e.g. a menu), the sound emitter should be attached to that object. Stereo and other non-positional audio playing can make spatialized elements difficult for users to locate.

Use spatial sound over standard 3D sound as much as possible. On the HoloLens, for the best user experience, 3D audio should be acheived using spatial sound, not legacy 3DAudio technologies. In general, the improved spatialization is worth the small CPU cost over standard 3D sound.

Stream music, voice overs and long ambience tracks. To preserve system resources, longer sounds and sounds that don't always need to be loaded in memory for instant access should be streamed. Voice-overs are a great example, as they are often only played once, as in during a cut scene.