How 3D Spatialized Audio Bottlenecks Virtual Reality Video

Gabor Szanto

Virtual Reality will create another computing boom. VR is insatiable in demand for better batteries, more bandwidth, more processing power.

-Naval Ravikant, Angellist

Naval is 100% right. VR is an unrelenting, greedy God, shamelessly demanding the entire capacity (and MOAR!) of your smartphone’s CPU, GPU, RAM and memory bus for its needs.

But the computational capacity of your smartphone isn’t anything ike your laptop’s ie cheap and virtually infinitely abundant. Computational capacity on mobile SoCs is a scarce and expensive resource. So scarce, that the Samsung Gear VR throttles back performance when it reaches thermal limits.

John Carmack of Oculus noted,

When I first brought up the system in the most straightforward way with the UI and video layers composited together every frame, the phone overheated to the thermal limit in less than 20 minutes. It was then a process of finding out what work could be avoided with minimal loss in quality.

And various ways of preventing overheating of the Samsung Gear VR became a ‘hot’ topic (excuse the pun) in the VR community.

Fix Overheating Issues with the Samsung Gear VR
It is amazing to think that normal smartphones are powering the Samsung Gear VR. The requirements are so demanding that in some cases, your phone can give you an overheating warning if it detects temperatures above normal.

Not long ago, it was announced that Android Nougat has an entirely new mode, the Sustained Performance API that helps to ameliorate thermal management problems caused by VR’s insatiable appetite for computational resources.

Jevon’s Paradox in Computing of Virtual Reality

As computational capacity of mobile devices grows, oddly so does the demand for computation. In economics, this is known as Jevon’s Paradox. As mobile SoCs become more efficient and deliver more performance per watt, computation itself becomes less expensive. As computation becomes less expensive, this induces greater demand and more use of computation.

Suggesting that the era of mobile computing didn’t begin with the iPhone in 2007. But is just beginning now.

Video and audio used to be just plaed back on mobile devices, but as mobile hardware becomes better, video and audio are increasingly processed, rendered and transformed on the mobile device.

This important shift from client playback to client processing is already underway, and is accelerating. The difference between playback and rendering is the difference between simply eating a cake (playback) and baking that cake (rendering). In 2004, your Motorola RAZR would only let you eat cake (play MP3s). Today’s Android and iOS devices allow you to bake a cake by rendering video and audio in real-time, for example, DJs perform computationally expensive audio transformations like time-stretching and pitch-shifting with almost zero latency on iPhones, a few years ago, this was unthinkable. You had to have a desktop class processor or dedicated DJ hardware.

Soon, the supercomputers in our pockets will be doing 1000x more computation and rendering than they are doing now as processing done in the cloud, moves to the mobile device itself. This is obvious if you think about it – on applications with low latency needs (VR, driverless cars to name two), even if the processing ‘in the cloud’ were somehow instantaneous, you still have to account for network latency.

VR is pushing this shift even further. And VR must use spatialized audio enable true immersion. 3D spatialized audio allows for VR users to localize and perceive sounds in 3D space around them, just as one does in meat-space.

However, there is one ironic problem with spatialized audio. As the Unity SDK notes:

Sound occlusion is a very hard problem to solve in terms of computation power. Whereas in global illumination you may consider the movement of light as effectively instantaneous, sound is moving very slowly. Therefore calculating the way sound actually moves around (as waves) in a room is not feasible computationally. For the same reason there are many approaches towards spatialisation tackling different problems to various extents.

For VR to fulfill its promise, it must offer users spatialized audio yet spatialized audio increases the demand for computation and power by orders of magnitude. (See our technical addendum 3D Audio HRTF Processing Memory Bus Calculation as well as The 1% Rule for Mobile App Power Consumption.)

Hence, within every VR app, audio competes with video rendering and physics engines in a resource war for access to computation.

The Yin and Yang of VR Video and VR Audio

The ways in which we perceive wine and VR are surprisingly similar.

When we drink wine, we focus on the flavor. Experienced wine-tasters can perceive all sorts of ancillary flavors such as cherries, plums, prunes and even the taste of jam in a glass of zinfandel. However, if we have a cold and our sense of smell is impeded, no matter how good the wine, we cannot taste it. For that glass of zin to be an enjoyable experience, we must be able to smell it because smell and taste are inextricably linked.

VR is the same.

When we place a VR headset over our eyes, we naturally focus on the visuals. Yet no matter how good the video, no matter how many FPS, if the audio isn’t low latency, high quality and performant (and right now, it mostly isn’t), no VR user will be able to immerse. The user won’t be able to tell you why they couldn’t immerse but only that something ‘was off’. As cinema legend George Lucas famously noted,

The sound and music are 50% of the entertainment in a movie.

Because just like taste and smell; video and audio are also inextricably linked.

Allocate much CPU and memory bus bandwidth to audio, and then suffer with insufficient FPS for video. Too much to video and audio becomes low quality. Either way the VR enthusiasts lose.

How to strike a balance that makes for immersive experiences? That takes into account quality and computational limits of both video and audio?

This article will explain why VR adoption is being bottlenecked by the current state of 3D spatialized audio.

While we’ll discuss the functions of a few algorithms intrinsic to audio processing, this article is written and aimed at developers, engineering leads, project managers, product managers, content directors, business development and sales roles. Readers need not have technical audio experience or deep academic understanding.

Table of Contents

Audio Programmers are Right and Wrong

In the world of media, video has been favored historically. Given a choice, developers will almost always provide video with the lion’s share of system resources.

This has not gone unnoticed by crusty audio programmers--when what is needed, the audiophiles say, is to follow George Lucas’s lead and accord audio as much importance as video in games, apps and VR.

After all, which would you rather choose?

  1. Skype call with bad video and good audio?
  2. Skype call with good video and bad audio?

It turns out that most of you will choose the first. Users tend to tolerate low quality video if the audio is adequate.

So, the audio developers aren’t wrong. But neither are they entirely right.

Given the length of hardware cycles, the audio developers are absolutely wrong in thinking that audio-related processing deserves more access to the computation.

It doesn’t. But with better software, audio can be performant, high quality and low-power.

The Two Types of Spatialized Audio for VR

You can categorize the current state of 3D/spatialized audio into two distinct types: cinematic and object based.

Cinematic VR audio is used for pre-generated video and film content, and is easy to stream. Object based VR audio is used mostly for games.

We’ll illustrate the distinction between the two by showing how one would spatialize the sound of a chirping bird circling around a VR user.

Spatializing Audio in VR

Cinematic spatial audio in VR is implemented with ambisonics technology. Ambisonics has a fixed number of continuously streamed audio channels with 3D audio information. The most popular audio format is ‘first order B-format’, where four channels carry one main sound pressure channel and three directivity channels in the X Y Z directions of the 3D space.

Then a virtual sound-field surrounding the user is created, with virtual speakers around his head. In order to provide an immersive experience, the number of virtual speakers is greater than 4. The B-format signal is mixed down separately to each virtual speaker, and as you rotate your head, the speakers move along virtually.

As you can see below, with cinematic spatial audio the sound of the chirping bird circling our user is distributed amongst the virtual speakers in the sound field. This is how audio in Google Cardboard works.

Spatializing Audio in VR

On the other hand, in object based VR audio, the virtual speakers aren’t fixed in number or in their position relative to you. Every sound source gets its own virtual speaker — for example, if the chirping bird is flying around you in a virtual space -- it is as if a speaker were affixed to the bird.

Spatializing Audio in VR

It turns out that in cinematic VR, sounds are often positioned to occur between virtual speakers. For example, as the bird flies in front of our user from his left to his right, the chirps are distributed across the left speaker, the front speaker, the right speaker -- which the effect being that the user hears chirps in the spaces between the virtual speakers.

Unfortunately, the result is usually an unintentionally dampened and ‘phasey’ sound. This is sometimes also referred to as ‘processed’ or ‘foggy’ sound. When audio is phasey, processed or foggy, it makes for poor user experiences that lack the needed presence and immersion.

[This is because of inescapable problems with something called the Head Related Transfer Function (“HRTF”). We’ll revisit the HRTF and phasey sound below.]

The noteworthy advantage of object-based VR audio is that each sound source only contains one virtual speaker’s error. The downside is that for each audio source, you need a virtual speaker.

The more virtual speakers, the more immersive the audio, the more immersive, the better the experience and the more computation is needed for audio processing and, of course, the hotter the device gets. The hotter the device gets, the less efficiently it processes.

VR Audio is Too Convoluted

The typical implementation of one virtual speaker emitting sound from a point outside of your head is based on the Head Related Transfer Function (HRTF).

HRTF works by measuring the sound arriving to the human ear canal from a sound source at a given point in 3D space and consists of a datet containing impulse responses for a few hundred locations around the measurement subject’s head. Common used HRTF datasets include MIT KEMAR and IRCAM.

The audio is then combined (“convoluted”) with this impulse response to provide the sensation of audio coming from a specific point in space, outside your head. But convolution-based audio is a poor choice for high quality 3D sound.

If you look at the frequency response of HRTF spatialized audio in a spectrum analyzer where you can see the frequency content, some of the frequencies have high magnitudes, the next may have low, the next may be high.

In other words, the frequency response of HRTF is what audio engineers call “jagged”. This has the unfortunate result of creating phasey sound. This is one of the unfortunate side effects of convolution with HRTF data. If you’re familiar with DJ software, you know exactly what ‘phasey’ sound is -- the older time-stretching algorithms suffered from this (eg Pioneer CDJ). Phasey sound acts on us subconsciously and doesn’t register as realistic or natural -- it stands as a direct impediment to immersive audio experiences.

Phasey Audio Example: Listen to this audio file. The first vocalization is processed by a widely available spatializer that makes it phasey, the second version is processed by Superpowered.

Professional musicians have learned by trial-and-error to take great care not to produce music that sounds phasey because it unconsciously irritates listeners. Current implementations of HRTF do not take care to account for this sort of implicit musicality.

Furthermore, if a sound source falls in between the four measured impulse responses as is 99% the case, because HRTF measurements are taken at fixed points in space, the sound cannot be “continuous”. So the sound ends up convoluted by the product of four “jagged” responses.

This is a grave mistake that needs to be addressed if VR is be widely adopted.

The more jagged the frequency response, the more “processed” the sound becomes. And as the sound moves in the virtual space, the wild variation of the frequencies can create a “jet” effect (AKA flanging). The jet effect sounds cool in small doses at the right time in a club dance song, but sounds terrible if every sound spatializes with a small trailing “jet” sound.

Enter The Human Ear

Sound localization is based on a number of factors, here are the three most important:

  1. Inter-aural time delay (“ITD”) is the time difference between your ears. When sound comes from your left, you hear it a about .7 milliseconds earlier than on your right. That difference provides an important directivity cue for your brain. It doesn’t work for frequencies above the wavelength of the size of your head, which is somewhere around 1500 Hz.
  2. The shape of your pinna (your visible outer ear) causes several small time-delays at different frequencies, changing the frequency response of the incoming sound. Your brain learns these typical response curves, providing another directivity cue.
  3. Head shadow: for frequencies above the wavelength of your head’s size, your head reduces the amplitude of high frequency sounds coming from certain directions. When sound comes from your left, your right ear receives much less high frequency content, because of your head’s shadow effect.

There are other factors, but these three have the largest impact on your ability to process sound in terms of directivity cues and sound localization.

The shape of your pinna has various elements, but it’s main cavity’s resonance has the most effect on frequency response. That main cavity is called the “concha”. Research has revealed that the other elements of your pinna have magnitudes less of an effect, and as there are big differences between human ears (and HRTF data for each subject) optimizing for those small effects is not useful, as optimization may work for only a small subset of users with similar ear shapes, while the same optimization would sound unnatural for most others.

Practically, the effects of the concha and head-shadow on frequency response are the most significant directivity cues — this is why it is unnecessary and suboptimal for audio to use, for example, 512 wildly fluctuating frequency bins as is commonly used.

The human variability in ears is the reason why we cannot rely on 1, 2 or even 1000 HRTF data-sets to create good spatialization data.

The Holy Grail: Notches in the Frequency Response

The aforementioned directivity cues make ‘notches’ in the frequency response. What if someone were to research the most significant notches and map them to the pinna and head-shadow?

Which means that we don’t need to modulate the frequency response with 512 bins — but we need only to apply nice, smooth notches. In doing so, we avoid highly fluctuating and jagged frequency response (i.e. phasey sound). Again, the problem with phasey sound is that it doesn’t sound real. If your frequency curve is nice and smooth — you don’t get processed sound. You get the Holy Grail. You get immersion.

In order to deliver to biggest impact for an audience as broad as possible (so spatialization works for everybody), we’ll need some “psycho-acoustic enhancements” as well. Typically, when you read something like “psycho-acoustic enhancements”, you should think “we couldn’t figure it out with science, so we hacked the sound here and there”.

These sound hacks are critically necessary in this case, as pure, super smooth and natural spatialized sound may be not immersive enough to get the sort of user experience/effect needed for VR. You don’t want the most natural chirping bird sound, you actually want the cleanest and most 3D-like bird sound. You want to amaze the listener. Which is a long way of saying the results of academic research may not be good enough for a cool 3D game sound.

As a result, here at Superpowered we’re continuously researching and tweaking the sounds created by music producers (!), and offering alternative sound options in our coming spatializer to suit every need.

The Resource War for the CPU and Memory Bus Bandwidth

... mobile graphics pipelines rely on a pretty fast CPU that is connected to a pretty fast GPU by a pretty slow bus and/or memory controller...
(source)

The other tremendous benefit of the ‘notches’ approach other than better quality audio— is that it takes orders upon orders of magnitude fewer CPU and memory bandwidth resources to process audio. Memory bus bandwidth is often overlooked performance bottleneck on mobile devices. In terms of high performance VR/AR graphics on resource-constrained mobile devices, memory access/transfer is an incredibly expensive resource as the CPU and GPU often sit completely idle in “wait state” for data to process. This idleness is pure waste.

Worse yet, most developers don’t realize that while audio processing in a VR app typically uses just a fraction of the CPU typically, audio can easily use a significant chunk of the available memory bus bandwidth, resulting in dropped frames.

Therefore, it is memory bus bandwidth, not CPU or GPU clock speed, that is typically the bottleneck on high performance graphics. This suggests that answers to the performance challenges in VR and AR lie not just waiting for better hardware, but in better software.

(See here for a conceptual model of how the memory bus bandwidth and load interact with the CPU, GPU, RAM and audio chip in mobile audio processing.)

Developers shouldn’t have to spend time considering and optimizing audio sources in a virtual soundfield. Developers shouldn’t have to fight a resource war over the CPU and memory bus.

Superpowered technology not only accounts for inherent musicality, but also for the fundamental hardware constraints in mobile devices.

Better audio with less CPU and less memory bus. Two (chirping) birds with one Superpowered stone.

The Download it today.

  • iOS
  • Android
  • spatialized audio
  • 3d audio
  • cinematic audio