3D Audio HRTF Processing Memory Bus Calculation

Gabor Szanto

This post serves as a technical addendum to How 3D Spatialized Audio Bottlenecks Virtual Reality Video.

Below we walk you through the effects on memory bus bandwidth for HRTF (3D audio) audio processing, and how it easily exceeds 3 MB/s per channel.

Assumptions

Let's imagine a VR app running on a mobile device with an ARM SoC (RAM, CPU, GPU). The Apple A9 has 64 KB L1 cache for each of its two cores, the 3 MB L2 cache is shared by both cores, and it has a 4 MB L3 cache servicing the entire SOC. The Snapdragon 820 has only 1 MB L2 and no L3.

As readers know, audio is processed in buffer-sized chunks, 1024 samples is a common value in game engines. So we have 48000 Hz >> 10 = 46 "turns" every second. The cache is completely overwritten with other data between every turn, as one is processing huge amounts of graphics, for example.

Let's imagine we have 8 stereo sounds in the 3D space, and they are all available as raw PCM data in memory, so we don't really need any processing to "produce" them.

Audio engines don’t batch audio DSP tasks together, every sound-source is processed one after another, and after the job is done, they are mixed together. FMOD in Unity transfers mono sources in stereo, that is, even in mono, there are always 2 channels.

Audio samples are processed as 32-bit floating point numbers, not 16-bit signed integers. This is true for other audio engines (Wwise, CoreAudio etc etc) as well: 16-bit is enough for storing audio, but not enough for high audio quality processing.

How one audio source is processed for HRTF (3D positional audio) with regard to the memory bus

  1. Input 1024 samples to a ring buffer or audio buffer chain, with panning and spread calculation: 8192 bytes in + 8192 bytes out = 16384
  2. HRTF processing typically happens with 4:1 overlap and 2^10 size.
  3. Apply windowing, overlapping and re-organize the samples in preparation for FFT (the fastest FFTs don't demand bit-reversal, which would make processing even more expensive):
    • First round: 768 samples in (256 cached) + 1024 samples out = 1792 samples = 7168 bytes
    • Second round: 256 in (768 cached) + 1024 out = 1280 samples = 5120 bytes
    • Third round: 5120 bytes
    • Fourth round: 5120 bytes
    Sum: 22528 bytes
  4. Forward real FFT: 1024 complex out = 8192 bytes, four times = 16384 bytes
    (not counting the FFT coefficients since they we’ll assume are in cache).
  5. HRTF multiplication: 16384 bytes hrtf in (fft result cached), 8192 bytes out = 24576 bytes
  6. Inverse real FFT: 8192 bytes out
  7. Windowing and overlapping (same as input): 22528 bytes
  8. Audio source output (panning, spatial mix, step 8 output is cached, which is 8192 bytes): 8192 bytes
  9. Mixing the output to the global reverb's input buffer: (8192 bytes cached input) 8192 bytes

Result

We have processed 1 audio source using 16384 + 22528 + 16384 + 24576 + 8192 + 22528 + 8192 + 8192 = 126976 bytes memory bandwidth.

This 126k was calculated with a conservative "everything stays in cache" assumption, however in the real world, we rarely, if ever, see an audio implementation optimized this well with perfectly placed PLD instructions (and in Assembly code).

As you recall, we have 46 turns, so it's 5,840,896 bytes = 5.7 MB/s memory bandwidth per audio source!

Furthermore, in the case of 8 audio sources, before mixing, we're already using 45.6 MB/s bandwidth (5.7 * 8).

Mixing is easy, we take 1024 samples * 2 channels * 4 bytes = 8192 bytes result, which calls for 8192 * 46 = 368 KB/s.

[Please note, we haven’t even done any early reflections processing or occlusion at all in this case.]

Finally, we may have some global reverb effect. For brevity’s sake, I'm won’t go into details here. Suffice to say, a good quality traditional (non-convolution) reverb has at least 12 buffers inside for comb and all-pass filters. Let's assume 368 * 13 is about 5 MB/s for the reverb.

So, as you can see, HRTF 3D audio processing can easily overwrite all cache on an ARM SOC, multiple times.

Therefore, the super-conservative estimate of a calculated 50 MB/s bandwidth for the entire process for 8 sound-sources looks understated. Just the mixing of the 8 sources together will take 9 * 8192 * 46 = 3.3 MB/s bandwidth.

Not to mention, we often lose the coefficients for the FFTs from L1 and L2 cache (there is no L3 on Snapdragon), as well as other interstitial results, which then have to loaded into cache again, taking up memory bus bandwidth.

Conclusion

Processing eight 3D audio sources with the very best optimized implementation (hand-crafted Assembly, best algorithms) will take more than 100 MB/s bandwidth. If we add loads from the cache as well (no doubt that these must generate some heat too), we easily can go above 200 MB/s.

Our real-life testing shows that these theoretical numbers are probably even larger. And we still haven’t accounted for thermal effects of CPU processing. If we were to calculate these as well, approaching 1GB/s is plausible, which is big enough to have a significant effect on other processing needs, such as texture transfers for graphics.

  • iOS
  • Android
  • spatialized audio
  • 3d audio
  • cinematic audio