Low Latency Audio. Cross Platform on Android, OSX and iOS. Free.
Android's 10 Millisecond Problem Mobile Audio Latency Test App Android Audio Player USB Audio and MIDI for Android
This post serves as a technical addendum to How 3D Spatialized Audio Bottlenecks Virtual Reality Video.
Below we walk you through the effects on memory bus bandwidth for HRTF (3D audio) audio processing, and how it easily exceeds 3 MB/s per channel.
Let's imagine a VR app running on a mobile device with an ARM SoC (RAM, CPU, GPU). The Apple A9 has 64 KB L1 cache for each of its two cores, the 3 MB L2 cache is shared by both cores, and it has a 4 MB L3 cache servicing the entire SOC. The Snapdragon 820 has only 1 MB L2 and no L3.
As readers know, audio is processed in buffer-sized chunks, 1024 samples is a common value in game engines. So we have 48000 Hz >> 10 = 46 "turns" every second. The cache is completely overwritten with other data between every turn, as one is processing huge amounts of graphics, for example.
Let's imagine we have 8 stereo sounds in the 3D space, and they are all available as raw PCM data in memory, so we don't really need any processing to "produce" them.
Audio engines don’t batch audio DSP tasks together, every sound-source is processed one after another, and after the job is done, they are mixed together. FMOD in Unity transfers mono sources in stereo, that is, even in mono, there are always 2 channels.
Audio samples are processed as 32-bit floating point numbers, not 16-bit signed integers. This is true for other audio engines (Wwise, CoreAudio etc etc) as well: 16-bit is enough for storing audio, but not enough for high audio quality processing.
We have processed 1 audio source using 16384 + 22528 + 16384 + 24576 + 8192 + 22528 + 8192 + 8192 = 126976 bytes memory bandwidth.
This 126k was calculated with a conservative "everything stays in cache" assumption, however in the real world, we rarely, if ever, see an audio implementation optimized this well with perfectly placed PLD instructions (and in Assembly code).
As you recall, we have 46 turns, so it's 5,840,896 bytes = 5.7 MB/s memory bandwidth per audio source!
Furthermore, in the case of 8 audio sources, before mixing, we're already using 45.6 MB/s bandwidth (5.7 * 8).
Mixing is easy, we take 1024 samples * 2 channels * 4 bytes = 8192 bytes result, which calls for 8192 * 46 = 368 KB/s.
[Please note, we haven’t even done any early reflections processing or occlusion at all in this case.]
Finally, we may have some global reverb effect. For brevity’s sake, I'm won’t go into details here. Suffice to say, a good quality traditional (non-convolution) reverb has at least 12 buffers inside for comb and all-pass filters. Let's assume 368 * 13 is about 5 MB/s for the reverb.
So, as you can see, HRTF 3D audio processing can easily overwrite all cache on an ARM SOC, multiple times.
Therefore, the super-conservative estimate of a calculated 50 MB/s bandwidth for the entire process for 8 sound-sources looks understated. Just the mixing of the 8 sources together will take 9 * 8192 * 46 = 3.3 MB/s bandwidth.
Not to mention, we often lose the coefficients for the FFTs from L1 and L2 cache (there is no L3 on Snapdragon), as well as other interstitial results, which then have to loaded into cache again, taking up memory bus bandwidth.
Processing eight 3D audio sources with the very best optimized implementation (hand-crafted Assembly, best algorithms) will take more than 100 MB/s bandwidth. If we add loads from the cache as well (no doubt that these must generate some heat too), we easily can go above 200 MB/s.
Our real-life testing shows that these theoretical numbers are probably even larger. And we still haven’t accounted for thermal effects of CPU processing. If we were to calculate these as well, approaching 1GB/s is plausible, which is big enough to have a significant effect on other processing needs, such as texture transfers for graphics.