Enhancing conversations with spatial audio

Surround sound has been a major feature of movie theaters for years. When sounds come from all over the theater, it really makes you feel like you’re in the middle of the action. Spatial audio is a newer technology that tries to recreate that same feeling when you’re just wearing a pair of headphones. For group communication apps like Clubhouse, spatial audio doesn’t just sound great; it actually makes it easier to follow the conversation.

Overview

I started looking into the possibilities with spatial audio soon after joining Clubhouse. Our ears are great at locating where a sound is coming from, and they do so via a few different methods, including comparing the timing at which the sound arrives at each ear. It turns out that if you apply the same subtle timing shifts to a sound being played by an app, you can make it seem like that sound is coming from whatever position you want.

This technology is called a Head-Related Transfer Function (HRTF), which processes an audio signal to make it seem like it’s arriving at your head from a specific angle. Even with an ordinary pair of headphones, you can create positioning in two dimensions, making audio seem to come from the left, right, or even behind a listener! The technology makes some assumptions about one’s head and ear geometry, but for most listeners, the effect is quite convincing.

In typical audio communication apps, if there are multiple users speaking, their audio is simply mixed together as a mono stream before being played out. However, at Clubhouse, we can apply a spatial audio HRTF for each audio stream, and by doing so, position each speaker in their own unique space. This can make you feel like you’re in the midst of a conversation, rather than just listening in to a conference call.

Spatial Audio Benefits

In addition to making the experience more immersive, research shows that spatial audio improves intelligibility and reduces the cognitive load needed to follow a conversation.

Without spatial cues, listeners need to rely on other audio characteristics to determine who is speaking, typically voice timbre or amplitude. It’s true that these cues work fairly well, which is why we're all mostly able to follow along on a conventional conference call. However, differentiation by timbre comes at a substantial cognitive cost, especially when voices are similar (e.g., of similar age and gender). This process requires conscious attention and consequently a nontrivial cognitive load.

Spatial cues, on the other hand, can be distinguished even in the absence of attention. This leads to a reduction in cognitive load, even while increasing intelligibility.

Intuitively, this makes sense. Our brains have been using spatial cues to understand real-world conversations for thousands of years, but we've been foregoing the use of that specialized ability in conventional conference calls and meeting apps. Clubhouse’s use of spatial audio provides these critical cues, making conversations on the app feel more human and, at the same time, more effortless to follow.

Integration Complexities

Audio Plumbing

To integrate the HRTF technology into Clubhouse, we had to get access to the raw received audio streams from the client. We chose to do this on the client rather than the server to avoid adding additional delay to the audio (as would be necessary if the server had to decode, process, and re-encode each packet). With the client approach, we needed to hook into the received audio streams from our conferencing service, resample from the incoming audio’s native sampling rate to the HRTF sampling rate, and then apply the HRTF to each incoming audio stream. This resulted in a set of stereo output streams, which we then mixed and sent to the playout device.

Note that since the output of the HRTF is different for the left and right ear (as that’s where much of the positioning comes from), it’s critical that stereo playout is used. This is relatively straightforward when using wired headphones, but gets much more complicated when Bluetooth headphones are involved. Only the A2DP Bluetooth audio profile natively supports stereo playout; the HSP and HFP profiles typically used by communication apps do not.

At present, there is no Bluetooth profile that supports both stereo playout as well as microphone input, which limits when spatial audio can be used while wearing Bluetooth headphones. We hope to work with the mobile OS providers to address this issue in the near future.

Psychoacoustics

As we thought about the product aspects of this feature, we had a lot of things to consider. How should speakers be placed in virtual audio space, and how does this change as the number of room participants grows? We ended up trying a number of different tunings, and some we were able to cross out right away — given the effectiveness of the technology, positions too far to the side just felt strange. However, bunching speakers too closely together reduced the value of the spatial positioning.

Once again, the approach that worked best took cues from real life. We considered how people usually arrange themselves when speaking in a group, and placed the first few participants accordingly. Then, when a new speaker arrived in a room, we positioned them in the largest remaining space, similar to how someone would enter into a typical conversation. We also applied a subtle transition so that if someone started talking who was a bit off to the side, your positioning would gradually update to face them (again, just like in real life).

This all took a fair amount of trial and error, and we tuned out the exact values here experimentally through listening tests and controlled experimentation. Check out some of the reactions of our users.

Music

Everything I’ve discussed so far assumes that Clubhouse participants can be modeled as monophonic sources. However, on Clubhouse there are a number of rooms where people perform music live, taking advantage of the app’s ability to transmit stereo audio. This posed a unique problem for spatial audio - how could we support stereo sources while also trying to position everyone at precise locations in the room?

The approach we landed on was to essentially make each stereo source into two mono sources, spaced a predetermined distance apart (kind of like a boom box). There were a number of challenges here — figuring out exactly when to engage this mode isn’t straightforward, as some sources can dynamically change from mono to stereo — but this technique preserved the richness of the stereo effect while also allowing stereo streams to come from distinct spatial positions.

Wrapping up

So that’s a whirlwind tour of spatial audio in Clubhouse. We think it’s a great way to make conversations on the app feel more authentic, and I hope you all are enjoying the feature.

If you missed our live conversation about spatial audio on Clubhouse, no worries! Check out the replay here. Also, if you’re someone that enjoys solving tough problems, join our team by checking out our job openings today!

Justin Uberti, Head of Streaming Technology

This post is part of our engineering blog series, Technically Speaking. If you liked this post and want to read more, click here.