Every day, people on Clubhouse are learning together, laughing often, and connecting with people from around the world. And we wanted to give our community the tools to share these incredible moments with people outside of Clubhouse. That’s why last year, we decided to launch Clips, which allowed users to take a snippet of the 30 seconds of a room and share it on other social channels.
Rooms happen in real-time, so that presents some technical challenges. We had to get creative with how we stored the audio and pulled it into a shareable format.
Today, we’re sharing why Clips was an important feature to build, how we got started, as well as some details about some of the challenges we encountered along the way. We’re also sharing some of the differences between building on iOS vs. Android. Let’s dive in.
The technical challenge of creating Clips
Our Client Engineers strive to keep our clients as simple and stateless as possible. Most screens in the app fetch information from our backend, render that state into a user-friendly UI, and only allow the user to mutate that state by calling another endpoint. Since most of the business logic is stored on the backend, it can be shared between our iOS and Android clients and updated much more quickly via a server deployment rather than waiting for a mobile app update.
However, building Clips posed a different set of constraints than most projects. Our backend app servers don’t process any room audio until rooms are ended. So, in order to render the short video for Clips on-demand for the user, we must capture everything we need to render it client side.
This meant that anytime the user entered a room with clips enabled, we had to record a trailing buffer of audio along with all of the metadata about who was on stage and who was speaking the loudest. Given this is running in the background for many users, this code needed to be written with performance in mind, ensuring we don’t keep data around longer than necessary.
Additionally, when the user actually taps on the clip button, we need to render the clip — first for the preview and then for the final exported artifact. It wouldn’t be possible to round trip this to the server and have it render the video while the user is waiting, so it's necessary to do this rendering on the client side. This has its own platform-specific concerns we’ll detail later on.
Designing the feature & building a prototype
Before we got to work, we needed to design the feature. Our product team created a detailed brief for what they wanted the feature to do which could be passed on to the development team.
Our product management team and design team collaborated on the design of Clips. A few things they considered:
- Length of clips
- Editing experience
- Sharing experience
We decided we did not want a full editing experience and did not want to drop users into an editor and take them away from the Room. We determined that the simplest way to ensure users stayed within the experience would be to implement a button that allowed users to record the last 30 seconds of a room.
The team then created a prototype to validate the idea and to test different clip lengths on actual content. For the prototype, we actually built a duration picker into the UI so that after selecting the clip you could actually choose from different options (i.e. 5s, 15s, 30s, 45s, 60s, …). While this implementation was very crude and slow, it gave us a lot of confidence knowing that the clip format was interesting and 30 seconds was the right length. Both of these factors were impossible to tell just by looking at designs and without trying them out on real-world examples.
Clips on iOS
We ran into a few challenges when implementing clips on iOS. First, to render the actual visual animation we had designed, we needed to collect all the metadata for when users join or leave the stage as well as whether or not they’re speaking. Thankfully, we were able to adapt some of the code we had for rendering replays in-app to collect the same data during live rooms
Those events then were stored in a log so that we could stop at any point in time and replay the last 30 seconds of events to generate the animation. It was important to make sure that this metadata capture code was very efficient since it would need to run in the background whenever the user was in a room eligible for clips. One important optimization was to truncate this log periodically so memory usage didn’t grow unbounded while you stayed longer in a room.
Next we had the issue of actually getting the audio content for the clip. In live rooms it was easy enough to record a trailing buffer of the conversation as you were listening. When the user is playing back a room replay however we take advantage of the HTTP Live Streaming support built into AVPlayer, which gives us a ton of great features for free, however doing so prevents you from accessing the audio content directly.
In order to workaround this when we start generating a clip while the user is listening to a replay, we actually have to hit the network and re-download the audio chunks before the current position and then demultiplex the HLS stream to get access to the audio directly.
The last challenge we ran into was one of performance. We wanted to make sure that the experience of creating a clip was fast since any delays while you're in a live room means you could miss someone asking you to speak or make it hard to effectively moderate the conversation. In the prototype we used an AVAssetWriter and AVAssetWriterInputPixelBufferAdaptor to render the actual video for the clip and then used a AVAssetExportSession to mix that video in with an audio and prep it for sharing.
After some fiddling around with different AVFoundation APIs we eventually found a better way to combine the steps using an AVMutableVideoCompositionLayerInstruction so we could insert the same layer + animations we use in our preview UI directly into the AVAssetExportSession. This cut down the time to render a clip from 10+ seconds in the prototype to under 200ms on the fastest of phones and just under a second on the slowest devices we support. An important performance win to make clips a usable feature while you are on stage.
Clips on Android
iOS had much nicer APIs to do the animated "drawing" required for the video than Android. Implementing this on Android required a lot of work and setup.
The APIs that actually generate video (with audio) are much lower-level on Android, so there's more difficulty involved there. Plus, the sheer range of Android devices and versions means we can't rely on specific codecs being available at runtime, so we need to be a lot more resilient on Android.
On Android, here’s how we implemented Clips:
- Collecting all the metadata in a rolling buffer (who's in the room, who is speaking etc)
- Collecting the audio itself. For live rooms, content is streamed into a local file. For replays, we have to parse a mpeg-ts / hls stream
- Set up a Surface to draw our video content on
- Draw the actual video based on the collected metadata
- Render a video (with no audio)
- Composite this video with audio
- Share this video locally
- Use Android Share intent to share off platform
Although iOS and Android engineers at Clubhouse tackle a wide range of projects, we were particularly jazzed about Clips, as the rendering of audio and video together– from live content– presented new challenges that we don’t always get to tackle.
Today, thanks to our efforts and the rest of the team, Clubhouse users can easily share 30-second snippets of public room conversations, showing off what’s happening on the platform and encouraging new users to join in.
If you enjoyed reading this article, please join us on Clubhouse on Tuesday, December 6 at 3:00 PM PST for a live conversation about how we built clips on mobile. RSVP here today!
P.S. If you’re someone that enjoys solving tough problems, come join our team by checking out our job openings today! This post is part of our engineering blog series, Technically Speaking. If you liked this post and want to read more, click here.
Written by: Daniel Hammond (iOS Engineering Manager) and Daniel Grech (Android Engineer)