When I joined Clubhouse in March, everyone was working really hard to keep up with our explosive growth. Going from tens of thousands of users to millions practically overnight means that some of the product assumptions you made early on were suddenly obsolete. As we were working toward opening the Clubhouse community to even more by dropping the invitation system and building our Android app, it became apparent that our notification philosophy would not scale.

Clubhouse is a live social audio platform, so it is important for members of the community to know when people they follow are talking — and the main way we did that early on was by sending notifications whenever someone they followed would start a room or become a speaker. While this worked well in the early days, as the community grew and people started making more and more friends, what had at first been a helpful nudge to come listen to your friend was slowly turning into an annoyance for our most engaged users.

All of a sudden, we had tied a behavior that makes the experience of using Clubhouse better (deepening connections on the platform) to a negative consequence (receiving too many notifications), so we set out to decouple those two things while making sure that people did not miss conversations they wanted to join.

Formulating the problem

At this point, we had a rough idea of our high level goal: reduce the number of noisy notifications while at the same time continuing to send the relevant ones. In other words, we have a set of events (notifications) which can be successful (they are “relevant”) or not, and we want to identify which of these fall in the former category. This sounds a lot like a classification problem, and — after unsuccessfully trying out a few heuristics which ended up hurting engagement — we decided it was time to train a predictive model.

Therefore, the first part was to properly define what events we were going to make predictions for, and how we would define success. This part is interesting because it showcases the subtle nuances that are involved when building products that rely on machine learning: If you misformulate your problem or set the wrong target to optimize for, you will likely miss the mark — yes, that’s foreshadowing!

Let’s take a look at how the notification pipeline works:

  • First, an event happens: someone creates a channel or joins the stage on an existing room;
  • Next, we generate a list of candidates who could be interested and that we might notify;
  • Then we use certain heuristics to avoid spamming people (for example if someone set their notification settings to “send fewer notifications”, we might skip sending them if we’ve already sent them a few recently);
  • Finally, we then send the notification;
  • The notification is delivered to the recipient’s phone;
  • And maybe the recipient sees the notification and decides whether they want to tap on it.

In an ideal situation, the events we’d consider would be “someone sees a notification”, but in practice, because of how mobile operating systems are designed, we do not reliably know whether a notification is successfully delivered — and have no way of knowing whether it was actually seen by the recipient. The only certainty we have is that we sent the notification, so in the absence of better data, the events we would have to classify are notification sends.

The second part is defining what success is, and while our first instinct was to go with taps on notifications, after talking to the team and members of the community, we decided to broaden this to include joining the room after the notification regardless of having tapped on it.

Training the model and deploying to production

One of the decisions we made early on was to use Google BigQuery for all our data warehousing needs. While there are really powerful alternatives, we were drawn to BigQuery because the operational burden is minimal.

An additional benefit was the ability to leverage BigQuery ML to quickly train and evaluate models: because we already logged all notifications sent events, and because we were able to compute offline features (such as number of notifications sent to the user in the previous week), we were able to quickly train an XGBoost model that performed well enough for us to decide to move on and deploy it to production to run an online experiment.

Because it was important to us that members of the community miss as few conversations as possible, we picked a threshold such that we’d achieve a 90% recall of relevant notifications, which meant we’d be able to cut 40% of notifications sent.

Online evaluation & hopes crushed

We exported the XGBoost model and incorporated it in our Django app. The features we used to train the model were either historical features computed offline in the data warehouse and stored in memcache once a day, or features we could derive at send time such as the local time in the potential recipient’s timezone, which made the integration easier from an infrastructure perspective.

Like most features we release, we ran an A/B test, using the model to filter notifications sent to a small random number of users and comparing their behavior to a control group for whom we did not filter notifications. After running this for a few days, we looked at the results... and they were disappointing: While we were sending about 40% fewer notifications, and while we kept the average number of room joins from sent notifications roughly level, the test group was overall less engaged. We had to go back to the drawing board.

Reformulating the problem

Remember the ominous foreshadowing above? It took us a while to track down why using this model hurt our metrics when it was accomplishing exactly what we had trained it to do. And that boiled down to two reasons: one subtle, and one very obvious in hindsight.

The subtle reason is that not all notifications sends are equal: when your friend Sally creates a room we send you a notification, and when your other friend George joins as a speaker we send you another notification, but instead of showing you two notifications for the same room, which would be redundant, we update the first one. This means that based on our definition of success above, if you join that room at any point after we notified you, we consider both of those notifications to be successful. Overall this skews our positive set towards popular rooms that get lots of speakers in a short period of time.

The more obvious reason why the model didn’t perform as we intended is that the prediction was done at the individual notification level. However all notifications aren’t equal: the cost of not sending a relevant notification to a user who gets 10 of these in any given day is lower than that of not sending one to someone for whom that would have been the only notification that day.

To alleviate both of those issues, we decided that instead of giving the same weight to all notifications, we would discount repeated notifications: if someone got only one relevant notification in a day, that notification would have a score of 1, but if they got 100 relevant notifications, each would be considered to contribute 0.01. We then used the same features to train a regression model predicting that score.

Thankfully, because we had built our training pipeline using dbt and the dbt_ml package, adapting the code and retraining the model was trivial. After a couple of weeks of experimenting with different thresholds, we were able to cut roughly 50% of notifications we were sending, while keeping the number of rooms joined from notifications constant, as well as the number of users returning to Clubhouse thanks to a notification.

Conclusion

If building deterministic software is akin to putting together a puzzle (which might sometimes be a very difficult puzzle), training and shipping a machine learning model is putting together a puzzle, except that someone handed you a bag containing pieces from different puzzles, an old candy wrapper, and a couple of LEGO plates.

This project highlights the benefits of leveraging managed services to accelerate development by focusing on the unique parts of the work, as well of being a full stack data scientist/machine learning engineer in that it touched all parts of the process and infrastructure: figuring out what to optimize (and going back to the drawing board), building a training pipeline, doing the feature engineering, coming up with a way to serve the model within our infrastructure, etc.

Thanks for reading! If you missed our live conversation about the blog on Clubhouse, tap here to listen to the replay and join the conversation. And if you enjoy working on challenges like these, check out our job openings and join us!

Adrien Friggeri, Data Scientist

This post is part of our engineering blog series, Technically Speaking. If you liked this post and want to read more, click here.