Motivation for Dynamite
One of the central parts of building any web service is storing user data efficiently, correctly, and safely.
This is particularly critical and challenging at a social networking startup. We’re iterating faster than ever on making new products and improving features for our community. At the same time, we need to ensure the reliability of cross-cutting features like account deactivation, and have to safeguard a deeply-nested social graph with nuanced behavior and granular user controls impacting which users can read and write different pieces of data.
About 9 months ago, we hit limits in traditional tooling we were relying on and wanted to move faster.
At Clubhouse, we love operationally-simple, predictable tools that reduce surprises. So, we’ve been building our newer products and migrating existing ones to AWS DynamoDB. It’s fully-managed, and offers the right pairing of excellent performance characteristics for common cases along with predictable limits and fine-grained control over hot query paths.
It was critical to make the developer experience on the transition outstanding for product teams.
We built Dynamite, a compact Python client library for DynamoDB that improves on the state-of-the-art around object mapping libraries. It offers thread- and type-safety, automated schema validation, integration with regulatory tooling used by our legal & policy team, a practical version of row-level security, and tools to help ensure data deletion coverage.
Best of all, there’s no codegen–engineers get instant feedback as they write code without needing to repeatedly run a script.
The first version of Dynamite
DynamoDB is a NoSQL database, with an extremely lightweight schema system tracking fields that have defined indexes.
This makes it extremely simple to start out with.
Performance was the core objective when we started our migration to DynamoDB, and we had collective experience with performance surprises from ORMs. We wanted the simplest, most predictable solution we could get. We started with a simple wrapper around AWS’s boto library that let us issue tightly optimized queries to DynamoDB in a thread-safe manner. Over time, we developed internal norms around how to use Python dataclasses to pass around items from DynamoDB in our code, in ways that would offer compile-time type coverage via mypy.
This was a huge performance improvement.
The need for schemas
But, as more developers started to use it, we started to see recurring classes of bugs, as developers had to write some amount of recurring boilerplate to set up a new table, with room for mistakes. The most common mistake was implementing serialization logic where compile-time data types would get out of sync with the run-time data types. Here’s an example:
A developer might try to save an integer timestamp. Our wrapper around boto would coerce this to the Decimal type that boto expects. But then, because we didn’t have schema information, we would read it back as a Decimal. Developers would forget that the int they saved was a Decimal, and get downstream bugs because from code expecting ints but getting Decimal objects. And beyond this, an int isn’t even the right data type either. Idiomatic Python would represent a timestamp like this as a datetime object.
Ideally, we could implement this once.
We looked at options for getting schemas added to the codebase. We wanted something that was layered on top of our existing library for sensitive perf cases, had a clear migration path from current code with minimal work, and would provide compile-time type information. Developers like IDE autocompletion, and we leverage mypy for compile-time type-checking in the surrounding codebase. We also wanted to be able to build systems solutions for solving cross-cutting data privacy needs on top of this.
Given these requirements, it made sense to pursue the simplest in-house solution possible.
We developed the first version in 2 days, and by end-of-week, had 3 separate products leveraging it, with developers in different areas of the company providing helpful feedback. Over the following months, we incrementally iterated on the syntax based on the experience of developers.
Here’s an example of what a schema looks like today:
A developer implements two classes. A DynamiteItem tracks the set of fields that an item has. DynamiteTable tracks configuration for the table itself, and provides standard read and write methods to fetch and write DynamiteItems. This plays nicely with mypy out-of-the-box, and there’s no codegen needed to generate typed accessors.
We use these schemas to power a variety of internal tools for developers and cross-functional partners.
During development, the schemas are used to power type-safe serialization logic. When a developer is ready to deploy, they run a one-time script that generates Terraform configs to deploy the table in production. We developed a script that can efficiently and accurately sync the schemas into Transcend’s data mapping tooling, to reduce manual work needed to generate compliance documentation like a Record of Processing Activities.
Read & write policies
After the initial rollout of Dynamite, we explored solving more security and privacy tasks with it.
One of the most important concerns with any collected user data is making sure only the correct users can read and write it. A standard method for this in many web applications is with endpoint gating. You add some logic to each endpoint that verifies whether the authenticated user is allowed to access the data on the endpoint. This works well for applications like CMSes where there’s a limited set of admin endpoints with clear data per page, and coarse permissionings.
The need for read & write policies
This strategy doesn’t scale well for social networks, though.
Social networks typically have deeply nested hierarchies of data, and granular user controls on who can see what data. If you add a new endpoint, you need to think about what data might be on the endpoint, and keep the gating logic in sync with that. To make a social network robust, gating really needs to happen at each read or write callsite, so the gating logic remains in sync with the data requested. Clubhouse initially solved this with wrapper classes that implement gating logic for each database query, and setting a norm of calling wrappers those instead of the database directly.
This is a very effective solution, but similar to schemas, there’s room to simplify code and make it more robust, especially around such a sensitive area.
Protecting data with read & write policies
A traditional generalization of this is Row-Level Security (RLS) in databases like PostgreSQL. Gating logic manifests in read and write policies defining which users can access or edit data.
It’s great in concept, but a key challenge is that the policy specifications are frequently too restrictive for real-world applications, to the point that it may introduce risk. Policies are typically in a database-specific language like SQL, when in reality, gating logic typically needs to stay in sync with some amount of business logic. This introduces some amount of code duplication, or work to add user-defined functions. Additionally, RLS policies are tightly tied to your database, which creates issues if you need to add a caching layer, or want to use a different database.
Dynamite has a practical version of RLS that we call read and write policies, inspired by the open-source Ent library. It looks like this:
Each read or write policy is a list of sequential waterfall rules. Each rule is a function that takes a viewer object (e.g. the current authenticated user for the endpoint) along with the DynamiteItem being read or write operation being executed. Each rule returns allow, deny or skip. Allow grants access. Deny blocks access. Skip defers to the next rule in the list.
Rules are reusable, and policies are easy to enforce – each read or write method on DynamiteTable checks the read or write policy automatically.
It also helps prevent bugs around batch writes. If we make 3 calls to wrapper methods, and gating logic on the 2nd call fails, we might end up with the database in a weird spot. In order to fix this, you’d need to refactor your code to have a bulk write method, with a bulk gating check. With Dynamite, we can offer this for free – before a bulk write, we can coalesce the list of attempted writes, and check all relevant policies before actually persisting any data.
Another key privacy task is around data deletion. Users expect social networks to have a working “Deactivate Account” button. And it’s even becoming a key requirement of Apple App Store review.
The need for erasure policies
The challenge of a “Deactivate Account” button is usually not in adding the button, but orchestrating what happens after.
Unlike a SQL database, DynamoDB doesn’t have a built-in concept of cascading deletes. The naive solution is to have a method that implements this cascading delete. You sequentially call a bunch of methods that delete objects, and those deletion methods call other delete methods and so on.
But! We’re dealing with a social network here.
There’s a deeply-nested graph, with cycles. We absolutely don’t want to forget to delete data. And we also don’t want to accidentally delete something that we shouldn’t (e.g. a user that didn’t request deletion). It’s very easy to over- or under-traverse the graph by mistake. And verifying that the naive solution is correct can be time-consuming and error-prone.
Ensuring coverage with erasure policies
We added erasure policies to Dynamite, to provide a declarative, automatically-verifiable approach for defining deletion cascades in social graphs.
When discussing with developers internally, and auditing code, we realized that the typical developer makes three sequential design decisions:
- Do I need to delete this? Global system configuration that’s not related to a user definitely shouldn’t be deleted – that would break the app! But most other data should be.
- When should it be deleted? Developers need to define what deletions should cascade to rows in a table. For example, if a room is deleted, we should also delete the room’s associated topics.
- What should be deleted? Most of the time, the entire row should be deleted. But there are cases when only a field should be deleted (i.e. when a user is associated with content they don’t own), and we should disassociate them without deleting the entire row.
Erasure policies convert this series of design decisions into declarative code:
The declarative format makes it possible to have schema verification rules that check that there’s a sufficient erasure policy, with CI either passing or triggering manual review depending on the policy.
Dynamite is one of the workhorse tools that backend and full stack engineers use every day at Clubhouse to iterate quickly on products, empowered with easy-to-use, privacy-aware infrastructure.
P.S. If you’re someone that enjoys solving tough problems, come join our team by checking out our job openings today! This post is part of our engineering blog series, Technically Speaking. If you liked this post and want to read more, click here.
Written by: Jonathan Wilde (Privacy Engineer)