This post has been more than 4 years since it was last updated.
If you are reading this article, probably you are interested to work with a NoSQL database and have heard about this awesome and most popular NoSQL database of choice called Cassandra. One of the most important things about creating a project based on cassandra is modeling your data. A data model can be the difference between a successful Apache Cassandra project and a failed one. A good portion of this article is devoted to understanding how to do it right and putting you in right track.
Design Differences Between RDBMS and Cassandra
Before we start creating our Cassandra data model, let’s take a minute to highlight some of the key differences in doing data modeling for Cassandra versus a relational database.
You cannot perform joins in Cassandra. If you need something like a join, you’ll have to either do the work on the client side, or create a denormalized second table that represents the join results for you.
NO REFERENTIAL INTEGRITY
Although Cassandra supports features such as lightweight transactions and batches, Cassandra itself has no concept of referential integrity across tables. In a relational database, you could specify foreign keys in a table to reference the primary key of a record in another table. But Cassandra does not enforce this. It is still a common design requirement to store IDs related to other entities in your tables, but operations such as cascading deletes are not available.
In relational database design, we are often taught the importance of normalization. This is not an advantage when working with Cassandra because it performs best when the data model is denormalized. It is often the case that many applications end up denormalizing data in relational database. There are two common reasons for this. One is performance. You simply can’t get the performance you need when you have to do so many joins on years’ worth of data, so you denormalize along the lines of known queries. This ends up working, but ultimately makes one question whether using a relational database is the best approach in these circumstances.
In Cassandra you don’t start with the data model, you start with the query model. Instead of modeling the data first and then writing queries, you model the queries and let the data be organized around them. Note down the most common query paths your application will use, and then create the tables that you need to support them.
Step 1: Determine What Queries to Support
Try to determine exactly what queries you need to support. This can include a lot of considerations that you may not think of at first. For example, you may need to think about:
- Grouping by an attribute
- Ordering by an attribute
- Filtering based on some set of conditions
- Enforcing uniqueness in the result set etc …
Changes to just one of these query requirements will frequently warrant a data model change for maximum efficiency.
Step 2: Try to create a table where you can satisfy your query by reading (roughly) one partition
In practice, this generally means you will use roughly one table per query pattern. If you need to support multiple query patterns, you usually need more than one table.
To put this another way, each table should pre-build the “answer” to a high-level query that you need to support. If you need different types of answers, you usually need different tables. This is how you optimize for reads.
Data duplication is okay. Many of your tables may repeat the same data.
Rules by Examples
Example 1: User Lookup
The high-level requirement is “we have users and want to look them up”. Let’s go through the steps:
Step 1: Determine what specific queries to support Let’s say we want to either be able to look up a user by their username or their email. With either lookup method, we should get the full set of user details.
Step 2: Try to create a table where you can satisfy your query by reading one partition Since we want to know the full details for user with either lookup method, it’s ideal to use two tables:
CREATE TABLE users_by_username ( username text PRIMARY KEY, email text, age int ) CREATE TABLE users_by_email ( email text PRIMARY KEY, username text, age int )
Now, let’s check two fundamental rules for this model:
Data spreads evenly? Each user gets their own partition, so yes.
Minimal partitions read? We only have to read one partition, so yes.
Example 2: User Club
Now the high-level requirement has changed. Users are in clubs, and we want to get all users in a club.
Step 1: Determine what specific queries to support We want to get the full user info for every user in a particular club. Order of users does not matter.
Step 2: Try to create a table where you can satisfy your query by reading one partition How do we fit a club into a partition? We can use a compound PRIMARY KEY for this:
CREATE TABLE clubs ( clubname text, username text, email text, age int, PRIMARY KEY (clubname, username) )
Note that the PRIMARY KEY has two components: clubname, which is the partitioning key, and username, which is called the clustering key. This will give us one partition per clubname. Within a particular partition (club), rows will be ordered by username. Fetching a club is as simple as doing the following:
SELECT * FROM clubs WHERE clubname = ?
This satisfies the goal of minimizing the number of partitions that are read, because we only need to read one partition. However, it doesn’t do so well with the first goal of evenly spreading data around the cluster. If we have thousands or millions of small clubs with hundreds of users each, we’ll get a pretty even spread. But if there’s one club with millions of users in it, the entire burden will be shouldered by one node (or one set of replicas).
If we want to spread the load more evenly, there are a few strategies we can use. The basic technique is to add another column to the PRIMARY KEY to form a compound partition key. Here’s one example:
CREATE TABLE clubs ( clubname text, username text, email text, age int, hash_prefix int, PRIMARY KEY ((clubname, hash_prefix), username) )
The new column, hash_prefix, holds a prefix of a hash of the username. For example, it could be the first byte of the hash modulo four. Together with clubname, these two columns form the compound partition key. Instead of a club residing on one partition, it’s now spread across four partitions. Our data is more evenly spread out, but we now have to read four times as many partitions. This is an example of the two goals conflicting. You need to find a good balance for your particular use case. If you do a lot of reads and clubs don’t get too large, maybe changing the modulo value from four to two would be a good choice. On the other hand, if you do very few reads, but any given club can grow very large, changing from four to ten would be a better choice.
There are other ways to split up a partition, which I will cover in the next example.
Before we move on, let me point out something else about this data model: we’re duplicating user info potentially many times, once for each club. You might be tempted to try a data model like this to reduce duplication:
CREATE TABLE users ( id uuid PRIMARY KEY, username text, email text, age int )
CREATE TABLE clubs ( clubname text, user_id uuid, PRIMARY KEY (clubname, user_id) )
Obviously, this minimizes duplication. But how many partitions do we need to read? If a club has 1000 users, we need to read 1001 partitions. This is probably 100x more expensive to read than our first data model. If reads need to be efficient at all, this is not a good model. On the other hand, if reads are extremely infrequent, but updates to user info (say, the username) are extremely common, this data model might actually make sense. We have to make sure to take your read/update ratio into account when designing schema.
We’ve covered a few fundamental practices and walked through a detailed example to help you get started with Cassandra data model design. Here are the key takeaways:
- Don’t think of a relational table, but think of a nested sorted map data structure while designing Cassandra column families.
- Model column families around query patterns. But start your design with entities and relationships, if you can.
- De-normalize and duplicate for read performance. But don’t de-normalize if you don’t need to.
There are many ways to model. The best way depends on your use case and query patterns.
All Rights Reserved