Skip to content

Using Cassandra

Liam Brandt edited this page Nov 8, 2018 · 1 revision

The most important thing in Cassandra is to know it’s rules and play by them. There are a few things that, at first, may seem counterintuitive, but once you embrace them you’ll wonder why it isn’t always done this way.

The first are some non-goals. These are things you might be inclined to try to achieve, but in Cassandra you are just wasting your time.

  1. Replication of Data: It may seem like replicating data all over the database is bad, if it’s written on the database already just use that copy! This is of little concern as disk space is the least limiting factor in today’s computers.
  2. Write Queries: Keeping write queries low is nice, but almost all writes are equally efficient in Cassandra (and writes are quite cheap). If writing to more than one place makes your reads more efficient then go for it.

An actual goal that our team should treat as law is “Read queries MUST access only one partition”(we will discuss partitions in a moment). We achieve this by constructing the tables and primary key in a way such that they are efficient for the most important queries. Each table has a primary key that uniquely identifies every row in that table. The basic make up of a primary key is as follows:

PRIMARY KEY = PARTITION KEY + CLUSTERING KEY

This key is constructed of the partition key and the clustering key, both of which are a series of one or more columns in the table that primary key is for.

Basic makeup example:

PARTITION KEY = COLUMN 1 + COLUMN 2 + COLUMN 5

CLUSTERING KEY = COLUMN 3 + COLUMN 4

The tables themselves are made of rows and columns. The columns are what is defined in the schema. This kind of like defining the type of object that will be in the table. The rows are like the objects in the table. They have members whose type and name is given by the schema.

The rows are split and stored in partitions and those partitions are clustered in an order. There is one partition for each different partition key value, and the partitions are ordered (either ascending or descending) by the clustering key.

So now we have a better understanding of what our goal means. When accessing the database, construct your query in such a way that it has a defined partition key, but varies in clustering key. Now, even if a table has the data you seek, it may not all be stored in the same partition. This is where our non-goals come into play. We should ignore the idea that replication of data is bad, and feel entirely free to make tables that contain the same data as others, but structured in a way that adheres to our rule. If we are trying to access the latest snapshot of every system versus accessing every snapshot of one system, you could imagine that we would need two tables to make both of those reads efficient.

Clone this wiki locally