WTF is a SuperColumn? An Intro to the Cassandra Data Model
Know English?
Why don't you want to start to translate this material? You can invite your friends to help you.
Text
For the last month or two the Digg engineering team has spent quite a bit of time looking into, playing with and finally deploying Cassandra in production. It’s been a super fun project to take on – but even before the fun began we had to spend quite a bit of time figuring out Cassandra’s data model… the phrase “WTF is a ’super column’” was uttered quite a few times.
If you’re coming from an RDBMS background (which is almost everyone) you’ll probably trip over some of the naming conventions while learning about Cassandra’s data model. It took me and my team members at Digg a couple days of talking things out before we “got it”. In recent weeks a bikeshed went down in the dev mailing list proposing a completely new naming scheme to alleviate some of the confusion. Throughout this discussion I kept thinking: “maybe if there were some decent examples out there people wouldn’t get so confused by the naming.” So, this is my stab at explaining Cassandra’s data model; It’s intended to help you get your feet wet & doesn’t go into every single detail but, hopefully, it helps clarify a few things.
BTW: this is long. If you’d rather have a PDF version of this you can download it here.
The Pieces
Let’s first go thru the building blocks before we see how they can all be stuck together:
Column
The column is the lowest/smallest increment of data. It’s a tuple (triplet) that contains a name, a value and a timestamp.
Here’s a column represented in JSON-ish notation:
{ // this is a column
name: "emailAddress",
value: "arin@example.com",
timestamp: 123456789
}
That’s all it is. For simplicity sake let’s ignore the timestamp. Just think of it as a name/value pair.
Also, it’s worth noting is that the name and value are both binary (technically byte[]) and can be of any length.
SuperColumn
A SuperColumn is a tuple w/ a binary name & a value which is a map containing an unbounded number of Columns – keyed by the Column’s name. Keeping with the JSON-ish notation we get:
{ // this is a SuperColumn
name: "homeAddress",
// with an infinite list of Columns
value: {
// note the keys is the name of the Column
street: {name: "street", value: "1234 x street", timestamp: 123456789},
city: {name: "city", value: "san francisco", timestamp: 123456789},
zip: {name: "zip", value: "94107", timestamp: 123456789},
}
}
Column vs SuperColumn
Columns and SuperColumns are both a tuples w/ a name & value. The key difference is that a standard Column’s value is a “string” and in a SuperColumn the value is a Map of Columns. That’s the main difference… their values contain different types of data. Another minor difference is that SuperColumn’s don’t have a timestamp component to them.
Before We Get Rolling
Before I move on I wanna simplify our notation a couple ways: 1) ditch the timestamps from Columns & 2) pull the Columns’ & SuperColumns’ names component out so that it looks like a key/value pair. So we’re gonna go from:
{ // this is a super column
name: "homeAddress",
// with an infinite list of columns
value: {
street: {name: "street", value: "1234 x street", timestamp: 123456789},
city: {name: "city", value: "san francisco", timestamp: 123456789},
zip: {name: "zip", value: "94107", timestamp: 123456789},
}
}
to
homeAddress: {
street: "1234 x street",
city: "san francisco",
zip: "94107",
}
Grouping ‘Em
There’s a single structure used to group both the Columns and SuperColumns…this structure is called a ColumnFamily and comes in 2 varieties Standard & Super.
ColumnFamily
A ColumnFamily is a structure that contains an infinite number of Rows. Huh, did you say Rows? Ya – rows To make it sit easier in your head just think of it as a table in an RDBMS.
OK – each Row has a client supplied (that means you) key & contains a map of Columns. Again, the keys in the map are the names of the Columns and the values are the Columns themselves:
UserProfile = { // this is a ColumnFamily
phatduckk: { // this is the key to this Row inside the CF
// now we have an infinite # of columns in this row
username: "phatduckk",
email: "phatduckk@example.com",
phone: "(900) 976-6666"
}, // end row
ieure: { // this is the key to another row in the CF
// now we have another infinite # of columns in this row
username: "ieure",
email: "ieure@example.com",
phone: "(888) 555-1212"
age: "66",
gender: "undecided"
},
}
Remember: for simplicity we're only showing the value of the Column but in reality the values in the
map are the entire Column.
You can think of it as a HashMap/dictionary or associative array. If you start thinking that way then you’re are the right track.
One thing I want to point out is that there’s no schema enforced at this level. The Rows do not have a predefined list of Columns that they contain. In our example above you see that the row with the key “ieure” has Columns with names “age” and “gender” whereas the row identified by the key “phatduckk” doesn’t. It’s 100% flexible: one Row may have 1,989 Columns whereas the other has 2. One Row may have a Column called “foo” whereas none of the rest do. This is the schemaless aspect of Cassandra.
A ColumnFamily Can Be Super Too
Now, a ColumnFamily can be of type Standard or Super.
What we just went over was an example of the Standard type. What makes it Standard is the fact that all the Rows contains a map of normal (aka not-Super) Columns… there’s no SuperColumns scattered about.
When a ColumnFamily is of type Super we have the opposite: each Row contains a map of SuperColumns. The map is keyed with the name of each SuperColumn and the value is the SuperColumn itself. And, just to be clear, since this ColumnFamily is of type Super, there are no Standard ColumnFamily’s in there. Here’s an example:
AddressBook = { // this is a ColumnFamily of type Super
phatduckk: { // this is the key to this row inside the Super CF
// the key here is the name of the owner of the address book
// now we have an infinite # of super columns in this row
// the keys inside the row are the names for the SuperColumns
// each of these SuperColumns is an address book entry
friend1: {street: "8th street", zip: "90210", city: "Beverley Hills", state: "CA"},
// this is the address book entry for John in phatduckk's address book
John: {street: "Howard street", zip: "94404", city: "FC", state: "CA"},
Kim: {street: "X street", zip: "87876", city: "Balls", state: "VA"},
Tod: {street: "Jerry street", zip: "54556", city: "Cartoon", state: "CO"},
Bob: {street: "Q Blvd", zip: "24252", city: "Nowhere", state: "MN"},
...
// we can have an infinite # of ScuperColumns (aka address book entries)
}, // end row
ieure: { // this is the key to another row in the Super CF
// all the address book entries for ieure
joey: {street: "A ave", zip: "55485", city: "Hell", state: "NV"},
William: {street: "Armpit Dr", zip: "93301", city: "Bakersfield", state: "CA"},
},
}
Keyspace
A Keyspace is the outer most grouping of your data. All your ColumnFamily’s go inside a Keyspace. Your Keyspace will probably named after your application.
Now, a Keyspace can have multiple ColumnFamily’s but that doesn’t mean there’s an imposed relationship between them. For example: they’re not like tables in MySQL… you can’t join them. Also, just because ColumnFamily_1 has a Row with key “phatduckk” that doesn’t mean ColumnFamily_2 has one too.
Sorting
