Cassandra

From wiki.mikejung.biz
Jump to: navigation, search

Liquidweb 728x90.jpg

Intro

Cassandra was strongly influenced by Amazon's Dynamo which is based on a distributed key-value database. It's important to understand a few items before getting into too much detail about Cassandra. One of the important concepts to understand is ACID:

  • A--Atomic, this means all or nothing. A transaction either commits, or it fails.
  • C--Consistent, this means that everyone reading the data is getting the same result, assuming we are looking at the exact same record.
  • I--Isolated, this means that transactions are isolated from each other, there cannot be two transactions on at the same time with the same data.
  • D--Durable, this means that once a transaction is complete, it is written to disk in case of failure.
  • Transactions are great for those that need absolute consistency, however transactions do not scale out well on a large scale, especially across multiple nodes. At this point, Cassandra can help.
  • Cassandra is based on shared nothing. This means that there are no master, slave or control servers. Each node does the exact same as the other.

Basic Overview

Cassandra's strong points are as follows, note how some of these points fit into the ACID idea.

Distributed Cassandra is meant to run on as many nodes as possible to achieve the most performance. Any write can go to any node, and any read can go to any node, it doesn't matter, Cassandra knows what's up.

Decentralized Cassandra has no single point of failure. All nodes are the same. A peer to peer protocol is used to communicate and it uses gossip to keep track of any dead nodes in the cluster. This means that nodes can fail and the cluster will keep on keeping on, with a bit of a performance hit, but with many nodes this is minimal.

Elastic Scalability This means that the cluster seamlessly scales up or down if nodes are added or removed from the cluster. This all happens automatically, and it's awesome.

Consistency Reads always return the most recent value no matter what. Consistency is tunable, and the client can make this more, or less consistent depending on the results that they want returned. This can effect performance, but again, this is set per client, so the cluster as a whole does not really care.

Levels of Consistency

There are 3 types of consistency that Cassandra can use:

Strict Any read will always return the most recent written value. However, issues can arise with this, especially when multiple nodes are involved, or located in different parts of the world. What would happen if 4 nodes are all off by 3 seconds? Bad things, that's what. Use NTP to keep the cluster in sync.

Casual If some recent writes to a file seem to be related then they are read in the order that they were written.

Weak (eventual) This means that all updated will eventually propagate across the cluster.

  • Cassandra prefers to always be writable, which means that there are never any locks in place preventing a write from taking place, cough, cough, MyISAM.

There are some variable that can be set that affect consistency and replication, they are as follows:

Replication Factor Determines how much performance is sacrificed for more consistency. The number of nodes that you want updates to propagate to across the cluster when a write happens.

Consistency level Specified for every operation done by the client. How many replicas across the cluster respond when a write is performed, or if a read is done. The amount of nodes specified must all respond, otherwise the operation will fail.

  • An example of this would be a consistency level of 10 on a 10 node cluster. An operation much hear back from 10 nodes to consider the operation as completed.

Cassandra versus other types of systems

This section is meant to list a few different systems and the software that is used. Hopefully this will clear up some confusion on what is what.

Relational

  • MySQL
  • SQL Server
  • Postgres

Amazon Dynamo based

  • Cassandra
  • Voldemort
  • CouchDB
  • Riak

Google Big Table based

  • MongoDB
  • Hadoop
  • Hypertable
  • Redis

Rows versus Columns

Cassandra is column-oriented, however each row can have as many columns in them as they like. This means that one row could have 100 columns in it, and another row could have 5 columns in it. This is a very important thing to remember, and it is one of the major differences between typical relational databases and NoSQL databases.

  • Each row has a Unique Key that makes it accessible
  • Data is not restricted by schema, there is no need to pre-determine schema ahead of time.
  • A Keyspace is the container for column families. You can think of this as Keyspace (database) and column family (table). Keep in mind that this is just to visualize how data is stored, these two items are not the same as databases and tables, but are similar.

Directories

Cassandra has a few different directories that are of interest. I'm listing them here to be thorough.

bin This holds the binaries to run Cassandra. This also holds some scripts such as nodetool, which are used to get information on the cluster.

conf Contains storage-conf.xml which defines the data store and things like keyspaces and column families. The main conf file is also located here.

interface This contains the API for accessing Cassandra

javadoc Contains Java documentation

lib Contains external libraries that Cassandra needs to run.

Basic Commands

This starts the Cassandra server in the foreground, which can be useful for testing or troubleshooting.

bin/cassandra -f

This starts the command line interface used to interact with Cassandra

bin/cassandra-cli

This displays the help menu when in the cli

>help

You can also connect directly to the cluster using this command

bin/cassandra-cli localhost/9160

Here are some other commands that can be used to find information


>show cluster name

>show keyspaces

>show api version

Here are some commands to create and use keyspaces

###Create a keyspace (database) named "MyKeyspace" and each update has to be replicated to 1 other node

>create keyspace MyKeyspace with replication_factor=1

###Same as use database in MySQL land

>use MyKeyspace

###Similar to create table $tablename

>create column family User

###Gives information on what is in the keyspace (database)

>describe keyspace MyKeyspace

Inserting and selecting

###The following commands are inserting some columns in the row 'mjung' which these commands create. 

>set User ['mjung']['fname'] = 'mike'
>set User ['mjung']['email'] = [email protected]'

###['fname'] and ['email'] are columns for the key ['mjung'] in this case

###The following command should now return two results, which were added above

>count User['mjung']

###The following command will return the two sets of data from above, but will display the values instead of the count

>get User['mjung']

###This command will remove my email record

>del User['mjung'] ['email']

###This command removes everything in the key, so it would remove me as a user

>del User['mjung'] 

Cassandra's Data Model(bottom up view)

  • Before we get into Cassandra's data model, we should take a quick look at the relational model
  • Databases contain tables which contain columns that rows can then insert data into.
  • Each row has a Unique Key assigned to it.
  • Data that is inserted is restricted by the tables schema, so you cannot add a column for a row if the column has not already been added.

Now forget all of the above

Column Families

At the basic level, this is what a data store looks like in Cassandra's world:

[value1] [value2] [value3]

This is useful to store things, but not so useful if you want to keep track of what each value is. To fix this, the following is done.

[name1]   [name2] [name3]
[value1] [value2] [value3]

This means that we can now say that [name1] = [firstname] or [email] or anything you want. The values are then inserted so that we know [email protected] is not a phone number, but it's an email address.

  • This pair of name/value is a column.
  • The entity that holds data using this pair is called a row.
  • The unique identifier for each row is called a key.
  • Column Families are containers for these rows, which are made up of columns.

Keys

  • Key names can be strings, intergers, UUIDs, or whatever else. You cannot do this with relational databases.

Rows

  • You don't need to store a value for every column in a row. This means that instead of using a NULL value for information that you don't have, you can simply not have a column for that value. This saves space since NULL actually uses space.

Columns

  • Every column will always have a timestamp associated with it, however, this is something like a meta data timestamp, and this cannot be queried against. So you can't really do something like "select * from $column ORDER BY timestamp". Not in Cassandras world at least.

Super Column Families

  • You can create a group of related columns, which is basically like making a map of maps (yo dawg). This is useful for nesting things.

Cassandra's Data Model(top down view)

Cluster

  • The outer-most layer of Cassandra is the cluster. The cluster is made up of the nodes (servers). You can also think of a cluster as a ring.
  • Data is assigned to each node according to what order they are in, in the ring.

Node(replica)

  • Each node holds replicas for ranges of data in the cluster. This is meant to accomplish fault-tolerance. The number of nodes which act as replicas are set by the client.

Keyspaces

  • Each cluster contains at least one Keyspace, there can be as many keyspaces in a cluster as you want, but generally, you want to create one Keyspace per application.
  • Multiple keyspaces can all co-exist together in the cluster without any issues.
  • You can set different replication factors and settings for each Keyspace in a cluster

Replication Factor The number of nodes that act like replicas for each row of data. This is transparent to the client. The higher the replication factor, the more consistent the data is, however this reduces performance. The lower the factor is, the higher the performance, but less consistent.

Replica Placement Strategy How the replicas are placed in the ring.

  • Simple -- This means the ring in NOT rack aware
  • Old Network -- This means that the ring is rack aware
  • New Network -- This means that the ring is data center aware.
  • Please note that rack and data center don't necessarily mean that they are physically different locations.

Column Families

Column Families represent the structure of data.

  • Column families are defined by columns, however the columns are not defined on their own, the rows define what columns there are.
  • You can add any column you want in a row, they do not have to all match. This allows for very flexible data structures.

There are two attributes for each column family

  • Name
  • Comparator, which defines how columns are sorted during a query.

There are two types of column families

  • Standard which is the default. This is just some columns that are related.
  • Super which is the nesting of columns. Another way to look at this is a column of columns.

There are a few options that can be changed per column family

keys_cached The number of locations that keys are cached per SSTable.

rows_cached The numbers of rows whose entire data set are stored in RAM.

comment This is used to add information on how column families are defined. Set by the developer to keep track of things.

read_repair_chance This can be either a 0 or a 1. If enabled, each read operation is run against two or more nodes(replicas) and if one of those results is out of date or different, then it is repaired with the most current record. You can disable this if the amount of reads is much greater than the amount of writes, this would improve performance across the cluster.

preload_row_cache This will force the cache to rebuild and get moved to RAM if a node is starting up for the first time, or if it was restarted. This can prevent a "cold cache" that sometimes happens with MySQL.

Columns(rows define columns)

  • A column is the smallest unit of data structure for Cassandra.
  • A column consists of name, value and timestamp. These are the "meta data" values associated with each column.
  • Rows for the same column family are stored on disk in the same file.

There can be two types of rows, at least in the extreme

  • Skinny rows contain less columns, but there are a lot of rows. (Think hotdog instead of hamburger)
  • Wide rows contain less rows, but there are many, many more columns in each row (Hamburger instead of hotdog).

Sorting methods for columns

ascii Directly compares bytes that are parsed as ascii

bytes This is the default sorting type. This will properly sort most kinds of data.

Lexical UUID 16 byte UUID

Long Type 8 byte long numeric

Interger Type Same as long type, but faster and more flexible.

Time UUID Based on timestamp or MAC address

UTF8' XML, used for validation

Custom Whatever you want

Super Columns

  • Useful for keeping "related" column families in the same file on disk, this helps to improve performance.

What Cassandra does not offer

  • There is no query language built in. You can use the API, but there is no built in SQL.
  • There is no concept of JOINS. Cassandra does not need to join tables, since it is not relational.
  • There are no secondary indexes. Since there are no joins, this will not work. You can however, create another column family that has look up data.
  • You cannot ORDER BY or GROUP BY in the sense that you would with a relational database.
  • Cassandra performs best when data is denormalized.

Cassandra Architecture

System Keyspace

  • This is named "system" and it stored meta data about the cluster to help aid with operations. This is similar to the MySQL database that MySQL uses to keep track of things.
  • The meta data includes the following

node's token

cluster name

keyspace definition

migration data

is the node bootstrapped?

Peer to Peer

  • There is no master or slave with Cassandra. Each node talks to each other to learn how the ring is formed. Because of this, new nodes are automatically detected and dead nodes are removed.

Gossip

  • The gossip protocol uses an intra-ring connection to get each nodes information, which is used to detect failures.
  • The gossiper runs every second, on a timer to check this.

There are a few steps in the gossiper process, this is much like the TCP 3 way handshake

  • A node chooses another random node in the cluster. It starts a gossip session.
  • The gossip initiator sends it's chosen friend a SYN message.
  • The friend receives this (or doesn't if it's dead) and replies with an ACK message to the initiator.
  • The initiator receives this and sends back an ACK2.

If this does not complete, the initiator makes the node as dead and records this information.

This is not always cut and dry, the protocol uses suspicion for each node and understands that network latency could be high at times

Read Repair

  • When a client reads from a node in the cluster, and the read repair is enabled, it will read against multiple nodes to make sure that the results are consistent. If the results are consistent, the read operation is returned to the client.
  • Depending on the consistency level (set by the client) if the results are NOT the same on multiple nodes, a read repair will either happen before, or after the results are returned to the client.

Writes

  • When a write operation is performed, it is immediately written to the commit log, which is stored on disk and is another mechanism which helps prevent crashes, or lost data.
  • Writes do not count and are considered to fail if it does not reach the commit log.
  • Once the write has gone to the commit log it gets sent to the memtable, stored in RAM.
  • Once the objects that are stored in the memtable reach a certain limit they get flushed back to disk in a file called SSTable'. This is a non-blocking operation and will not lock up reads or writes.
  • This process is repeated over and over.

There is only one commit log per server

The benefit of this process is that writes are done sequentially and are always in "append" mode. Cassandra never writes to the same data file more than once.

  • Reads always look at the memtable first, since that is where the most recent writes will be. Avoiding using the disk.

Hinted Handoff

  • Writes that are aimed for a node in the cluster that happens to be dead are handled using the hinted handoff method.
  • The write goes to another node, with a note on it saying that "this write is not meant for you, but the other node is dead, can you hang on to this write until the other node is up, then pass this over to them so they can handle this?".

Compaction

  • This process merges SSTables together. This helps to save space by compacting these files together. This also reduces the amount of seeks the disk has to do since most of the "related" data should be in the same file on disk. Compaction is the process that handles this in the background.

Bloom Filters

  • Bloom filters are an algorithm that improves performance by mapping values in a data set into a digest string. This reduces the size of the of the values and because of this they are able to be stored in RAM instead of on disk, which reduces disk access making things much faster.

Tombstones

  • This is a "soft-delete" that is built into Cassandra and happens automatically. When a record is deleted Cassandra marks this with a Tombstone instead of totally deleting this. The record is not accessible, however Cassandra leaves this marked for 10 days in case the delete was headed for a node that is now down. Once the node is back online the delete gets sent to it and the write is then completed.