Cassandra Tutorial — Hadoop and NoSQL Part 2
March 09, 2018
Apache Cassandra is an open source, distributed and decentralized database that is ideal for handling huge amounts of structured data across multiple site in multiple regions.
It provides high availability without a single point of failure, thus it is continuously online and is ideal for business applications that need high degrees of uptime.
Cassandra is very easy to scale by nature. Just like Hadoop when you scale Cassandra because you need more space, you also get more resources so that response time for queries can stay low even with the increase in data
Cassandra supports ACID transactions. Remember ACID stands for Atomicity, Consistency, Isolation, and Durability. ACID transactions come from relational database world which makes Cassandra very attractive.
However, Cassandra has a concept of eventually consistent meaning that after a write, the data that you wrote might not be available in another node for reading immediately. It takes time to replicate data across many nodes. On the other hand, a relational database has much stricter writing rules which makes it more attractive for transactional flows in which immediate consistency is needed, i.e. data needs to get back to the customer immediately. Cassandra also looks like Hadoop because it is designed to run on commodity hardware.
Cassandra’s Architecture
Cassandra has a peer-to-peer distributed system across the servers that the data is distributed across. All of the servers in the cluster have the same role. Each server is independent from the other and interconnected at the same time. All of the servers in the cluster have the ability to accept read and writes regardless of what data is actually located on the server. This also means that if a server goes down, read and writes can be requested from all of the other servers. The ability for all of this to work is because of the data replication that Cassandra offers. One or more of the servers acts as a replica for a certain piece of data. Cassandra ensures that correct data is returned to the client or else it will do a read repair will bring all of the replicas up to date.
Cassandra Terminology
- Node — Server that has an installation of Cassandra that stores data
- Data Center — Collection of nodes that work together
- Cluster — One or more data centers
- CQL — Cassandra Query Language
- Cqlsh — the command line of Cassandra where you enter commands or queries
- Keyspace — Collection of tables (i.e. database in the relational world)
How Cassandra Handles Writes
Remember that the client can reach out to any node to initiate a read or a write.
When a client wants to write, that activity is written in the commit logs inside of the node. A commit log is a crash recovery mechanism that handles all of the write operations. A commit log is like a snapshot of the data so if a node goes down it can start up right where it left off by reading the commit log. After it is written to the commit log, the data will be stored in the mem-table. The mem-table is an in-memory data structure that stores the data. Once that mem-table is full the data will be written into the SStable file. The SStable is a file on disk that holds the contents of the mem-table. The admin decides how big the mem-table should be by setting defaults in the settings. When the mem-table writes to the SStable, the data is flushed from the mem-table to make room for more data that incoming from the commit logs.
How Cassandra Handles Reads
Cassandra gets the values from the mem-tables and uses a bloom filter to find the correct SSTable that has the correct data. A bloom filter is a fast algorithm that determines if an element is a member of a set. The bloom filter tells the client where to get the data in the correct SSTable.
How Cassandra Handles Data
As we stated above a data center is a bunch of nodes that work together to create an instance of Cassandra. Each of these nodes has a replica that takes over in case that server was to go down.
A keyspace is kind of like the database. It has some tables that hold the data and some attributes that are very important.
Replication Factor is the number of machines that will receive the copies of the same data at one time.
Cassandra also has column families. A column family has a bunch of rows. These rows hold ordered columns. Just like in HBase these column families define the structure of the data.
There are also two kinds of columns: a column and a supercolumn. A column is the very basic structure that includes the key (column name), value, and a timestamp. A SuperColumn is still a key value pair. The only difference is that the value can be a map of other columns.
Why is this even a thing?
Well when storing data, Cassandra tries to keep column families stored on disk together in a file. This is very helpful for querying data. If you know two values will be queried together often, it makes sense to put those values in the same file (or column family). SuperColumns can just help ensure that the data is stored together to improve the performance of your query.
Cassandra Example
Let’s quickly install Cassandra into our Cloudera Quickstart image. We will be using yum to install the database.
First thing we need to do is add the repo into yum.repos directory (located at /etc/yum.repos.d/cassandra.repo
).
Type the following to create and open the file:
vi /etc/yum.repos.d/cassandra.repo
Press i
to enter interactive mode so that you can edit the file. Copy the following code:
[cassandra]
name=Apache Cassandra
baseurl=https://www.apache.org/dist/cassandra/redhat/311x/
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://www.apache.org/dist/cassandra/KEYS
Press the Esc
button, and then types :wq
, to save and exit the file. Now, type in the following:
yum install cassandra
service cassandra start
Great job you just installed Cassandra. Now let’s install python so you can have some fun using the CQLSH.
If you run python -V and have a version higher than 2.7, you don’t have to follow these steps. If it is less than 2.7, you need to follow the following steps.
yum update ## optional
yum install scl-utils
yum install centos-release-scl-rh
yum install python27
scl enable python27 bash
export PYTHONPATH="/usr/lib/python2.7/site-packages/":$PYTHONPATH
pip install cqlsh
After this you should be good to go to start playing around with Cassandra. It’s a great tool to get familiar with. We’re going to use a Cassandra docker image to show a couple of commands inside of Cassandra.
Let’s start by getting the docker containers up and running.
docker run --name cassandra -d cassandra
docker run -it --link cassandra:cassandra --rm cassandra cqlsh cassandra
Now you should have the docker container open with the cqlsh. Let’s create a keyspace to put a table inside.
CREATE KEYSPACE commonlounge WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 3};
We’ve created a keyspace with the name commonlounge
with a simple strategy replications and replication factor of 3. Let’s create a table and load some data inside of the table.
use commonlounge;
CREATE TABLE guests(guest_id int PRIMARY KEY, guest_name text, guest_address text, guest_number text);
INSERT INTO guests (guest_id, guest_name, guest_address, guest_number) VALUES (1, 'Papa Smurf', '231 Maple Road', '5558298483');
INSERT INTO guests (guest_id, guest_name, guest_address, guest_number) VALUES (2, 'Snow White', '829 Oak Tree Lane', '55573892832');
INSERT INTO guests (guest_id, guest_name, guest_address, guest_number) VALUES (3, 'MJ', '7493 Champions Lane', '5554219854');
Now that we have some data inside. Let’s do some queries.
select * from guests;
guest_id | guest_address | guest_name | guest_number
----------+---------------------+------------+--------------
1 | 231 Maple Road | Papa Smurf | 5558298483
2 | 829 Oak Tree Lane | Snow White | 55573892832
3 | 7493 Champions Lane | MJ | 5554219854
select * from guests where guest_name = 'MJ' ALLOW FILTERING;
guest_id | guest_address | guest_name | guest_number
----------+---------------------+------------+--------------
3 | 7493 Champions Lane | MJ | 5554219854
Notice that to use a WHERE
clause on a field that isn’t a primary key, we have to use the command ALLOW FILTERING
.