In the project of the Apache Foundation, Cassandra is a distributed
database that can store large amounts of data through its horizontal
This term refers to the possibility offered by Cassandra
architecture to add new machines that are called nodes. The machines used
are typically machines called Community hardware that is to say, they
represent the best quality / price ratio.
Cassandra is inspired by a famous Amazon paper that came out in 2007
about a system called Dynamo. Is reliable, it's performant, and it's
Cassandra is a node-based architecture and this is important placed
into that always-on-aspect as well as performant. In a Cassandra cluster,
the lowest level is a node and a node represents a single Cassandra
A cluster is a peer to peer set of nodes and there is no single
point of failure.
Collectively, both of these data centers, the nodes, and the rack
make up the Cassandra cluster. This node-based architecture, at the lowest
level, there was a node and this is a partition that's on the node.
It's a unit of data and it's the lowest level that you have of the
data put into this partition on a node. And then you have a rack, we have
a set of nodes and then we have a data center, which should be a set of
Cassandra is also a shared nothing environment. It’s that there is
no central controller. As you've seen your diagram so far, there is no
master, there is no slave. All the nodes are independent and all the nodes
are the same.
So this is a very important distinction in terms of other
distributed systems you'll look at. Cassandra has no master, no slave, so
you could read or write to any node. This makes life a lot simpler when
you think about administering a cluster.
You could bring up a node, you could take down a node, they are all
the same, they are all independent. They communicate as we'll learn in
future segments, but there is no central controller, no master and no
slave. It's also a fully replicated environment. So Cassandra will provide
data redundancy and failover. So as we said there is no master or slave,
they are all independent, but they communicate and data is found
redundantly across the cluster. And this, you could control.
There's a replication factor that you could configure to specify the
number of copies of the data across the cluster. All copies just like all
nodes are equal, there is no primary. So you don't have a primary version
of data and a backup. You have data, it could be read or written on any
There are preferred nodes for data, but it doesn't change where you
can write or read your data.
In Cassandra the Data is transparently partitioned across nodes,
there's nothing that you need to do. Data is sent to a node, so you are
writing to a node. The data is hashed and then sent to a partition based
upon the hash.
There are two general partitioning strategies with Cassandra, one is
a random partitioning and one is an ordered partitioning. In the previous
slide, remember that data is hashed and sent. Now this is done with a
random partitioning, the ordered, it's not.
There is a big difference between the random and the ordered. In the
random partitioning, you should write a piece of data to Cassandra, it's
going to hash it and distribute it across the cluster accordingly.
The idea is that you get a nice distribution of data across all
these nodes, again, that could be Cassandra cluster. This random
partitioning of spreading data across the cluster helps as you increase
the number of nodes to get that higher throughput, it's very distributed
reading, writing. Order portioning on the other hand, it's not
There are two ways and there are two partitioners that you'll find
in Cassandra you could do it. There is a byte order partitioner and an
order preserving. Neither of them is recommended approach. The reason is
that if you're writing data in ordered fashion, so as the writes commend,
Cassandra puts it on a node.
What's going to happen as you're trying to preserve that order
across the cluster, and inevitably you're going to end up with
You're not going to be able to get the data distributed across the
cluster, maybe if you say you have 24 nodes in your cluster, may 8 of them
have the majority of the data, that's not going to be desired outcome when
you look at the throughput from reading and writing. You set up
partitioning in the cassandra.yaml, there's where we have the partitioning
option. A beautiful thing is there are no other mechanics.
With Cassandra, there is no sharding, there is no other work you
have to do, usually, because all the nodes are equal, you bring up a node
and you're ready to go. You bring up multiple nodes, they find each other
based on the configuration and you're ready to go, it's very simple and
the partitioning, again, controllable via the YAML file.
The data of a node (a node is an instance of Cassandra) are
automatically replicated to other nodes (different machines). If a node is
down, data are available through other nodes. The term replication factor
is the number of nodes where the data is replicated. Furthermore,
Cassandra cluster architecture defines the term as a group of at least two
nodes and a data center as delocalized clusters.
Cassandra ensures replication across different data center. Nodes
that have fallen can be replaced without service unavailability.
In a cluster all nodes are equal. There is no concept of master or
slave, or process that would bear management, or even bottleneck at the
Rich data model:
The data model proposed by Cassandra based on the concept of key /
value can develop many use cases in the Web world.
Scalability is linear, the flow of writing and reading increases
linearly when a new server is added to the cluster. Furthermore, Cassandra
ensures that there will be no system downtime or interruption to the
Ability to specify the level of consistency in reading and writing.
This is called Tuneable Consistency. Apache Cassandra has no transaction.
Data writing is very fast compared to the world of relational
Cassandra is fully replicated. There is no master, no slave. It's
always on, its performant and these are some of the features and
characteristics of Cassandra that make it a fantastic solution to the big