Plan du site  
pixel
pixel

Articles - Étudiants SUPINFO

Overview of Apache Cassandra

Par Nezha EL GOURII Publié le 21/10/2016 à 15:30:42 Noter cet article:
(0 votes)
Avis favorable du comité de lecture

Introduction

In the project of the Apache Foundation, Cassandra is a distributed database that can store large amounts of data through its horizontal scalability.

This term refers to the possibility offered by Cassandra architecture to add new machines that are called nodes. The machines used are typically machines called Community hardware that is to say, they represent the best quality / price ratio.

Cassandra is inspired by a famous Amazon paper that came out in 2007 about a system called Dynamo. Is reliable, it's performant, and it's always on.

Concept of Cassandra

Cassandra is a node-based architecture and this is important placed into that always-on-aspect as well as performant. In a Cassandra cluster, the lowest level is a node and a node represents a single Cassandra instance.

A cluster is a peer to peer set of nodes and there is no single point of failure.

Collectively, both of these data centers, the nodes, and the rack make up the Cassandra cluster. This node-based architecture, at the lowest level, there was a node and this is a partition that's on the node.

It's a unit of data and it's the lowest level that you have of the data put into this partition on a node. And then you have a rack, we have a set of nodes and then we have a data center, which should be a set of racks.

Cassandra is also a shared nothing environment. It’s that there is no central controller. As you've seen your diagram so far, there is no master, there is no slave. All the nodes are independent and all the nodes are the same.

So this is a very important distinction in terms of other distributed systems you'll look at. Cassandra has no master, no slave, so you could read or write to any node. This makes life a lot simpler when you think about administering a cluster.

You could bring up a node, you could take down a node, they are all the same, they are all independent. They communicate as we'll learn in future segments, but there is no central controller, no master and no slave. It's also a fully replicated environment. So Cassandra will provide data redundancy and failover. So as we said there is no master or slave, they are all independent, but they communicate and data is found redundantly across the cluster. And this, you could control.

There's a replication factor that you could configure to specify the number of copies of the data across the cluster. All copies just like all nodes are equal, there is no primary. So you don't have a primary version of data and a backup. You have data, it could be read or written on any nodes.

There are preferred nodes for data, but it doesn't change where you can write or read your data.

The Partitioning Process

In Cassandra the Data is transparently partitioned across nodes, there's nothing that you need to do. Data is sent to a node, so you are writing to a node. The data is hashed and then sent to a partition based upon the hash.

There are two general partitioning strategies with Cassandra, one is a random partitioning and one is an ordered partitioning. In the previous slide, remember that data is hashed and sent. Now this is done with a random partitioning, the ordered, it's not.

There is a big difference between the random and the ordered. In the random partitioning, you should write a piece of data to Cassandra, it's going to hash it and distribute it across the cluster accordingly.

The idea is that you get a nice distribution of data across all these nodes, again, that could be Cassandra cluster. This random partitioning of spreading data across the cluster helps as you increase the number of nodes to get that higher throughput, it's very distributed reading, writing. Order portioning on the other hand, it's not recommended.

There are two ways and there are two partitioners that you'll find in Cassandra you could do it. There is a byte order partitioner and an order preserving. Neither of them is recommended approach. The reason is that if you're writing data in ordered fashion, so as the writes commend, Cassandra puts it on a node.

What's going to happen as you're trying to preserve that order across the cluster, and inevitably you're going to end up with hotspots?

You're not going to be able to get the data distributed across the cluster, maybe if you say you have 24 nodes in your cluster, may 8 of them have the majority of the data, that's not going to be desired outcome when you look at the throughput from reading and writing. You set up partitioning in the cassandra.yaml, there's where we have the partitioning option. A beautiful thing is there are no other mechanics.

With Cassandra, there is no sharding, there is no other work you have to do, usually, because all the nodes are equal, you bring up a node and you're ready to go. You bring up multiple nodes, they find each other based on the configuration and you're ready to go, it's very simple and the partitioning, again, controllable via the YAML file.

Advantages

Fault tolerance:

The data of a node (a node is an instance of Cassandra) are automatically replicated to other nodes (different machines). If a node is down, data are available through other nodes. The term replication factor is the number of nodes where the data is replicated. Furthermore, Cassandra cluster architecture defines the term as a group of at least two nodes and a data center as delocalized clusters.

Cassandra ensures replication across different data center. Nodes that have fallen can be replaced without service unavailability.

Decentralized:

In a cluster all nodes are equal. There is no concept of master or slave, or process that would bear management, or even bottleneck at the network part.

Rich data model:

The data model proposed by Cassandra based on the concept of key / value can develop many use cases in the Web world.

Elastic:

Scalability is linear, the flow of writing and reading increases linearly when a new server is added to the cluster. Furthermore, Cassandra ensures that there will be no system downtime or interruption to the application level.

High availability:

Ability to specify the level of consistency in reading and writing. This is called Tuneable Consistency. Apache Cassandra has no transaction. Data writing is very fast compared to the world of relational databases.

Conclusion

Cassandra is fully replicated. There is no master, no slave. It's always on, its performant and these are some of the features and characteristics of Cassandra that make it a fantastic solution to the big data challenge.

A propos de SUPINFO | Contacts & adresses | Enseigner à SUPINFO | Presse | Conditions d'utilisation & Copyright | Respect de la vie privée | Investir
Logo de la société Cisco, partenaire pédagogique de SUPINFO, la Grande École de l'informatique, du numérique et du management Logo de la société IBM, partenaire pédagogique de SUPINFO, la Grande École de l'informatique, du numérique et du management Logo de la société Sun-Oracle, partenaire pédagogique de SUPINFO, la Grande École de l'informatique, du numérique et du management Logo de la société Apple, partenaire pédagogique de SUPINFO, la Grande École de l'informatique, du numérique et du management Logo de la société Sybase, partenaire pédagogique de SUPINFO, la Grande École de l'informatique, du numérique et du management Logo de la société Novell, partenaire pédagogique de SUPINFO, la Grande École de l'informatique, du numérique et du management Logo de la société Intel, partenaire pédagogique de SUPINFO, la Grande École de l'informatique, du numérique et du management Logo de la société Accenture, partenaire pédagogique de SUPINFO, la Grande École de l'informatique, du numérique et du management Logo de la société SAP, partenaire pédagogique de SUPINFO, la Grande École de l'informatique, du numérique et du management Logo de la société Prometric, partenaire pédagogique de SUPINFO, la Grande École de l'informatique, du numérique et du management Logo de la société Toeic, partenaire pédagogique de SUPINFO, la Grande École de l'informatique, du numérique et du management Logo du IT Academy Program par Microsoft, partenaire pédagogique de SUPINFO, la Grande École de l'informatique, du numérique et du management

SUPINFO International University
Ecole d'Informatique - IT School
École Supérieure d'Informatique de Paris, leader en France
La Grande Ecole de l'informatique, du numérique et du management
Fondée en 1965, reconnue par l'État. Titre Bac+5 certifié au niveau I.
SUPINFO International University is globally operated by EDUCINVEST Belgium - Avenue Louise, 534 - 1050 Brussels