Introduction to Apache Cassandra ( Part 1 )

Introduction to Apache Cassandra ( Part 1 )

Disclaimer : This is not a blog/article that is well formatted, it's just the running notes I have taken down during the session Introduction to Cassandra.

Link : Intro to Cassandra

Part 1

1) What is Cassandra?
-> NoSQL Database ( Not or not only Sql database).
-> Distributed Database( Cassandra is run on multiple servers) A Cassandra server is called Node and multiple nodes together are called cluster.
-> Decentralised Database ( No presence on one master/head node , i.e. every node has the same responsibilities)
-> Every server is able to communicate with every other server.

2) Now let us try to understand how databases scale to understand how Cassandra scales.

What is scaling? : Database scalability is the ability of a database to handle changing demands by adding/removing resources

-> Databases scale in 2 ways :
-> Vertical scaling : Vertical scaling is bringing in a more powerful server in terms of the computing power to handle increasing loads on the application.
-> Horizontal scaling : Horizontal scaling is just adding a new server with the same computing power as the one that already exists. It roughly means adding more number of servers.

-> Cassandra scales Horizontally.
-> Cassandra scales Linearly , meaning the cluster performance increases with increase in the number of nodes( Cassandra servers).


3) Architecture of Cassandra

-> Data is distributed in Cassandra, meaning not a single server has all the data we need, data is split into multiple servers. Data is partitioned into multiple Cassandra Nodes.
Data distribution helps in scaling up process.

-> Data is replicated Every piece of data is replicated as much time as the Replication Factor set for the cluster for the current data centre. ex : If the replication factor is set as 3, then every server is responsible for 3 datasets.

-> One downside of this replication is that it requires more disk space.

-> The replication factor(RF) is dependent on the type of data we r storing. Usually if the data is very important like financial data, we increase the RF to be on the safer side.


4) CAP Theorem
It states that : In the distributed environment , in case of a failure only two of the Three qualities( Consistency, Availability, Partition Tolerance ) can be reached under best circumstances.

Explanation of the 3 qualities :

-> Availability : Getting the answer , when we ask the questions to the database. It's basically to check if the database is available when we need it. From the fact that data is replicated we can observe that Cassandra is highly available.

-> Consistency : Would you always get the most up to data. Since we have data repeated in Cassandra we can always build a mechanism to check if the data we are getting is the updated version.

-> Partition Tolerance : Ability of a distributed system to communicate btw itself. For example if we have a server in India , then one in USA and one in Europe.
and let's say that there is a network outage in USA and the servers are not able to communicate to servers in USA , then such condition is called NETWORK PARTITION.


5) IS Cassandra AP( Available and Partition tolerance ) or CP ( Consistence and Partition tolerance)
-> Cassandra is configurably consistent. We have different consistency levels.

Different consistency levels :
1) ONE
2) QUOROM
3) ALL etc....

-> Consistency level one means that the data sent by the client is written into all the replica servers, but we wait only for one conformation.
-> Consistency level QUOROM means that the data sent by the client is written into all the replica servers, but we wait conformation from majority servers( depending on the replication factor this majority number varies ).

END OF PART 1