APACHE CASSANDRA: ARCHITECTURE AND INSTALLATION.

Bài đăng này đã không được cập nhật trong 10 năm

Before taking a look about Apache Cassandra, we should understand the conception of NoSQL database.

What is NoSQL?

A NoSQL or Not Only SQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Motivations for this approach include simplicity of design, horizontal scaling and finer control over availability. The data structure (e.g. key-value, graph, or document) differs from the RDBMS, and therefore some operations are faster in NoSQL and some in RDBMS. There are differences though and the particular suitability of a given NoSQL DB depends on the problem to be solved.

What are the difference between NoSQL and traditional RDBMS

	SQL	NoSQL
Known as	SQL Based database Management System is majorly known as RDBMS or DBMS Systems	Not only SQL, non-relational database or distributed database
Schema	Predefine schema to store structure data	It is a dynamic element based on the data elements
Scalability	SQL Databases are vertically scalable. If we want to scale SQL base database, we need to give hardware boost on which the DBMS System is installed	NoSQL database are horizontally scalable. If we want to scale it, we need to add more nodes and create distribution network based on our own need and required power
Data retrieval	In SQL based database, to define and manipulate data we can use SQL (Structure Query Language), which is very powerful nowadays.	Queries are focus on the collection and documents. Sometimes it is called as UnQL (Unstructured Query Language). It is varies from vendor to vendor of the NoSQL database.
Classification	RDBMS can be classified into two major type: Open Source database,Close Source database	NoSQL can be classified based on the way of storing data: Key-value pair store, Graph database, Document Store, Column Store, XML Store
Best fit for	Type of data: Relational data or sometimes Object Oriented data.Excellent for heavy duty transaction based data or related application, as it is more stable and mostly satisfied ACID Properties	Type of data: Hierarchical data and document base data. This type of database is preferred for large data sets.Basically not meant for transaction based application but it’s main objective focus on document based large data sets

Why we need NoSQL database?

Interactive applications have changed dramatically over the last 15 years, and so have the data management needs of those apps. Today, three interrelated megatrends – Big Data, Big Users, and Cloud Computing – are driving the adoption of NoSQL technology.

Big users: Today, with the growth in global Internet use, the increased number of hours users spend online, and the growing popularity of smartphones and tablets, it’s not uncommon for apps to have millions of users a day. Supporting large numbers of concurrent users is important but, because app usage requirements are hard to predict, it’s just as important to dynamically support rapidly growing (or shrinking) numbers of concurrent users.
Big data: Developers want a very flexible database that easily accommodates new data types and isn’t disrupted by content structure changes from third-party data providers. Much of the new data is unstructured and semi-structured, so developers also need a database that is capable of efficiently storing it. Unfortunately, the rigidly defined, schema-based approach used by relational databases makes it impossible to quickly incorporate new types of data, and is a poor fit for unstructured and semi-structured data. NoSQL provides a data model that maps better to these needs.
Cloud computing: Today, most new applications (both consumer and business) use a three-tier Internet architecture, run in a public or private cloud, and support large numbers of users. At the database tier, relational databases were originally the popular choice. Their use was increasingly problematic however, because they are a centralized, share-everything technology that scales up rather than out. This made them a poor fit for applications that require easy and dynamic scalability. NoSQL databases have been built from the ground up to be distributed, scale-out technologies and therefore fit better with the highly distributed nature of the three-tier Internet architecture.

What is Apache Cassandra?

Apache Cassandra is a massively scalable open source NoSQL database. Cassandra is perfect for managing large amounts of structured, semi-structured, and unstructured data across multiple data centers and the cloud. Cassandra delivers continuous availability, linear scalability, and operational simplicity across many commodity servers with no single point of failure, along with a powerful dynamic data model designed for maximum flexibility and fast response times.

The benefits Cassandra delivers to your business

Elastic scalability: Allows you to add capacity to accommodate more customers and more data whenever you need.
Always on architecture: Contains no single point of failure (as with traditional master/slave RDBMS’s and other NoSQL solutions) resulting in continuous availability for business-critical applications that can’t afford to go down, ever.
Fast linear-scale performance: Enables sub-second response times with linear scalability (double your throughput with two nodes, quadruple it with four, and so on) to deliver response time speeds your customers have come to expect.
Flexible data storage: Easily accommodates the full range of data formats – structured, semi-structured and unstructured – that run through today’s modern applications. Also dynamically accommodates changes to your data structures as your data needs evolve.
Easy data distribution: Gives you maximum flexibility to distribute data where you need by replicating data across multiple datacenters, the cloud and even mixed cloud/on-premise environments. Read and write to any node with all changes being automatically synchronized across a cluster.
Operational simplicity: with all nodes in a cluster being the same, there is no complex configuration to manage so administration duties are greatly simplified.
Transaction support: Delivers the “AID” in ACID compliance through its use of a commit log to capture all writes and built-in redundancies that ensure data durability in the event of hardware failures, as well as transaction isolation, atomicity, with consistency being tunable.

Cassandra Architecture

“Masterless” structure: all nodes are the same. Cassandra employs a peer-to-peer distributed system where all nodes are the same and data is distributed among all nodes in the cluster. Each node exchanges information across the cluster every second.
Partition: Data is automatically, transparently distribution across all nodes that participate in a cluster.
Replication: Stores redundant copies of data across nodes that participate in a cluster. If any node in a cluster goes down, one or more copies of that node’s data is available on other machines in the cluster (no single point of failure).
Linear scalability: Capacity may be easily added simply by adding new nodes online. For example, if 2 nodes can handle 100,000 transactions per second, 4 nodes will support 200,000 transactions/sec and 8 nodes will tackle 400,000 transactions/sec.
SSTable: Data is written to an in-memory structure, called a memtable, which resembles a write-back cache. Once the memory structure is full, the data is written to disk in an SSTable data file. All writes are automatically partitioned and replicated throughout the cluster.
Compaction: Cassandra periodically consolidates SSTables, discards tombstones (an indicator that a column was deleted), and regenerates the index in the SSTable.
Cqlsh: Any authorized user can access data in any node in any data center in cluster using CQL language. CQL uses a similar syntax to SQL.

Cassandra Key Structures

Cluster: A group of nodes where you store your data. A cluster contains one or more data centers.
Data center: A group of related nodes configured together within a cluster for replication and workload-segregation purposes; it is not necessarily a physical data center.
Commit log: All data is written first to the commit log for durability. After all its data has been flushed to SSTables, it can be archived, deleted, or recycled.
Table: A collection of ordered columns fetched by row. A row consists of columns and have a primary key. The first part of the key is a column name.
SSTable: A sorted string table (SSTable) is an immutable data file to which Cassandra writes memtables periodically. SSTables are append only and stored on disk sequentially and maintained for each Cassandra table.

Cassandra installation (for Linux)

#apt-get install openjdk-7-jdk
#echo “deb http://debian.datastax.com/community stable main” | tee -a /etc/apt/sources.list.d/cassandra.sources.list
#curl -L http://debian.datastax.com/debian/repo_key | apt-key add -
#apt-get update
#apt-get install dsc20

Cassandra 2.0 requires JAVA jdk7.0 to run so you need to set it as default JAVA version after the installation.

Running Cassandra

Execute command : #cassandra for running it implicitly or #cassandra –f to see what happen inside Cassandra by viewing the log.

Note that some time you have to kill an already implicitly executed Cassandra process to have these commands worked or you will see some errors like “Java Runtime Errror”.

Using Cqlsh

Simply run the command: cqlsh –u "username_if_any" –p "password_if_any" to access Cassandra query language shell.
Create a new keyspace: create keyspace yourkeyspace with replication = { ‘class’ : ‘SimpleStrategy’, ‘replication_factor’ : 3 }; .
Select your newly created keyspace: use "yourkeyspace".
Create a new table:

CREATE TABLE users (
user_name varchar PRIMARY KEY,
password varchar,
gender varchar,
session_token varchar,
state varchar,
birth_year bigint
);

You can access to website of DataStax – the primary contributor to Apache Cassandra Project to see more query usage.

Reference

Why NoSQL

Apache Cassandra Documentation 1

Apache Cassandra Documentation 2

NoSQL