[Data Modeling] Apache Cassandra

Dev/Data Engineering

[Data Modeling] Apache Cassandra

HJChung 2021. 8. 15. 15:48

1. Apache Cassandra란?

Apache Cassandra is an open source NoSQL distributed database trusted by thousands of companies for scalability and high availability without compromising performance.

Apache Cassandra는 scalability와 high availability에 적합한 수많은 회사들이 사용하고 있는 오픈 소스 분산형 NoSQL 데이터베이스 관리 시스템이다.

하지만 정의를 봐도 잘 이해가 되지 않는다. '분산형 데이터베이스가 뭔지, 어떻게 scalability와 high availability 하길래 최적화 되어있다는 것인지 모르기 때문이다.' 이것들 부터 뭔지 알고 넘어가야겠다.

1) Distributed Database

네트워크상에서 여러 컴퓨터에 물리적으로 분산되어 있지만 분산 데이터베이스 관리 시스템을 통해 논리적으로 통합하고 공유하며 분산된 작업 처리가 수행되어, 사용자는 하나의 데이터베이스처럼 인식하도록 투명성을 제공하는 데이터베이스를 분산 데이터베이스라고 한다.

분산 데이터베이스 관리 시스템을 통해 제공되는 투명성의 종류로는 아래의 것들이 있다.

분할 투명성: 사용자가 입력한 전역 질의를 여러 단편 질의로 변환해 주기 때문에 사용자는 전역 스키마가 어떻게 분할되었는지 알 필요가 없다.
위치 투명성: 사용자는 분산 데이터베이스 상에 존재하는 어떠한 데이터의 물리적인 위치도 알 필요가 없고, 사용자는 데이터의 위치나 입력 시스템의 위치와 무관하게 동일한 명령으로 동일한 데이터에 접근 할 수 있다.
중복 투명성: 데이터베이스 객체가 여러 시스템에 중복되어 존재함에도 고객과는 무관하게 데이터의 일관성이 유지된다.
장애 투명성: 데이터베이스가 분산되어있는 각 지역의 시스템이나 통신망에 이상이 발생해도, 데이터의 무결성은 보장된다.
병행 투명성: 여러 고객의 응용 프로그램이 동시에 분산 데이터베이스에 대한 트랜잭션을 수행하는 경우에도 결과에 이상이 없다.

2) CAP Theorm

A theorem in computer science that states it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees of consistency, availability, and partition tolerance.

CAP이론은 분산 시스템에서는 consistency(일관성), availability(가용성), partition tolerance(분리 내구성) 속성3가지 중에 2개 이상 모두 충족하는 것은 불가능하다. 는 이론이다.

Consistency: 'Every read from the database gets the latest(and correct) piece of data or an error. All clients have the same view of the data. ' 로 '모든 읽기 동작은 마지막으로 쓰여진 옮은 데이터를 리턴해야하며 모든 client는 같은 시간에 같은 데이터를 볼 수 있어야 한다.' 는 것이다.
ACID 에서의 Consistency 와 CAP의 Consistency는 다른 것이며, ACID에서의 Consistency는 데이터 무결성 조건에 해당하는 것으로 '데이터는 항상 일관성있는 상태를 유지해야 하고 데이터의 조작 후에도 무결성을 해치지 말아야 한다'는 속성이다. 더 자세한 내용은 걸려있는 링크를 통해 확인 할 수 있다.

Availability: ' The system continues to operate when in the presence of node failure' 로, 특정 노드에 장애가 나도 서비스가 가능해야 한다. 는 것이다.
즉, request를 받았으면 response는 주어져야 한다. 그런데 Availablity가 있다/없다 보다 Availability가 좋냐/좋지 않냐도 하나의 포인트가 된다. [DB/분산] 초보자를 위한 CAP 이론 해당 글을 보면 더 쉽고 잘 설명이 되어있는데, request를 보낸 뒤 몇 시간 이 흐른 뒤, 아니면 일관성을 지키기 위해 네트워크가 복구 될 때까지 기다린 후 response가 도착한다면 이걸 Availability가 좋다고 볼 수 있을까? 이런 점 때문에 cassandra 소개에서도 Availablity앞에 'high' 가 붙어 있는 것 같다.

Partition-Tolerance: 'The system continues to work regardless of losing network connectivity between nodes' 노드간 네트워크 통신에 문제가 생겼을 경우에도 시스템은 동작해야한다.

여기서 Apache Cassandra는 Availability와 Partition-Tolerance 속성을 만족하도록 설계되었다.

대부분의 NoSQL 데이터베이스에서 Availability은 BASE (Basically Available, Soft State, Eventually Consistent) 속성이 있기 때문에 Consistency는보다 우선 순위가 높다. 그래서 하나의 노드에 문제가 생겨도 응답을 바로바로 하고 네트워크가 끊어져도 서비스에 문제가 없는 것이 노드간에 데이터가 조금 다른 것보다 중요하다고 생각하기도 하고, Gossip protocol 을 통해 궁극적인 노드 간 궁극적인 일관성을 기대하기도 한다.

3) Scalability

데이터베이스가 더 많은 처리량을 핸들링 할 수 있게 데이터베이스 리소스를 추가하는 확장성에 대해 '어떤 경우에 필요하며 cassandra에서는 어떻길래 확장성이 좋다는 것인가?' 를 중점으로 정리하고자 한다.

we'll define scalability as the ability to add computational resources to a database in order to gain more throughput. We'll look specifically at the two types of scalability available - vertical and horizontal - and provide a discussion of each in this context.

Cassandras는 Consistent hashing을 이용한 Ring 구조와 Gossip protocol(노드를 통해 정보를 공유한다)을 통해 각 노드 장비들의 추가, 제거 등이 자유롭고, 데이터센터까지 고려 할 수 있는 데이터 복제 정책을 사용하여 안전하게 수평으로 확장 가능하다.

Elastic scalability − Cassandra is highly scalable; it allows to add more hardware to accommodate more customers and more data as per requirement.
Fast linear-scale performance − Cassandra is linearly scalable, i.e., it increases your throughput as you increase the number of nodes in the cluster. Therefore it maintains a quick response time.

※ 이 개념을 알아 보면서 데이터베이스 샤딩 이라는 개념도 알게되었는데, 나중에 기회가 된다면 좀더 깊이 공부해보고 정리하면 좋을 것 같다.

- Database의 샤딩(Sharding)이란?

- DB분산처리를 위한 sharding

- LINE Manga 데이터베이스 샤딩 – 데이터베이스 엔지니어 편

2. Apache Cassandra의 특징

Elastic scalability − Cassandra is highly scalable; it allows to add more hardware to accommodate more customers and more data as per requirement.
Always on architecture − Cassandra has no single point of failure and it is continuously available for business-critical applications that cannot afford a failure.
Fast linear-scale performance − Cassandra is linearly scalable, i.e., it increases your throughput as you increase the number of nodes in the cluster. Therefore it maintains a quick response time.
Flexible data storage − Cassandra accommodates all possible data formats including: structured, semi-structured, and unstructured. It can dynamically accommodate changes to your data structures according to your need.
Easy data distribution − Cassandra provides the flexibility to distribute data where you need by replicating data across multiple data centers.
Transaction support − Cassandra supports properties like Atomicity, Consistency, Isolation, and Durability (ACID).
Fast writes − Cassandra was designed to run on cheap commodity hardware. It performs blazingly fast writes and can store hundreds of terabytes of data, without sacrificing the read efficiency.

cassandra를 사용하여 data modeling을 할 때는 반드시 'query first approach' to our database design 이다. 그래서 각 table은 우리가 얻고자 하는 쿼리 결과를 반영하고 있어야 한다. 만약 그렇지 않다면 쿼리는 실행불가능 할 수도 있고, 매우 느릴 수도 있다.

3. Cassandra Data Structure

앞서 Cassandra는 Consistent hashing을 이용한 Ring 구조를 가진고, 노드간 Gossip protocol로 통신한다고 나왔다.

Ring 구조가 뭘까? Ring 구조와 노드가 어떤 식으로 되어있고, 구체적으로 노드간 어떻게 통신한다는 걸까?

1) Cassandra의 기본 데이터 구조

출처: https://getto215.github.io/cassandra_intro/

Keyspace(Logical Data Storage)가 있고, Keyspace 아래에는 Table이 존재한다. Table은 다수의 Row들로 구성되어 있으며,

각 Row는 Key와 Value로 이루어진 Column들로 구성된다. (어디선 column Family라고 소개하고 어디선 Table이라고 소개하는데, CQL문법이 추가되면서 Column Family가 Table로 명칭이 변경되는 등 많은 변화가 있었다고 한다. 자세한 내용은 Apache Cassandra 톺아보기 - 1편에서 Cassandra의 History를 훑어보는 내용과 stackoverflow를 살펴보면 될 것 같다.)

출처: https://blog.gft.com/blog/2017/01/24/the-distributed-architecture-behind-apache-cassandra/

이런 식으로 대응해보면 조금 더 쉽게 이해된다.

2) Ring 구조, key

cassandra는 partition key를 기준으로 ring을 구성하는 각 노드에 데이터를 분산해서 저장한다.

cassandra partitions the data on something called partition key.

every piece of data with the same partition key will be stored on the same node in the cluster.

partition key는 cassandra가 데이터에 접근 할 때 사용하는 key이다.

cassandra에서는 Partition key를 Hash function을 통해 unique한 token을 생성하고, 이 token이 이 data를 어느 node에 저장할지를 결정한다. It'll responsible for storign data less than the value of that token but greater than the value of the token.

출처: https://www.scnsoft.com/blog/cassandra-performance

And if we have the replication factor of 3 (usually it is 3, but it’s tunable for each keyspace), the next two tokens' nodes (or the ones that are physically closer to the first node) also store the data.

다양한 key들

partition key
- The partition key's row value will be hashed and stored on the node in the system kthat holds that range of values.
- 즉, partition key determines the distribution of data across the system

clustering key(clustering column)
- primary key의 1번째 key를 제외한 나머지 key
- clustering key will determine the sort order within a Partition.
- The clustering column will sort the data in sorted ascending order
- More than one clustering column can be added (or none!)
- From there the clustering columns will sort in order of how they were added to the primary key
primary key
- 유일한 row임을 나타내는 key
- A Simple PRIMARY KEY is just one column that is also the PARTITION KEY.
- A Composite PRIMARY KEY is made up of more than one column and will assist in creating a unique value and in your retrieval queries. (즉, primary key is made up of either just the partition key or with the addition of clustering columns)

composite(compound) key
- primary key가 2개 이상

3) node간 통신 - Peer to Peer Architecture, Gossip Protocol

Cassandra의 모든 노드는 데이터 및 노드 상태에 대한 정보를 브로드캐스트하는 가십 프로토콜이라는 피어 투 피어 통신 프로토콜을 통해 서로 통신합니다. (https://blog.gft.com/blog/2017/01/24/the-distributed-architecture-behind-apache-cassandra/ 에서 노드통신 부분)

4. 실습

creating a table in Apache Cassandra and inserting rows of data

Example: creating a Music Library of albums
query first approach
1. Give me every album in my music library that was released in a given year
TODO. year이 partition key인 music library table을 생성
2. Give me every album in my artist library that was created by a given artist
TODO. artist가 partition key인 artist library table을 생성
TODO. year이 cluster column

# import Apache Cassandra python package
import cassandra

# create a connection to the database
from cassandra.cluseter import Cluster
try:
	cluster = Cluster(['127.0.0.1'])
    session = cluster.connect()
except Exception as e:
	print(e)
    
# create both table
query = "CREATE TABLE IF NOT EXISTS music_library"
query = query + "(year int, artist_name text, album_name text, PRIMARY KEY (year, artist_name))"
try:
	session.execute(query)
except Exception as e:
	print(e)
    
    
query = "CREATE TABLE IF NOT EXISTS album_library"
query = query + "(year int, artist_name text, album_name text, PRIMARY KEY (artist_name, year))"
try:
	session.execute(query)
except Exception as e:
	print(e)
    
# insert some data into both tables
query = "INSERT INTO music_library (year, artist_name, album_name)"
query = query + " VALUES (%s, %s, %s)"

query1 = "INSERT INTO album_library (artist_name, year, album_name)"
query1 = query1 + " VALUES (%s, %s, %s)"

try:
    session.execute(query, (1970, "The Beatles", "Let it Be"))
except Exception as e:
    print(e)
    
try:
    session.execute(query, (1965, "The Beatles", "Rubber Soul"))
except Exception as e:
    print(e)
    
try:
    session.execute(query, (1965, "The Who", "My Generation"))
except Exception as e:
    print(e)

    
try:
    session.execute(query1, ("The Beatles", 1970, "Let it Be"))
except Exception as e:
    print(e)
    
try:
    session.execute(query1, ("The Beatles", 1965, "Rubber Soul"))
except Exception as e:
    print(e)
    
try:
    session.execute(query1, ("The Who", 1965, "My Generation"))
except Exception as e:
    print(e)

# validate Data model
query = "select * from music_library WHERE YEAR=1970"
try:
    rows = session.execute(query)
except Exception as e:
    print(e)
    
for row in rows:
    print (row.year, row.artist_name, row.album_name)

query = "select * from album_library WHERE ARTIST_NAME='The Beatles'"
try:
    rows = session.execute(query)
except Exception as e:
    print(e)
    
for row in rows:
    print (row.artist_name, row.year, row.album_name)