Cassandra - A Decentralized Structured Storage System
http://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf, 英文
http://www.dbthink.com/?p=372, 中文
对Cassandra并没有深入研究, 在data server上copy了bigtable, 而在分布式nodes管理上copy了Dynamo的去中心化的架构, 可以认为是去中心化设计的Bigtable.
当然他在copy bigtable的数据模型和SSTable机制的同时, 也做了些改变以适应他mail search的场景, 没有table的概念只有一个key空间, 支持列排序, 支持超列, 具体可参考Cassandra Vs HBase.
个人认为除非特别的原因, 比如相同的场景, 否则没有太多的理由选择Cassandra, HBase更通用, 更成熟.
1. 导论
Facebook维护着世界上最大的社交网络平台,利用分布在世界各地的大量数据中心的成千上万台服务器,为上亿的用户提供服务.
Facebook 平台有严格的业务要求,包含性能、可靠性、效率以及高度的可伸缩性以支持平台的持续增长.在一个包含成千上万的组件的基础设施上处理故障是我们的标准运作模式;在任何时候,随时都可能出现相当数量的服务器或网络组件故障.这样,软件系统在构建时就需要将故障当作一种常态而不是异常来处理.为了满足上面描述的这些可靠性与可伸缩性,Facebook开发了Cassandra系统.
为了实现可伸缩性与可靠性,Cassandra组合了多项众所周知的技术.
我们设计Cassandra的最初目的是解决收件箱搜索的存储需要(storage needs of the Inbox Search problem).
在Facebook,这意味着这个系统需要能够处理非常大的写吞吐量,每天几十亿的写请求,随着用户数的规模而增长.由于我们是通过在地理上分布的数据中心对用户进行服务的,因此支持跨越多个数据中心的数据复制对于降低搜索延时就非常关键了.当我们在2008年6月发布收件箱搜索项目时,我们有1亿的用户, 现在我们差不多有2.5亿的用户,Cassandra一直保持了其对业务的承诺.目前,Facebook内部已经有多个服务部署了Cassandra作为其后端存储系统.
2. 相关研究
Google文件系统(GFS)[9]是另一个分布式文件系统,用来存储Google内部应用的各种状态数据.GFS设计比较简单,用一台主服务器存储所有的元数据(metadata),数据拆分成块(chunk)存储在多个块服务器(chunk server)上.不过,目前Google已经使用Chubby[3]抽象层为GFS的主服务器做了容错处理(fault tolerant).
Dynamo[6]是一个Amazon开发的存储系统,Amazon用它来存储检索用户的购物车(user shopping carts).Dynamo利用基于Gossip的会员算法来维护每个节点上所有其他节点的信息.可以认为Dynamo是一个只支持一跳路由请求(one-hop request routing)的结构化覆盖层(structured overlay).Dynamo使用一个向量时钟(vector lock)概要来发现更新冲突,但偏爱客户端的冲突解决机制.为了管理向量时间戳(vector timestamp),Dynamo中的写操作同时也需要执行一次读操作.在一个需要处理非常大的写吞吐量的系统中,这可能会成为瓶颈.
Bigtable[4]既提供了结构化也支持数据的分布式,不过它依赖于一个分布式的文件系统来保证数据的持久化.
3. 数据模型
Cassandra中的表是一个按照主键索引的分布式多维映射(Distributed multi-dimensional map indexed by a key)
The row key in a table is a string with no size restrictions, although typically 16 to 36 bytes long.
Every operation under a single row key is atomic per replica no matter how many columns are being read or written into.
Columns are grouped together into sets called column families very much similar to what happens in the Bigtable[4] system. Cassandra exposes two kinds of columns families, Simple and Super column families. Super column families can be visualized as a column family within a column family.
Furthermore, applications can specify the sort order of columns within a Super Column or Simple Column family.
The system allows columns to be sorted either by time or by name. Time sorting of columns is exploited by application like Inbox Search where the results are always displayed in time sorted order.
Typically applications use a dedicated Cassandra cluster and manage them as part of their service. Although the system supports the notion of multiple tables all deployments have only one table in their schema.
数据模型和Bigtable很相似...
4. API
The Cassandra API consists of the following three simple methods.
insert(table; key; rowMutation)
get(table; key; columnName)
delete(table; key; columnName)
columnName can refer to a specific column within a column family, a column family, a super column family, or a column within a super column.
5. SYSTEM ARCHITECTURE
The architecture of a storage system that needs to operate in a production setting is complex.
In addition to the actual data persistence component, the system needs to have the following characteristics;
- scalable and robust solutions for load balancing,
- membership and failure detection,
- failure recovery,
- replica synchronization,
- overload handling,
- state transfer,
- concurrency and job scheduling,
- request marshalling,
- request routing,
- system monitoring and alarming,
- and configuration management.
Describing the details of each of the solutions is beyond the scope of this paper, so we will focus on the core distributed systems techniques used in Cassandra:
Partitioning, Replication, Membership, Failure handling and scaling.
5.1 Partitioning
One of the key design features for Cassandra is the ability to scale incrementally. This requires, the ability to dynamically partition the data over the set of nodes (i.e., storage hosts) in the cluster. Cassandra partitions data across the cluster using consistent hashing [11] but uses an order preserving hash function to do so.
Typically there exist two ways to address this issue:
One is for nodes to get assigned to multiple positions in the circle (like in Dynamo),
and the second is to analyze load information on the ring and have lightly loaded nodes move on the ring to alleviate heavily loaded nodes as described in [17]. Cassandra opts for the latter as it makes the design and implementation very tractable and helps to make very deterministic choices about load balancing.
首先,在环上随机的为每个节点指定位置可能导致数据与负载的分布不均衡.其次,基本的一致性算法会抹杀节点之间性能的异质性(差异).解决这个问题一般有两种方法:一种方法是在环上为节点指定多个位置(Dynamo采用的方法),另一种方法是分析环上的负载信息,并移动负载较低的节点的位置以缓解负载过重的节点,引文[17]对此有详细描述.Cassandra选择了后者,因为使用它可以简化设计与实现,并且可以让负载均衡的选择更加具有确定性.
和Dynamo的策略一样, 除了负载均衡策略, dynamo采用的是不同的节点不同数目的虚拟节点来解决, 而Cassandra采用的是动态移动环上点的位置来解决这个问题
5.2 Replication
Cassandra uses replication to achieve high availability and durability.
Each data item is replicated at N hosts, where N is the replication factor configured “per-instance". Each key, k, is assigned to a coordinator node (described in the previous section). The coordinator is in charge of the replication of the data items that fall within its range. In addition to locally storing each key within its range, the coordinator replicates these keys at the N-1 nodes in the ring.
Cassandra provides the client with various options for how data needs to be replicated. Cassandra provides various replication policiessuch as “Rack Unaware", “Rack Aware" (within a datacenter) and "Datacenter Aware". Replicas are chosen based on the replication policy chosen by the application.
Cassandra system elects a leader amongst its nodes using a system called Zookeeper[13].
All nodes on joining the cluster contact the leader who tells them for what ranges they are replicas for and leader makes a concerted effort to maintain the invariant that no node is responsible for more than N-1 ranges in the ring. The metadata about the ranges a node is responsible is cached locally at each node and in a fault-tolerant manner inside Zookeeper - this way a node that crashes and comes back up knows what ranges it was responsible for.
We borrow from Dynamo parlance and deem the nodes that are responsible for a given range the “preference list" for the range.
Data center failures happen due to power outages, cooling failures, network failures, and natural disasters.
Cassandra is configured such that each row is replicated across multiple data centers. In essence, the preference list of a key is constructed such that the storage nodes are spread across multiple datacenters. These datacenters are connected through high speed network links. This scheme of replicating across multiple datacenters allows us to handle entire data center failures without any outage.
5.3 Membership
Cluster membership in Cassandra is based on Scuttlebutt[19], a very effcient anti-entropy Gossip based mechanism.
The salient feature of Scuttlebutt is that it has very effcient CPU utilization and very effcient utilization of the gossip channel.
Within the Cassandra system Gossip is not only used for membership but also to disseminate other system related control state.
5.3.1 Failure Detection
Failure detection is a mechanism by which a node can locally determine if any other node in the system is up or down.
In Cassandra failure detection is also used to avoid attempts to communicate with unreachable nodes during various operations.
Cassandra uses a modified version of the Φ Accrual Failure Detector[8].
The idea of an Accrual Failure Detection is that the failure detection module doesn't emit a Boolean value stating a node is up or down. Instead the failure detection module emits a value which represents a suspicion level for each of monitored nodes.
This value is defined as Φ. The basic idea is to express the value of Φ on a scale that is dynamically adjusted to reflect network and load conditions at the monitored nodes.
Φ has the following meaning:
Given some threshold Φ, and assuming that we decide to suspect a node A when Φ = 1, then the likelihood that we will make a mistake (i.e., the decision will be contradicted in the future by the reception of a late heartbeat) is about 10%. The likelihood is about 1% with Φ = 2, 0.1% with Φ = 3, and so on. Every node in the system maintains a sliding window of inter-arrival times of gossip messages from other nodes in the cluster.
The distribution of these inter-arrival times is determined and Φ is calculated. Although the original paper suggests that the distribution is approximated by the Gaussian distribution we found the Exponential Distribution to be a better approximation, because of the nature of the gossip channel and its impact on latency. To our knowledge our implementation of the Accrual Failure Detection in a Gossip based setting is the first of its kind. Accrual Failure Detectors are very good in both their accuracy and their speed and they also adjust well to network conditions and server load conditions.
每个节点会维护一个sliding window去记录其他节点发出的gossip messages的时间间隔, 并根据这个时间间隔去计算Φ(用指数分布代替高斯分布, 得到了更好的近似), 间隔越小, Φ越大, 那么该节点出问题的的概率就越小. 这个failure detection的方法, 可以动态的调整节点状况, 更合理. 确实是cassendra独创的, dynamo里面没有这样的机制.
5.4 Bootstrapping
When a node starts for the first time, it chooses a random token for its position in the ring. For fault tolerance, the mapping is persisted to disk locally and also in Zookeeper.
The token information is then gossiped around the cluster.
This is how we know about all nodes and their respective positions in the ring. This enables any node to route a request for a key to the correct node in the cluster.
In the bootstrap case, when a node needs to join a cluster, it reads its configuration file which contains a list of a few contact points within the cluster.
We call these initial contact points, seeds of the cluster. Seeds can also come from a configuration service like Zookeeper.
5.5 Scaling the Cluster
When a new node is added into the system, it gets assigned a token such that it can alleviate a heavily loaded node.
This results in the new node splitting a range that some other node was previously responsible for.
The Cassandra bootstrap algorithm is initiated from any other node in the system by an operator using either a command line utility or the Cassandra web dashboard.
The node giving up the data streams the data over to the new node using kernel-kernel copy techniques. Operational experience has shown that data can be transferred at the rate of 40 MB/sec from a single node. We are working on improving this by having multiple replicas take part in the bootstrap transfer thereby parallelizing the effort, similar to Bittorrent.
5.6 Local Persistence
The Cassandra system relies on the local file system for data persistence.
The data is represented on disk using a format that lends itself to effcient data retrieval.
Typical write operation involves a write into a commit log for durability and recoverability and an update into an in-memory data structure. The write into the in-memory data structure is performed only after a successful write into the commit log. We have a dedicated disk on each machine for the commit log since all writes into the commit log are sequential and so we can maximize disk throughput. When the in-memory data structure crosses a certain threshold, calculated based on data size and number of objects, it dumps itself to disk.
This write is performed on one of many commodity disks that machines are equipped with.
All writes are sequential to disk and also generate an index for effcient lookup based on row key. These indices are also persisted along with the data file.
Over time many such files could exist on disk and a merge process runs in the background to collate the different files into one file.
This process is very similar to the compaction process that happens in the Bigtable system (SSTable).
A typical read operation first queries the in-memory data structure before looking into the files on disk. The files are looked at in the order of newest to oldest. When a disk lookup occurs we could be looking up a key in multiple files on disk. In order to prevent lookups into files that do not contain the key, a bloom filter, summarizing the keys in the file, is also stored in each data le and also kept in memory.
A key in a column family could have many columns. Some special indexing is required to retrieve columns which are further away from the key. In order to prevent scanning of every column on disk we maintain column indices which allow us to jump to the right chunk on disk for column retrieval. As the columns for a given key are being serialized and written out to disk we generate indices at every 256K chunk boundary. This boundary is con gurable, but we have found 256K to work well for us in our production workloads.
基本采用了和Bigtable完全相同的策略...
5.7 Implementation Details
The Cassandra process on a single machine is primarily consists of the following abstractions:
partitioning module, the cluster membership and failure detection module and the storage engine module.
Each of these modules rely on an event driven substrate where the message processing pipeline and the task pipeline are split into multiple stages along the line of the SEDA[20] architecture.
Each of these modules has been implemented from the ground up using Java.
The cluster membership and failure detection module, is built on top of a network layer which uses non-blocking I/O.
All system control messages rely on UDP based messaging while the application related messages for replication and request routing relies on TCP.
The request routing modules are implemented using a certain state machine.
When a read/write request arrives at any node in the cluster the state machine morphs through the following states
(i) identify the node(s) that own the data for the key
(ii) route the requests to the nodes and wait on the responses to arrive
(iii) if the replies do not arrive within a configured timeout value fail the request and return to the client
(iv) figure out the latest response based on timestamp
(v) schedule a repair of the data at any replica if they do not have the latest piece of data.
The system can be configured to perform either synchronous or asynchronous writes. For certain systems that require high throughput we rely on asynchronous replication. Here the writes far exceed the reads that come into the system. During the synchronous case we wait for a quorum of responses before we return a result to the client.
In any journaled system there needs to exist a mechanism for purging(清除) commit log entries.
In Cassandra we use a rolling a commit log where a new commit log is rolled out after an older one exceeds a particular, configurable, size. We have found that rolling commit logs after 128MB size seems to work very well in our production workloads.
Every commit log has a header which is basically a bit vector whose size is fixed and typically more than the number of column families that a particular system will ever handle.
In our implementation we have an in-memory data structure and a data file that is generated per column family.
Every time the in-memory data structure for a particular column family is dumped to disk we set its bit in the commit log stating that this column family has been successfully persisted to disk.
This is an indication that this piece of information is already committed. These bit vectors are per commit log and also maintained in memory.
Every time a commit log is rolled its bit vector and all the bit vectors of commit logs rolled prior to it are checked.
If it is deemed that all the data has been successfully persisted to disk then these commit logs are deleted.
The write operation into the commit log can either be in normal mode or in fast sync mode. In the fast sync mode the writes to the commit log are buffered.
This implies that there is a potential of data loss on machine crash. In this mode we also dump the in-memory data structure to disk in a buffered fashion. Traditional databases are not designed to handle particularly high write throughput.
Cassandra morphs all writes to disk into sequential writes thus maximizing disk write throughput.
Since the files dumped to disk are never mutated no locks need to be taken while reading them. The server instance of Cassandra is practically lockless for read/write operations. Hence we do not need to deal with or handle the concurrency issues that exist in B-Tree based database plementations.
The Cassandra system indexes all data based on primary key.
The data file on disk is broken down into a sequence of blocks. Each block contains at most 128 keys and is demarcated by a block index. The block index captures the relative offset of a key within the block and the size of its data.
When an in-memory data structure is dumped to disk, a block index is generated and their offsets written out to disk as indices.
This index is also maintained in memory for fast access. A typical read operation always looks up data first in the in-memory data structure. If found the data is returned to the application since the in-memory data structure contains the latest data for any key. If not found then we perform disk I/O against all the data files on disk in reverse time order.
Since we are always looking for the latest data we look into the latest file first and return if we find the data. Over time the number of data files will increase on disk. We perform a compaction process, very much like the Bigtable system, which merges multiple files into one; essentially merge sort on a bunch of sorted data files.
The system will always compact files that are close to each other with respect to size i.e there will never be a situation where a 100GB file is compacted with a file which is less than 50GB. Periodically a major compaction process is run to compact all related data files into one big file. This compaction process is a disk I/O intensive operation. Many optimizations can be put in place to not affect in coming read requests.
6. PRACTICAL EXPERIENCES
In the process of designing, implementing and maintaining Cassandra we gained a lot of useful experience and learned numerous lessons.
One very fundamental lesson learned was not to add any new feature without understanding the effects of its usage by applications.
Most problematic scenarios do not stem from just node crashes and network partitions. We share just a few interesting scenarios here.
- Before launching the Inbox Search application we had to index 7TB of inbox data for over 100M users, then stored in our MySQL[1] infrastructure, and load it into the Cassandra system.
The whole process involved running Map/Reduce[7] jobs against the MySQL data files, indexing them and then storing the reverse-index in Cassandra. The M/R process actually behaves as the client of Cassandra. We exposed some background channels for the M/R process to aggregate the reverse index per user and send over the serialized data over to the Cassandra instance, to avoid the serialization/deserialization overhead. This way the Cassandra instance is only bottlenecked by network bandwidth. - Most applications only require atomic operation per key per replica.
However there have been some applications that have asked for transactional mainly for the purpose of maintaining secondary indices. Most developers with years of development experience working with RDBMS's nd this a very useful feature to have. We are working on a mechanism to expose such atomic operations. - We experimented with various implementations of Failure Detectors such as the ones described in [15] and [5].
Our experience had been that the time to detect failures increased beyond an acceptable limit as the size of the cluster grew. In one particular experiment in a cluster of 100 nodes time to taken to detect a failed node was in the order of two minutes. This is practically unworkable in our environments. With the accrual failure detector with a slightly conservative value of PHI, set to 5, the average time to detect failures in the above experiment was about 15 seconds. - Monitoring is not to be taken for granted. The Cassandra system is well integrated with Ganglia[12], a distributed performance monitoring tool.
We expose various system level metrics to Ganglia and this has helped us understand the behavior of the system when subject to our production workload. Disks fail for no apparent reasons. The bootstrap algorithm has some hooks to repair nodes when disk fail. This is however an administrative operation. - Although Cassandra is a completely decentralized system we have learned that having some amount of coordination is essential to making the implementation of some distributed features tractable. For example Cassandra is integrated with Zookeeper, which can be used for various coordination tasks in large scale distributed systems. We intend to use the Zookeeper abstraction for some key features which actually do not come in the way of applications that use Cassandra as the storage engine.
Cassandra Vs HBase