Columnar Storage

http://the-paper-trail.org/blog/columnar-storage/

 

You’re going to hear a lot about columnar storage formats in the next few months, as a variety of distributed execution engines are beginning to consider them for their IO efficiency, and the optimisations that they open up for query execution. In this post, I’ll explain why we care so much about IO efficiency and show how columnar storage – which is a simple idea – can drastically improve performance for certain workloads.

 

磁盘仍然是bigdata查询的最大的瓶颈.
所以disk-based query engines需要解决的最大问题, 仍然是如何提高从磁盘读取数据的效率.
思路, 可以从优化磁盘寻道时间和效率入手, 这个比较困难,进展缓慢
更多的, 应该考虑怎样尽量从磁盘中读取有效的数据, 即减少查询所需要的磁盘读取量 

Disks are still the major bottleneck in query execution over large datasets.
Therefore, the best thing an engineer can do to improve the performance of disk-based query engines (like RDBMs and Impala) usually is to improve the performance of reading bytes from disk.

1. The traditional way to improve disk bandwidth has been to wait, and allow disks to get faster.
However, disks are not getting faster very quickly (having settled at roughly 100 MB/s, with ~12 disks per server), and SSDs can’t yet achieve the storage density to be directly competitive with HDDs on a per-server basis.

 

2. The other way to improve disk performance is to maximise the ratio of ‘useful’ bytes read to total bytes read. The idea is not to read more data than is absolutely necessary to serve a query, so the useful bandwidth realised is increased without actually improving the performance of the IO subsystem.

下面就'怎样最大化在磁盘读出数据中的有用数据的比例?'思路来阐述

Columns VS Rows

传统基于row的存储方式

Traditional database file format store data in rows, where each row is comprised of a contiguous collection of column values.

On disk, that looks roughly like the following:

image

This row-major layout usually has a header for each row that describes, for example, which columns in the row are NULL. Each column value is then stored contiguously after the header, followed by another row with its own header, and so on.

 

SELECT * FROM table

requires returning every single column of every single row in the table

SELECT <col_set> FROM table WHERE <predicate_set>

the set of rows returned by the evaluation of the predicates over the table is a large proportion of the total set of rows (i.e. the selectivity is high)

the predicate_set requires reading a large subset of the set of columns or b) col_set is a large subset of the set of columns (i.e. the projectivity is high)

 

More simply, a query is going to be efficient if it requires reading most of the columns of most of the rows.
In these cases, row-major formats allow the query execution engine to achieve good IO efficiency.
总结的很简单, 对于row-based, 无论实际需要多少column, 总是需要读出整行
所以如果确实需要读出大部分column (无论在select部分或where部分), 那么row-major是高效的方式

 

基于column的存储方式

However, there is a general consensus that these SELECT * kinds of queries are not representative of typical analytical workloads; instead either a large number of columns are not projected, or they are projected only for a small subset of rows where only a few columns are required to decide which rows to return.

Coupled with a general trend towards very wide tables with high column counts, the total number of bytes that are required to satisfy a query are often a relatively small fraction of the size on disk of the target table. In these cases, row-major formats often are quite wasteful in the amount of IO they require to execute a query.

但是在现实中, 很多场景往往只需要很少一部分的column, 那么row-major就显得非常低效
自然的思路是, 既然我只需要某一个column, 那么把每个column分开存放并读取就显得更合理

Instead of a format that makes it efficient to read entire rows, it’s advantageous for analytical workloads to make it efficient to read entire columns at once.
Based on our understanding of what makes disks efficient, we can see that the obvious approach is to store columns values densely and contiguously on disk.
This is the basic idea behind columnar file formats. The following diagram shows what this looks like on disk:

image

 

QUERY EXECUTION

The diagram below shows what a simple query plan for SELECT col_b FROM table WHERE col_a > 5

Reading from a traditional row-major file format

对于row-major, 需要读出所有的row来完成这个查询, 明显是比较低效的, 如果row中包含大量的columns

image

Reading from columnar storage

image

 

This, then, is the basic idea of columnar storage: we recognise that analytical workloads rarely require full scans of all table data, but do often require full scans of a small subset of the columns, and so we arrange to make column scans cheap at the expense of extra cost reading individual rows.

 

THE COST OF COLUMNAR

Is this a free lunch? Should every analytical database go out and change every file format to be column-major?
Obviously the story is more complicated than that. There are some query archetypes that suffer when data is stored in a columnar format.

天下没有免费的午餐, 在bigdata领域, 没有one thing fits all
所以使用columnar方案的问题如下

1. It is expensive to reassemble a row, since the separate values that comprise it are spread far across the disk. Every column included in a projection implies an extra disk seek, and this can add up when the projectivity of a query is high. 所以columnar只适用于读取少量column的case, 否则reassemble的耗费会很高

2. Row-major formats have good in-memory spatial locality, and there are common operations that benefit enormously from this. 比如, 做两个column的相加, 对row-major就很简单, 而column的逻辑就复杂许多

所以对这样的情况, 比如HBase, 就采取column family的策略, 把经常一起使用的column放在一起, 达到很好的优化效果

posted on 2013-03-26 17:31  fxjwind  阅读(1146)  评论(0编辑  收藏  举报