ORC File Format

Optimized Row Columnar(ORC)文件格式,提供了一种高效的方式来存储hive数据。它被设计主要是为了克服其他hive文件格式的限制。

主要有以下几个优点:

  • 每一个task只有一个单一的文件
  • hive类型支持datetime,decimal,complex types(struct,list,map and union)
    • skip row groups that don't pass predicate filtering
    • seek to a given row
  • block-mode compression based on data type
    • run-length encoding for integer columns
    • dictionary encoding for string columns
  • concurrent reads of the same file using separate RecordReaders(使用独立的RecordReader同时读取相同的文件)
  • ability to split files without scanning for markers
  • bound the amount of memory needed for reading or writing
  • metadata stored using Protocol Buffers,which allows addition and removal of fields

File Structure

一个ORC文件包含行数据的组叫做stripes(条),包括文件页脚的附加信息(file footer)。在文件的末尾,一个postscript(后记)保存压缩参数,和压缩页脚的大小。

(stripes)默认条纹的大小是250MB,尽可能大的空间可以优化hdfs的读取

(file footer)文件的页脚包括一组条纹,每个条纹的行数,每个列的数据类型,也包括列级别的聚合。

 

Stripe Structure

 

一个stripes包括index data,row data,stripe footer.

The stripe footed contains a directory of stream locations.

Row data在表扫描中使用。

index data 包括每一个列的最小和最大的值,列的行位置。行索引提供偏移可以查找到正确的压缩块在非压缩块中。

Note that ORC indexes are used only for the selection of stripes and row groups and not for answering queries.

Having relatively frequent row index entries enables row-skipping within a stripe for rapid reads,despite large stripe sizes.By default every 10000 rows can

be skipped

 

HiveQL Syntax

File formats are specified at the table(or partition) level. You can specify the ORC file format with HiveQL statements such as these:

  • CREATE TABLE .... STORED AS ORC
  • ALTER TABLE ...[PARTITION partition_spec] SET FILEFORMAT ORC
  • SET hive.default.fileformat=Orc

Serialization and Compression

Integer Column Serialization

Integer columns are serialized in two streams.

  1. present bit stream:is the value non-null?
  2. data stream:a stream of integers

Interger data is serialized in a way that takes advantage of the common distribution of numbers:

  • Integers are encoded using a variable-width encoding that has fewed bytes for small integers
  • Repea

 

posted @ 2016-10-21 15:49  dalu610  阅读(250)  评论(0编辑  收藏  举报