【Hive】Hive ORC相关配置

参见:官方文档

 

ORC File Format

The ORC file format was introduced in Hive 0.11.0. See ORC Files for details.

Besides the configuration properties listed in this section, some properties in other sections are also related to ORC:

hive.exec.orc.memory.pool
  • Default Value: 0.5
  • Added In: Hive 0.11.0 with HIVE-4248

Maximum fraction of heap that can be used by ORC file writers.

hive.exec.orc.write.format
  • Default Value: (empty)
  • Added In: Hive 0.12.0 with HIVE-4123; default changed from 0.11 to null with HIVE-5091 (also in Hive 0.12.0)

Define the version of the file to write. Possible values are 0.11 and 0.12. If this parameter is not defined, ORC will use the run length encoding (RLE) introduced in Hive 0.12. Any value other than 0.11 results in the 0.12 encoding.

Additional values may be introduced in the future (see HIVE-6002).

hive.exec.orc.base.delta.ratio
  • Default Value: 8
  • Added In: Hive 1.3.0 and 2.1.0 with HIVE-13563

Define the ratio of base writer and delta writer in terms of STRIPE_SIZE and BUFFER_SIZE.

hive.exec.orc.default.stripe.size
  • Default Value: 256*1024*1024 (268,435,456) in 0.13.0;
                             64*1024*1024 (67,108,864) in 0.14.0
  • Added In: Hive 0.13.0 with HIVE-5425; default changed in 0.14.0 with HIVE-7231 and HIVE-7490

Define the default ORC stripe size, in bytes.

hive.exec.orc.default.block.size
  • Default Value: 256*1024*1024 (268,435,456)
  • Added In: Hive 0.14.0 with HIVE-7231

Define the default file system block size for ORC files.

hive.exec.orc.dictionary.key.size.threshold
  • Default Value: 0.8
  • Added In: Hive 0.12.0 with HIVE-4324

If the number of keys in a dictionary is greater than this fraction of the total number of non-null rows, turn off dictionary encoding.  Use 1 to always use dictionary encoding.

hive.exec.orc.default.row.index.stride
  • Default Value: 10000
  • Added In: Hive 0.13.0 with HIVE-5728

Define the default ORC index stride in number of rows. (Stride is the number of rows an index entry represents.)

hive.exec.orc.default.buffer.size
  • Default Value: 256*1024 (262,144)
  • Added In: Hive 0.13.0 with HIVE-5728

Define the default ORC buffer size, in bytes.

hive.exec.orc.default.block.padding
  • Default Value: true
  • Added In: Hive 0.13.0 with HIVE-5728

Define the default block padding. Block padding was added in Hive 0.12.0 (HIVE-5091, "ORC files should have an option to pad stripes to the HDFS block boundaries").

hive.exec.orc.block.padding.tolerance
  • Default Value: 0.05
  • Added In: Hive 0.14.0 with HIVE-7231

Define the tolerance for block padding as a decimal fraction of stripe size (for example, the default value 0.05 is 5% of the stripe size). For the defaults of 64Mb ORC stripe and 256Mb HDFS blocks, a maximum of 3.2Mb will be reserved for padding within the 256Mb block with the default hive.exec.orc.block.padding.tolerance. In that case, if the available size within the block is more than 3.2Mb, a new smaller stripe will be inserted to fit within that space. This will make sure that no stripe written will cross block boundaries and cause remote reads within a node local task.

hive.exec.orc.default.compress
  • Default Value: ZLIB
  • Added In: Hive 0.13.0 with HIVE-5728

Define the default compression codec for ORC file.

hive.exec.orc.encoding.strategy
  • Default Value: SPEED
  • Added In: Hive 0.14.0 with HIVE-7219

Define the encoding strategy to use while writing data. Changing this will only affect the light weight encoding for integers. This flag will not change the compression level of higher level compression codec (like ZLIB). Possible options are SPEED and COMPRESSION.

hive.orc.splits.include.file.footer

If turned on, splits generated by ORC will include metadata about the stripes in the file. This data is read remotely (from the client or HiveServer2 machine) and sent to all the tasks.

hive.orc.cache.stripe.details.size

Cache size for keeping meta information about ORC splits cached in the client.

hive.orc.cache.use.soft.references
  • Default Value: false
  • Added In: Hive 1.3.0, Hive 2.1.1, Hive 2.2.0 with HIVE-13985

By default, the cache that ORC input format uses to store the ORC file footer uses hard references for the cached object. Setting this to true can help avoid out-of-memory issues under memory pressure (in some cases) at the cost of slight unpredictability in overall query performance.

hive.io.sarg.cache.max.weight.mb
  • Default Value: 10
  • Added In: Hive 2.2.1, Hive 2.3.1, Hive 2.4.0, Hive 3.0.0 with HIVE-17669

The maximum weight allowed for the SearchArgument Cache, in megabytes. By default, the cache allows a max-weight of 10MB, after which entries will be evicted. Set to 0, to disable SearchArgument caching entirely.

hive.orc.compute.splits.num.threads

How many threads ORC should use to create splits in parallel.

hive.exec.orc.split.strategy
  • Default Value: HYBRID
  • Added In: Hive 1.2.0 with HIVE-10114

What strategy ORC should use to create splits for execution. The available options are "BI", "ETL" and "HYBRID".

The HYBRID mode reads the footers for all files if there are fewer files than expected mapper count, switching over to generating 1 split per file if the average file sizes are smaller than the default HDFS blocksize. ETL strategy always reads the ORC footers before generating splits, while the BI strategy generates per-file splits fast without reading any data from HDFS.

hive.exec.orc.skip.corrupt.data
  • Default Value: false
  • Added In: Hive 0.13.0 with HIVE-6382

If ORC reader encounters corrupt data, this value will be used to determine whether to skip the corrupt data or throw an exception. The default behavior is to throw an exception.

hive.exec.orc.zerocopy

Use zerocopy reads with ORC. (This requires Hadoop 2.3 or later.)

hive.merge.orcfile.stripe.level
  • Default Value: true
  • Added In: Hive 0.14.0 with HIVE-7509

When hive.merge.mapfileshive.merge.mapredfiles or hive.merge.tezfiles is enabled while writing a table with ORC file format, enabling this configuration property will do stripe-level fast merge for small ORC files. Note that enabling this configuration property will not honor the padding tolerance configuration (hive.exec.orc.block.padding.tolerance).

hive.orc.row.index.stride.dictionary.check
  • Default Value: true
  • Added In: Hive 0.14.0 with HIVE-7832

If enabled dictionary check will happen after first row index stride (default 10000 rows) else dictionary check will happen before writing first stripe. In both cases, the decision to use dictionary or not will be retained thereafter.

hive.exec.orc.compression.strategy
  • Default Value: SPEED
  • Added In: Hive 0.14.0 with HIVE-7859

Define the compression strategy to use while writing data. This changes the compression level of higher level compression codec (like ZLIB).

Value can be SPEED or COMPRESSION.


 

posted @ 2022-06-30 10:10  梦醒江南·Infinite  阅读(838)  评论(0编辑  收藏  举报