hive SerDe序列化和反序列序列化表
1 什么是SerDe
SerDe 是两个单词的拼写 serialized(序列化) 和 deserialized(反序列化)。 什么是序列化和反序列化呢?
当进程在进行远程通信时,彼此可以发送各种类型的数据,无论是什么类型的数据都会以 二进制序列的形式在网络上传送。发送方需要把对象转化为字节序列才可在网络上传输, 称为对象序列化;接收方则需要把字节序列恢复为对象,称为对象的反序列化。
Hive的反序列化是对key/value反序列化成hive table的每个列的值。Hive可以方便 的将数据加载到表中而不需要对数据进行转换,这样在处理海量数据时可以节省大量的时间。
Hive SerDe
What is a SerDe?
- SerDe is a short name for "Serializer and Deserializer."
- Hive uses SerDe (and FileFormat) to read and write table rows.
- HDFS files --> InputFileFormat --> <key, value> --> Deserializer --> Row object (读流程)
- Row object --> Serializer --> <key, value> --> OutputFileFormat --> HDFS files (写流程)
----参考官网:https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide#DeveloperGuide-HiveSerDe
2 Built-in SerDes(SerDe包括内置类型)
- Avro (Hive 0.9.1 and later)
- ORC (Hive 0.11 and later)
- RegEx
- Thrift
- Parquet (Hive 0.13 and later)
- CSV (Hive 0.14 and later)
- JsonSerDe (Hive 0.12 and later in hcatalog-core)
----参考官网:https://cwiki.apache.org/confluence/display/Hive/SerDe
3 序列化的使用
3.1 建表时指定序列化方式
· RegexSerDe
CREATE TABLE apachelog ( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?" ) STORED AS TEXTFILE;
· JsonSerDe
ADD JAR /usr/lib/hive-hcatalog/lib/hive-hcatalog-core.jar; CREATE TABLE my_table(a string, b bigint, ...) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS TEXTFILE;
· CSVSerDe
CREATE TABLE my_table(a string, b string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ("separatorChar" = "\t","quoteChar"= "'","escapeChar"= "\\") STORED AS TEXTFILE;
·ORCSerDe
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
参考官网:
Registration of Native SerDes
As of Hive 0.14 a registration mechanism has been introduced for native Hive SerDes. This allows dynamic binding between a "STORED AS" keyword in place of a triplet of {SerDe, InputFormat, and OutputFormat} specification, in CreateTable statements.
The following mappings have been added through this registration mechanism:
Syntax
|
Equivalent
|
---|---|
Syntax
|
Equivalent
|
STORED AS AVRO / STORED AS AVROFILE |
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' |
STORED AS ORC / STORED AS ORCFILE |
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde ' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat ' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat ' |
STORED AS PARQUET / STORED AS PARQUETFILE |
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe ' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat ' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat ' |
STORED AS RCFILE |
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat' |
STORED AS SEQUENCEFILE |
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileInputFormat' OUTPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileOutputFormat' |
STORED AS TEXTFILE |
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' |