hive的元数据
简介:
hive是建立在hadoop之上的数据仓库,一般用于对大型数据集的读写和管理,存在hive里的数据实际上就是存在HDFS上,都是以文件的形式存在,不能进行读写操作,所以我们需要元数据或者说叫schem来对hdfs上的数据进行管理。那元数据表之间有没有什么关联呢?答案是肯定的。hive默认元数据表是存储在derby中的,但是derby是单session的,所以我们一般会修改会mysql
那么该如何启用mysql来管理元数据呢?
1 <configuration> 2 <property> 3 <name>javax.jdo.option.ConnectionURL</name> 4 <value>jdbc:mysql://hadoop001:3306/ruoze_d6?createDatabaseIfNotExist=true&characterEncoding=UTF-8</value> 5 </property> 6 <property> 7 <name>javax.jdo.option.ConnectionDriverName</name> 8 <value>com.mysql.jdbc.Driver</value> 9 </property> 10 <property> 11 <name>javax.jdo.option.ConnectionUserName</name> 12 <value>root</value> 13 </property> 14 <property> 15 <name>javax.jdo.option.ConnectionPassword</name> 16 <value>123456</value> 17 </property> 18 </configuration>
以上的配置就会启用MYSQL管理元数据
第4行的配置是配置了mysql里的数据库名字叫ruoze_d6,第11行和第16行式配置了MySQL的登录用户名和密码,并且ruoze_d6这个库不需要在mysql中特别建立
mysql> use ruoze_d6;
Database changed
mysql>
1 mysql> show tables; 2 +---------------------------+ 3 | Tables_in_ruoze_d6 | 4 +---------------------------+ 5 | bucketing_cols | 6 | cds | 7 | columns_v2 | 8 | database_params | 9 | dbs | 10 | func_ru | 11 | funcs | 12 | global_privs | 13 | groupinfor | 14 | idxs | 15 | index_params | 16 | makedata_job | 17 | part_col_privs | 18 | part_col_stats | 19 | part_privs | 20 | partition_key_vals | 21 | partition_keys | 22 | partition_params | 23 | partitions | 24 | roles | 25 | sd_params | 26 | sds | 27 | sequence_table | 28 | serde_params | 29 | serdes | 30 | skewed_col_names | 31 | skewed_col_value_loc_map | 32 | skewed_string_list | 33 | skewed_string_list_values | 34 | skewed_values | 35 | sort_cols | 36 | tab_col_stats | 37 | table_params | 38 | tbl_col_privs | 39 | tbl_privs | 40 | tbls | 41 | version | 42 | | 43 +---------------------------+ 44 37 rows in set (0.00 sec)
这里一共有37张表, 我们捡主次分析一下
-
version(存储Hive版本的元数据表)
mysql> select * from version ; +--------+----------------+----------------------------------------+ | VER_ID | SCHEMA_VERSION | VERSION_COMMENT | +--------+----------------+----------------------------------------+ | 11 | 1.1.0 | Set by MetaStore hadoop@172.16.202.233 | +--------+----------------+----------------------------------------+ 1 row in set (0.00 sec)
- 说明:
- 第一列是ID主键;第二列是hive的版本,第三列是版本说明,并且这张表里只有一条数据,且只能有一条数据,如果这张表被删除,当启动Hive-Cli时候,就会报错”Table ‘hive.version’ doesn’t exist”。
- 但是前提示关闭某个参数,如果那个参数开着,那么你如果删除了这张表或者说清空这张表,他都会自动建立,那个参数我忘记是啥了,回头想起来会来补上
-
DBS(hive数据库相关的元数据表)
mysql> select * from DBS \G; *************************** 1. row *************************** DB_ID: 1 DESC: Default Hive database DB_LOCATION_URI: hdfs://hadoop001:9000/user/hive/warehouse NAME: default OWNER_NAME: public OWNER_TYPE: ROLE *************************** 2. row *************************** DB_ID: 6 DESC: NULL DB_LOCATION_URI: hdfs://hadoop001:9000/user/hive/warehouse/hadoop_g6.db NAME: hadoop_g6 OWNER_NAME: hadoop OWNER_TYPE: USER *************************** 3. row *************************** DB_ID: 11 DESC: NULL DB_LOCATION_URI: hdfs://hadoop001:9000/user/hive/warehouse/ruoze_d6.db NAME: ruoze_d6 OWNER_NAME: hadoop OWNER_TYPE: USER 3 rows in set (0.00 sec)
- 说明:该表存储Hive中所有数据库的基本信息
列名 | 解释 |
DB_ID |
数据库ID |
DESC |
数据库描述 |
DB_LOCATION_URI |
数据库HDFS路径 |
NAME |
数据库名 |
OWNER_NAME |
数据库所有者用户名 |
OWNER_TYPE |
所有者角色 |
-
database_params(hive数据库相关的元数据表)
mysql> desc database_params; +-------------+---------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------------+---------------+------+-----+---------+-------+ | DB_ID | bigint(20) | NO | PRI | NULL | | | PARAM_KEY | varchar(180) | NO | PRI | NULL | | | PARAM_VALUE | varchar(4000) | YES | | NULL | | +-------------+---------------+------+-----+---------+-------+
- 说明:该表存储数据库的相关参数,在CREATE DATABASE时候用 WITH DBPROPERTIES (property_name=property_value, …)指定的参数
字段 | 说明 | 示例 |
DB_ID |
数据库ID | 11 |
PARAM_KEY |
参数名 | createby |
PARAM_VALUE |
参数值 | root |
-
TBLS(Hive表和视图相关的元数据表)
mysql> select * from TBLS \G; *************************** 1. row *************************** TBL_ID: 37 CREATE_TIME: 1555494334 DB_ID: 1 LAST_ACCESS_TIME: 0 OWNER: hadoop RETENTION: 0 SD_ID: 37 TBL_NAME: makedata_job TBL_TYPE: MANAGED_TABLE VIEW_EXPANDED_TEXT: NULL VIEW_ORIGINAL_TEXT: NULL
- 说明:该表中存储Hive表、视图、索引表的基本信息。
TBL_ID
|
表ID |
CREATE_TIME
|
创建时间 |
DB_ID
|
数据库ID |
LAST_ACCESS_TIME
|
上次访问时间 |
OWNER
|
所有者 |
RETENTION
|
保留字段 |
SD_ID
|
序列化配置信息(对应SDS表中的SD_ID) |
TBL_NAME
|
表名 |
TBL_TYPE
|
表类型 |
VIEW_EXPANDED_TEXT
|
视图详细的HQL语句 |
VIEW_ORIGINAL_TEXT
|
视图原始的HQL语句 |
-
table_params(Hive表和视图相关的元数据表)
mysql> select * from table_params; +--------+-----------------------+-------------+ | TBL_ID | PARAM_KEY | PARAM_VALUE | +--------+-----------------------+-------------+ | 37 | COLUMN_STATS_ACCURATE | true | | 37 | numFiles | 5 | | 37 | numRows | 0 | | 37 | rawDataSize | 0 | | 37 | totalSize | 2921282 | | 37 | transient_lastDdlTime | 1555551458 | | 42 | EXTERNAL | TRUE | | 42 | transient_lastDdlTime | 1555555620 | | 46 | COLUMN_STATS_ACCURATE | true | | 46 | numFiles | 1 | | 46 | numRows | 500000 | | 46 | rawDataSize | 72051224 | | 46 | totalSize | 30284817 | | 46 | transient_lastDdlTime | 1555557177 | | 51 | EXTERNAL | TRUE | | 51 | transient_lastDdlTime | 1555772013 | | 52 | COLUMN_STATS_ACCURATE | true | | 52 | numFiles | 1 | | 52 | numRows | 500000 | | 52 | rawDataSize | 67551224 | | 52 | totalSize | 75265591 | | 52 | transient_lastDdlTime | 1555772485 | | 56 | COLUMN_STATS_ACCURATE | true | | 56 | numFiles | 1 | | 56 | numRows | 500000 | | 56 | rawDataSize | 64051224 | | 56 | totalSize | 64641768 | | 56 | transient_lastDdlTime | 1555773864 | | 66 | COLUMN_STATS_ACCURATE | true | | 66 | numFiles | 1 | | 66 | numRows | 500000 | | 66 | rawDataSize | 359000000 | | 66 | totalSize | 17782969 | | 66 | transient_lastDdlTime | 1555775575 | | 67 | COLUMN_STATS_ACCURATE | true | | 67 | numFiles | 1 | | 67 | numRows | 500000 | | 67 | orc.compress | NONE | | 67 | rawDataSize | 359000000 | | 67 | totalSize | 53967047 | | 67 | transient_lastDdlTime | 1555775880 | | 68 | COLUMN_STATS_ACCURATE | true | | 68 | numFiles | 1 | | 68 | numRows | 500000 | | 68 | rawDataSize | 4000000 | | 68 | totalSize | 61117546 | | 68 | transient_lastDdlTime | 1555776185 | | 69 | COLUMN_STATS_ACCURATE | true | | 69 | numFiles | 1 | | 69 | numRows | 500000 | | 69 | rawDataSize | 4000000 | | 69 | totalSize | 16854027 | | 69 | transient_lastDdlTime | 1555776356 | | 71 | COLUMN_STATS_ACCURATE | true | | 71 | numFiles | 1 | | 71 | numRows | 1 | | 71 | rawDataSize | 0 | | 71 | totalSize | 1 | | 71 | transient_lastDdlTime | 1555809751 | | 76 | transient_lastDdlTime | 1555836141 | | 77 | COLUMN_STATS_ACCURATE | true | | 77 | numFiles | 1 | | 77 | numRows | 0 | | 77 | rawDataSize | 0 | | 77 | totalSize | 366 | | 77 | transient_lastDdlTime | 1555837173 | +--------+-----------------------+-------------+
- 说明:该表存储表/视图的属性信息。
字段 | dec |
TBL_ID |
表ID(对应TBLS中的TBL_ID) |
PARAM_KEY |
属性名 |
PARAM_VALUES |
属性值 |
-
TBL_PRIVS 该表存储表/视图的授权信息(不做详细说明)
mysql> desc TBL_PRIVS; +----------------+--------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +----------------+--------------+------+-----+---------+-------+ | TBL_GRANT_ID | bigint(20) | NO | PRI | NULL | | | CREATE_TIME | int(11) | NO | | NULL | | | GRANT_OPTION | smallint(6) | NO | | NULL | | | GRANTOR | varchar(128) | YES | | NULL | | | GRANTOR_TYPE | varchar(128) | YES | | NULL | | | PRINCIPAL_NAME | varchar(128) | YES | | NULL | | | PRINCIPAL_TYPE | varchar(128) | YES | | NULL | | | TBL_PRIV | varchar(128) | YES | | NULL | | | TBL_ID | bigint(20) | YES | MUL | NULL | | +----------------+--------------+------+-----+---------+-------+ 9 rows in set (0.01 sec)
TBL_ID对应TBLS中的TBL_ID
-
sds(Hive文件存储信息相关的元数据表)
mysql> desc sds; +---------------------------+---------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +---------------------------+---------------+------+-----+---------+-------+ | SD_ID | bigint(20) | NO | PRI | NULL | | | CD_ID | bigint(20) | YES | MUL | NULL | | | INPUT_FORMAT | varchar(4000) | YES | | NULL | | | IS_COMPRESSED | bit(1) | NO | | NULL | | | IS_STOREDASSUBDIRECTORIES | bit(1) | NO | | NULL | | | LOCATION | varchar(4000) | YES | | NULL | | | NUM_BUCKETS | int(11) | NO | | NULL | | | OUTPUT_FORMAT | varchar(4000) | YES | | NULL | | | SERDE_ID | bigint(20) | YES | MUL | NULL | | +---------------------------+---------------+------+-----+---------+-------+
- 说明:文件存储的基本信息:
SD_ID |
|
CD_ID |
字段信息ID |
INPUT_FORMAT |
文件输入格式 |
IS_COMPRESSED |
是否压缩 |
IS_STOREDASSUBDIRECTORIES |
是否以子目录存储 |
LOCATION |
HDFS路径 |
NUM_BUCKETS |
分桶数量 |
OUTPUT_FORMAT |
文件输出格式 |
SERDE_ID |
序列化类ID |
字段 | 说明 |
-
SD_PARAMS(Hive文件存储信息相关的元数据表)
mysql> desc SD_PARAMS; +-------------+---------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------------+---------------+------+-----+---------+-------+ | SD_ID | bigint(20) | NO | PRI | NULL | | | PARAM_KEY | varchar(256) | NO | PRI | NULL | | | PARAM_VALUE | varchar(4000) | YES | | NULL | | +-------------+---------------+------+-----+---------+-------+ 3 rows in set (0.00 sec)
- 说明:该表存储Hive存储的属性信息,在创建表时候使用
STORED BY ‘storage.handler.class.name’ [WITH SERDEPROPERTIES (…)指定。
-
serdes(Hive文件存储信息相关的元数据表)
mysql> select * from serdes; +----------+------+-------------------------------------------------------------+ | SERDE_ID | NAME | SLIB | +----------+------+-------------------------------------------------------------+ | 37 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | | 42 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | | 43 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | | 46 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | | 51 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | | 52 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | | 56 | NULL | org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe | | 66 | NULL | org.apache.hadoop.hive.ql.io.orc.OrcSerde | | 67 | NULL | org.apache.hadoop.hive.ql.io.orc.OrcSerde | | 68 | NULL | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | | 69 | NULL | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | | 71 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | | 76 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | | 77 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | +----------+------+-------------------------------------------------------------+ 14 rows in set (0.00 sec)
- 说明:该表存储序列化使用的类信息
字段 | 字段说明 |
SERDE_ID |
序列化类配置ID(对应SDS的SERDE_ID ) |
NAME |
序列化类别名 |
SLIB |
序列化类 |
-
serde_params(Hive文件存储信息相关的元数据表)
mysql> select * from serde_params; +----------+----------------------+-------------+ | SERDE_ID | PARAM_KEY | PARAM_VALUE | +----------+----------------------+-------------+ | 37 | field.delim | | | 37 | serialization.format | | | 42 | field.delim | | | 42 | serialization.format | | | 43 | field.delim | | | 43 | serialization.format | | | 46 | serialization.format | 1 | | 51 | field.delim | | | 51 | serialization.format | | | 52 | serialization.format | 1 | | 56 | serialization.format | 1 | | 66 | serialization.format | 1 | | 67 | serialization.format | 1 | | 68 | serialization.format | 1 | | 69 | serialization.format | 1 | | 71 | serialization.format | 1 | | 76 | field.delim | | | 76 | serialization.format | | | 77 | field.delim | | | 77 | serialization.format | | +----------+----------------------+-------------+ 20 rows in set (0.00 sec)
- 说明:该表存储序列化的一些属性、格式信息,比如:行、列分隔符
字段 | 字段说明 |
SERDE_ID |
序列化类配置ID(对应SDS的SERDE_ID ) |
PARAM_KEY |
属性名 |
PARAM_VALUE |
属性值 |
-
columns_v2Hive表字段相关的元数据表
mysql> select * from columns_v2; +-------+---------+-------------+--------------+-------------+ | CD_ID | COMMENT | COLUMN_NAME | TYPE_NAME | INTEGER_IDX | +-------+---------+-------------+--------------+-------------+ | 37 | NULL | ip | varchar(20) | 4 | | 37 | NULL | levelnm | varchar(6) | 2 | | 37 | NULL | region | varchar(6) | 1 | | 37 | NULL | time_random | varchar(20) | 3 | | 37 | NULL | traffic | varchar(12) | 7 | | 37 | NULL | urlid | varchar(100) | 6 | | 37 | NULL | urlnm | varchar(6) | 0 | | 37 | NULL | urlym | varchar(20) | 5 | | 42 | NULL | cdn | string | 0 | | 42 | NULL | domain | string | 5 | | 42 | NULL | ip | string | 4 | | 42 | NULL | level | string | 2 | | 42 | NULL | region | string | 1 | | 42 | NULL | time | string | 3 | | 42 | NULL | traffic | bigint | 7 | | 42 | NULL | url | string | 6 | +-------+---------+-------------+--------------+-------------+ 17 rows in set (0.00 sec)
- 说明:表的字段信息
字段 | 字段说明 |
CD_ID |
字段信息ID(对应表SDS的CD_ID) |
COMMENT |
字段注释 |
COLUMN_NAME |
字段名 |
TYPE_NAME |
字段类型 |
INTEGER_IDX |
字段顺序 |
-
partitions(Hive表分区相关的元数据表)
mysql> select * from partitions ; +---------+-------------+------------------+--------------+-------+--------+ | PART_ID | CREATE_TIME | LAST_ACCESS_TIME | PART_NAME | SD_ID | TBL_ID | +---------+-------------+------------------+--------------+-------+--------+ | 21 | 1555555926 | 0 | day=20190418 | 43 | 42 | +---------+-------------+------------------+--------------+-------+--------+ 1 row in set (0.00 sec)
- 说明:分区的基本信息
字段 | 字段说明 |
PART_ID |
分区ID |
CREATE_TIME |
分区创建时间 |
LAST_ACCESS_TIME |
最后一次访问时间 |
PART_NAME |
分区名称 |
SD_ID |
分区存储ID |
TBL_ID |
表ID |
-
partition_keys(Hive表分区相关的元数据表)
mysql> select * from partition_keys; +--------+--------------+-----------+-----------+-------------+ | TBL_ID | PKEY_COMMENT | PKEY_NAME | PKEY_TYPE | INTEGER_IDX | +--------+--------------+-----------+-----------+-------------+ | 42 | NULL | day | string | 0 | +--------+--------------+-----------+-----------+-------------+ 1 row in set (0.00 sec)
- 说明:分区的字段信息
字段名称 | 字段说明 |
TBL_ID |
表ID |
PKEY_COMMENT |
分区字段说明 |
PKEY_NAME |
分区字段名称 |
PKEY_TYPE |
分区字段类型 |
INTEGER_IDX |
分区字段顺序 |
-
partition_key_vals(Hive表分区相关的元数据表)
mysql> select * from partition_key_vals; +---------+--------------+-------------+ | PART_ID | PART_KEY_VAL | INTEGER_IDX | +---------+--------------+-------------+ | 21 | 20190418 | 0 | +---------+--------------+-------------+ 1 row in set (0.00 sec)
- 说明:该表存储分区字段值
字段 | 字段说明 |
PART_ID |
分区ID |
PART_KEY_VAL |
分区字段值 |
INTEGER_IDX |
分区字段值顺序 |
-
partition_params(Hive表分区相关的元数据表)
mysql> select * from partition_params; +---------+-----------------------+-------------+ | PART_ID | PARAM_KEY | PARAM_VALUE | +---------+-----------------------+-------------+ | 21 | COLUMN_STATS_ACCURATE | true | | 21 | numFiles | 1 | | 21 | totalSize | 29975501 | | 21 | transient_lastDdlTime | 1555556171 | +---------+-----------------------+-------------+ 4 rows in set (0.00 sec)
- 说明:该表存储分区的属性信息.
字段 | 字段说明 |
PART_ID |
分区ID |
PARAM_KEY |
分区属性名 |
PARAM_VALUE |
分区属性值 |
-
其他不常用的元数据表
此图转载于https://mp.weixin.qq.com/s/c2C4SYaj-GUP6hTkPNV_hQ
参考博客:https://mp.weixin.qq.com/s/c2C4SYaj-GUP6hTkPNV_hQ