hive的元数据

 简介:

hive是建立在hadoop之上的数据仓库,一般用于对大型数据集的读写和管理,存在hive里的数据实际上就是存在HDFS上,都是以文件的形式存在,不能进行读写操作,所以我们需要元数据或者说叫schem来对hdfs上的数据进行管理。那元数据表之间有没有什么关联呢?答案是肯定的。hive默认元数据表是存储在derby中的,但是derby是单session的,所以我们一般会修改会mysql

那么该如何启用mysql来管理元数据呢?

 1 <configuration>
 2  <property>
 3   <name>javax.jdo.option.ConnectionURL</name>
 4   <value>jdbc:mysql://hadoop001:3306/ruoze_d6?createDatabaseIfNotExist=true&amp;characterEncoding=UTF-8</value>
 5  </property>
 6 <property>
 7       <name>javax.jdo.option.ConnectionDriverName</name>
 8         <value>com.mysql.jdbc.Driver</value>
 9  </property>
10 <property>
11   <name>javax.jdo.option.ConnectionUserName</name>
12     <value>root</value>
13     </property>
14 <property>
15   <name>javax.jdo.option.ConnectionPassword</name>
16     <value>123456</value>
17     </property>
18 </configuration>
以上的配置就会启用MYSQL管理元数据
第4行的配置是配置了mysql里的数据库名字叫ruoze_d6,第11行和第16行式配置了MySQL的登录用户名和密码,并且ruoze_d6这个库不需要在mysql中特别建立

mysql> use ruoze_d6;
Database changed
mysql>

 

 1 mysql> show tables;
 2 +---------------------------+
 3 | Tables_in_ruoze_d6        |
 4 +---------------------------+
 5 | bucketing_cols            |
 6 | cds                       |
 7 | columns_v2                |
 8 | database_params           |
 9 | dbs                       |
10 | func_ru                   |
11 | funcs                     |
12 | global_privs              |
13 | groupinfor                |
14 | idxs                      |
15 | index_params              |
16 | makedata_job              |
17 | part_col_privs            |
18 | part_col_stats            |
19 | part_privs                |
20 | partition_key_vals        |
21 | partition_keys            |
22 | partition_params          |
23 | partitions                |
24 | roles                     |
25 | sd_params                 |
26 | sds                       |
27 | sequence_table            |
28 | serde_params              |
29 | serdes                    |
30 | skewed_col_names          |
31 | skewed_col_value_loc_map  |
32 | skewed_string_list        |
33 | skewed_string_list_values |
34 | skewed_values             |
35 | sort_cols                 |
36 | tab_col_stats             |
37 | table_params              |
38 | tbl_col_privs             |
39 | tbl_privs                 |
40 | tbls                      |
41 | version                   |
42 |                   |
43 +---------------------------+
44 37 rows in set (0.00 sec)

这里一共有37张表, 我们捡主次分析一下

  • version(存储Hive版本的元数据表)

mysql> select * from version  ;
+--------+----------------+----------------------------------------+
| VER_ID | SCHEMA_VERSION | VERSION_COMMENT                        |
+--------+----------------+----------------------------------------+
|     11 | 1.1.0          | Set by MetaStore hadoop@172.16.202.233 |
+--------+----------------+----------------------------------------+
1 row in set (0.00 sec)
  • 说明
  1. 第一列是ID主键;第二列是hive的版本,第三列是版本说明,并且这张表里只有一条数据,且只能有一条数据,如果这张表被删除,当启动Hive-Cli时候,就会报错”Table ‘hive.version’ doesn’t exist”。
  2. 但是前提示关闭某个参数,如果那个参数开着,那么你如果删除了这张表或者说清空这张表,他都会自动建立,那个参数我忘记是啥了,回头想起来会来补上

 

  • DBS(hive数据库相关的元数据表)

mysql> select * from DBS \G;
*************************** 1. row ***************************
          DB_ID: 1
           DESC: Default Hive database
DB_LOCATION_URI: hdfs://hadoop001:9000/user/hive/warehouse
           NAME: default
     OWNER_NAME: public
     OWNER_TYPE: ROLE
*************************** 2. row ***************************
          DB_ID: 6
           DESC: NULL
DB_LOCATION_URI: hdfs://hadoop001:9000/user/hive/warehouse/hadoop_g6.db
           NAME: hadoop_g6
     OWNER_NAME: hadoop
     OWNER_TYPE: USER
*************************** 3. row ***************************
          DB_ID: 11
           DESC: NULL
DB_LOCATION_URI: hdfs://hadoop001:9000/user/hive/warehouse/ruoze_d6.db
           NAME: ruoze_d6
     OWNER_NAME: hadoop
     OWNER_TYPE: USER
3 rows in set (0.00 sec)
  • 说明:该表存储Hive中所有数据库的基本信息
列名 解释
DB_ID
数据库ID
DESC
数据库描述
DB_LOCATION_URI
数据库HDFS路径
NAME
数据库名
 OWNER_NAME
数据库所有者用户名
OWNER_TYPE
所有者角色

 

  • database_params(hive数据库相关的元数据表)

mysql> desc database_params;
+-------------+---------------+------+-----+---------+-------+
| Field       | Type          | Null | Key | Default | Extra |
+-------------+---------------+------+-----+---------+-------+
| DB_ID       | bigint(20)    | NO   | PRI | NULL    |       |
| PARAM_KEY   | varchar(180)  | NO   | PRI | NULL    |       |
| PARAM_VALUE | varchar(4000) | YES  |     | NULL    |       |
+-------------+---------------+------+-----+---------+-------+
  • 说明:该表存储数据库的相关参数,在CREATE DATABASE时候用 WITH DBPROPERTIES (property_name=property_value, …)指定的参数
字段 说明 示例
DB_ID
数据库ID 11
PARAM_KEY
参数名 createby
PARAM_VALUE 
参数值 root

 

 

  •  TBLS(Hive表和视图相关的元数据表)

mysql> select * from TBLS \G;
*************************** 1. row ***************************
            TBL_ID: 37
       CREATE_TIME: 1555494334
             DB_ID: 1
  LAST_ACCESS_TIME: 0
             OWNER: hadoop
         RETENTION: 0
             SD_ID: 37
          TBL_NAME: makedata_job
          TBL_TYPE: MANAGED_TABLE
VIEW_EXPANDED_TEXT: NULL
VIEW_ORIGINAL_TEXT: NULL
  • 说明:该表中存储Hive表、视图、索引表的基本信息。
TBL_ID
表ID
 CREATE_TIME
创建时间
 DB_ID
数据库ID
LAST_ACCESS_TIME
上次访问时间
OWNER
所有者
RETENTION
保留字段
 SD_ID
序列化配置信息(对应SDS表中的SD_ID
TBL_NAME
表名
TBL_TYPE
表类型
VIEW_EXPANDED_TEXT
视图详细的HQL语句
VIEW_ORIGINAL_TEXT
视图原始的HQL语句
   

 

 

  • table_params(Hive表和视图相关的元数据表)

mysql> select * from table_params;
+--------+-----------------------+-------------+
| TBL_ID | PARAM_KEY             | PARAM_VALUE |
+--------+-----------------------+-------------+
|     37 | COLUMN_STATS_ACCURATE | true        |
|     37 | numFiles              | 5           |
|     37 | numRows               | 0           |
|     37 | rawDataSize           | 0           |
|     37 | totalSize             | 2921282     |
|     37 | transient_lastDdlTime | 1555551458  |
|     42 | EXTERNAL              | TRUE        |
|     42 | transient_lastDdlTime | 1555555620  |
|     46 | COLUMN_STATS_ACCURATE | true        |
|     46 | numFiles              | 1           |
|     46 | numRows               | 500000      |
|     46 | rawDataSize           | 72051224    |
|     46 | totalSize             | 30284817    |
|     46 | transient_lastDdlTime | 1555557177  |
|     51 | EXTERNAL              | TRUE        |
|     51 | transient_lastDdlTime | 1555772013  |
|     52 | COLUMN_STATS_ACCURATE | true        |
|     52 | numFiles              | 1           |
|     52 | numRows               | 500000      |
|     52 | rawDataSize           | 67551224    |
|     52 | totalSize             | 75265591    |
|     52 | transient_lastDdlTime | 1555772485  |
|     56 | COLUMN_STATS_ACCURATE | true        |
|     56 | numFiles              | 1           |
|     56 | numRows               | 500000      |
|     56 | rawDataSize           | 64051224    |
|     56 | totalSize             | 64641768    |
|     56 | transient_lastDdlTime | 1555773864  |
|     66 | COLUMN_STATS_ACCURATE | true        |
|     66 | numFiles              | 1           |
|     66 | numRows               | 500000      |
|     66 | rawDataSize           | 359000000   |
|     66 | totalSize             | 17782969    |
|     66 | transient_lastDdlTime | 1555775575  |
|     67 | COLUMN_STATS_ACCURATE | true        |
|     67 | numFiles              | 1           |
|     67 | numRows               | 500000      |
|     67 | orc.compress          | NONE        |
|     67 | rawDataSize           | 359000000   |
|     67 | totalSize             | 53967047    |
|     67 | transient_lastDdlTime | 1555775880  |
|     68 | COLUMN_STATS_ACCURATE | true        |
|     68 | numFiles              | 1           |
|     68 | numRows               | 500000      |
|     68 | rawDataSize           | 4000000     |
|     68 | totalSize             | 61117546    |
|     68 | transient_lastDdlTime | 1555776185  |
|     69 | COLUMN_STATS_ACCURATE | true        |
|     69 | numFiles              | 1           |
|     69 | numRows               | 500000      |
|     69 | rawDataSize           | 4000000     |
|     69 | totalSize             | 16854027    |
|     69 | transient_lastDdlTime | 1555776356  |
|     71 | COLUMN_STATS_ACCURATE | true        |
|     71 | numFiles              | 1           |
|     71 | numRows               | 1           |
|     71 | rawDataSize           | 0           |
|     71 | totalSize             | 1           |
|     71 | transient_lastDdlTime | 1555809751  |
|     76 | transient_lastDdlTime | 1555836141  |
|     77 | COLUMN_STATS_ACCURATE | true        |
|     77 | numFiles              | 1           |
|     77 | numRows               | 0           |
|     77 | rawDataSize           | 0           |
|     77 | totalSize             | 366         |
|     77 | transient_lastDdlTime | 1555837173  |
+--------+-----------------------+-------------+
  • 说明:该表存储表/视图的属性信息。
字段 dec
TBL_ID
表ID(对应TBLS中的TBL_ID)
PARAM_KEY
属性名
PARAM_VALUES
属性值

 

  •  TBL_PRIVS 该表存储表/视图的授权信息(不做详细说明)

mysql> desc TBL_PRIVS;
+----------------+--------------+------+-----+---------+-------+
| Field          | Type         | Null | Key | Default | Extra |
+----------------+--------------+------+-----+---------+-------+
| TBL_GRANT_ID   | bigint(20)   | NO   | PRI | NULL    |       |
| CREATE_TIME    | int(11)      | NO   |     | NULL    |       |
| GRANT_OPTION   | smallint(6)  | NO   |     | NULL    |       |
| GRANTOR        | varchar(128) | YES  |     | NULL    |       |
| GRANTOR_TYPE   | varchar(128) | YES  |     | NULL    |       |
| PRINCIPAL_NAME | varchar(128) | YES  |     | NULL    |       |
| PRINCIPAL_TYPE | varchar(128) | YES  |     | NULL    |       |
| TBL_PRIV       | varchar(128) | YES  |     | NULL    |       |
| TBL_ID         | bigint(20)   | YES  | MUL | NULL    |       |
+----------------+--------------+------+-----+---------+-------+
9 rows in set (0.01 sec)
TBL_ID对应TBLS中的TBL_ID
  • sds(Hive文件存储信息相关的元数据表)

mysql> desc sds;
+---------------------------+---------------+------+-----+---------+-------+
| Field                     | Type          | Null | Key | Default | Extra |
+---------------------------+---------------+------+-----+---------+-------+
| SD_ID                     | bigint(20)    | NO   | PRI | NULL    |       |
| CD_ID                     | bigint(20)    | YES  | MUL | NULL    |       |
| INPUT_FORMAT              | varchar(4000) | YES  |     | NULL    |       |
| IS_COMPRESSED             | bit(1)        | NO   |     | NULL    |       |
| IS_STOREDASSUBDIRECTORIES | bit(1)        | NO   |     | NULL    |       |
| LOCATION                  | varchar(4000) | YES  |     | NULL    |       |
| NUM_BUCKETS               | int(11)       | NO   |     | NULL    |       |
| OUTPUT_FORMAT             | varchar(4000) | YES  |     | NULL    |       |
| SERDE_ID                  | bigint(20)    | YES  | MUL | NULL    |       |
+---------------------------+---------------+------+-----+---------+-------+

 

  • 说明:文件存储的基本信息:
SD_ID
 
CD_ID 
字段信息ID
INPUT_FORMAT
文件输入格式
IS_COMPRESSED
是否压缩
IS_STOREDASSUBDIRECTORIES 
是否以子目录存储
LOCATION
HDFS路径
 NUM_BUCKETS
分桶数量
 OUTPUT_FORMAT 
文件输出格式
SERDE_ID 
序列化类ID
字段 说明

 

  • SD_PARAMS(Hive文件存储信息相关的元数据表)

mysql> desc SD_PARAMS;
+-------------+---------------+------+-----+---------+-------+
| Field       | Type          | Null | Key | Default | Extra |
+-------------+---------------+------+-----+---------+-------+
| SD_ID       | bigint(20)    | NO   | PRI | NULL    |       |
| PARAM_KEY   | varchar(256)  | NO   | PRI | NULL    |       |
| PARAM_VALUE | varchar(4000) | YES  |     | NULL    |       |
+-------------+---------------+------+-----+---------+-------+
3 rows in set (0.00 sec)
  • 说明:该表存储Hive存储的属性信息,在创建表时候使用 

STORED BY ‘storage.handler.class.name’ [WITH SERDEPROPERTIES (…)指定。

  • serdes(Hive文件存储信息相关的元数据表)

mysql> select * from serdes; 
+----------+------+-------------------------------------------------------------+
| SERDE_ID | NAME | SLIB                                                        |
+----------+------+-------------------------------------------------------------+
|       37 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe          |
|       42 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe          |
|       43 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe          |
|       46 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe          |
|       51 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe          |
|       52 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe          |
|       56 | NULL | org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe        |
|       66 | NULL | org.apache.hadoop.hive.ql.io.orc.OrcSerde                   |
|       67 | NULL | org.apache.hadoop.hive.ql.io.orc.OrcSerde                   |
|       68 | NULL | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe |
|       69 | NULL | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe |
|       71 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe          |
|       76 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe          |
|       77 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe          |
+----------+------+-------------------------------------------------------------+
14 rows in set (0.00 sec)
  • 说明:该表存储序列化使用的类信息
字段 字段说明
SERDE_ID 
序列化类配置ID(对应SDS的SERDE_ID )

NAME 
序列化类别名
SLIB
序列化类
  • serde_params(Hive文件存储信息相关的元数据表)

mysql> select * from serde_params; 
+----------+----------------------+-------------+
| SERDE_ID | PARAM_KEY            | PARAM_VALUE |
+----------+----------------------+-------------+
|       37 | field.delim          |                |
|       37 | serialization.format |                |
|       42 | field.delim          |                |
|       42 | serialization.format |                |
|       43 | field.delim          |                |
|       43 | serialization.format |                |
|       46 | serialization.format | 1           |
|       51 | field.delim          |                |
|       51 | serialization.format |                |
|       52 | serialization.format | 1           |
|       56 | serialization.format | 1           |
|       66 | serialization.format | 1           |
|       67 | serialization.format | 1           |
|       68 | serialization.format | 1           |
|       69 | serialization.format | 1           |
|       71 | serialization.format | 1           |
|       76 | field.delim          |                |
|       76 | serialization.format |                |
|       77 | field.delim          |                |
|       77 | serialization.format |                |
+----------+----------------------+-------------+
20 rows in set (0.00 sec)
  • 说明:该表存储序列化的一些属性、格式信息,比如:行、列分隔符
字段 字段说明
SERDE_ID 
序列化类配置ID(对应SDS的SERDE_ID )
PARAM_KEY          
属性名
PARAM_VALUE 
属性值

 

  • columns_v2Hive表字段相关的元数据表

mysql> select * from columns_v2; 
+-------+---------+-------------+--------------+-------------+
| CD_ID | COMMENT | COLUMN_NAME | TYPE_NAME    | INTEGER_IDX |
+-------+---------+-------------+--------------+-------------+
|    37 | NULL    | ip          | varchar(20)  |           4 |
|    37 | NULL    | levelnm     | varchar(6)   |           2 |
|    37 | NULL    | region      | varchar(6)   |           1 |
|    37 | NULL    | time_random | varchar(20)  |           3 |
|    37 | NULL    | traffic     | varchar(12)  |           7 |
|    37 | NULL    | urlid       | varchar(100) |           6 |
|    37 | NULL    | urlnm       | varchar(6)   |           0 |
|    37 | NULL    | urlym       | varchar(20)  |           5 |
|    42 | NULL    | cdn         | string       |           0 |
|    42 | NULL    | domain      | string       |           5 |
|    42 | NULL    | ip          | string       |           4 |
|    42 | NULL    | level       | string       |           2 |
|    42 | NULL    | region      | string       |           1 |
|    42 | NULL    | time        | string       |           3 |
|    42 | NULL    | traffic     | bigint       |           7 |
|    42 | NULL    | url         | string       |           6 |
+-------+---------+-------------+--------------+-------------+
17 rows in set (0.00 sec)
  • 说明:表的字段信息
字段 字段说明
CD_ID 
字段信息ID(对应表SDS的CD_ID)
COMMENT 
字段注释
COLUMN_NAME 
字段名
TYPE_NAME    
字段类型
INTEGER_IDX 
字段顺序

 

 

 

  • partitions(Hive表分区相关的元数据表)

mysql> select * from partitions ; 
+---------+-------------+------------------+--------------+-------+--------+
| PART_ID | CREATE_TIME | LAST_ACCESS_TIME | PART_NAME    | SD_ID | TBL_ID |
+---------+-------------+------------------+--------------+-------+--------+
|      21 |  1555555926 |                0 | day=20190418 |    43 |     42 |
+---------+-------------+------------------+--------------+-------+--------+
1 row in set (0.00 sec)
  • 说明:分区的基本信息
 字段 字段说明 
PART_ID
 分区ID
CREATE_TIME 
 分区创建时间
LAST_ACCESS_TIME 
 最后一次访问时间
PART_NAME
 分区名称
 
SD_ID 
 分区存储ID
 
TBL_ID 
 表ID

 

 

  • partition_keys(Hive表分区相关的元数据表)

mysql> select * from partition_keys; 
+--------+--------------+-----------+-----------+-------------+
| TBL_ID | PKEY_COMMENT | PKEY_NAME | PKEY_TYPE | INTEGER_IDX |
+--------+--------------+-----------+-----------+-------------+
|     42 | NULL         | day       | string    |           0 |
+--------+--------------+-----------+-----------+-------------+
1 row in set (0.00 sec)
  • 说明:分区的字段信息
字段名称 字段说明
TBL_ID 
表ID
PKEY_COMMENT 
分区字段说明
PKEY_NAME 
分区字段名称
PKEY_TYPE 
分区字段类型
INTEGER_IDX 
分区字段顺序

 

 

  • partition_key_vals(Hive表分区相关的元数据表)

mysql> select * from partition_key_vals; 
+---------+--------------+-------------+
| PART_ID | PART_KEY_VAL | INTEGER_IDX |
+---------+--------------+-------------+
|      21 | 20190418     |           0 |
+---------+--------------+-------------+
1 row in set (0.00 sec)
  • 说明:该表存储分区字段值
字段 字段说明
PART_ID 
分区ID
PART_KEY_VAL 
分区字段值
INTEGER_IDX 
分区字段值顺序

 

  • partition_params(Hive表分区相关的元数据表)

mysql> select * from partition_params; 
+---------+-----------------------+-------------+
| PART_ID | PARAM_KEY             | PARAM_VALUE |
+---------+-----------------------+-------------+
|      21 | COLUMN_STATS_ACCURATE | true        |
|      21 | numFiles              | 1           |
|      21 | totalSize             | 29975501    |
|      21 | transient_lastDdlTime | 1555556171  |
+---------+-----------------------+-------------+
4 rows in set (0.00 sec)
  • 说明:该表存储分区的属性信息.
字段 字段说明
PART_ID 
分区ID
PARAM_KEY 
分区属性名
PARAM_VALUE 
分区属性值

 

 

 

  • 其他不常用的元数据表


此图转载于https://mp.weixin.qq.com/s/c2C4SYaj-GUP6hTkPNV_hQ

参考博客:https://mp.weixin.qq.com/s/c2C4SYaj-GUP6hTkPNV_hQ

posted @ 2019-04-23 13:50  任重而道远的小蜗牛  阅读(10621)  评论(0编辑  收藏  举报