hive 分区表和分桶表
1、创建分区表
hive> create table weather_list(year int,data int) partitioned by (createtime string,area string) row format delimited fields terminated by ",";
修改表:
hive> alter table weather_list change data new_data int; hive> alter table weather_list change year new_year int;
1.1、加载数据
hive> load data local inpath '/home/hadoop/sampler/w2.csv' into table weather_list partition(createtime='2011-01-01',area='bj'); Loading data to table busdata.weather_list partition (createtime=2011-01-01, area=bj) OK Time taken: 1.455 seconds hive> load data local inpath '/home/hadoop/sampler/w3.csv' into table weather_list partition(createtime='2011-01-02',area='sc'); Loading data to table busdata.weather_list partition (createtime=2011-01-02, area=sc) OK Time taken: 1.394 seconds hive> load data local inpath '/home/hadoop/sampler/w4.csv' into table weather_list partition(createtime='2011-01-03',area='tj'); Loading data to table busdata.weather_list partition (createtime=2011-01-03, area=tj) OK Time taken: 1.568 seconds hive> load data local inpath '/home/hadoop/sampler/w4.csv' into table weather_list partition(createtime='2011-01-04',area='sz'); Loading data to table busdata.weather_list partition (createtime=2011-01-04, area=sz) OK Time taken: 1.209 seconds hive> load data local inpath '/home/hadoop/sampler/w5.csv' into table weather_list partition(createtime='2011-01-05',area='gz'); Loading data to table busdata.weather_list partition (createtime=2011-01-05, area=gz) OK Time taken: 1.148 seconds hive> load data local inpath '/home/hadoop/sampler/w5.csv' into table weather_list partition(createtime='2011-01-01',area='gz'); Loading data to table busdata.weather_list partition (createtime=2011-01-01, area=gz) OK Time taken: 1.278 seconds
partition的分区字段体现在存储目录上,与文件中的实际存储字段没有关系。
hive> dfs -ls /hive/warehouse/busdata.db/weather_list/createtime=2011-01-01/area=gz; Found 6 items -rw-r--r-- 1 hadoop supergroup 18018 2019-03-05 22:14 /hive/warehouse/busdata.db/weather_list/createtime=2011-01-01/area=gz/w1.csv -rw-r--r-- 1 hadoop supergroup 18022 2019-03-05 22:14 /hive/warehouse/busdata.db/weather_list/createtime=2011-01-01/area=gz/w2.csv -rw-r--r-- 1 hadoop supergroup 18028 2019-03-05 22:14 /hive/warehouse/busdata.db/weather_list/createtime=2011-01-01/area=gz/w3.csv -rw-r--r-- 1 hadoop supergroup 18022 2019-03-05 22:14 /hive/warehouse/busdata.db/weather_list/createtime=2011-01-01/area=gz/w4.csv -rw-r--r-- 1 hadoop supergroup 18027 2019-03-05 22:12 /hive/warehouse/busdata.db/weather_list/createtime=2011-01-01/area=gz/w5.csv -rw-r--r-- 1 hadoop supergroup 18027 2019-03-05 22:14 /hive/warehouse/busdata.db/weather_list/createtime=2011-01-01/area=gz/w5_copy_1.csv
hive> dfs -ls /hive/warehouse/busdata.db/weather_list/createtime=2011-01-01; Found 2 items drwxr-xr-x - hadoop supergroup 0 2019-03-05 22:09 /hive/warehouse/busdata.db/weather_list/createtime=2011-01-01/area=bj drwxr-xr-x - hadoop supergroup 0 2019-03-05 22:14 /hive/warehouse/busdata.db/weather_list/createtime=2011-01-01/area=gz
1.2、显示分区信息
hive> show partitions weather_list; OK createtime=2010-01-01/area=bj createtime=2010-01-01/area=sh createtime=2010-01-01/area=yn createtime=2010-01-02/area=sh createtime=2011-01-01/area=bj createtime=2011-01-01/area=gz createtime=2011-01-02/area=sc createtime=2011-01-03/area=tj createtime=2011-01-04/area=sz createtime=2011-01-05/area=gz Time taken: 0.584 seconds, Fetched: 10 row(s)
1.3、分区列属于表的正式列,但是文件中没有存储分区列信息。分区列的信息是从目录中读取的。
hive> select * from weather_list where area='bj' limit 10; OK 1999 71 2010-01-01 bj 1994 57 2010-01-01 bj 1995 33 2010-01-01 bj 1993 44 2010-01-01 bj 1994 99 2010-01-01 bj 1994 83 2010-01-01 bj 1995 59 2010-01-01 bj 1991 32 2010-01-01 bj 1992 74 2010-01-01 bj 2000 56 2010-01-01 bj Time taken: 2.527 seconds, Fetched: 10 row(s)
2、分桶表
2.1、检查分桶属性,设置分桶属性是为了使用hive来自动分桶,因为分桶是根据分桶字段和数量进行hash取余,也可以自己分桶后导入。
hive> set hive.enforce.bucketing; hive.enforce.bucketing=false hive> set hive.enforce.bucketing=true; hive> set hive.enforce.bucketing; hive.enforce.bucketing=true
2.2、建立分桶表
hive> create table bucket_userinfo(userid int,username string) clustered by (userid) sorted by (userid asc) into 2 buckets row format delimited fields terminated by ","; hive> desc formatted bucket_userinfo; OK # col_name data_type comment userid int username string # Detailed Table Information Database: busdata OwnerType: USER Owner: hadoop CreateTime: Wed Mar 06 23:11:37 CST 2019 LastAccessTime: UNKNOWN Retention: 0 Location: hdfs://bigdata-senior01.home.com:9000/hive/warehouse/busdata.db/bucket_userinfo Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"userid\":\"true\",\"username\":\"true\"}} SORTBUCKETCOLSPREFIX TRUE bucketing_version 2 numFiles 0 numRows 0 rawDataSize 0 totalSize 0 transient_lastDdlTime 1551885097 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets: 2 Bucket Columns: [userid] Sort Columns: [Order(col:userid, order:1)] Storage Desc Params: field.delim , serialization.format , Time taken: 0.379 seconds, Fetched: 34 row(s)
2.3、使用hive自动分桶,这种情况是针对源数据已经导入hive。
hive> insert overwrite table bucket_userinfo select userid,username from userinfo; 然后hive启动作业分桶导入数据,本例中分两个桶,所以最终会根据userid的奇偶生成两个文件。
hive> dfs -ls /hive/warehouse/busdata.db/bucket_userinfo; Found 2 items -rw-r--r-- 1 hadoop supergroup 106 2019-03-06 23:13 /hive/warehouse/busdata.db/bucket_userinfo/000000_0 -rw-r--r-- 1 hadoop supergroup 103 2019-03-06 23:13 /hive/warehouse/busdata.db/bucket_userinfo/000001_0
hive> dfs -cat /hive/warehouse/busdata.db/bucket_userinfo/000000_0; 2,xu.dm 4,user123 6,user2 8,user4 10,user6 14,user8 16,user10 18,user12 20,user14 22,soldier2 24,soldier4 hive> dfs -cat /hive/warehouse/busdata.db/bucket_userinfo/000001_0; 1,admin 3,myuser 5,user1 7,user3 9,user5 13,user7 15,user9 17,user11 19,user13 21,soldier1 23,soldier3
hive> select * from bucket_userinfo; OK 2 xu.dm 4 user123 6 user2 8 user4 10 user6 14 user8 16 user10 18 user12 20 user14 22 soldier2 24 soldier4 1 admin 3 myuser 5 user1 7 user3 9 user5 13 user7 15 user9 17 user11 19 user13 21 soldier1 23 soldier3 Time taken: 0.238 seconds, Fetched: 22 row(s)
2.4、从外部文件导入数据,结果与上面一样
hive> create table bucket_userinfo2(userid int,username string) clustered by (userid) sorted by (userid) into 2 buckets row format delimited fields terminated by ",";
hive> load data local inpath '/home/hadoop/userinfo2.txt' into table bucket_userinfo2;
hive> select * from bucket_userinfo2; OK 2 xu.dm 4 user123 6 user2 8 user4 10 user6 14 user8 16 user10 18 user12 20 user14 22 soldier2 24 soldier4 1 admin 3 myuser 5 user1 7 user3 9 user5 13 user7 15 user9 17 user11 19 user13 21 soldier1 23 soldier3
hive>dfs -ls /hive/warehouse/busdata.db/bucket_userinfo2; Found 2 items -rw-r--r-- 1 hadoop supergroup 106 2019-03-07 22:44 /hive/warehouse/busdata.db/bucket_userinfo2/000000_0 -rw-r--r-- 1 hadoop supergroup 103 2019-03-07 22:44 /hive/warehouse/busdata.db/bucket_userinfo2/000001_0
2.4、对桶数据采样
hive> select * from bucket_userinfo tablesample(bucket 1 out of 2 on userid); OK 2 xu.dm 6 user2 10 user6 20 user14 3 myuser 7 user3 17 user11 19 user13 21 soldier1 Time taken: 0.077 seconds, Fetched: 9 row(s)