hive简单的项目实战
解压user.zip
[root@hadoop1 test]# unzip user.zip -d /test/bigdatacase/dataset
Archive: user.zip
inflating: /test/bigdatacase/dataset/raw_user.csv
inflating: /test/bigdatacase/dataset/small_user.csv
查看解压出来的两个文件,查看raw_user.csv
头文件看一下
[root@hadoop1 dataset]# ll
总用量 1025444
-rw-r--r--. 1 root root 1036519392 3月 3 2016 raw_user.csv
-rw-r--r--. 1 root root 13530111 11月 26 2016 small_user.csv
[root@hadoop1 dataset]# head -5 raw_user.csv
user_id,item_id,behavior_type,user_geohash,item_category,time
10001082,285259775,1,97lk14c,4076,2014-12-08 18
10001082,4368907,1,,5503,2014-12-12 12
10001082,4368907,1,,5503,2014-12-12 12
10001082,53616768,1,,9762,2014-12-02 15
[root@hadoop1 dataset]#
每行记录都包含5个字段,数据集中的字段及其含义如下:
user_id(用户id)
item_id(商品id)
behaviour_type(包括浏览、收藏、加购物车、购买,对应取值分别是1、2、3、4)
user_geohash(用户地理位置哈希值,有些记录中没有这个字段值,所以后面会用脚本做数据预处理时把这个字段全部删除)
item_category(商品分类)
time(该记录产生时间)
一、数据集的预处理
1.删除文件第一行记录,即字段名称
raw_user和small_user中的第一行都是字段名称,在文件中的数据导入到数据仓库Hive中时,不需要第一行字段名称,因此,这里在做数据预处理时,删除第一行
[root@hadoop1 dataset]# sed -i '1d' raw_user.csv
[root@hadoop1 dataset]# head -5 raw_user.csv
10001082,285259775,1,97lk14c,4076,2014-12-08 18
10001082,4368907,1,,5503,2014-12-12 12
10001082,4368907,1,,5503,2014-12-12 12
10001082,53616768,1,,9762,2014-12-02 15
10001082,151466952,1,,5232,2014-12-12 11
[root@hadoop1 dataset]#
同理small_user.csv也执行一次
[root@hadoop1 dataset]# sed -i '1d' small_user.csv
[root@hadoop1 dataset]# head -5 small_user.csv
10001082,285259775,1,97lk14c,4076,2014-12-08 18
10001082,4368907,1,,5503,2014-12-12 12
10001082,4368907,1,,5503,2014-12-12 12
10001082,53616768,1,,9762,2014-12-02 15
10001082,151466952,1,,5232,2014-12-12 11
[root@hadoop1 dataset]#
后面都是用small_user.csv这个小数据集进行操作,这样可以节省时间。等所有流程都跑通以后,就可以使用大数据集raw_user.csv去测试一遍。
2.对字段进行预处理
下面对数据集进行一些预处理,包括为每行记录增加一个id字段(让记录具有唯一性)、增加一个省份字段(用来后续进行可视化分析),并且丢弃user_geohash字段(后面分析不需要这个字段)。
下面要建一个脚本文件pre_deal.sh,把这个脚本文件放在dataset目录下,和数据集small_user.csv放在同一个目录下:
[root@hadoop1 dataset]# cd /test/bigdatacase//dataset
[root@hadoop1 dataset]# vim pre_deal.sh
#/bin/bash
#设置输入文件,把用户执行pre_deal.sh命令时提供的第一个参数作为输入文件名称
infile=$1
#下面设置输出文件,把用户执行pre_deal.sh命令时提供的第二个参数作为输出文件名称
outfile=$2
#awk -F "," '处理逻辑' $infile > $outfile
awk -F "," 'BEGIN{
srand();
id=0;
Province[0]="山东";Province[1]="山西";Province[2]="河南";Province[3]="河北";Province[4]="陕西";Province[5]="内蒙古";Province[6]="上海市";
Province[7]="北京市";Province[8]="重庆市";Province[9]="天津市";Province[10]="福建";Province[11]="广东";Province[12]="广西";Province[13]="云南";
Province[14]="浙江";Province[15]="贵州";Province[16]="新疆";Province[17]="西藏";Province[18]="江西";Province[19]="湖南";Province[20]="湖北";
Province[21]="黑龙江";Province[22]="吉林";Province[23]="辽宁"; Province[24]="江苏";Province[25]="甘肃";Province[26]="青海";Province[27]="四川";
Province[28]="安徽"; Province[29]="宁夏";Province[30]="海南";Province[31]="香港";Province[32]="澳门";Province[33]="台湾";
}
{
id=id+1;
value=int(rand()*34);
print id"\t"$1"\t"$2"\t"$3"\t"$5"\t"substr($6,1,10)"\t"Province[value]
}' $infile > $outfile
这段脚本大概是-F参数用于指出每行记录的不同字段之间用逗号进行分割
srand()用于生成随机数的种子,id是为数据集新增的一个字段,它是一个自增类型,每条记录增加1,这样可以保证每条记录具有唯一性。
为数据集新增一个省份字段,方便用来进行后面的数据可视化分析,给每条记录增加一个省份字段的值,用Province[]数组用来保存全国各个省份信息,在遍历数据集raw_user.csv的时候,每当遍历到其中一条记录,使用value=int(rand()*34)语句随机生成一个0-33的整数,作为Province省份值,然后从Province[]数组当中获取省份名称,增加到该条记录中。
substr($6,1,10)这个语句是为了截取时间字段time的年月日,方便后续存储为date格式
awk每次遍历到一条记录时,每条记录包含了6个字段,其中,第6个字段是时间字段,substr($6,1,10)语句就表示获取第6个字段的值,截取前10个字符,第6个字段是类似”2014-12-08 18″这样的字符串(也就是表示2014年12月8日18时),substr($6,1,10)截取后,就丢弃了小时,只保留了年月日。
print id”\t”$1″\t”$2″\t”$3″\t”$5″\t”substr($6,1,10)”\t”Province[value]这行语句中,我们丢弃了每行记录的第4个字段,前面有说用户地理位置哈希值数据预处理时把这个字段全部删除,所以这里丢弃。生成后的文件是“\t”进行分割
执行脚本,对small_user.csv进行数据预处理
[root@hadoop1 dataset]# ./pre_deal.sh small_user.csv user_table.txt
-bash: ./pre_deal.sh: 权限不够
# 加一个可执行的权限
[root@hadoop1 dataset]# chmod +x pre_deal.sh
[root@hadoop1 dataset]# ./pre_deal.sh small_user.csv user_table.txt
[root@hadoop1 dataset]# head -10 user_table.txt
1 10001082 285259775 1 4076 2014-12-08 香港
2 10001082 4368907 1 5503 2014-12-12 山西
3 10001082 4368907 1 5503 2014-12-12 湖北
4 10001082 53616768 1 9762 2014-12-02 河北
5 10001082 151466952 1 5232 2014-12-12 江西
6 10001082 53616768 4 9762 2014-12-02 广东
7 10001082 290088061 1 5503 2014-12-12 西藏
8 10001082 298397524 1 10894 2014-12-12 湖北
9 10001082 32104252 1 6513 2014-12-12 甘肃
10 10001082 323339743 1 10894 2014-12-12 江西
3.导入数据库
把user_table.txt中的数据最终导入到数据仓库Hive中,将user_table.txt上传到分布式文件系统HDFS中,然后,在Hive中创建一个外部表,完成数据的导入。
把Linux本地文件系统中的user_table.txt上传到分布式文件系统HDFS中,存放在HDFS中的“/bigdatacase/dataset”目录下,在HDFS的根目录下面创建一个新的目录bigdatacase,并在这个目录下创建一个子目录dataset
[root@hadoop1 dataset]# hdfs dfs -mkdir -p /bigdatacase/dataset
[root@hadoop1 dataset]# hdfs dfs -put /test/bigdatacase/dataset/user_table.txt /bigdatacase/dataset
查看一下
[root@hadoop1 dataset]# hdfs dfs -cat /bigdatacase/dataset/user_table.txt | head -10
1 10001082 285259775 1 4076 2014-12-08 香港
2 10001082 4368907 1 5503 2014-12-12 山西
3 10001082 4368907 1 5503 2014-12-12 湖北
4 10001082 53616768 1 9762 2014-12-02 河北
5 10001082 151466952 1 5232 2014-12-12 江西
6 10001082 53616768 4 9762 2014-12-02 广东
7 10001082 290088061 1 5503 2014-12-12 西藏
8 10001082 298397524 1 10894 2014-12-12 湖北
9 10001082 32104252 1 6513 2014-12-12 甘肃
10 10001082 323339743 1 10894 2014-12-12 江西
cat: Unable to write to output stream.
[root@hadoop1 dataset]#
创建外部表
先创建一个数据库,然后在库中建立一张外部表bigdata_user,它包含字段(id, uid, item_id, behavior_type, item_category, date, province),请在hive命令提示符下输入如下命令
hive (default)> CREATE DATABASE dblab;
OK
hive (default)> use dblab;
OK
Time taken: 0.692 seconds
CREATE EXTERNAL TABLE bigdata_user(
id INT,
uid STRING,
item_id STRING,
behavior_type INT,
item_category STRING,
visit_data DATE,
province STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/bigdatacase/dataset';
已经成功把HDFS中的“/bigdatacase/dataset”目录下的数据加载到了数据仓库Hive中,查看一下
这里不小心把visit_date DATE
写成 visit_data DATE
,修改一下列
hive (dblab)> ALTER TABLE bigdata_user CHANGE visit_data visit_date date;
OK
Time taken: 9.862 seconds
hive (dblab)> select * from bigdata_user limit 10;
OK
bigdata_user.id bigdata_user.uid bigdata_user.item_id bigdata_user.behavior_type bigdata_user.item_category bigdata_user.visit_date bigdata_user.province
1 10001082 285259775 1 4076 2014-12-08 香港
2 10001082 4368907 1 5503 2014-12-12 山西
3 10001082 4368907 1 5503 2014-12-12 湖北
4 10001082 53616768 1 9762 2014-12-02 河北
5 10001082 151466952 1 5232 2014-12-12 江西
6 10001082 53616768 4 9762 2014-12-02 广东
7 10001082 290088061 1 5503 2014-12-12 西藏
8 10001082 298397524 1 10894 2014-12-12 湖北
9 10001082 32104252 1 6513 2014-12-12 甘肃
10 10001082 323339743 1 10894 2014-12-12 江西
Time taken: 12.896 seconds, Fetched: 10 row(s)
hive (dblab)> select behavior_type from bigdata_user limit 10;
OK
behavior_type
1
1
1
1
1
4
1
1
1
1
Time taken: 6.927 seconds, Fetched: 10 row(s)
beeline查询表会比较工整
0: jdbc:hive2://hadoop001:10000> select * from bigdata_user limit 10;
+------------------+-------------------+-----------------------+-----------------------------+-----------------------------+--------------------------+------------------------+
| bigdata_user.id | bigdata_user.uid | bigdata_user.item_id | bigdata_user.behavior_type | bigdata_user.item_category | bigdate_user.visit_data | bigdata_user.province |
+------------------+-------------------+-----------------------+-----------------------------+-----------------------------+--------------------------+------------------------+
| 1 | 10001082 | 285259775 | 1 | 4076 | 2014-12-08 | 香港 |
| 2 | 10001082 | 4368907 | 1 | 5503 | 2014-12-12 | 山西 |
| 3 | 10001082 | 4368907 | 1 | 5503 | 2014-12-12 | 湖北 |
| 4 | 10001082 | 53616768 | 1 | 9762 | 2014-12-02 | 河北 |
| 5 | 10001082 | 151466952 | 1 | 5232 | 2014-12-12 | 江西 |
| 6 | 10001082 | 53616768 | 4 | 9762 | 2014-12-02 | 广东 |
| 7 | 10001082 | 290088061 | 1 | 5503 | 2014-12-12 | 西藏 |
| 8 | 10001082 | 298397524 | 1 | 10894 | 2014-12-12 | 湖北 |
| 9 | 10001082 | 32104252 | 1 | 6513 | 2014-12-12 | 甘肃 |
| 10 | 10001082 | 323339743 | 1 | 10894 | 2014-12-12 | 江西 |
+------------------+-------------------+-----------------------+-----------------------------+-----------------------------+--------------------------+------------------------+
10 rows selected (1.055 seconds)
0: jdbc:hive2://hadoop001:10000> select behavior_type from bigdata_user limit 10;
+----------------+
| behavior_type |
+----------------+
| 1 |
| 1 |
| 1 |
| 1 |
| 1 |
| 4 |
| 1 |
| 1 |
| 1 |
| 1 |
+----------------+
10 rows selected (6.337 seconds)
0: jdbc:hive2://hadoop001:10000>
查看一下的bigdata_user属性show create table bigdata_user;
查看一下表的简单结构desc bigdata_user;
hive (dblab)> show create table bigdata_user;
OK
createtab_stmt
CREATE EXTERNAL TABLE `bigdata_user`(
`id` int,
`uid` string,
`item_id` string,
`behavior_type` int,
`item_category` string,
`visit_date` date,
`province` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'='\t',
'serialization.format'='\t')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://192.168.1.11:8020/bigdatacase/dataset'
TBLPROPERTIES (
'bucketing_version'='2',
'transient_lastDdlTime'='1605450738')
Time taken: 35.032 seconds, Fetched: 22 row(s)
hive (dblab)>
hive (dblab)> desc bigdata_user;
OK
col_name data_type comment
id int
uid string
item_id string
behavior_type int
item_category string
visit_date date
province string
Time taken: 14.897 seconds, Fetched: 7 row(s)
hive (dblab)>
二、简单查询分析
先测试一下简单的指令:
查看前10位用户对商品的行为
SELECT behavior_type FROM bigdata_user LIMIT 10;
hive (dblab)> SELECT behavior_type FROM bigdata_user LIMIT 10;
OK
behavior_type
1
1
1
1
1
4
1
1
1
1
Time taken: 10.666 seconds, Fetched: 10 row(s)
查询前20位用户购买商品时的时间和商品的种类
SELECT visit_date,item_category FROM bigdata_user LIMIT 20;
hive (dblab)> SELECT visit_date,item_category FROM bigdata_user LIMIT 20;
OK
visit_date item_category
2014-12-08 4076
2014-12-12 5503
2014-12-12 5503
2014-12-02 9762
2014-12-12 5232
2014-12-02 9762
2014-12-12 5503
2014-12-12 10894
2014-12-12 6513
2014-12-12 10894
2014-12-12 2825
2014-11-28 2825
2014-12-15 3200
2014-12-03 10576
2014-11-20 10576
2014-12-13 10576
2014-12-08 10576
2014-12-14 7079
2014-12-02 6669
2014-12-12 5232
Time taken: 7.74 seconds, Fetched: 20 row(s)
表中查询可以利用嵌套语句
hive (dblab)> SELECT e.bh, e.it FROM (SELECT behavior_type AS bh, item_category AS it FROM bigdata_user) AS e LIMIT 20;
OK
e.bh e.it
1 4076
1 5503
1 5503
1 9762
1 5232
4 9762
1 5503
1 10894
1 6513
1 10894
1 2825
1 2825
1 3200
1 10576
1 10576
1 10576
1 10576
1 7079
1 6669
1 5232
Time taken: 5.872 seconds, Fetched: 20 row(s)
三、查询条数统计分析
(1)用聚合函数count()计算出表内有多少条行数据
SELECT COUNT(*) FROM bigdata_user;
hive (dblab)> SELECT COUNT(*) FROM bigdata_user; Query ID = root_20201115231503_8cce31a9-81cb-4ef3-92ca-94bafb2086b4 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> 2020-11-15 23:17:28,850 INFO [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 Starting Job = job_1605437253399_0001, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0001/ Kill Command = /program/hadoop-3.2.1/bin/mapred job -kill job_1605437253399_0001 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2020-11-15 23:20:28,863 Stage-1 map = 0%, reduce = 0% 2020-11-15 23:21:29,392 Stage-1 map = 0%, reduce = 0% 2020-11-15 23:22:27,725 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.91 sec 2020-11-15 23:23:10,606 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 11.72 sec MapReduce Total cumulative CPU time: 11 seconds 970 msec Ended Job = job_1605437253399_0001 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 11.97 sec HDFS Read: 15603831 HDFS Write: 106 SUCCESS Total MapReduce CPU Time Spent: 11 seconds 970 msec OK _c0 300000 Time taken: 501.726 seconds, Fetched: 1 row(s)
(2)在函数内部加上distinct,查出uid不重复的数据有多少条
SELECT COUNT(DISTINCT uid) FROM bigdata_user;
hive (dblab)> SELECT COUNT(DISTINCT uid) FROM bigdata_user; Query ID = root_20201115232536_31d143aa-c5ed-4b53-b0a6-8c64f14f8634 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> 2020-11-15 23:25:58,752 INFO [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 Starting Job = job_1605437253399_0002, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0002/ Kill Command = /program/hadoop-3.2.1/bin/mapred job -kill job_1605437253399_0002 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2020-11-15 23:26:38,676 Stage-1 map = 0%, reduce = 0% 2020-11-15 23:27:38,820 Stage-1 map = 0%, reduce = 0% 2020-11-15 23:28:20,002 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.33 sec 2020-11-15 23:29:05,715 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 9.77 sec 2020-11-15 23:29:11,177 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 10.77 sec MapReduce Total cumulative CPU time: 10 seconds 770 msec Ended Job = job_1605437253399_0002 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 10.77 sec HDFS Read: 15599879 HDFS Write: 103 SUCCESS Total MapReduce CPU Time Spent: 10 seconds 770 msec OK _c0 270 Time taken: 253.202 seconds, Fetched: 1 row(s)
(3)查询不重复的数据有多少条(为了排除客户刷单情况)
SELECT COUNT(*) FROM (SELECT
uid,item_id,behavior_type,item_category,visit_date,province
FROM
bigdata_user
GROUP BY
uid,item_id,behavior_type,item_category,visit_date,province
HAVING COUNT(*)=1)a;
hive (dblab)> SELECT COUNT(*) FROM (SELECT > uid,item_id,behavior_type,item_category,visit_date,province > FROM > bigdata_user > GROUP BY > uid,item_id,behavior_type,item_category,visit_date,province > HAVING COUNT(*)=1)a; Query ID = root_20201116003743_ffd71e21-1df6-4876-b80e-1672d4c3f19d Total jobs = 2 Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> 2020-11-16 00:38:51,966 INFO [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 Starting Job = job_1605437253399_0004, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0004/ Kill Command = /program/hadoop-3.2.1/bin/mapred job -kill job_1605437253399_0004 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2020-11-16 00:40:19,298 Stage-1 map = 0%, reduce = 0% 2020-11-16 00:41:12,139 Stage-1 map = 67%, reduce = 0%, Cumulative CPU 11.81 sec 2020-11-16 00:41:13,194 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 13.35 sec 2020-11-16 00:42:13,270 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 13.35 sec 2020-11-16 00:42:43,105 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 17.76 sec 2020-11-16 00:42:49,495 Stage-1 map = 100%, reduce = 74%, Cumulative CPU 22.34 sec 2020-11-16 00:42:55,171 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 25.5 sec MapReduce Total cumulative CPU time: 25 seconds 680 msec Ended Job = job_1605437253399_0004 Launching Job 2 out of 2 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> 2020-11-16 00:43:09,951 INFO [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 Starting Job = job_1605437253399_0005, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0005/ Kill Command = /program/hadoop-3.2.1/bin/mapred job -kill job_1605437253399_0005 Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1 2020-11-16 00:43:51,371 Stage-2 map = 0%, reduce = 0% 2020-11-16 00:44:11,773 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 2.71 sec 2020-11-16 00:45:12,075 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 2.71 sec 2020-11-16 00:45:41,184 Stage-2 map = 100%, reduce = 67%, Cumulative CPU 7.7 sec 2020-11-16 00:45:47,993 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 8.67 sec MapReduce Total cumulative CPU time: 8 seconds 670 msec Ended Job = job_1605437253399_0005 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 25.68 sec HDFS Read: 15606542 HDFS Write: 117 SUCCESS Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 8.67 sec HDFS Read: 6997 HDFS Write: 106 SUCCESS Total MapReduce CPU Time Spent: 34 seconds 350 msec OK _c0 284368 Time taken: 494.126 seconds, Fetched: 1 row(s)
四.关键字条件查询分析
1.以关键字的存在区间为条件的查询 使用where可以缩小查询分析的范围和精确度,下面用实例来测试一下。
(1)查询2014年12月10日到2014年12月13日有多少人浏览了商品
hive (dblab)> SELECT COUNT(*) FROM bigdata_user WHERE > bwhavior_type='1' > AND visit_date<'2014-12-13' > AND visit_date>'2014-12-10'; Query ID = root_20201116090806_6b2a5b9c-f996-4caf-baa7-4bea6812cfda Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> 2020-11-16 09:11:29,381 INFO [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 Starting Job = job_1605437253399_0008, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0008/ Kill Command = /program/hadoop-3.2.1/bin/mapred job -kill job_1605437253399_0008 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2020-11-16 09:16:53,597 Stage-1 map = 0%, reduce = 0% 2020-11-16 09:17:08,350 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 9.95 sec 2020-11-16 09:17:34,025 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 16.15 sec MapReduce Total cumulative CPU time: 16 seconds 350 msec Ended Job = job_1605437253399_0008 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 16.35 sec HDFS Read: 15606581 HDFS Write: 105 SUCCESS Total MapReduce CPU Time Spent: 16 seconds 350 msec OK _c0 26329 Time taken: 629.506 seconds, Fetched: 1 row(s)
(2)以月的第n天为统计单位,依次显示第n天网站卖出去的商品的个数
SELECT COUNT(distinct uid), day(visit_date) FROM bigdata_user WHERE
behavior_type='4' GROUP BY day(visit_date);
hive (dblab)> SELECT COUNT(distinct uid), day(visit_date) FROM bigdata_user WHERE > behavior_type='4' GROUP BY day(visit_date); Query ID = root_20201116115335_d26e8b96-2962-4e18-8e18-adbc7324d127 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> 2020-11-16 11:55:06,219 INFO [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 Starting Job = job_1605437253399_0009, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0009/ Kill Command = /program/hadoop-3.2.1/bin/mapred job -kill job_1605437253399_0009 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2020-11-16 11:56:13,054 Stage-1 map = 0%, reduce = 0% 2020-11-16 11:56:35,088 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 8.08 sec 2020-11-16 11:57:03,002 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 14.01 sec MapReduce Total cumulative CPU time: 14 seconds 10 msec Ended Job = job_1605437253399_0009 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 14.01 sec HDFS Read: 15601205 HDFS Write: 618 SUCCESS Total MapReduce CPU Time Spent: 14 seconds 10 msec OK _c0 _c1 37 1 48 2 42 3 38 4 42 5 33 6 42 7 36 8 34 9 40 10 43 11 98 12 39 13 43 14 42 15 44 16 42 17 66 18 38 19 50 20 33 21 34 22 32 23 47 24 34 25 31 26 30 27 34 28 39 29 38 30 Time taken: 212.641 seconds, Fetched: 30 row(s)
2.关键字赋予给定值为条件,对其他数据进行分析
取给定时间和给定地点,求当天发出到该地点的货物的数量
SELEct COUNT(*) FROM bigdata_user WHERE
province='江西' AND visit_date='2014-12-12' AND behavior_type='4';
hive (dblab)> SELEct COUNT(*) FROM bigdata_user WHERE > province='江西' AND visit_date='2014-12-12' AND behavior_type='4'; Query ID = root_20201116120451_c13ae714-6cf1-400c-b9c0-d6afdd25ac7a Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> 2020-11-16 12:05:01,440 INFO [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 Starting Job = job_1605437253399_0010, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0010/ Kill Command = /program/hadoop-3.2.1/bin/mapred job -kill job_1605437253399_0010 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2020-11-16 12:06:43,726 Stage-1 map = 0%, reduce = 0% 2020-11-16 12:07:13,387 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 9.12 sec 2020-11-16 12:07:31,429 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 9.12 sec MapReduce Total cumulative CPU time: 14 seconds 140 msec Ended Job = job_1605437253399_0010 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 14.14 sec HDFS Read: 15606342 HDFS Write: 102 SUCCESS Total MapReduce CPU Time Spent: 14 seconds 140 msec OK _c0 12 Time taken: 166.15 seconds, Fetched: 1 row(s)
五.根据用户行为分析
1.查询一件商品在某天的购买比例或浏览比例
例如
SELECT COUNT(*) FROM bigdata_user WHERE visit_date='2014-12-11'
AND behavior_type='4';
hive (dblab)> SELECT COUNT(*) FROM bigdata_user WHERE visit_date='2014-12-11' > AND behavior_type='4'; Query ID = root_20201116122136_fbb9be2d-7c88-409d-ada8-0e24ca4e912d Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> 2020-11-16 12:22:11,860 INFO [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 Starting Job = job_1605437253399_0011, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0011/ Kill Command = /program/hadoop-3.2.1/bin/mapred job -kill job_1605437253399_0011 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2020-11-16 12:24:31,024 Stage-1 map = 0%, reduce = 0% 2020-11-16 12:24:40,319 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 8.37 sec 2020-11-16 12:25:19,783 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 16.99 sec MapReduce Total cumulative CPU time: 16 seconds 990 msec Ended Job = job_1605437253399_0011 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 16.99 sec HDFS Read: 15605958 HDFS Write: 102 SUCCESS Total MapReduce CPU Time Spent: 16 seconds 990 msec OK _c0 69 Time taken: 242.638 seconds, Fetched: 1 row(s)
查询有多少用户在2014-12-11点击了该店
SELECT COUNT(*) FROM bigdata_user WHERE
visit_date='2014-12-12';
根据上面语句得到购买数量和点击数量,两个数相除即可得出当天该商品的购买率。
2.查询某个用户在某一天点击网站占该天所有点击行为的比例(点击行为包括浏览,加入购物车,收藏,购买)
//查询用户10001082在2014-12-12点击网站的次数
SELECT COUNT(*) FROM bigdata_user WHERE uid=10001082 AND visit_date='2014-12-12'; hive (dblab)> SELECT COUNT(*) FROM bigdata_user WHERE > uid=10001082 AND visit_date='2014-12-12'; Query ID = root_20201116124039_d60d642d-b253-49a4-a5d6-b938e307157b Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> 2020-11-16 12:40:52,056 INFO [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 Starting Job = job_1605437253399_0013, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0013/ Kill Command = /program/hadoop-3.2.1/bin/mapred job -kill job_1605437253399_0013 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2020-11-16 12:41:17,591 Stage-1 map = 0%, reduce = 0% 2020-11-16 12:41:44,422 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 8.98 sec 2020-11-16 12:42:10,097 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 15.17 sec MapReduce Total cumulative CPU time: 15 seconds 170 msec Ended Job = job_1605437253399_0013 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 15.17 sec HDFS Read: 15606459 HDFS Write: 102 SUCCESS Total MapReduce CPU Time Spent: 15 seconds 170 msec OK _c0 69 Time taken: 95.548 seconds, Fetched: 1 row(s) hive (dblab)>
//查询所有用户在这一天点击该网站的次数
SELECT COUNT(*) FROM bigdata_user WHERE
visit_date='2014-12-12';
hive (dblab)> SELECT COUNT(*) FROM bigdata_user WHERE > visit_date='2014-12-12'; Query ID = root_20201116124543_a447db6c-52cd-4be9-adb6-1a57f5163eb3 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> 2020-11-16 12:45:48,156 INFO [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 Starting Job = job_1605437253399_0014, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0014/ Kill Command = /program/hadoop-3.2.1/bin/mapred job -kill job_1605437253399_0014 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2020-11-16 12:48:59,555 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 9.29 sec 2020-11-16 12:50:06,680 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 9.29 sec 2020-11-16 12:51:44,664 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 11.4 sec 2020-11-16 12:51:59,859 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 16.82 sec MapReduce Total cumulative CPU time: 17 seconds 280 msec Ended Job = job_1605437253399_0014 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 17.28 sec HDFS Read: 15605296 HDFS Write: 105 SUCCESS Total MapReduce CPU Time Spent: 17 seconds 280 msec OK _c0 17494 Time taken: 454.833 seconds, Fetched: 1 row(s) 上面两条语句的结果相除,就得到了要要求的比例。
3.给定购买商品的数量范围,查询某一天在该网站的购买该数量商品的用户id
//查询某一天在该网站购买商品超过5次的用户id
SELECT uid FROM bigdata_user WHERE behavior_type='4' AND visit_date='2014-12-12' GROUP BY uid having count(behavior_type='4')>5; hive (dblab)> SELECT uid FROM bigdata_user WHERE > behavior_type='4' AND visit_date='2014-12-12' > GROUP BY uid having count(behavior_type='4')>5; Query ID = root_20201116130147_72230e9f-b748-4851-974c-1ca5dcceb37c Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> 2020-11-16 13:07:39,493 INFO [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 Starting Job = job_1605437253399_0015, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0015/ Kill Command = /program/hadoop-3.2.1/bin/mapred job -kill job_1605437253399_0015 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2020-11-16 13:09:16,947 Stage-1 map = 0%, reduce = 0% 2020-11-16 13:10:02,481 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 10.51 sec 2020-11-16 13:10:33,388 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 17.71 sec MapReduce Total cumulative CPU time: 17 seconds 710 msec Ended Job = job_1605437253399_0015 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 17.71 sec HDFS Read: 15607239 HDFS Write: 478 SUCCESS Total MapReduce CPU Time Spent: 17 seconds 710 msec OK uid 100226515 100300684 100555417 100605 10095384 10142625 101490976 101982646 102011320 102030700 102079825 102349447 102612580 102650143 103082347 103139791 103794013 103995979 Time taken: 533.86 seconds, Fetched: 18 row(s)
六.用户实时查询分析
某个地区的用户当天浏览网站的次数
//创建新的数据表进行存储
CREATE TABLE scan(province STRING,scan INT) COMMENT 'This is the search of bigdataday'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
hive (dblab)> CREATE TABLE scan(province STRING,scan INT) COMMENT 'This is the search of bigdataday'
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
OK
Time taken: 47.899 seconds
//导入数据
INSERT OVERWRITE TABLE scan SELECT province,COUNT(behavior_type)
FROM bigdata_user WHERE behavior_type='1' GROUP BY province;
//查看一下
SELECT * FROM scan;
hive (dblab)> INSERT OVERWRITE TABLE scan SELECT province,COUNT(behavior_type) > FROM bigdata_user WHERE behavior_type='1' GROUP BY province; Query ID = root_20201116133805_d502f968-f966-4de9-ac2b-075942cd23dc Total jobs = 2 Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> 2020-11-16 13:38:36,530 INFO [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 Starting Job = job_1605437253399_0016, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0016/ Kill Command = /program/hadoop-3.2.1/bin/mapred job -kill job_1605437253399_0016 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2020-11-16 13:39:13,851 Stage-1 map = 0%, reduce = 0% 2020-11-16 13:39:36,587 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 6.79 sec 2020-11-16 13:39:54,796 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 13.42 sec MapReduce Total cumulative CPU time: 13 seconds 420 msec Ended Job = job_1605437253399_0016 Loading data to table dblab.scan Launching Job 2 out of 2 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> 2020-11-16 13:40:02,593 INFO [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 Starting Job = job_1605437253399_0017, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0017/ Kill Command = /program/hadoop-3.2.1/bin/mapred job -kill job_1605437253399_0017 Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1 2020-11-16 13:40:34,772 Stage-3 map = 0%, reduce = 0% 2020-11-16 13:41:35,425 Stage-3 map = 0%, reduce = 0% 2020-11-16 13:42:27,059 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 5.0 sec 2020-11-16 13:42:57,783 Stage-3 map = 100%, reduce = 100%, Cumulative CPU 9.3 sec MapReduce Total cumulative CPU time: 9 seconds 300 msec Ended Job = job_1605437253399_0017 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 13.42 sec HDFS Read: 15608313 HDFS Write: 913 SUCCESS Stage-Stage-3: Map: 1 Reduce: 1 Cumulative CPU: 9.3 sec HDFS Read: 11053 HDFS Write: 515 SUCCESS Total MapReduce CPU Time Spent: 22 seconds 720 msec OK _col0 _col1 Time taken: 336.34 seconds hive (dblab)> SELECT * FROM scan; OK scan.province scan.scan 上海市 8394 云南 8357 内蒙古 8146 北京市 8328 台湾 8186 吉林 8397 四川 8412 天津市 8557 宁夏 8414 安徽 8222 山东 8404 山西 8342 广东 8321 广西 8265 新疆 8311 江苏 8356 江西 8344 河北 8356 河南 8475 浙江 8321 海南 8417 湖北 8231 湖南 8390 澳门 8342 甘肃 8367 福建 8426 西藏 8355 贵州 8236 辽宁 8316 重庆市 8197 陕西 8364 青海 8384 香港 8203 黑龙江 8380 Time taken: 8.221 seconds, Fetched: 34 row(s) hive (dblab)>