hive简单的项目实战

解压user.zip

[root@hadoop1 test]# unzip user.zip -d /test/bigdatacase/dataset
Archive:  user.zip
  inflating: /test/bigdatacase/dataset/raw_user.csv  
  inflating: /test/bigdatacase/dataset/small_user.csv

查看解压出来的两个文件，查看raw_user.csv头文件看一下

[root@hadoop1 dataset]# ll
总用量 1025444
-rw-r--r--. 1 root root 1036519392 3月   3 2016 raw_user.csv
-rw-r--r--. 1 root root   13530111 11月 26 2016 small_user.csv
[root@hadoop1 dataset]# head -5 raw_user.csv
user_id,item_id,behavior_type,user_geohash,item_category,time
10001082,285259775,1,97lk14c,4076,2014-12-08 18
10001082,4368907,1,,5503,2014-12-12 12
10001082,4368907,1,,5503,2014-12-12 12
10001082,53616768,1,,9762,2014-12-02 15
[root@hadoop1 dataset]#

每行记录都包含5个字段，数据集中的字段及其含义如下：

user_id（用户id）
item_id(商品id)
behaviour_type（包括浏览、收藏、加购物车、购买，对应取值分别是1、2、3、4）
user_geohash(用户地理位置哈希值，有些记录中没有这个字段值，所以后面会用脚本做数据预处理时把这个字段全部删除)
item_category（商品分类）
time（该记录产生时间）

一、数据集的预处理

1.删除文件第一行记录，即字段名称

raw_user和small_user中的第一行都是字段名称，在文件中的数据导入到数据仓库Hive中时，不需要第一行字段名称，因此，这里在做数据预处理时，删除第一行

[root@hadoop1 dataset]# sed -i '1d' raw_user.csv
[root@hadoop1 dataset]# head -5 raw_user.csv
10001082,285259775,1,97lk14c,4076,2014-12-08 18
10001082,4368907,1,,5503,2014-12-12 12
10001082,4368907,1,,5503,2014-12-12 12
10001082,53616768,1,,9762,2014-12-02 15
10001082,151466952,1,,5232,2014-12-12 11
[root@hadoop1 dataset]#

同理small_user.csv也执行一次

[root@hadoop1 dataset]# sed -i '1d' small_user.csv
[root@hadoop1 dataset]# head -5 small_user.csv
10001082,285259775,1,97lk14c,4076,2014-12-08 18
10001082,4368907,1,,5503,2014-12-12 12
10001082,4368907,1,,5503,2014-12-12 12
10001082,53616768,1,,9762,2014-12-02 15
10001082,151466952,1,,5232,2014-12-12 11
[root@hadoop1 dataset]#

后面都是用small_user.csv这个小数据集进行操作，这样可以节省时间。等所有流程都跑通以后，就可以使用大数据集raw_user.csv去测试一遍。

2.对字段进行预处理

下面对数据集进行一些预处理，包括为每行记录增加一个id字段（让记录具有唯一性）、增加一个省份字段（用来后续进行可视化分析），并且丢弃user_geohash字段（后面分析不需要这个字段）。

下面要建一个脚本文件pre_deal.sh，把这个脚本文件放在dataset目录下，和数据集small_user.csv放在同一个目录下：

[root@hadoop1 dataset]# cd /test/bigdatacase//dataset
[root@hadoop1 dataset]# vim pre_deal.sh

#/bin/bash
#设置输入文件，把用户执行pre_deal.sh命令时提供的第一个参数作为输入文件名称
infile=$1
#下面设置输出文件，把用户执行pre_deal.sh命令时提供的第二个参数作为输出文件名称
outfile=$2


#awk -F "," '处理逻辑' $infile > $outfile
awk -F "," 'BEGIN{
                srand();
                id=0;
                Province[0]="山东";Province[1]="山西";Province[2]="河南";Province[3]="河北";Province[4]="陕西";Province[5]="内蒙古";Province[6]="上海市";
                Province[7]="北京市";Province[8]="重庆市";Province[9]="天津市";Province[10]="福建";Province[11]="广东";Province[12]="广西";Province[13]="云南"; 
                Province[14]="浙江";Province[15]="贵州";Province[16]="新疆";Province[17]="西藏";Province[18]="江西";Province[19]="湖南";Province[20]="湖北";
                Province[21]="黑龙江";Province[22]="吉林";Province[23]="辽宁"; Province[24]="江苏";Province[25]="甘肃";Province[26]="青海";Province[27]="四川";
                Province[28]="安徽"; Province[29]="宁夏";Province[30]="海南";Province[31]="香港";Province[32]="澳门";Province[33]="台湾";
        }
        {
                id=id+1;
                value=int(rand()*34);       
                print id"\t"$1"\t"$2"\t"$3"\t"$5"\t"substr($6,1,10)"\t"Province[value]
        }' $infile > $outfile

这段脚本大概是-F参数用于指出每行记录的不同字段之间用逗号进行分割

srand()用于生成随机数的种子，id是为数据集新增的一个字段，它是一个自增类型，每条记录增加1，这样可以保证每条记录具有唯一性。

为数据集新增一个省份字段，方便用来进行后面的数据可视化分析，给每条记录增加一个省份字段的值，用Province[]数组用来保存全国各个省份信息，在遍历数据集raw_user.csv的时候，每当遍历到其中一条记录，使用value=int(rand()*34)语句随机生成一个0-33的整数，作为Province省份值，然后从Province[]数组当中获取省份名称，增加到该条记录中。

substr($6,1,10)这个语句是为了截取时间字段time的年月日，方便后续存储为date格式

awk每次遍历到一条记录时，每条记录包含了6个字段，其中，第6个字段是时间字段，substr($6,1,10)语句就表示获取第6个字段的值，截取前10个字符，第6个字段是类似”2014-12-08 18″这样的字符串（也就是表示2014年12月8日18时），substr($6,1,10)截取后，就丢弃了小时，只保留了年月日。

print id”\t”$1″\t”$2″\t”$3″\t”$5″\t”substr($6,1,10)”\t”Province[value]这行语句中，我们丢弃了每行记录的第4个字段，前面有说用户地理位置哈希值数据预处理时把这个字段全部删除，所以这里丢弃。生成后的文件是“\t”进行分割

执行脚本，对small_user.csv进行数据预处理

[root@hadoop1 dataset]# ./pre_deal.sh small_user.csv user_table.txt
-bash: ./pre_deal.sh: 权限不够

# 加一个可执行的权限
[root@hadoop1 dataset]# chmod +x pre_deal.sh 
[root@hadoop1 dataset]# ./pre_deal.sh small_user.csv user_table.txt
[root@hadoop1 dataset]# head -10 user_table.txt 
1   10001082    285259775   1   4076    2014-12-08  香港
2   10001082    4368907 1   5503    2014-12-12  山西
3   10001082    4368907 1   5503    2014-12-12  湖北
4   10001082    53616768    1   9762    2014-12-02  河北
5   10001082    151466952   1   5232    2014-12-12  江西
6   10001082    53616768    4   9762    2014-12-02  广东
7   10001082    290088061   1   5503    2014-12-12  西藏
8   10001082    298397524   1   10894   2014-12-12  湖北
9   10001082    32104252    1   6513    2014-12-12  甘肃
10  10001082    323339743   1   10894   2014-12-12  江西

3.导入数据库

把user_table.txt中的数据最终导入到数据仓库Hive中，将user_table.txt上传到分布式文件系统HDFS中，然后，在Hive中创建一个外部表，完成数据的导入。

把Linux本地文件系统中的user_table.txt上传到分布式文件系统HDFS中，存放在HDFS中的“/bigdatacase/dataset”目录下，在HDFS的根目录下面创建一个新的目录bigdatacase，并在这个目录下创建一个子目录dataset

[root@hadoop1 dataset]# hdfs dfs -mkdir -p /bigdatacase/dataset
[root@hadoop1 dataset]# hdfs dfs -put /test/bigdatacase/dataset/user_table.txt /bigdatacase/dataset

查看一下

[root@hadoop1 dataset]# hdfs dfs -cat /bigdatacase/dataset/user_table.txt | head -10
1   10001082    285259775   1   4076    2014-12-08  香港
2   10001082    4368907 1   5503    2014-12-12  山西
3   10001082    4368907 1   5503    2014-12-12  湖北
4   10001082    53616768    1   9762    2014-12-02  河北
5   10001082    151466952   1   5232    2014-12-12  江西
6   10001082    53616768    4   9762    2014-12-02  广东
7   10001082    290088061   1   5503    2014-12-12  西藏
8   10001082    298397524   1   10894   2014-12-12  湖北
9   10001082    32104252    1   6513    2014-12-12  甘肃
10  10001082    323339743   1   10894   2014-12-12  江西
cat: Unable to write to output stream.
[root@hadoop1 dataset]#

创建外部表

先创建一个数据库，然后在库中建立一张外部表bigdata_user，它包含字段（id, uid, item_id, behavior_type, item_category, date, province），请在hive命令提示符下输入如下命令

hive (default)> CREATE DATABASE dblab;
OK
hive (default)> use dblab;
OK
Time taken: 0.692 seconds

CREATE EXTERNAL TABLE bigdata_user(
id INT,
uid STRING,
item_id STRING,
behavior_type INT,
item_category STRING,
visit_data DATE,
province STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
LOCATION '/bigdatacase/dataset';

已经成功把HDFS中的“/bigdatacase/dataset”目录下的数据加载到了数据仓库Hive中，查看一下

这里不小心把visit_date DATE 写成 visit_data DATE，修改一下列

hive (dblab)> ALTER TABLE bigdata_user CHANGE visit_data visit_date date;
OK
Time taken: 9.862 seconds

hive (dblab)> select * from bigdata_user limit 10;
OK
bigdata_user.id bigdata_user.uid    bigdata_user.item_id    bigdata_user.behavior_type  bigdata_user.item_category  bigdata_user.visit_date bigdata_user.province
1   10001082    285259775   1   4076    2014-12-08  香港
2   10001082    4368907 1   5503    2014-12-12  山西
3   10001082    4368907 1   5503    2014-12-12  湖北
4   10001082    53616768    1   9762    2014-12-02  河北
5   10001082    151466952   1   5232    2014-12-12  江西
6   10001082    53616768    4   9762    2014-12-02  广东
7   10001082    290088061   1   5503    2014-12-12  西藏
8   10001082    298397524   1   10894   2014-12-12  湖北
9   10001082    32104252    1   6513    2014-12-12  甘肃
10  10001082    323339743   1   10894   2014-12-12  江西
Time taken: 12.896 seconds, Fetched: 10 row(s)
hive (dblab)> select behavior_type from bigdata_user limit 10;
OK
behavior_type
1
1
1
1
1
4
1
1
1
1
Time taken: 6.927 seconds, Fetched: 10 row(s)

beeline查询表会比较工整

0: jdbc:hive2://hadoop001:10000> select * from bigdata_user limit 10;
+------------------+-------------------+-----------------------+-----------------------------+-----------------------------+--------------------------+------------------------+
| bigdata_user.id  | bigdata_user.uid  | bigdata_user.item_id  | bigdata_user.behavior_type  | bigdata_user.item_category  | bigdate_user.visit_data  | bigdata_user.province  |
+------------------+-------------------+-----------------------+-----------------------------+-----------------------------+--------------------------+------------------------+
| 1                | 10001082          | 285259775             | 1                           | 4076                        | 2014-12-08               | 香港                     |
| 2                | 10001082          | 4368907               | 1                           | 5503                        | 2014-12-12               | 山西                     |
| 3                | 10001082          | 4368907               | 1                           | 5503                        | 2014-12-12               | 湖北                     |
| 4                | 10001082          | 53616768              | 1                           | 9762                        | 2014-12-02               | 河北                     |
| 5                | 10001082          | 151466952             | 1                           | 5232                        | 2014-12-12               | 江西                     |
| 6                | 10001082          | 53616768              | 4                           | 9762                        | 2014-12-02               | 广东                     |
| 7                | 10001082          | 290088061             | 1                           | 5503                        | 2014-12-12               | 西藏                     |
| 8                | 10001082          | 298397524             | 1                           | 10894                       | 2014-12-12               | 湖北                     |
| 9                | 10001082          | 32104252              | 1                           | 6513                        | 2014-12-12               | 甘肃                     |
| 10               | 10001082          | 323339743             | 1                           | 10894                       | 2014-12-12               | 江西                     |
+------------------+-------------------+-----------------------+-----------------------------+-----------------------------+--------------------------+------------------------+
10 rows selected (1.055 seconds)
0: jdbc:hive2://hadoop001:10000> select behavior_type from bigdata_user limit 10;
+----------------+
| behavior_type  |
+----------------+
| 1              |
| 1              |
| 1              |
| 1              |
| 1              |
| 4              |
| 1              |
| 1              |
| 1              |
| 1              |
+----------------+
10 rows selected (6.337 seconds)
0: jdbc:hive2://hadoop001:10000>

查看一下的bigdata_user属性show create table bigdata_user;

查看一下表的简单结构desc bigdata_user;

hive (dblab)> show create table bigdata_user;
OK
createtab_stmt
CREATE EXTERNAL TABLE `bigdata_user`(
  `id` int, 
  `uid` string, 
  `item_id` string, 
  `behavior_type` int, 
  `item_category` string, 
  `visit_date` date, 
  `province` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
WITH SERDEPROPERTIES ( 
  'field.delim'='\t', 
  'serialization.format'='\t') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://192.168.1.11:8020/bigdatacase/dataset'
TBLPROPERTIES (
  'bucketing_version'='2', 
  'transient_lastDdlTime'='1605450738')
Time taken: 35.032 seconds, Fetched: 22 row(s)
hive (dblab)>

hive (dblab)> desc bigdata_user;
OK
col_name    data_type   comment
id                      int                                         
uid                     string                                      
item_id                 string                                      
behavior_type           int                                         
item_category           string                                      
visit_date              date                                        
province                string                                      
Time taken: 14.897 seconds, Fetched: 7 row(s)
hive (dblab)>

二、简单查询分析

先测试一下简单的指令：

查看前10位用户对商品的行为

SELECT behavior_type FROM bigdata_user LIMIT 10;

hive (dblab)> SELECT behavior_type FROM bigdata_user LIMIT 10;
OK
behavior_type
1
1
1
1
1
4
1
1
1
1
Time taken: 10.666 seconds, Fetched: 10 row(s)

查询前20位用户购买商品时的时间和商品的种类

SELECT visit_date,item_category FROM bigdata_user LIMIT 20;

hive (dblab)> SELECT visit_date,item_category FROM bigdata_user LIMIT 20;
OK
visit_date  item_category
2014-12-08  4076
2014-12-12  5503
2014-12-12  5503
2014-12-02  9762
2014-12-12  5232
2014-12-02  9762
2014-12-12  5503
2014-12-12  10894
2014-12-12  6513
2014-12-12  10894
2014-12-12  2825
2014-11-28  2825
2014-12-15  3200
2014-12-03  10576
2014-11-20  10576
2014-12-13  10576
2014-12-08  10576
2014-12-14  7079
2014-12-02  6669
2014-12-12  5232
Time taken: 7.74 seconds, Fetched: 20 row(s)

表中查询可以利用嵌套语句

hive (dblab)> SELECT e.bh, e.it FROM (SELECT behavior_type AS bh, item_category AS it FROM bigdata_user) AS e LIMIT 20;
OK
e.bh    e.it
1   4076
1   5503
1   5503
1   9762
1   5232
4   9762
1   5503
1   10894
1   6513
1   10894
1   2825
1   2825
1   3200
1   10576
1   10576
1   10576
1   10576
1   7079
1   6669
1   5232
Time taken: 5.872 seconds, Fetched: 20 row(s)

三、查询条数统计分析

(1)用聚合函数count()计算出表内有多少条行数据

SELECT COUNT(*) FROM bigdata_user;

hive (dblab)> SELECT COUNT(*) FROM bigdata_user;
Query ID = root_20201115231503_8cce31a9-81cb-4ef3-92ca-94bafb2086b4
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
2020-11-15 23:17:28,850 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
Starting Job = job_1605437253399_0001, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0001/
Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-11-15 23:20:28,863 Stage-1 map = 0%,  reduce = 0%
2020-11-15 23:21:29,392 Stage-1 map = 0%,  reduce = 0%
2020-11-15 23:22:27,725 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.91 sec
2020-11-15 23:23:10,606 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 11.72 sec
MapReduce Total cumulative CPU time: 11 seconds 970 msec
Ended Job = job_1605437253399_0001
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 11.97 sec   HDFS Read: 15603831 HDFS Write: 106 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 970 msec
OK
_c0
300000
Time taken: 501.726 seconds, Fetched: 1 row(s)

(2)在函数内部加上distinct，查出uid不重复的数据有多少条

SELECT COUNT(DISTINCT uid) FROM bigdata_user;

hive (dblab)> SELECT COUNT(DISTINCT uid) FROM bigdata_user;
Query ID = root_20201115232536_31d143aa-c5ed-4b53-b0a6-8c64f14f8634
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
2020-11-15 23:25:58,752 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
Starting Job = job_1605437253399_0002, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0002/
Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-11-15 23:26:38,676 Stage-1 map = 0%,  reduce = 0%
2020-11-15 23:27:38,820 Stage-1 map = 0%,  reduce = 0%
2020-11-15 23:28:20,002 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.33 sec
2020-11-15 23:29:05,715 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 9.77 sec
2020-11-15 23:29:11,177 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 10.77 sec
MapReduce Total cumulative CPU time: 10 seconds 770 msec
Ended Job = job_1605437253399_0002
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 10.77 sec   HDFS Read: 15599879 HDFS Write: 103 SUCCESS
Total MapReduce CPU Time Spent: 10 seconds 770 msec
OK
_c0
270
Time taken: 253.202 seconds, Fetched: 1 row(s)

(3)查询不重复的数据有多少条(为了排除客户刷单情况)

SELECT COUNT(*) FROM (SELECT
uid,item_id,behavior_type,item_category,visit_date,province 
FROM
bigdata_user
GROUP BY
uid,item_id,behavior_type,item_category,visit_date,province 
HAVING COUNT(*)=1)a;

hive (dblab)> SELECT COUNT(*) FROM (SELECT
            > uid,item_id,behavior_type,item_category,visit_date,province 
            > FROM
            > bigdata_user
            > GROUP BY
            > uid,item_id,behavior_type,item_category,visit_date,province 
            > HAVING COUNT(*)=1)a;
Query ID = root_20201116003743_ffd71e21-1df6-4876-b80e-1672d4c3f19d
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
2020-11-16 00:38:51,966 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
Starting Job = job_1605437253399_0004, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0004/
Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-11-16 00:40:19,298 Stage-1 map = 0%,  reduce = 0%
2020-11-16 00:41:12,139 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU 11.81 sec
2020-11-16 00:41:13,194 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 13.35 sec
2020-11-16 00:42:13,270 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 13.35 sec
2020-11-16 00:42:43,105 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 17.76 sec
2020-11-16 00:42:49,495 Stage-1 map = 100%,  reduce = 74%, Cumulative CPU 22.34 sec
2020-11-16 00:42:55,171 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 25.5 sec
MapReduce Total cumulative CPU time: 25 seconds 680 msec
Ended Job = job_1605437253399_0004
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
2020-11-16 00:43:09,951 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
Starting Job = job_1605437253399_0005, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0005/
Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0005
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2020-11-16 00:43:51,371 Stage-2 map = 0%,  reduce = 0%
2020-11-16 00:44:11,773 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 2.71 sec
2020-11-16 00:45:12,075 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 2.71 sec
2020-11-16 00:45:41,184 Stage-2 map = 100%,  reduce = 67%, Cumulative CPU 7.7 sec
2020-11-16 00:45:47,993 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 8.67 sec
MapReduce Total cumulative CPU time: 8 seconds 670 msec
Ended Job = job_1605437253399_0005
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 25.68 sec   HDFS Read: 15606542 HDFS Write: 117 SUCCESS
Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 8.67 sec   HDFS Read: 6997 HDFS Write: 106 SUCCESS
Total MapReduce CPU Time Spent: 34 seconds 350 msec
OK
_c0
284368
Time taken: 494.126 seconds, Fetched: 1 row(s)

四．关键字条件查询分析

1.以关键字的存在区间为条件的查询使用where可以缩小查询分析的范围和精确度，下面用实例来测试一下。

(1)查询2014年12月10日到2014年12月13日有多少人浏览了商品

hive (dblab)> SELECT COUNT(*) FROM bigdata_user WHERE 
            > bwhavior_type='1'
            > AND visit_date<'2014-12-13'
            > AND visit_date>'2014-12-10';
            Query ID = root_20201116090806_6b2a5b9c-f996-4caf-baa7-4bea6812cfda
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
2020-11-16 09:11:29,381 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
Starting Job = job_1605437253399_0008, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0008/
Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0008
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-11-16 09:16:53,597 Stage-1 map = 0%,  reduce = 0%
2020-11-16 09:17:08,350 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 9.95 sec
2020-11-16 09:17:34,025 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 16.15 sec
MapReduce Total cumulative CPU time: 16 seconds 350 msec
Ended Job = job_1605437253399_0008
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 16.35 sec   HDFS Read: 15606581 HDFS Write: 105 SUCCESS
Total MapReduce CPU Time Spent: 16 seconds 350 msec
OK
_c0
26329
Time taken: 629.506 seconds, Fetched: 1 row(s)

（2）以月的第n天为统计单位，依次显示第n天网站卖出去的商品的个数

SELECT COUNT(distinct uid), day(visit_date) FROM bigdata_user WHERE
behavior_type='4' GROUP BY day(visit_date);

hive (dblab)> SELECT COUNT(distinct uid), day(visit_date) FROM bigdata_user WHERE
            > behavior_type='4' GROUP BY day(visit_date);
Query ID = root_20201116115335_d26e8b96-2962-4e18-8e18-adbc7324d127
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
2020-11-16 11:55:06,219 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
Starting Job = job_1605437253399_0009, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0009/
Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0009
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-11-16 11:56:13,054 Stage-1 map = 0%,  reduce = 0%
2020-11-16 11:56:35,088 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 8.08 sec
2020-11-16 11:57:03,002 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 14.01 sec
MapReduce Total cumulative CPU time: 14 seconds 10 msec
Ended Job = job_1605437253399_0009
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 14.01 sec   HDFS Read: 15601205 HDFS Write: 618 SUCCESS
Total MapReduce CPU Time Spent: 14 seconds 10 msec
OK
_c0 _c1
37  1
48  2
42  3
38  4
42  5
33  6
42  7
36  8
34  9
40  10
43  11
98  12
39  13
43  14
42  15
44  16
42  17
66  18
38  19
50  20
33  21
34  22
32  23
47  24
34  25
31  26
30  27
34  28
39  29
38  30
Time taken: 212.641 seconds, Fetched: 30 row(s)

2.关键字赋予给定值为条件，对其他数据进行分析

取给定时间和给定地点，求当天发出到该地点的货物的数量

SELEct COUNT(*) FROM bigdata_user WHERE 
province='江西' AND visit_date='2014-12-12' AND behavior_type='4';

hive (dblab)> SELEct COUNT(*) FROM bigdata_user WHERE 
            > province='江西' AND visit_date='2014-12-12' AND behavior_type='4';
Query ID = root_20201116120451_c13ae714-6cf1-400c-b9c0-d6afdd25ac7a
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
2020-11-16 12:05:01,440 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
Starting Job = job_1605437253399_0010, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0010/
Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0010
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-11-16 12:06:43,726 Stage-1 map = 0%,  reduce = 0%
2020-11-16 12:07:13,387 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 9.12 sec
2020-11-16 12:07:31,429 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 9.12 sec
MapReduce Total cumulative CPU time: 14 seconds 140 msec
Ended Job = job_1605437253399_0010
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 14.14 sec   HDFS Read: 15606342 HDFS Write: 102 SUCCESS
Total MapReduce CPU Time Spent: 14 seconds 140 msec
OK
_c0
12
Time taken: 166.15 seconds, Fetched: 1 row(s)

五．根据用户行为分析

1．查询一件商品在某天的购买比例或浏览比例

例如

SELECT COUNT(*) FROM bigdata_user WHERE visit_date='2014-12-11'
AND behavior_type='4';

hive (dblab)> SELECT COUNT(*) FROM bigdata_user WHERE visit_date='2014-12-11'
            > AND behavior_type='4';
Query ID = root_20201116122136_fbb9be2d-7c88-409d-ada8-0e24ca4e912d
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
2020-11-16 12:22:11,860 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
Starting Job = job_1605437253399_0011, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0011/
Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0011
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-11-16 12:24:31,024 Stage-1 map = 0%,  reduce = 0%
2020-11-16 12:24:40,319 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 8.37 sec
2020-11-16 12:25:19,783 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 16.99 sec
MapReduce Total cumulative CPU time: 16 seconds 990 msec
Ended Job = job_1605437253399_0011
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 16.99 sec   HDFS Read: 15605958 HDFS Write: 102 SUCCESS
Total MapReduce CPU Time Spent: 16 seconds 990 msec
OK
_c0
69
Time taken: 242.638 seconds, Fetched: 1 row(s)

查询有多少用户在2014-12-11点击了该店

SELECT COUNT(*) FROM bigdata_user WHERE
 visit_date='2014-12-12';

根据上面语句得到购买数量和点击数量，两个数相除即可得出当天该商品的购买率。

2.查询某个用户在某一天点击网站占该天所有点击行为的比例（点击行为包括浏览，加入购物车，收藏，购买）

//查询用户10001082在2014-12-12点击网站的次数

SELECT COUNT(*) FROM bigdata_user WHERE
uid=10001082 AND visit_date='2014-12-12';
hive (dblab)> SELECT COUNT(*) FROM bigdata_user WHERE
            > uid=10001082 AND visit_date='2014-12-12';
Query ID = root_20201116124039_d60d642d-b253-49a4-a5d6-b938e307157b
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
2020-11-16 12:40:52,056 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
Starting Job = job_1605437253399_0013, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0013/
Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0013
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-11-16 12:41:17,591 Stage-1 map = 0%,  reduce = 0%
2020-11-16 12:41:44,422 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 8.98 sec
2020-11-16 12:42:10,097 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 15.17 sec
MapReduce Total cumulative CPU time: 15 seconds 170 msec
Ended Job = job_1605437253399_0013
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 15.17 sec   HDFS Read: 15606459 HDFS Write: 102 SUCCESS
Total MapReduce CPU Time Spent: 15 seconds 170 msec
OK
_c0
69
Time taken: 95.548 seconds, Fetched: 1 row(s)
hive (dblab)>

//查询所有用户在这一天点击该网站的次数

SELECT COUNT(*) FROM bigdata_user WHERE
visit_date='2014-12-12';

hive (dblab)> SELECT COUNT(*) FROM bigdata_user WHERE
            > visit_date='2014-12-12';
Query ID = root_20201116124543_a447db6c-52cd-4be9-adb6-1a57f5163eb3
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
2020-11-16 12:45:48,156 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
Starting Job = job_1605437253399_0014, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0014/
Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0014
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-11-16 12:48:59,555 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 9.29 sec
2020-11-16 12:50:06,680 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 9.29 sec
2020-11-16 12:51:44,664 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 11.4 sec
2020-11-16 12:51:59,859 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 16.82 sec
MapReduce Total cumulative CPU time: 17 seconds 280 msec
Ended Job = job_1605437253399_0014
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 17.28 sec   HDFS Read: 15605296 HDFS Write: 105 SUCCESS
Total MapReduce CPU Time Spent: 17 seconds 280 msec
OK
_c0
17494
Time taken: 454.833 seconds, Fetched: 1 row(s)
上面两条语句的结果相除，就得到了要要求的比例。

3.给定购买商品的数量范围，查询某一天在该网站的购买该数量商品的用户id

//查询某一天在该网站购买商品超过5次的用户id

SELECT uid FROM bigdata_user WHERE
behavior_type='4' AND visit_date='2014-12-12'
GROUP BY uid having count(behavior_type='4')>5;
hive (dblab)> SELECT uid FROM bigdata_user WHERE
            > behavior_type='4' AND visit_date='2014-12-12'
            > GROUP BY uid having count(behavior_type='4')>5;
Query ID = root_20201116130147_72230e9f-b748-4851-974c-1ca5dcceb37c
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
2020-11-16 13:07:39,493 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
Starting Job = job_1605437253399_0015, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0015/
Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0015
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-11-16 13:09:16,947 Stage-1 map = 0%,  reduce = 0%
2020-11-16 13:10:02,481 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 10.51 sec
2020-11-16 13:10:33,388 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 17.71 sec
MapReduce Total cumulative CPU time: 17 seconds 710 msec
Ended Job = job_1605437253399_0015
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 17.71 sec   HDFS Read: 15607239 HDFS Write: 478 SUCCESS
Total MapReduce CPU Time Spent: 17 seconds 710 msec
OK
uid
100226515
100300684
100555417
100605
10095384
10142625
101490976
101982646
102011320
102030700
102079825
102349447
102612580
102650143
103082347
103139791
103794013
103995979
Time taken: 533.86 seconds, Fetched: 18 row(s)

六.用户实时查询分析

某个地区的用户当天浏览网站的次数

//创建新的数据表进行存储

CREATE TABLE scan(province STRING,scan INT) COMMENT 'This is the search of bigdataday' 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;

hive (dblab)> CREATE TABLE scan(province STRING,scan INT) COMMENT 'This is the search of bigdataday' 
            > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
OK
Time taken: 47.899 seconds

//导入数据

INSERT OVERWRITE TABLE scan SELECT province,COUNT(behavior_type)
FROM bigdata_user WHERE behavior_type='1' GROUP BY province;



//查看一下

SELECT * FROM scan;

hive (dblab)> INSERT OVERWRITE TABLE scan SELECT province,COUNT(behavior_type)
            > FROM bigdata_user WHERE behavior_type='1' GROUP BY province;
Query ID = root_20201116133805_d502f968-f966-4de9-ac2b-075942cd23dc
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
2020-11-16 13:38:36,530 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
Starting Job = job_1605437253399_0016, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0016/
Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0016
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-11-16 13:39:13,851 Stage-1 map = 0%,  reduce = 0%
2020-11-16 13:39:36,587 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.79 sec
2020-11-16 13:39:54,796 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 13.42 sec
MapReduce Total cumulative CPU time: 13 seconds 420 msec
Ended Job = job_1605437253399_0016
Loading data to table dblab.scan
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
2020-11-16 13:40:02,593 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
Starting Job = job_1605437253399_0017, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0017/
Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0017
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1
2020-11-16 13:40:34,772 Stage-3 map = 0%,  reduce = 0%
2020-11-16 13:41:35,425 Stage-3 map = 0%,  reduce = 0%
2020-11-16 13:42:27,059 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 5.0 sec
2020-11-16 13:42:57,783 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 9.3 sec
MapReduce Total cumulative CPU time: 9 seconds 300 msec
Ended Job = job_1605437253399_0017
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 13.42 sec   HDFS Read: 15608313 HDFS Write: 913 SUCCESS
Stage-Stage-3: Map: 1  Reduce: 1   Cumulative CPU: 9.3 sec   HDFS Read: 11053 HDFS Write: 515 SUCCESS
Total MapReduce CPU Time Spent: 22 seconds 720 msec
OK
_col0   _col1
Time taken: 336.34 seconds
hive (dblab)> SELECT * FROM scan;
OK
scan.province   scan.scan
上海市 8394
云南  8357
内蒙古 8146
北京市 8328
台湾  8186
吉林  8397
四川  8412
天津市 8557
宁夏  8414
安徽  8222
山东  8404
山西  8342
广东  8321
广西  8265
新疆  8311
江苏  8356
江西  8344
河北  8356
河南  8475
浙江  8321
海南  8417
湖北  8231
湖南  8390
澳门  8342
甘肃  8367
福建  8426
西藏  8355
贵州  8236
辽宁  8316
重庆市 8197
陕西  8364
青海  8384
香港  8203
黑龙江 8380
Time taken: 8.221 seconds, Fetched: 34 row(s)
hive (dblab)>

posted @ 2020-11-16 13:54 可樂Star 阅读(813) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

可樂Star

该博客仅作为记录学习使用！博客内容选自各渠道资料，如有侵权，请留言，谢谢！

hive简单的项目实战

一、数据集的预处理

1.删除文件第一行记录，即字段名称

2.对字段进行预处理

3.导入数据库

创建外部表

二、简单查询分析

三、查询条数统计分析

(1)用聚合函数count()计算出表内有多少条行数据

(2)在函数内部加上distinct，查出uid不重复的数据有多少条

四．关键字条件查询分析

(1)查询2014年12月10日到2014年12月13日有多少人浏览了商品

（2）以月的第n天为统计单位，依次显示第n天网站卖出去的商品的个数

五．根据用户行为分析

1．查询一件商品在某天的购买比例或浏览比例

2.查询某个用户在某一天点击网站占该天所有点击行为的比例（点击行为包括浏览，加入购物车，收藏，购买）

3.给定购买商品的数量范围，查询某一天在该网站的购买该数量商品的用户id

六.用户实时查询分析

某个地区的用户当天浏览网站的次数

公告

可樂Star

该博客仅作为记录学习使用！ 博客内容选自各渠道资料，如有侵权，请留言，谢谢！

hive简单的项目实战

一、数据集的预处理

1.删除文件第一行记录，即字段名称

2.对字段进行预处理

3.导入数据库

创建外部表

二、简单查询分析

三、查询条数统计分析

(1)用聚合函数count()计算出表内有多少条行数据

(2)在函数内部加上distinct，查出uid不重复的数据有多少条

四．关键字条件查询分析

(1)查询2014年12月10日到2014年12月13日有多少人浏览了商品

（2）以月的第n天为统计单位，依次显示第n天网站卖出去的商品的个数

五．根据用户行为分析

1．查询一件商品在某天的购买比例或浏览比例

2.查询某个用户在某一天点击网站占该天所有点击行为的比例（点击行为包括浏览，加入购物车，收藏，购买）

3.给定购买商品的数量范围，查询某一天在该网站的购买该数量商品的用户id

六.用户实时查询分析

某个地区的用户当天浏览网站的次数

公告

该博客仅作为记录学习使用！博客内容选自各渠道资料，如有侵权，请留言，谢谢！