TPC-H数据导入Hive方案
参考博文:
https://blog.csdn.net/haoxiaoyan/article/details/53033870 -->在这篇博文中,前面步骤对我有很大帮助,但后面将tpc-h生成的数据导入到hive的步骤,我没用看懂,我会在后面介绍另外一种方案导入数据。
https://github.com/maomao1994/TPC-H/blob/master/Hive-HiveQL/Hive-HiveQL.md -->我使用了该作者提供的源代码,实现了往hive中导入tpc-h测试集数据。
https://www.ibm.com/support/knowledgecenter/STXKQY_BDA_SHR/bl1bda_tpch.htm -->ibm介绍的关于将tpch数据导入hive方案。
使用系统:centos 7
正文:
1:下载TPC-H源码,用来生成测试数据:
地址:http://www.tpc.org/tpc_documents_current_versions/current_specifications.asp -->选择下载tpc-h那栏.zip格式的文件。
2:将下载好的.zip文件解压到一个文件夹下:
最好将该解压文件夹放到一个存储空间大的文件夹下,别放在根目录下。我起初放在了一个小文件夹下,最后生成测试集数据后,启动hive报错。
错误类型如下,最后更改了存放位置,就OK啦。
Resources are low on NN. Please add or free up more resources then turn off safe mode manually.
3:进入该解压文件夹下,进入dbgen目录:(我的文件夹名字为:tpc-h,解压到了home目录下)
cd /home/tpc-h/dbgen
找到一个名字叫makefile.suite的文件,用vim打开并修改:
vi makefile.suite
在makefile.suite文件里找到对应行,更改为如下形式:
CC = gcc # Current values for DATABASE are: INFORMIX, DB2, TDAT (Teradata) # SQLSERVER, SYBASE # Current values for MACHINE are: ATT, DOS, HP, IBM, ICL, MVS, # SGI, SUN, U2200, VMS, LINUX, WIN32 # Current values for WORKLOAD are: TPCH DATABASE= SQLSERVER MACHINE = LINUX WORKLOAD = TPCH
更改完毕,保存文件。在当前目录下,更改makefile.suite文件为makefile.代码如下:(目的是为了后面使用make命令)
mv makefile.suite makefile
检查你的系统是否含有gcc编译器,若没有,则需要安装。代码如下:
yum install gcc
4:修改dbgen文件夹下的tpcd.h文件:
vi tpcd.h
在该文件里,找到SQLSERVER段落,修改为下面内容:
#ifdef SQLSERVER #define GEN_QUERY_PLAN "EXPLAIN;" #define START_TRAN "START TRANSACTION;\n" #define END_TRAN "COMMIT;\n" #define SET_OUTPUT "" #define SET_ROWCOUNT "limit %d;\n" #define SET_DBASE "use %s;\n" #endif
5:执行make命令,键入如下代码:
make
对,在命令行输入make四个字母就可以了,接着会出现很多.o格式的文件。
6:键入如下代码,生成测试集数据(要在dbgen目录下执行):
./dbgen -s 1
代码解释:-s为数据规模因子,1代表1G,如果改为10就表示10G。看你要多少。执行完毕后你会发现在当前目录下,出现了8个.tbl文件。这就是生成的数据。
7:将tpc-h数据导入hive:
7.1 输入如下命令,启动hive:(下面的命令均在hive里进行)
hive
7.2 先创建一个数据库:
CREATE DATABASE tpch;
然后使用该数据库:
USE tpch;
7.3 创建8张数据表,名字和我们之前生成的8个.tbl文件的名字保持相同。代码如下:(代码来自于参考博文之第2篇)
Create external table lineitem (L_ORDERKEY INT, L_PARTKEY INT,L_SUPPKEY INT,L_LINENUMBER INT,L_QUANTITY DOUBLE,L_EXTENDEDPRICE DOUBLE,L_DISCOUNT DOUBLE,L_TAX DOUBLE,L_RETURNFLAG STRING,L_LINESTATUS STRING,L_SHIPDATE STRING,L_COMMITDATE STRING,L_RECEIPTDATE STRING,L_SHIPINSTRUCT STRING,L_SHIPMODE STRING,L_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/lineitem'; Create external table nation (N_NATIONKEY INT,N_NAME STRING,N_REGIONKEY INT,N_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/nation'; Create external table region (R_REGIONKEY INT,R_NAME STRING,R_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/region'; Create external table part (P_PARTKEY INT , P_NAME STRING , P_MFGR STRING, P_BRAND STRING, P_TYPE STRING , P_SIZE INT , P_CONTAINER STRING, P_RETAILPRICE DOUBLE, P_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/part'; Create external table supplier (S_SUPPKEY INT , S_NAME STRING ,S_ADDRESS STRING , S_NATIONKEY INT , S_PHONE STRING, S_ACCTBAL DOUBLE, S_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/supplier'; Create external table partsupp (PS_PARTKEY INT , PS_SUPPKEY INT , PS_AVAILQTY INT , PS_SUPPLYCOST DOUBLE, PS_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/partsupp'; Create external table customer (C_CUSTKEY INT , C_NAME STRING , C_ADDRESS STRING , C_NATIONKEY INT , C_PHONE STRING , C_ACCTBAL DOUBLE , C_MKTSEGMENT STRING , C_COMMENT STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/customer'; Create external table orders (O_ORDERKEY INT , O_CUSTKEY INT ,O_ORDERSTATUS STRING,O_TOTALPRICE DOUBLE, O_ORDERDATE DATE , O_ORDERPRIORITY STRING, O_CLERK STRING, O_SHIPPRIORITY INT , O_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/orders';
7.4 将数据导入hive,代码如下:(下面 /home/mao/tpch_workplace/tpc_h_tool,应替换你测试集所在的实际地址)
LOAD DATA LOCAL INPATH '/home/mao/tpch_workplace/tpc_h_tool/dbgen/region.tbl' INTO TABLE region;
LOAD DATA LOCAL INPATH '/home/mao/tpch_workplace/tpc_h_tool/dbgen/nation.tbl' INTO TABLE nation;
LOAD DATA LOCAL INPATH '/home/mao/tpch_workplace/tpc_h_tool/dbgen/part.tbl' INTO TABLE part;
LOAD DATA LOCAL INPATH '/home/mao/tpch_workplace/tpc_h_tool/dbgen/supplier.tbl' INTO TABLE supplier;
LOAD DATA LOCAL INPATH '/home/mao/tpch_workplace/tpc_h_tool/dbgen/partsupp.tbl' INTO TABLE partsupp;
LOAD DATA LOCAL INPATH '/home/mao/tpch_workplace/tpc_h_tool/dbgen/customer.tbl' INTO TABLE customer;
LOAD DATA LOCAL INPATH '/home/mao/tpch_workplace/tpc_h_tool/dbgen/orders.tbl' INTO TABLE orders;
LOAD DATA LOCAL INPATH '/home/mao/tpch_workplace/tpc_h_tool/dbgen/lineitem.tbl' INTO TABLE lineitem;
7.5 完毕,可以使用SQL查询语句,看数据表中是否有数据。导入过程耗时:10G数据大概6分钟。
7.6 除了这种导入数据的方案,还有其他几种方案,移步文章开头介绍的参考博文:第一篇博文和第三篇博文。