TPC-H数据导入Hive方案

参考博文:

https://blog.csdn.net/haoxiaoyan/article/details/53033870    -->在这篇博文中,前面步骤对我有很大帮助,但后面将tpc-h生成的数据导入到hive的步骤,我没用看懂,我会在后面介绍另外一种方案导入数据。

https://github.com/maomao1994/TPC-H/blob/master/Hive-HiveQL/Hive-HiveQL.md    -->我使用了该作者提供的源代码,实现了往hive中导入tpc-h测试集数据。

https://www.ibm.com/support/knowledgecenter/STXKQY_BDA_SHR/bl1bda_tpch.htm    -->ibm介绍的关于将tpch数据导入hive方案。

使用系统:centos 7

 

 

正文:

1:下载TPC-H源码,用来生成测试数据:

      地址:http://www.tpc.org/tpc_documents_current_versions/current_specifications.asp  -->选择下载tpc-h那栏.zip格式的文件。

 

2:将下载好的.zip文件解压到一个文件夹下:

      最好将该解压文件夹放到一个存储空间大的文件夹下,别放在根目录下。我起初放在了一个小文件夹下,最后生成测试集数据后,启动hive报错。

      错误类型如下,最后更改了存放位置,就OK啦。

    Resources are low on NN. Please add or free up more resources then turn off safe mode manually.

 

 

3:进入该解压文件夹下,进入dbgen目录:(我的文件夹名字为:tpc-h,解压到了home目录下)

    cd /home/tpc-h/dbgen

       找到一个名字叫makefile.suite的文件,用vim打开并修改:

    vi makefile.suite

       在makefile.suite文件里找到对应行,更改为如下形式:

     CC      = gcc

     # Current values for DATABASE are: INFORMIX, DB2, TDAT (Teradata)  

     #                                  SQLSERVER, SYBASE  

     # Current values for MACHINE are:  ATT, DOS, HP, IBM, ICL, MVS, 

     #                                  SGI, SUN, U2200, VMS, LINUX, WIN32   

     # Current values for WORKLOAD are:  TPCH 

     DATABASE= SQLSERVER  

     MACHINE = LINUX  

     WORKLOAD = TPCH

      更改完毕,保存文件。在当前目录下,更改makefile.suite文件为makefile.代码如下:(目的是为了后面使用make命令)

     mv  makefile.suite  makefile

     检查你的系统是否含有gcc编译器,若没有,则需要安装。代码如下:

     yum install gcc

 

4:修改dbgen文件夹下的tpcd.h文件:

     vi tpcd.h

 在该文件里,找到SQLSERVER段落,修改为下面内容:

     #ifdef  SQLSERVER
     #define GEN_QUERY_PLAN        "EXPLAIN;"
     #define START_TRAN                "START TRANSACTION;\n"
     #define END_TRAN                "COMMIT;\n"
     #define SET_OUTPUT                ""
     #define SET_ROWCOUNT                "limit %d;\n"
     #define SET_DBASE                "use %s;\n"
     #endif

 

 5:执行make命令,键入如下代码:

     make

 对,在命令行输入make四个字母就可以了,接着会出现很多.o格式的文件。

 

6:键入如下代码,生成测试集数据(要在dbgen目录下执行):

    ./dbgen -s 1

 代码解释:-s为数据规模因子,1代表1G,如果改为10就表示10G。看你要多少。执行完毕后你会发现在当前目录下,出现了8个.tbl文件。这就是生成的数据。

 

7:将tpc-h数据导入hive:

     7.1 输入如下命令,启动hive:(下面的命令均在hive里进行)   

    hive

 

    7.2 先创建一个数据库:

    CREATE DATABASE tpch;

         
          然后使用该数据库:

    USE tpch;

 

    7.3 创建8张数据表,名字和我们之前生成的8个.tbl文件的名字保持相同。代码如下:(代码来自于参考博文之第2篇)

 

   Create external table lineitem (L_ORDERKEY INT, L_PARTKEY INT,L_SUPPKEY INT,L_LINENUMBER INT,L_QUANTITY DOUBLE,L_EXTENDEDPRICE DOUBLE,L_DISCOUNT DOUBLE,L_TAX DOUBLE,L_RETURNFLAG STRING,L_LINESTATUS STRING,L_SHIPDATE STRING,L_COMMITDATE STRING,L_RECEIPTDATE STRING,L_SHIPINSTRUCT STRING,L_SHIPMODE STRING,L_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/lineitem';

   Create external table nation (N_NATIONKEY INT,N_NAME STRING,N_REGIONKEY INT,N_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/nation';

   Create external table region (R_REGIONKEY INT,R_NAME STRING,R_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/region';

   Create external table part (P_PARTKEY INT , P_NAME STRING , P_MFGR STRING, P_BRAND  STRING, P_TYPE STRING , P_SIZE INT , P_CONTAINER  STRING, P_RETAILPRICE DOUBLE, P_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/part';

   Create external table supplier (S_SUPPKEY INT , S_NAME STRING ,S_ADDRESS STRING , S_NATIONKEY  INT , S_PHONE  STRING, S_ACCTBAL DOUBLE, S_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/supplier';

   Create external table partsupp (PS_PARTKEY INT , PS_SUPPKEY INT , PS_AVAILQTY INT , PS_SUPPLYCOST DOUBLE, PS_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/partsupp';

   Create external table customer (C_CUSTKEY INT , C_NAME STRING , C_ADDRESS STRING , C_NATIONKEY  INT , C_PHONE  STRING , C_ACCTBAL DOUBLE , C_MKTSEGMENT STRING , C_COMMENT STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/customer';

   Create external table orders (O_ORDERKEY INT , O_CUSTKEY INT ,O_ORDERSTATUS STRING,O_TOTALPRICE DOUBLE, O_ORDERDATE DATE , O_ORDERPRIORITY STRING, O_CLERK  STRING,  O_SHIPPRIORITY  INT ,  O_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/orders';

 

     7.4 将数据导入hive,代码如下:(下面 /home/mao/tpch_workplace/tpc_h_tool,应替换你测试集所在的实际地址)

   LOAD DATA LOCAL INPATH '/home/mao/tpch_workplace/tpc_h_tool/dbgen/region.tbl' INTO TABLE region;
   LOAD DATA LOCAL INPATH '/home/mao/tpch_workplace/tpc_h_tool/dbgen/nation.tbl' INTO TABLE nation;
   LOAD DATA LOCAL INPATH '/home/mao/tpch_workplace/tpc_h_tool/dbgen/part.tbl'   INTO TABLE part;
   LOAD DATA LOCAL INPATH '/home/mao/tpch_workplace/tpc_h_tool/dbgen/supplier.tbl' INTO TABLE supplier;
   LOAD DATA LOCAL INPATH '/home/mao/tpch_workplace/tpc_h_tool/dbgen/partsupp.tbl' INTO TABLE partsupp;
   LOAD DATA LOCAL INPATH '/home/mao/tpch_workplace/tpc_h_tool/dbgen/customer.tbl' INTO TABLE customer;
   LOAD DATA LOCAL INPATH '/home/mao/tpch_workplace/tpc_h_tool/dbgen/orders.tbl' INTO TABLE orders;
   LOAD DATA LOCAL INPATH '/home/mao/tpch_workplace/tpc_h_tool/dbgen/lineitem.tbl' INTO TABLE lineitem;

 

     7.5 完毕,可以使用SQL查询语句,看数据表中是否有数据。导入过程耗时:10G数据大概6分钟。

     7.6 除了这种导入数据的方案,还有其他几种方案,移步文章开头介绍的参考博文:第一篇博文和第三篇博文。

 

posted @ 2018-10-17 19:42  星动OvO  阅读(1657)  评论(2编辑  收藏  举报