Linux-数据集 TPC-H、TPC-DS的导入和使用(MySQL)
一. TPC-H 数据集
1、数据集下载
TPC-H数据集: https://github.com/gregrahn/tpch-kit
可采用gcc下载或者直接下载zip包,然后解压即可。
具体使用方法可以查看 dbgen文件目录下的:README文件
2、数据表的创建与约束
在下载的文件包下的dbgen文件夹下,有2个文件,分别是dss.ddl、dss.ri。
(1)dss.ddl是创建表的语句,将里面的内容放入到dss.sql中(你自己创建的sql文件),执行表中的语句即可创建表。
\i /pathA/dss.sql
# 该文件对应的路径
(2)dss.ri 是表之间的限制条件,也将内容复制到sql文件下dss_ri.sql,然后执行。
注意:数据约束这里有问题,需要修改下表内的内容,把外键这里的语句改成如下格式。
ALTER TABLE REGION ADD PRIMARY KEY (R_REGIONKEY); ALTER TABLE NATION ADD PRIMARY KEY (N_NATIONKEY); ALTER TABLE NATION ADD CONSTRAINT NATION_FK1 FOREIGN KEY (N_REGIONKEY) REFERENCES REGION (R_REGIONKEY); COMMIT WORK; ALTER TABLE PART ADD PRIMARY KEY (P_PARTKEY); COMMIT WORK; ALTER TABLE SUPPLIER ADD PRIMARY KEY (S_SUPPKEY); ALTER TABLE SUPPLIER ADD CONSTRAINT SUPPLIER_FK1 FOREIGN KEY(S_NATIONKEY) references NATION(N_NATIONKEY); COMMIT WORK; ALTER TABLE PARTSUPP ADD PRIMARY KEY (PS_PARTKEY,PS_SUPPKEY); COMMIT WORK; ALTER TABLE CUSTOMER ADD PRIMARY KEY (C_CUSTKEY); ALTER TABLE CUSTOMER ADD CONSTRAINT CUSTOMER_FK1 FOREIGN KEY(C_NATIONKEY) references NATION(N_NATIONKEY); COMMIT WORK;
ALTER TABLE LINEITEM ADD PRIMARY KEY (L_ORDERKEY,L_LINENUMBER); COMMIT WORK; ALTER TABLE ORDERS ADD PRIMARY KEY (O_ORDERKEY); COMMIT WORK; ALTER TABLE PARTSUPP ADD CONSTRAINT PARTSUPP_FK1 FOREIGN KEY (PS_SUPPKEY) references SUPPLIER(S_SUPPKEY); COMMIT WORK; ALTER TABLE PARTSUPP ADD CONSTRAINT PARTSUPP_FK2 FOREIGN KEY (PS_PARTKEY) references PART (P_PARTKEY); COMMIT WORK; ALTER TABLE ORDERS ADD CONSTRAINT ORDERS_FK1 FOREIGN KEY (O_CUSTKEY) references CUSTOMER(C_CUSTKEY); COMMIT WORK; ALTER TABLE LINEITEM ADD CONSTRAINT LINEITEM_FK1 FOREIGN KEY (L_ORDERKEY) references ORDERS(O_ORDERKEY); COMMIT WORK; ALTER TABLE LINEITEM ADD CONSTRAINT LINEITEM_FK2 FOREIGN KEY (L_PARTKEY,L_SUPPKEY) references PARTSUPP (PS_PARTKEY,PS_SUPPKEY); COMMIT WORK;
执行命令
\i /pathA/dss_ri.sql
# 该文件对应的路径
3、数据的生成与导入
(1)生成dbgen、qgen
通过下面命令,生成两个文件,分别是dbgen、qgen。分别用于产生数据和查询。
make -f Makefile
(2)生成数据
生成8个tbl文件,生成以后修改八个文件的可执行权限,将这八个tbl文件放入创建好的文件目录tbl下(注意1表示1G数据量,0.1表示100M数据量)。
./dbgen -s 1 -f chmod 777 tbl/*.tbl
或者单独生成每个表的数据
./dbgen -vf -s 10 -T r | tee -a region.tbl ./dbgen -vf -s 10 -T n | tee -a nation.tbl ./dbgen -vf -s 10 -T c | tee -a customer.tbl ./dbgen -vf -s 10 -T L | tee -a lineitem.tbl ./dbgen -vf -s 10 -T O | tee -a orders.tbl ./dbgen -vf -s 10 -T P | tee -a part.tbl ./dbgen -vf -s 10 -T S | tee -a partsupp.tbl ./dbgen -vf -s 10 -T s | tee -a supplier.tbl
(3)数据导入
copy customer from '/home/ywb/Data/tpch-kit-master/dbgen/tbl/customer.tbl' with delimiter as '|' NULL ''; copy lineitem from '/home/ywb/Data/tpch-kit-master/dbgen/tbl/lineitem.tbl' with delimiter as '|' NULL ''; copy nation from '/home/ywb/Data/tpch-kit-master/dbgen/tbl/nation.tbl' with delimiter as '|' NULL ''; copy orders from '/home/ywb/Data/tpch-kit-master/dbgen/tbl/orders.tbl' with delimiter as '|' NULL ''; copy part from '/home/ywb/Data/tpch-kit-master/dbgen/tbl/part.tbl' with delimiter as '|' NULL ''; copy partsupp from '/home/ywb/Data/tpch-kit-master/dbgen/tbl/partsupp.tbl' with delimiter as '|' NULL ''; copy region from '/home/ywb/Data/tpch-kit-master/dbgen/tbl/region.tbl' with delimiter as '|' NULL ''; copy supplier from '/home/ywb/Data/tpch-kit-master/dbgen/tbl/supplier.tbl' with delimiter as '|' NULL '';
或者命令行
load data local infile '/home/Data/tpch-kit-master/dbgen/tbl/region.tbl' into table region fields terminated by '|' lines terminated by '\n'; load data local infile '/home/Data/tpch-kit-master/dbgen/tbl/nation.tbl' into table nation fields terminated by '|' lines terminated by '\n'; load data local infile '/home/Data/tpch-kit-master/dbgen/tbl/customer.tbl' into table customer fields terminated by '|' lines terminated by '\n'; load data local infile '/home/Data/tpch-kit-master/dbgen/tbl/supplier.tbl' into table supplier fields terminated by '|' lines terminated by '\n'; load data local infile '/home/Data/tpch-kit-master/dbgen/tbl/part.tbl' into table part fields terminated by '|' lines terminated by '\n'; load data local infile '/home/Data/tpch-kit-master/dbgen/tbl/orders.tbl' into table orders fields terminated by '|' lines terminated by '\n'; load data local infile '/home/Data/tpch-kit-master/dbgen/tbl/partsupp.tbl' into table partsupp fields terminated by '|' lines terminated by '\n'; load data local infile '/home/Data/tpch-kit-master/dbgen/tbl/lineitem.tbl' into table lineitem fields terminated by '|' lines terminated by '\n';
4、查询语句生成
(1)将dists.dss、qgen复制到queries下。
(2)批量生成sql语句(在queries目录下执行命令)
for i in {1..22} do name="d$i.sql" echo $name ./qgen -d $i >$name done
二. TPC-DS数据集
1、数据下载
TPC-DS数据集: https://github.com/gregrahn/tpcds-kit
采用gcc下载或者直接下载zip包,然后解压即可。
具体使用方法可以查看 tools文件目录下的:How_To_Guide-DS-V2.0.0.docx
2、数据表创建与约束
在下载的文件包下的tools文件夹下,有3个sql文件,分别是tpcds.sql、tocds_ri.sql、tpcds_source.sql。
(1)tpcds.sql是创建表的语句,执行表中的语句即可创建表。
\i /pathA/tpcds.sql
# 该文件对应的路径
(2)tpcds_ri.sql 是表之间的限制条件
注意这一步表的约束最好放到导入数据完以后再执行,否则主外键的的错误。
\i /pathA/tpcds_ri.sql
# 该文件对应的路径
3、数据生成与导入
(1)数据生成(在tools目录下)
dsdgen、dsqgen分别用于生成数据和查询模板。
注意:这里可能make的时候没有生成dsqgen这个,原因是少了两个包,根据错误提示下载即可。
sudo apt-get install byacc # 或者 apt-get install yacc sudo apt-get install flex
make -f Makefile.suite # dsdgen、dsqgen 生成
./dsdgen -sc 1 -DIR 'data_output_path' # ./dsdgen -sc 1 -DIR 'data/' # data是我创建的数据目录位置,在tools下 # ./dsdgen -h 查看用法 # data_output_path: 数据生成路径 # 如果没有dsdgen,则执行命令: # make -f Makefile.suite
(2)修改文件权限
修改data/下所有.dat文件的可执行权限。
chmod 777 data/*.dat
(3) 数据导入
注意:生成的数据中会多一个 ’ | ’ 字符,导致导入数据报错,所以需要先删除每行最后一个 |,直接在dat文件目录下终端执行该命令。
删除多余的 |
for i in `ls *.dat` do name=$i echo $name sed -i 's#|$##g' $name done
导入数据
copy call_center from '/home/ywb/Data/tpcds-kit-master/tools/data/call_center.dat' with delimiter as '|' NULL ''; copy catalog_page from '/home/ywb/Data/tpcds-kit-master/tools/data/catalog_page.dat' with delimiter as '|' NULL ''; copy catalog_returns from '/home/ywb/Data/tpcds-kit-master/tools/data/catalog_returns.dat' with delimiter as '|' NULL ''; copy catalog_sales from '/home/ywb/Data/tpcds-kit-master/tools/data/catalog_sales.dat' with delimiter as '|' NULL ''; copy customer from '/home/ywb/Data/tpcds-kit-master/tools/data/customer.dat' with delimiter as '|' NULL ''; copy customer_address from '/home/ywb/Data/tpcds-kit-master/tools/data/customer_address.dat' with delimiter as '|' NULL ''; copy customer_demographics from '/home/ywb/Data/tpcds-kit-master/tools/data/customer_demographics.dat' with delimiter as '|' NULL ''; copy date_dim from '/home/ywb/Data/tpcds-kit-master/tools/data/date_dim.dat' with delimiter as '|' NULL ''; copy dbgen_version from '/home/ywb/Data/tpcds-kit-master/tools/data/dbgen_version.dat' with delimiter as '|' NULL ''; copy household_demographics from '/home/ywb/Data/tpcds-kit-master/tools/data/household_demographics.dat' with delimiter as '|' NULL ''; copy income_band from '/home/ywb/Data/tpcds-kit-master/tools/data/income_band.dat' with delimiter as '|' NULL ''; copy inventory from '/home/ywb/Data/tpcds-kit-master/tools/data/inventory.dat' with delimiter as '|' NULL ''; copy item from '/home/ywb/Data/tpcds-kit-master/tools/data/item.dat' with delimiter as '|' NULL ''; copy promotion from '/home/ywb/Data/tpcds-kit-master/tools/data/promotion.dat' with delimiter as '|' NULL ''; copy reason from '/home/ywb/Data/tpcds-kit-master/tools/data/reason.dat' with delimiter as '|' NULL ''; copy ship_mode from '/home/ywb/Data/tpcds-kit-master/tools/data/ship_mode.dat' with delimiter as '|' NULL ''; copy store from '/home/ywb/Data/tpcds-kit-master/tools/data/store.dat' with delimiter as '|' NULL ''; copy store_returns from '/home/ywb/Data/tpcds-kit-master/tools/data/store_returns.dat' with delimiter as '|' NULL ''; copy store_sales from '/home/ywb/Data/tpcds-kit-master/tools/data/store_sales.dat' with delimiter as '|' NULL ''; copy time_dim from '/home/ywb/Data/tpcds-kit-master/tools/data/time_dim.dat' with delimiter as '|' NULL ''; copy warehouse from '/home/ywb/Data/tpcds-kit-master/tools/data/warehouse.dat' with delimiter as '|' NULL ''; copy web_page from '/home/ywb/Data/tpcds-kit-master/tools/data/web_page.dat' with delimiter as '|' NULL ''; copy web_returns from '/home/ywb/Data/tpcds-kit-master/tools/data/web_returns.dat' with delimiter as '|' NULL ''; copy web_sales from '/home/ywb/Data/tpcds-kit-master/tools/data/web_sales.dat' with delimiter as '|' NULL ''; copy web_site from '/home/ywb/Data/tpcds-kit-master/tools/data/web_site.dat' with delimiter as '|' NULL '';
4、生成查询
for i in `seq 1 99` do ./dsqgen -DIRECTORY ../query_templates/ -TEMPLATE "query${i}.tpl" -DIALECT netezza -FILTER Y > ../sql/query${i}.sql done
-DIRECTORY:SQL模板的路径。 这个是查询模板的文件目录位置 …/query_templates/注意:因为query_templates只有99个查询模板,所以最多一次只能生成99个查询语句,但是如果生成多次,生成的99个查询语句是一样
-TEMPLATE:SQL模板的名称
-DIALECT:include query dialect defintions found in < s >.tpl 。-DIALECT postgresql 是哪种数据库的查询语句
注意:DIALECT支持oracle,db2,sqlserver,netezza,ansi,但是没有postgresql。可以去query_templates文件目录下看是否有对应的模板文件,推荐netezza,pg可以解析。
-FILTER:重定向到标准输出,即…/sql/query${i}.sql 是输出路径和文件类型。
原文链接:https://blog.csdn.net/qq_41619524/article/details/120940683
本文来自博客园,作者:业余砖家,转载请注明原文链接:https://www.cnblogs.com/yeyuzhuanjia/p/16435502.html