Linux-数据集 TPC-H、TPC-DS的导入和使用(MySQL)

. TPC-H 数据集

1数据集下载

TPC-H数据集: https://github.com/gregrahn/tpch-kit

可采用gcc下载或者直接下载zip包,然后解压即可。

具体使用方法可以查看 dbgen文件目录下的:README文件

2、数据表的创建与约束

在下载的文件包下的dbgen文件夹下,有2个文件,分别是dss.ddl、dss.ri。

1)dss.ddl是创建表的语句,将里面的内容放入到dss.sql中(你自己创建的sql文件),执行表中的语句即可创建表。

\i  /pathA/dss.sql

# 该文件对应的路径

2)dss.ri 是表之间的限制条件也将内容复制到sql文件下dss_ri.sql,然后执行。

注意:数据约束这里有问题,需要修改下表内的内容,把外键这里的语句改成如下格式。

ALTER TABLE REGION  ADD PRIMARY KEY (R_REGIONKEY);

ALTER TABLE NATION  ADD PRIMARY KEY (N_NATIONKEY);
ALTER TABLE NATION  ADD CONSTRAINT NATION_FK1 FOREIGN KEY (N_REGIONKEY) REFERENCES REGION (R_REGIONKEY);
COMMIT WORK;

ALTER TABLE PART  ADD PRIMARY KEY (P_PARTKEY);
COMMIT WORK;

ALTER TABLE SUPPLIER  ADD PRIMARY KEY (S_SUPPKEY);
ALTER TABLE SUPPLIER  ADD CONSTRAINT SUPPLIER_FK1 FOREIGN KEY(S_NATIONKEY) references NATION(N_NATIONKEY);
COMMIT WORK;

ALTER TABLE PARTSUPP  ADD PRIMARY KEY (PS_PARTKEY,PS_SUPPKEY);
COMMIT WORK;

ALTER TABLE CUSTOMER  ADD PRIMARY KEY (C_CUSTKEY);
ALTER TABLE CUSTOMER  ADD CONSTRAINT CUSTOMER_FK1 FOREIGN KEY(C_NATIONKEY) references NATION(N_NATIONKEY);
COMMIT WORK;
ALTER TABLE LINEITEM ADD PRIMARY KEY (L_ORDERKEY,L_LINENUMBER); COMMIT WORK; ALTER TABLE ORDERS ADD PRIMARY KEY (O_ORDERKEY); COMMIT WORK; ALTER TABLE PARTSUPP ADD CONSTRAINT PARTSUPP_FK1 FOREIGN KEY (PS_SUPPKEY) references SUPPLIER(S_SUPPKEY); COMMIT WORK; ALTER TABLE PARTSUPP ADD CONSTRAINT PARTSUPP_FK2 FOREIGN KEY (PS_PARTKEY) references PART (P_PARTKEY); COMMIT WORK; ALTER TABLE ORDERS ADD CONSTRAINT ORDERS_FK1 FOREIGN KEY (O_CUSTKEY) references CUSTOMER(C_CUSTKEY); COMMIT WORK; ALTER TABLE LINEITEM ADD CONSTRAINT LINEITEM_FK1 FOREIGN KEY (L_ORDERKEY) references ORDERS(O_ORDERKEY); COMMIT WORK; ALTER TABLE LINEITEM ADD CONSTRAINT LINEITEM_FK2 FOREIGN KEY (L_PARTKEY,L_SUPPKEY) references PARTSUPP (PS_PARTKEY,PS_SUPPKEY); COMMIT WORK;

执行命令

\i /pathA/dss_ri.sql  

# 该文件对应的路径

3、数据的生成与导入

1)生成dbgen、qgen

通过下面命令,生成两个文件,分别是dbgen、qgen。分别用于产生数据和查询。

make  -f  Makefile

2)生成数据

生成8个tbl文件,生成以后修改八个文件的可执行权限,将这八个tbl文件放入创建好的文件目录tbl下(注意1表示1G数据量,0.1表示100M数据量)。

./dbgen  -s 1 -f
chmod 777 tbl/*.tbl

或者单独生成每个表的数据

./dbgen -vf -s 10 -T r | tee -a region.tbl
./dbgen -vf -s 10 -T n | tee -a nation.tbl
./dbgen -vf -s 10 -T c | tee -a customer.tbl
./dbgen -vf -s 10 -T L | tee -a lineitem.tbl
./dbgen -vf -s 10 -T O | tee -a orders.tbl
./dbgen -vf -s 10 -T P | tee -a part.tbl
./dbgen -vf -s 10 -T S | tee -a partsupp.tbl
./dbgen -vf -s 10 -T s | tee -a supplier.tbl

3)数据导入

copy customer from '/home/ywb/Data/tpch-kit-master/dbgen/tbl/customer.tbl' with delimiter as '|' NULL '';
copy lineitem from '/home/ywb/Data/tpch-kit-master/dbgen/tbl/lineitem.tbl' with delimiter as '|' NULL '';
copy nation from '/home/ywb/Data/tpch-kit-master/dbgen/tbl/nation.tbl' with delimiter as '|' NULL '';
copy orders from '/home/ywb/Data/tpch-kit-master/dbgen/tbl/orders.tbl' with delimiter as '|' NULL '';
copy part from '/home/ywb/Data/tpch-kit-master/dbgen/tbl/part.tbl' with delimiter as '|' NULL '';
copy partsupp from '/home/ywb/Data/tpch-kit-master/dbgen/tbl/partsupp.tbl' with delimiter as '|' NULL '';
copy region from '/home/ywb/Data/tpch-kit-master/dbgen/tbl/region.tbl' with delimiter as '|' NULL '';
copy supplier from '/home/ywb/Data/tpch-kit-master/dbgen/tbl/supplier.tbl' with delimiter as '|' NULL '';

或者命令行 

load data local infile '/home/Data/tpch-kit-master/dbgen/tbl/region.tbl' into table region fields terminated by '|' lines terminated by '\n';
load data local infile '/home/Data/tpch-kit-master/dbgen/tbl/nation.tbl' into table nation fields terminated by '|' lines terminated by '\n';
load data local infile '/home/Data/tpch-kit-master/dbgen/tbl/customer.tbl' into table customer fields terminated by '|' lines terminated by '\n';
load data local infile '/home/Data/tpch-kit-master/dbgen/tbl/supplier.tbl' into table supplier fields terminated by '|' lines terminated by '\n';
load data local infile '/home/Data/tpch-kit-master/dbgen/tbl/part.tbl' into table part fields terminated by '|' lines terminated by '\n';
load data local infile '/home/Data/tpch-kit-master/dbgen/tbl/orders.tbl' into table orders fields terminated by '|' lines terminated by '\n';
load data local infile '/home/Data/tpch-kit-master/dbgen/tbl/partsupp.tbl' into table partsupp fields terminated by '|' lines terminated by '\n';
load data local infile '/home/Data/tpch-kit-master/dbgen/tbl/lineitem.tbl' into table lineitem fields terminated by '|' lines terminated by '\n';

4、查询语句生成 

1)将dists.dss、qgen复制到queries下

2)批量生成sql语句(在queries目录下执行命令)

for i  in {1..22}
do
  name="d$i.sql"
  echo $name
  ./qgen -d $i >$name
done

 

. TPC-DS数据集 

1、数据下载

TPC-DS数据集: https://github.com/gregrahn/tpcds-kit

采用gcc下载或者直接下载zip包,然后解压即可。

具体使用方法可以查看 tools文件目录下的:How_To_Guide-DS-V2.0.0.docx

2、数据表创建与约束

在下载的文件包下的tools文件夹下,有3个sql文件,分别是tpcds.sql、tocds_ri.sql、tpcds_source.sql。

1)tpcds.sql是创建表的语句,执行表中的语句即可创建表。

\i /pathA/tpcds.sql  

# 该文件对应的路径

2)tpcds_ri.sql 是表之间的限制条件 

注意这一步表的约束最好放到导入数据完以后再执行,否则主外键的的错误。

\i /pathA/tpcds_ri.sql  

# 该文件对应的路径

3、数据生成与导入 

1)数据生成(在tools目录下)

dsdgen、dsqgen分别用于生成数据和查询模板。

注意:这里可能make的时候没有生成dsqgen这个,原因是少了两个包,根据错误提示下载即可

sudo apt-get install byacc     # 或者  apt-get install yacc  
sudo apt-get install flex
make -f Makefile.suite  # dsdgen、dsqgen 生成
./dsdgen -sc 1 -DIR 'data_output_path'

# ./dsdgen -sc 1 -DIR 'data/'  # data是我创建的数据目录位置,在tools下
# ./dsdgen -h 查看用法
# data_output_path: 数据生成路径
# 如果没有dsdgen,则执行命令:

# make -f Makefile.suite

2)修改文件权限 

修改data/下所有.dat文件的可执行权限。

chmod 777 data/*.dat

3) 数据导入 

注意:生成的数据中会多一个 ’ | ’ 字符,导致导入数据报错,所以需要先删除每行最后一个 |,直接在dat文件目录下终端执行该命令。

删除多余的 |

for i in `ls *.dat`
do
     name=$i
     echo $name
     sed -i 's#|$##g' $name
done

导入数据 

copy call_center from '/home/ywb/Data/tpcds-kit-master/tools/data/call_center.dat' with delimiter as '|' NULL '';
copy catalog_page from '/home/ywb/Data/tpcds-kit-master/tools/data/catalog_page.dat' with delimiter as '|' NULL '';
copy catalog_returns from '/home/ywb/Data/tpcds-kit-master/tools/data/catalog_returns.dat' with delimiter as '|' NULL '';
copy catalog_sales from '/home/ywb/Data/tpcds-kit-master/tools/data/catalog_sales.dat' with delimiter as '|' NULL '';
copy customer from '/home/ywb/Data/tpcds-kit-master/tools/data/customer.dat' with delimiter as '|' NULL '';
copy customer_address from '/home/ywb/Data/tpcds-kit-master/tools/data/customer_address.dat' with delimiter as '|' NULL '';
copy customer_demographics from '/home/ywb/Data/tpcds-kit-master/tools/data/customer_demographics.dat' with delimiter as '|' NULL '';
copy date_dim from '/home/ywb/Data/tpcds-kit-master/tools/data/date_dim.dat' with delimiter as '|' NULL '';
copy dbgen_version from '/home/ywb/Data/tpcds-kit-master/tools/data/dbgen_version.dat' with delimiter as '|' NULL '';
copy household_demographics from '/home/ywb/Data/tpcds-kit-master/tools/data/household_demographics.dat' with delimiter as '|' NULL '';
copy income_band from '/home/ywb/Data/tpcds-kit-master/tools/data/income_band.dat' with delimiter as '|' NULL '';
copy inventory from '/home/ywb/Data/tpcds-kit-master/tools/data/inventory.dat' with delimiter as '|' NULL '';
copy item from '/home/ywb/Data/tpcds-kit-master/tools/data/item.dat' with delimiter as '|' NULL '';
copy promotion from '/home/ywb/Data/tpcds-kit-master/tools/data/promotion.dat' with delimiter as '|' NULL '';
copy reason from '/home/ywb/Data/tpcds-kit-master/tools/data/reason.dat' with delimiter as '|' NULL '';
copy ship_mode from '/home/ywb/Data/tpcds-kit-master/tools/data/ship_mode.dat' with delimiter as '|' NULL '';
copy store from '/home/ywb/Data/tpcds-kit-master/tools/data/store.dat' with delimiter as '|' NULL '';
copy store_returns from '/home/ywb/Data/tpcds-kit-master/tools/data/store_returns.dat' with delimiter as '|' NULL '';
copy store_sales from '/home/ywb/Data/tpcds-kit-master/tools/data/store_sales.dat' with delimiter as '|' NULL '';
copy time_dim from '/home/ywb/Data/tpcds-kit-master/tools/data/time_dim.dat' with delimiter as '|' NULL '';
copy warehouse from '/home/ywb/Data/tpcds-kit-master/tools/data/warehouse.dat' with delimiter as '|' NULL '';
copy web_page from '/home/ywb/Data/tpcds-kit-master/tools/data/web_page.dat' with delimiter as '|' NULL '';
copy web_returns from '/home/ywb/Data/tpcds-kit-master/tools/data/web_returns.dat' with delimiter as '|' NULL '';
copy web_sales from '/home/ywb/Data/tpcds-kit-master/tools/data/web_sales.dat' with delimiter as '|' NULL '';
copy web_site from '/home/ywb/Data/tpcds-kit-master/tools/data/web_site.dat' with delimiter as '|' NULL '';

4生成查询 

for i in `seq 1 99`
do
  ./dsqgen  -DIRECTORY ../query_templates/ -TEMPLATE "query${i}.tpl" -DIALECT netezza -FILTER Y > ../sql/query${i}.sql
done

-DIRECTORY:SQL模板的路径。 这个是查询模板的文件目录位置 …/query_templates/注意:因为query_templates只有99个查询模板,所以最多一次只能生成99个查询语句,但是如果生成多次,生成的99个查询语句是一样

-TEMPLATE:SQL模板的名称

-DIALECT:include query dialect defintions found in < s >.tpl 。-DIALECT postgresql 是哪种数据库的查询语句

注意:DIALECT支持oracle,db2,sqlserver,netezza,ansi,但是没有postgresql。可以去query_templates文件目录下看是否有对应的模板文件,推荐netezza,pg可以解析。

-FILTER:重定向到标准输出,即…/sql/query${i}.sql 是输出路径和文件类型。

 

原文链接:https://blog.csdn.net/qq_41619524/article/details/120940683

posted @ 2022-07-01 17:46  业余砖家  阅读(1630)  评论(0编辑  收藏  举报