pg_bulkload快速加载数据
pg_bulkload
git下载地址:https://github.com/ossc-db/pg_bulkload
介绍
PostgreSQL提供了一个copy命令的便利数据加载工具,copy命令源于PostgreSQL数据库,copy命令支持文件与表之间的数据加载和表对文件的数据卸载。pg_bulkload是一种用于PostgreSQL的高速数据加载工具,相比copy命令。最大的优势就是速度。优势在让我们跳过shared buffer、wal buffer。直接写文件。pg_bulkload的direct模式就是这种思路来实现的,它还包含了数据恢复功能,即导入失败的话,需要恢复。
架构图
pg_bulkload主要包括两个模块:reader和writer。reader负责读取文件、解析tuple,writer负责把解析出的tuple写入输出源中。pg_bulkload最初的版本功能很简单,只是加载数据。3.1版本增加了数据过滤的功能。
安装
unzip pg_bulkload-master.zip
cd unzip pg_bulkload-master
make
make installl
安装完成后再数据库中创建extension
create extension pg_bulkload;
pg_bulkload 参数
[pg@sdbserver1 pg_bulkload-master]$ pg_bulkload --help
pg_bulkload is a bulk data loading tool for PostgreSQL
Usage:
Dataload: pg_bulkload [dataload options] control_file_path
Recovery: pg_bulkload -r [-D DATADIR]
Dataload options:
-i, --input=INPUT INPUT path or function
-O, --output=OUTPUT OUTPUT path or table
-l, --logfile=LOGFILE LOGFILE path
-P, --parse-badfile=* PARSE_BADFILE path
-u, --duplicate-badfile=* DUPLICATE_BADFILE path
-o, --option="key=val" additional option
Recovery options:
-r, --recovery execute recovery
-D, --pgdata=DATADIR database directory
Connection options:
-d, --dbname=DBNAME database to connect
-h, --host=HOSTNAME database server host or socket directory
-p, --port=PORT database server port
-U, --username=USERNAME user name to connect as
-w, --no-password never prompt for password
-W, --password force password prompt
Generic options:
-e, --echo echo queries
-E, --elevel=LEVEL set output message level
--help show this help, then exit
--version output version information, then exit
Read the website for details. http://github.com/ossc-db/pg_bulkload
Report bugs to http://github.com/ossc-db/pg_bulkload/issues.
-o的选项在help中没有,可以通过下面导入的test.log日志里面可以看到有哪些参数配置
pg_bulkload使用
db_sqlfx=# create table test(id int,info text ,crt_time timestamp(0));
CREATE TABLE
db_sqlfx=# insert into test select generate_series(1,5000000),md5(random()::text),clock_timestamp();
INSERT 0 5000000
db_sqlfx=# create index i_test_id on test(id);
CREATE INDEX
db_sqlfx=# copy test to '/home/pg/test.csv' with(format 'csv');
COPY 5000000
--不清理索引直接导入
[pg@sdbserver1 pg_bulkload-master]$ pg_bulkload -i /home/pg/test.csv -O test -l /home/pg/test.log -p 6543 -d db_sqlfx -o "TYPE=CSV" -o "WRITER=PARALLEL" -o "TRUNCATE=YES"
NOTICE: BULK LOAD START
NOTICE: BULK LOAD END
0 Rows skipped.
5000000 Rows successfully loaded.
0 Rows not loaded due to parse errors.
0 Rows not loaded due to duplicate errors.
0 Rows replaced with new rows.
这个地方需要注意的是,copy导出要指定with(format 'csv'),否则会报错
pg_bulkload耗时
21.03 sec
[pg@sdbserver1 pg_bulkload-master]$ cat /home/pg/test.log
pg_bulkload 3.1.18 on 2021-09-22 18:13:26.168808+08
INPUT = /home/pg/test.csv
PARSE_BADFILE = /home/pg/abdata/7.0/abase1/pg_bulkload/20210922181326_db_sqlfx_public_test.prs.csv
LOGFILE = /home/pg/test.log
LIMIT = INFINITE
PARSE_ERRORS = 0
CHECK_CONSTRAINTS = NO
TYPE = CSV
SKIP = 0
DELIMITER = ,
QUOTE = """
ESCAPE = """
NULL =
OUTPUT = public.test
MULTI_PROCESS = YES
VERBOSE = NO
WRITER = DIRECT
DUPLICATE_BADFILE = /home/pg/abdata/7.0/abase1/pg_bulkload/20210922181326_db_sqlfx_public_test.dup.csv
DUPLICATE_ERRORS = 0
ON_DUPLICATE_KEEP = NEW
TRUNCATE = YES
0 Rows skipped.
5000000 Rows successfully loaded.
0 Rows not loaded due to parse errors.
0 Rows not loaded due to duplicate errors.
0 Rows replaced with new rows.
Run began on 2021-09-22 18:13:26.168808+08
Run ended on 2021-09-22 18:13:47.19649+08
CPU 0.66s/8.85u sec elapsed 21.03 sec
copy导入耗时
--导出数据
db_sqlfx=# copy test to '/home/pg/test.sql';
COPY 5000000
--清理数据
db_sqlfx=# truncate table test;
TRUNCATE TABLE
Time: 95.247 ms
--导入
db_sqlfx=# copy test from '/home/pg/test.csv';
COPY 5000000
Time: 38811.062 ms (00:38.811)
在有索引的情况下,pg_bulkload的效率比copy要高
在不创建索引的情况下,导入和copy的效率对比?
--删除索引
drop index i_test_id;
--copy导入数据24s
db_sqlfx=# copy test from '/home/pg/test.csv';
COPY 5000000
Time: 24227.334 ms (00:24.227)
--使用pg_bulload导入10s
pg_bulkload:10s
--创建索引10s
db_sqlfx=# CREATE INDEX i_test_id ON public.test USING btree (id);
CREATE INDEX
Time: 10552.171 ms (00:10.552)
unlogged table
--索引保留,清空数据
update pg_class set relpersistence='u' where relname='test';
update pg_class set relpersistence='u' where relname='i_test_id';
--pg_bulkload 17s
Run began on 2021-09-23 16:22:52.192337+08
Run ended on 2021-09-23 16:23:09.726125+08
CPU 0.63s/7.43u sec elapsed 17.53 sec
--copy
db_sqlfx=# copy test from '/home/pg/test.csv';
COPY 5000000
Time: 27020.258 ms (00:27.020)
copy 在无日志的情况下提升较大,pg_bulkload提升不明显,因为本身已经绕过了shared_buffers
附一个批量的脚本:
1 -bash-4.1$ cat load.sh
2 #!/bin/sh
3
4 #$1 data fil ename
5
6 file=$1
7
8 if [ ! -f $file ]
9 then
10 echo "File is not exist"
11 exit 1
12 fi
13
14 echo "-----------------------------------------------------------------"
15
16 tbname=$( echo $file |cut -d . -f1 )
17 echo "Table name is : "$tbname
18
19 zcat $file|pg_bulkload -i stdin -O public.$tbname -l $tbname.log -o "TYPE=CSV" -o "WRITER=PARALLEL" -d sgdw
20
21 echo "load complete"
22 echo "-----------------------------------------------------------------"
总结
1、在有索引的情况下,pg_bulkload的效率比copy要高一倍,没有索引的情况pg_bulkload效率也要高
2、copy在unlogged table表的性能有提升,但是和pg_bulkload还有些差距
3、pg_bulkload的优势还是比较明显,主要在于绕了shared buffer、wal buffer,直接写文件,这种模式通unlogged table一样,不需要写wal。对于使用了流复制备库的场景,需要重建备库。基础备份也需要重新生成
4、当有超大表的时候导入性能还是非常好