大对象导出性能测试

对于大对象，如果数据是外部存储的，在custom 方式导出时，你会发现导出效率非常低。本文测试各种导出场景，验证各种情况下的导出效率。

为保证测试结果可比性，本测试三个场景测试数据量都一样（3.2G），只是单行大小的差异。

一、Inline 存储的导出效率

inline 存储表示行内存储，其行大小不能超过一页，导出性能与非大对象表没区别。例子如下：

create table t_clob(id integer,content clob);
alter table t_clob alter column content set storage main;

--没条记录 32*32 = 1k
insert into t_clob select generate_series(1,3200000),
   (select array_to_string(array(select crypt(substring('0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' FROM (ceil(random()*62))::int FOR 1),gen_salt('md5')) FROM generate_series(1, 32)), ''))

导出性能如下：

[c5@dbhost03 temp]$ date
Sat Dec 18 15:55:39 CST 2021
[c5@dbhost03 temp]$ sys_dump -Fp -f 1.dmp -t t_clob -d test -U system 
[c5@dbhost03 temp]$ date
Sat Dec 18 15:56:19 CST 2021

[c5@dbhost03 temp]$ date
Sat Dec 18 15:53:57 CST 2021
[c5@dbhost03 temp]$ sys_dump -Fc -f 1.dmp -t t_clob -d test -U system 
[c5@dbhost03 temp]$ date
Sat Dec 18 15:54:37 CST 2021

对于行内存储，无论custom ，还是 plain，导出速度没有区别。

二、行外存储：CLOB类型数据导出

1、构造数据

create table t_clob(id integer,content clob);

--尽量保证数据随机。每条记录 32 * 1024 = 32K
insert into t_clob select generate_series(1,100000),
(select array_to_string(array(select crypt(substring('0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' FROM (ceil(random()*62))::int FOR 1),gen_salt('md5')) FROM generate_series(1, 1024)), ''))

test=# \d+ t_clob
                                   Table "public.t_clob"
 Column  |  Type   | Collation | Nullable | Default | Storage  | Stats target | Description 
---------+---------+-----------+----------+---------+----------+--------------+-------------
 id      | integer |           |          |         | plain    |              | 
 content | clob    |           |          |         | extended |              | 
Access method: heap

extended：表示external , compressed

2、非压缩导出速度测试

--用时 33 秒
[c5@dbhost03 temp]$ date
Sat Dec 18 10:54:26 CST 2021
[c5@dbhost03 temp]$ sys_dump -Fp -f 1.dmp -t t_clob -d test -U system 
[c5@dbhost03 temp]$ date
Sat Dec 18 10:54:59 CST 2021
[c5@dbhost03 temp]$ ls -l
total 4194240
-rw-rw-r-- 1 c5 c5 3482289796 Dec 18 10:54 1.dmp

--用时 45 秒
[c5@dbhost03 temp]$ date
Sat Dec 18 10:56:11 CST 2021
[c5@dbhost03 temp]$ sys_dump -Ft -f 1.dmp -t t_clob -d test -U system 
[c5@dbhost03 temp]$ date
Sat Dec 18 10:56:56 CST 2021
[c5@dbhost03 temp]$ ls -l
total 4194240
-rw-rw-r-- 1 c5 c5 3482295296 Dec 18 10:56 1.dmp

--用时 251 秒
[c5@dbhost03 temp]$ date
Sat Dec 18 11:03:07 CST 2021
[c5@dbhost03 temp]$ sys_dump -Fc -f 1.dmp -t t_clob -d test -U system 
[c5@dbhost03 temp]$ date
Sat Dec 18 11:07:18 CST 2021
[c5@dbhost03 temp]$ ls -l
total 2474068
-rw-rw-r-- 1 c5 c5 2533441731 Dec 18 11:07 1.dmp

结论：plain 导出速度最快，10W条数据，用时33秒；custom 方式导出最慢，10W条数据，用时4分11秒，但占用空间最小，采用了压缩导出。

3、压缩方式导出

直接导出压缩格式：用时175秒

[c5@dbhost03 temp]$ date
Sat Dec 18 11:09:36 CST 2021
[c5@dbhost03 temp]$ sys_dump -Fp -f 1.dmp -t t_clob -d test -U system -Z 1
[c5@dbhost03 temp]$ date
Sat Dec 18 11:12:31 CST 2021
[c5@dbhost03 temp]$ ls -l
total 2544552
-rw-rw-r-- 1 c5 c5 2605617264 Dec 18 11:12 1.dmp

用管道方式，边导边压缩：用时126秒

[c5@dbhost03 temp]$ mknod 1.dmp.p p

[c5@dbhost03 temp]$ date
Sat Dec 18 11:56:54 CST 2021
[c5@dbhost03 temp]$ sys_dump -Fp -f 1.dmp.p -t t_clob -d test -U system &
[1] 36191
[c5@dbhost03 temp]$ compress < 1.dmp.p > 1.dmp.Z
date
[1]+  Done                    sys_dump -Fp -f 1.dmp.p -t t_clob -d test -U system
[c5@dbhost03 temp]$ date
Sat Dec 18 11:59:00 CST 2021
[c5@dbhost03 temp]$  ls -l
total 1762932
-rw-rw-r-- 1 c5 c5 1805240015 Dec 18 11:59 1.dmp.Z

三、行外存储：BLOB 类型数据导出

1、准备数据

create table t_blob(id integer,content blob);

insert into t_blob select generate_series(1,100000),blob_import('/data/c5/temp/1.txt.tar');

2、非压缩导出

--用时 73 秒
[c5@dbhost03 temp]$ date
Sat Dec 18 14:21:16 CST 2021
[c5@dbhost03 temp]$ sys_dump -Fp -f 1.dmp -t t_blob -d test -U system 
[c5@dbhost03 temp]$ date
Sat Dec 18 14:22:29 CST 2021
[c5@dbhost03 temp]$ ls -l 1.dmp
-rw-rw-r-- 1 c5 c5 7346689796 Dec 18 14:22 1.dmp

--用时 96 秒
[c5@dbhost03 temp]$ date
Sat Dec 18 14:23:42 CST 2021
[c5@dbhost03 temp]$ sys_dump -Ft -f 1.dmp -t t_blob -d test -U system 
[c5@dbhost03 temp]$ date
Sat Dec 18 14:25:18 CST 2021
[c5@dbhost03 temp]$ ls -l 1.dmp
-rw-rw-r-- 1 c5 c5 7346695168 Dec 18 14:25 1.dmp

--用时  93 秒
[c5@dbhost03 temp]$ date
Sat Dec 18 14:29:46 CST 2021
[c5@dbhost03 temp]$ sys_dump -Fc -f 1.dmp -t t_blob -d test -U system 
[c5@dbhost03 temp]$ date
Sat Dec 18 14:31:19 CST 2021
[c5@dbhost03 temp]$ ls -l
total 423636
-rw-rw-r-- 1 c5 c5 433723954 Dec 18 14:31 1.dmp

结论： plain 方式最快，但最占用空间；因为数据都是重复的，用custom 方式压缩比非常高，导出结果文件也比较小，整个速度也比较快。

3、压缩方式导出

直接导出压缩：用时88秒

[c5@dbhost03 temp]$ date
Sat Dec 18 14:51:44 CST 2021
[c5@dbhost03 temp]$ sys_dump -Fp -f 1.dmp -t t_blob -d test -U system -Z 1
[c5@dbhost03 temp]$ date
Sat Dec 18 14:53:12 CST 2021
[c5@dbhost03 temp]$ ls -l
total 524300
-rw-rw-r-- 1 c5 c5 500398983 Dec 18 14:53 1.dmp

四、导出测试结论

用plain 导出的文件的大小是实际数据的大小，一般会大于数据库实际占用的空间，这是因为默认的extended 存储是经过压缩的。
用plain 方式导出并不慢，慢的是用custom方式导出（External 存储方式慢，inline 方式并不慢）。
custom 导出实际是压缩的，因此，如果数据的压缩比较高，会因为写IO更小，而带来性能提升。如以上BLOB 测试结果
plain 导出时，compress 选项可能会带来很大性能损耗，实际要看压缩比情况，如果压缩比高，压缩减少的IO会抵消CPU带来的消耗。CLOB测试结果 1级压缩情况下，所耗时是非压缩情况下的 5倍；而BLOB 测试部分，由于数据都是相同的，压缩比高，实际时间消耗与 plain 相差不多。
如果要使用压缩，建议采用管道方式，边导边压缩。

posted @ 2021-12-21 19:14 KINGBASE研究院阅读(97) 评论(0) 编辑收藏举报

刷新页面返回顶部

KINGBASE研究院

大对象导出性能测试

一、Inline 存储的导出效率

二、行外存储：CLOB类型数据导出

1、构造数据

2、非压缩导出速度测试

3、压缩方式导出

三、行外存储：BLOB 类型数据导出

1、准备数据

2、非压缩导出

3、压缩方式导出

四、导出测试结论

公告