Deepgreen DB 是什么？

04a7383980f4887513cdd0efb610bdaa3671850c

Deepgreen DB 全称 Vitesse Deepgreen DB，它是一个可扩展的大规模并行（通常称为MPP）数据仓库解决方案，起源于开源数据仓库项目Greenplum DB（通常称为GP或GPDB）。所以已经熟悉了GP的朋友，可以无缝切换到Deepgreen。

它几乎拥有GP的所有功能，在保有GP所有优势的基础上，Deepgreen对原查询处理引擎进行了优化，新一代查询处理引擎扩展了：

优越的连接和聚合算法
新的溢出处理子系统
基于JIT的查询优化、矢量扫描和数据路径优化

下面简单介绍一下Deepgreen的主要特性（主要与Greenplum对比）：

1. 100% GPDB

Deepgreen与Greenplum几乎100%一致，这里说几乎，是因为Deepgreen也剔除了一些Greenplum上的鸡肋功能，例如MapReduce支持，可以说保有的都是精华。从SQL语法、存储过程语法，到数据存储格式，再到像gpstart/gpfdist等组件，Deepgreen为想要从Greenplum迁移过来的用户将迁移影响降到最低。尤其是在下面这些方面：

除了以quicklz方式压缩的数据需要修改外，其他数据无需重新装载
DML和DDL语句没有任何改变
UDF（用户定义函数）语法没有任何改变
存储过程语法没有任何改变
JDBC／ODBC等连接和授权协议没有任何改变
运行脚本没有任何改变（例如备份脚本）

那么Deepgreen和Greenplum的不同之处在哪呢？总结成一个词就是：快！快！快！（重要的事情说三遍）。因为大部分的OLAP工作都与CPU的性能有关，所以针对CPU优化后的Deepgreen在性能测试中，可以达到比原Greenplum快3～5倍的性能。

2.更快的Decimal类型

Deepgreen提供了两个更精确的Decimal类型：Decimal64和Decimal128，它们比Greenplum原有的Decimal类型（Numeric）更有效。因为它们更精确，相比于fload／double类型，更适合用在银行等对数据准确性要求高的业务场景。

安装：

这两个数据类型需要在数据库初始化以后，通过命令加载到需要的数据库中：
dgadmin@flash:~$ source deepgreendb/greenplum_path.sh
dgadmin@flash:~$ cd $GPHOME/share/postgresql/contrib/
dgadmin@flash:~/deepgreendb/share/postgresql/contrib$ psql postgres -f pg_decimal.sql

测试一把：

使用语句：select avg(x), sum(2*x) from table
数据量：100万
dgadmin@flash:~$ psql -d postgres
psql (8.2.15)
Type "help" for help.

postgres=# drop table if exists tt;
NOTICE:  table "tt" does not exist, skipping
DROP TABLE
postgres=# create table tt(
postgres(# ii bigint,
postgres(#  f64 double precision,
postgres(# d64 decimal64,
postgres(# d128 decimal128,
postgres(# n numeric(15, 3))
postgres-# distributed randomly;
CREATE TABLE
postgres=# insert into tt
postgres-# select i,
postgres-#     i + 0.123,
postgres-#     (i + 0.123)::decimal64,
postgres-#     (i + 0.123)::decimal128,
postgres-#     i + 0.123
postgres-# from generate_series(1, 1000000) i;
INSERT 0 1000000
postgres=# \timing on
Timing is on.
postgres=# select count(*) from tt;
  count
---------
 1000000
(1 row)

Time: 161.500 ms
postgres=# set vitesse.enable=1;
SET
Time: 1.695 ms
postgres=# select avg(f64),sum(2*f64) from tt;
       avg        |       sum
------------------+------------------
 500000.622996815 | 1000001245993.63
(1 row)

Time: 45.368 ms
postgres=# select avg(d64),sum(2*d64) from tt;
    avg     |        sum
------------+-------------------
 500000.623 | 1000001246000.000
(1 row)

Time: 135.693 ms
postgres=# select avg(d128),sum(2*d128) from tt;
    avg     |        sum
------------+-------------------
 500000.623 | 1000001246000.000
(1 row)

Time: 148.286 ms
postgres=# set vitesse.enable=1;
SET
Time: 11.691 ms
postgres=# select avg(n),sum(2*n) from tt;
         avg         |        sum
---------------------+-------------------
 500000.623000000000 | 1000001246000.000
(1 row)

Time: 154.189 ms
postgres=# set vitesse.enable=0;
SET
Time: 1.426 ms
postgres=# select avg(n),sum(2*n) from tt;
         avg         |        sum
---------------------+-------------------
 500000.623000000000 | 1000001246000.000
(1 row)

Time: 296.291 ms

结果列表：

45ms - 64位float
136ms - decimal64
148ms - decimal128
154ms - deepgreen numeric
296ms - greenplum numeric

通过上面的测试，decimal64（136ms）类型比deepgreen numeric（154ms）类型快，比greenplum numeric快两倍，生产环境中快5倍以上。

3.支持JSON

Deepgreen支持JSON类型，但是并不完全支持。不支持的函数有：json_each,json_each_text,json_extract_path,json_extract_path_text, json_object_keys, json_populate_record, json_populate_recordset, json_array_elements, and json_agg.

安装：

执行下面命令扩展json支持：

dgadmin@flash:~$ psql postgres -f $GPHOME/share/postgresql/contrib/json.sql

测试一把：

dgadmin@flash:~$ psql postgres
psql (8.2.15)
Type "help" for help.

postgres=# select '[1,2,3]'::json->2;
 ?column?
----------
 3
(1 row)

postgres=# create temp table mytab(i int, j json) distributed by (i);
CREATE TABLE
postgres=# insert into mytab values (1, null), (2, '[2,3,4]'), (3, '[3000,4000,5000]');
INSERT 0 3
postgres=#
postgres=# insert into mytab values (1, null), (2, '[2,3,4]'), (3, '[3000,4000,5000]');
INSERT 0 3
postgres=# select i, j->2 from mytab;
 i | ?column?
---+----------
 2 | 4
 2 | 4
 1 |
 3 | 5000
 1 |
 3 | 5000
(6 rows)

4.高效压缩算法

Deepgreen延续了Greenplum的zlib压缩算法用于存储压缩。除此之外，Deepgreen还提供两种对数据库负载更优的压缩格式：zstd和lz4.

如果客户在列存或者只追加堆表存储时要求更优的压缩比，请选择zstd压缩算法。相比于zlib，zstd有更好的压缩比，并且能更有效利用CPU。

如果客户有大量读取需求，那么可以选择lz4压缩算法，因为它有着惊人的解压速度。虽然在压缩比上lz4并没有zlib和zstd那么出众，但是为了满足高读取负载作出一些牺牲还是值得的。

有关于这两种压缩算法的具体内容，详见其主页：

zstd主页 http://facebook.github.io/zstd/
lz4主页 http://lz4.github.io/lz4/

测试一把：

这里只针对不压缩／zlib／zstd／lz4四种，进行简单的测试，我的机器性能并不高，所有结果仅供参考：

postgres=# create temp table ttnone (
postgres(#     i int,
postgres(#     t text,
postgres(#     default column encoding (compresstype=none))
postgres-# with (appendonly=true, orientation=column)
postgres-# distributed by (i);
CREATE TABLE
postgres=# \timing on
Timing is on.
postgres=# create temp table ttzlib(
postgres(#     i int,
postgres(#     t text,
postgres(#     default column encoding (compresstype=zlib, compresslevel=1))
postgres-# with (appendonly=true, orientation=column)
postgres-# distributed by (i);
CREATE TABLE
Time: 762.596 ms
postgres=# create temp table ttzstd (
postgres(#     i int,
postgres(#     t text,
postgres(#     default column encoding (compresstype=zstd, compresslevel=1))
postgres-# with (appendonly=true, orientation=column)
postgres-# distributed by (i);
CREATE TABLE
Time: 827.033 ms
postgres=# create temp table ttlz4 (
postgres(#     i int,
postgres(#     t text,
postgres(#     default column encoding (compresstype=lz4))
postgres-# with (appendonly=true, orientation=column)
postgres-# distributed by (i);
CREATE TABLE
Time: 845.728 ms
postgres=# insert into ttnone select i, 'user '||i from generate_series(1, 100000000) i;
INSERT 0 100000000
Time: 104641.369 ms
postgres=# insert into ttzlib select i, 'user '||i from generate_series(1, 100000000) i;
INSERT 0 100000000
Time: 99557.505 ms
postgres=# insert into ttzstd select i, 'user '||i from generate_series(1, 100000000) i;
INSERT 0 100000000
Time: 98800.567 ms
postgres=# insert into ttlz4 select i, 'user '||i from generate_series(1, 100000000) i;
INSERT 0 100000000
Time: 96886.107 ms
postgres=# select pg_size_pretty(pg_relation_size('ttnone'));
 pg_size_pretty
----------------
 1708 MB
(1 row)

Time: 83.411 ms
postgres=# select pg_size_pretty(pg_relation_size('ttzlib'));
 pg_size_pretty
----------------
 374 MB
(1 row)

Time: 4.641 ms
postgres=# select pg_size_pretty(pg_relation_size('ttzstd'));
 pg_size_pretty
----------------
 325 MB
(1 row)

Time: 5.015 ms
postgres=# select pg_size_pretty(pg_relation_size('ttlz4'));
 pg_size_pretty
----------------
 785 MB
(1 row)

Time: 4.483 ms
postgres=# select sum(length(t)) from ttnone;
    sum
------------
 1288888898
(1 row)

Time: 4414.965 ms
postgres=# select sum(length(t)) from ttzlib;
    sum
------------
 1288888898
(1 row)

Time: 4500.671 ms
postgres=# select sum(length(t)) from ttzstd;
    sum
------------
 1288888898
(1 row)

Time: 3849.648 ms
postgres=# select sum(length(t)) from ttlz4;
    sum
------------
 1288888898
(1 row)

Time: 3160.477 ms

5.数据采样

从Deepgreen 16.16版本开始，内建支持通过SQL进行数据真实采样，您可以通过定义行数或者定义采样比两种方式进行采样：

SELECT {select-clauses} LIMIT SAMPLE {n} ROWS;
SELECT {select-clauses} LIMIT SAMPLE {n} PERCENT;

测试一把：

postgres=# select count(*) from ttlz4;
   count
-----------
 100000000
(1 row)

Time: 903.661 ms
postgres=# select * from ttlz4 limit sample 0.00001 percent;
    i     |       t
----------+---------------
  3442917 | user 3442917
  9182620 | user 9182620
  9665879 | user 9665879
 13791056 | user 13791056
 15669131 | user 15669131
 16234351 | user 16234351
 19592531 | user 19592531
 39097955 | user 39097955
 48822058 | user 48822058
 83021724 | user 83021724
  1342299 | user 1342299
 20309120 | user 20309120
 34448511 | user 34448511
 38060122 | user 38060122
 69084858 | user 69084858
 73307236 | user 73307236
 95421406 | user 95421406
(17 rows)

Time: 4208.847 ms
postgres=# select * from ttlz4 limit sample 10 rows;
    i     |       t
----------+---------------
 78259144 | user 78259144
 85551752 | user 85551752
 90848887 | user 90848887
 53923527 | user 53923527
 46524603 | user 46524603
 31635115 | user 31635115
 19030885 | user 19030885
 97877732 | user 97877732
 33238448 | user 33238448
 20916240 | user 20916240
(10 rows)

Time: 3578.031 ms

6.TPC-H性能

Deepgreen与Greenplum的性能对比，请参考我另外两个帖子：

《Deepgreen与Greenplum TPC-H性能测试对比（使用德哥脚本）》

《Deepgreen与Greenplum TPC-H性能测试对比（使用VitesseData脚本）》

另外Deepgreen自身搭载的高性能组件Xdrive，在后期会另行分享～

End~

原文链接

posted @ 2017-06-16 14:15 _夜枫阅读(2931) 评论(0) 收藏举报

刷新页面返回顶部

_夜枫

Deepgreen DB 是什么？

公告