流表关联时数仓 DW 层把表打宽最主要的方式,项目中使用最多的维表存储是 hbase/mysql/redis,分别在大表、小表、高性能查询三种流表关联查询场景

最近调研和使用了一段时间的 starrocks,发现使用 starrocks 做维表的存储好像也很不错,所以做这样一次测试,测试 hbase/mysql/starrocks/redis 做维表的 tps 能达到什么级别,能否使用 starrocks 来替代 mysql、hbase。

测试服务器

CDH 6.2.1
服务器: 3 台 64G 16 核
CDH 所有组件都安装在这 3 台服务器上

流数据

测试流设计如下,从 kafka 读取数据,包含如下字段

CREATE TABLE user_log
(
    user_id     STRING,
    item_id     STRING,
    category_id STRING,
    behavior    STRING,
    page        STRING,
    position    STRING,
    sort        STRING,
    last_page   STRING,
    next_page   STRING,
    ts          TIMESTAMP(3),
    process_time as proctime(),
    WATERMARK FOR ts AS ts - INTERVAL '5' SECOND
) WITH (
      'connector' = 'kafka'
      ,'topic' = 'user_behavior'
      ,'properties.bootstrap.servers' = 'localhost:9092'
      ,'properties.group.id' = 'user_log'
      ,'scan.startup.mode' = 'latest-offset'
      ,'format' = 'json'
);

维表设计

维表包含如下字段

user_id: 10位数字字符串,不足10位,补 0,反转
性别: man、woman、unknown
年龄:1-100 数字
学历:小学、初中、高中、本科、研究生、博士生
住址:uuid
工作地:uuid
收入范围:1-10 的随机数字
默认收货地址:uuid
注册时间:数据写入时间
修改时间:数据写入时间

维表写入流如下:


CREATE TABLE user_info
(
    user_id     STRING,
    sex     STRING,
    age     integer,
    degree    STRING,
    address        STRING,
    work_address    STRING,
    income_range        STRING,
    default_shipping_address   STRING,
    register_date   timestamp(3),
    udpate_date          TIMESTAMP(3),
) WITH (
      'connector' = 'kafka'
      ,'topic' = 'user_behavior'
      ,'properties.bootstrap.servers' = 'dcmp10:9092,dcmp11:9092,dcmp12:9092'
      ,'properties.group.id' = 'user_log'
      ,'scan.startup.mode' = 'latest-offset'
      ,'format' = 'json'
);
  • 注:设计维表数据 100 万,实际写入 99 万条,剩下 1 万模拟流表关联找不到的情况

hbase 表

创建hbase 表


create 'user_info', { NAME => 'f',IN_MEMORY => 'true'}

写入数据


drop table if exists hbase_user_info_sink;
CREATE TABLE hbase_user_info_sink
(
    user_id STRING,
    f      ROW(sex                      STRING,
        age                      INTEGER,
        degree                   STRING,
        address                  STRING,
        work_address             STRING,
        income_range             STRING,
        default_shipping_address STRING,
        register_date            TIMESTAMP(3),
        udpate_date              TIMESTAMP(3))
) WITH (
      'connector' = 'hbase-2.2'
      ,'zookeeper.quorum' = 'dcmp10:2181,dcmp11:2181,dcmp12:2181'
      ,'zookeeper.znode.parent' = '/hbase'
      ,'table-name' = 'user_info'
    ,'sink.buffer-flush.max-size' = '10mb'
    ,'sink.buffer-flush.max-rows' = '2000'
      );

insert into hbase_user_info_sink
select user_id, row(sex, age, degree, address, work_address, income_range,default_shipping_address, register_date, udpate_date)
from user_info;

hbase 数据

hbase(main):002:0> count 'user_info',INTERVAL=100000
(hbase):2: warning: already initialized constant INTERVAL
Current count: 100000, row: 1010090000                                                             
Current count: 200000, row: 2020190000                                                             
Current count: 300000, row: 3030290000                                                             
Current count: 400000, row: 4040390000                                                             
Current count: 500000, row: 5050490000                                                             
Current count: 600000, row: 6060590000                                                             
Current count: 700000, row: 7070690000                                                             
Current count: 800000, row: 8080790000                                                             
Current count: 900000, row: 9090890000                                                             
990000 row(s)
Took 30.5924 seconds                                                                               
=> 990000

tpc 测试

测试sql 如下:

-- Lookup Source: Sync Mode
-- kafka source
CREATE TABLE user_log
(
    user_id     STRING,
    item_id     STRING,
    category_id STRING,
    behavior    STRING,
    page        STRING,
    `position`    STRING,
    sort        STRING,
    last_page   STRING,
    next_page   STRING,
    ts          TIMESTAMP(3),
    process_time as proctime(),
    WATERMARK FOR ts AS ts - INTERVAL '5' SECOND
) WITH (
      'connector' = 'kafka'
      ,'topic' = 'user_log'
      ,'properties.bootstrap.servers' = 'dcmp10:9092,dcmp11:9092,dcmp12:9092'
      ,'properties.group.id' = 'user_log'
      ,'scan.startup.mode' = 'latest-offset'
      ,'format' = 'json'
      );

drop table if exists hbase_behavior_conf;
CREATE
TEMPORARY TABLE hbase_behavior_conf (
    user_id STRING,
    f      ROW(sex                      STRING,
        age                      INTEGER,
        degree                   STRING,
        address                  STRING,
        work_address             STRING,
        income_range             STRING,
        default_shipping_address STRING,
        register_date            TIMESTAMP(3),
        udpate_date              TIMESTAMP(3))
) WITH (
      'connector' = 'hbase-2.2'
      ,'zookeeper.quorum' = 'dcmp10:2181,dcmp11:2181,dcmp12:2181'
      ,'zookeeper.znode.parent' = '/hbase'
      ,'table-name' = 'user_info'
   ,'lookup.cache.max-rows' = '100000'
   ,'lookup.cache.ttl' = '10 minute' -- ttl time 超过这么长时间无数据才行
   ,'lookup.async' = 'true'
);

---sinkTable
CREATE TABLE user_log_sink
(
    user_id                  STRING,
    item_id                  STRING,
    category_id              STRING,
    behavior                 STRING,
    page                     STRING,
    `position`                 STRING,
    sort                     STRING,
    last_page                STRING,
    next_page                STRING,
    ts                       TIMESTAMP(3),
    sex                      STRING,
    age                      INTEGER,
    degree                   STRING,
    address                  STRING,
    work_address             STRING,
    income_range             STRING,
    default_shipping_address STRING,
    register_date            TIMESTAMP(3),
    udpate_date              TIMESTAMP(3)
--   ,primary key (user_id) not enforced
) WITH (
      'connector' = 'kafka'
      ,'topic' = 'user_log_sink'
      ,'properties.bootstrap.servers' = 'dcmp10:9092,dcmp11:9092,dcmp12:9092'
      ,'properties.group.id' = 'user_log'
      ,'scan.startup.mode' = 'group-offsets'
      ,'format' = 'json'
      );

INSERT INTO user_log_sink
SELECT a.user_id
     ,a.item_id
     ,a.category_id
     ,a.behavior
     ,a.page
     ,a.`position`
     ,a.sort
     ,a.last_page
     ,a.next_page
     ,a.ts
     ,b.sex
     ,b.age
     ,b.degree
     ,b.address
     ,b.work_address
     ,b.income_range
     ,b.default_shipping_address
     ,b.register_date
     ,b.udpate_date
FROM user_log a
         left join hbase_behavior_conf FOR SYSTEM_TIME AS OF a.process_time AS b
                   ON a.user_id = b.user_id
where a.behavior is not null;

一个并行度

配置 10 万的缓存(总量的 10%),失效时间是 10 分钟

由于是测试lookup join 的tps,所以这里只看 lookupJoin 算子的 tps,可以看到 tps 稳定在 2.5w 左右

taskmanager 配置 4G 的内存,看起来 tasmnager 的 gc 有点严重,还能接受

三个并行度

同样配置,tps 最高可以跑到 6W,整体服务器 CPU 都跑到了 40%+

mysql

mysql 表结构

-- auto-generated definition
create table user_info
(
    id                       int auto_increment
        primary key,
    user_id                  varchar(12) null,
    sex                      varchar(10) null,
    age                      int         null,
    degree                   varchar(10) null,
    address                  varchar(50) null,
    work_address             varchar(50) null,
    income_range             varchar(50) null,
    default_shipping_address varchar(50) null,
    register_date            datetime    null,
    udpate_date              datetime    null
)
    comment '用户信息';

create index index_user_info_user_id
    on user_info (user_id);

数据量 99 万

tpc 测试


drop table if exists mysql_behavior_conf;
CREATE
TEMPORARY TABLE mysql_behavior_conf (
    user_id STRING,
    sex                      STRING,
    age                      INTEGER,
    degree                   STRING,
    address                  STRING,
    work_address             STRING,
    income_range             STRING,
    default_shipping_address STRING,
    register_date            TIMESTAMP(3),
    udpate_date              TIMESTAMP(3)
) WITH (
   'connector' = 'jdbc'
   ,'url' = 'jdbc:mysql://10.201.0.166:3306/shell1'
   ,'table-name' = 'user_info'
   ,'username' = 'root'
   ,'password' = 'daas2020'
   ,'lookup.cache.max-rows' = '100000'
 ,'lookup.cache.ttl' = '10 minute' -- ttl time 超过这么长时间无数据才行
);


---sinkTable
CREATE TABLE user_log_sink
(
    user_id                  STRING,
    item_id                  STRING,
    category_id              STRING,
    behavior                 STRING,
    page                     STRING,
    `position`               STRING,
    sort                     STRING,
    last_page                STRING,
    next_page                STRING,
    ts                       TIMESTAMP(3),
    sex                      STRING,
    age                      INTEGER,
    degree                   STRING,
    address                  STRING,
    work_address             STRING,
    income_range             STRING,
    default_shipping_address STRING,
    register_date            TIMESTAMP(3),
    udpate_date              TIMESTAMP(3)
--   ,primary key (user_id) not enforced
) WITH (
      'connector' = 'kafka'
      ,'topic' = 'user_log_sink'
      ,'properties.bootstrap.servers' = 'dcmp10:9092,dcmp11:9092,dcmp12:9092'
      ,'properties.group.id' = 'user_log'
      ,'scan.startup.mode' = 'group-offsets'
      ,'format' = 'json'
      );

INSERT INTO user_log_sink
SELECT a.user_id
     ,a.item_id
     ,a.category_id
     ,a.behavior
     ,a.page
     ,a.`position`
     ,a.sort
     ,a.last_page
     ,a.next_page
     ,a.ts
     ,b.sex
     ,b.age
     ,b.degree
     ,b.address
     ,b.work_address
     ,b.income_range
     ,b.default_shipping_address
     ,b.register_date
     ,b.udpate_date
FROM user_log a
         left join mysql_behavior_conf FOR SYSTEM_TIME AS OF a.process_time AS b
                   ON a.user_id = b.user_id
where a.behavior is not null;

mysql 数据百万(十万也差不多),加了索引和lookup 缓存,tps 稳定在 3600 左右 (GC 情况还比较好,就不贴图了)

  • 一个并行度就达到 mysql 的性能瓶颈,就不再测试多并行度的场景了

starrocks

万万没想到,starrocks 做维表是个战力只有 5 的渣渣,初步测试 tps 才 200 多,符合官网介绍,毕竟不是使用场景不同,跳过,跳过

redis

没环境跳过了

完整代码参考: github sqlSubmit

欢迎关注Flink菜鸟公众号,会不定期更新Flink(开发技术)相关的推文

posted on 2022-07-28 09:32  Flink菜鸟  阅读(1322)  评论(1编辑  收藏  举报