StarRocks-性能优化(基础)

优化方案

1、数据模型设计优化
1-1、分区与分桶策略 (高频使用)
分区(Partitioning):
-- 按时间或业务逻辑字段(如日期、地区)对表进行分区,减少扫描数据量。
CREATE TABLE ads_order (
    order_id BIGINT,
    user_id INT,
    amount DECIMAL(10,2),
    dt DATE
)
PARTITION BY RANGE(dt) (
    PARTITION p202301 VALUES [('2023-01-01'), ('2023-02-01')),
    PARTITION p202302 VALUES [('2023-02-01'), ('2023-03-01'))
);

分桶(Bucketing):
-- 选择高基数字段(如 user_id、order_id)作为分桶键,确保数据均匀分布,避免数据倾斜。
DISTRIBUTED BY HASH(user_id) BUCKETS 32;

1-2、数据冗余与宽表设计 (高频使用)
对高频关联的小表(如维度表),通过 Broadcast Join 或 Colocate Join 避免 Shuffle。
若多表关联逻辑固定,可设计宽表(预关联表),减少运行时 Join 开销。
2、查询优化 (高频使用)
2-1、Join 策略选择
Broadcast Join-- 适用于小表(如维度表)与大表的关联,将小表广播到所有节点,避免 Shuffle。
SELECT /*+ BROADCAST(dim_user) */ 
    o.order_id, 
    u.user_name, 
    o.amount
FROM ads_order o
JOIN dim_user u ON o.user_id = u.user_id;

Shuffle Join-- 大表间关联时,按 Join Key 重新分布数据,确保相同 Key 的数据在同一节点。
SELECT /*+ SHUFFLE_JOIN */ 
    o.order_id, 
    p.product_name, 
    o.amount
FROM ads_order o
JOIN ads_product p ON o.product_id = p.product_id;

Colocate Join-- 若多表的分桶键和分桶数一致,直接本地 Join,无需网络传输。
-- 创建表时指定相同分桶规则
CREATE TABLE ads_order (...) DISTRIBUTED BY HASH(user_id) BUCKETS 32;
CREATE TABLE dim_user (...) DISTRIBUTED BY HASH(user_id) BUCKETS 32;

-- 查询时自动触发 Colocate Join
SELECT o.order_id, u.user_name
FROM ads_order o
JOIN dim_user u ON o.user_id = u.user_id;
2-2、谓词下推与过滤
-- 在 Join 前尽量过滤数据,减少参与关联的数据量:
SELECT o.order_id, u.user_name
FROM (SELECT * FROM ads_order WHERE dt = '2023-01-01') o
JOIN dim_user u ON o.user_id = u.user_id;
2-3、避免笛卡尔积
-- 确保 Join 条件包含有效关联字段,避免全表扫描。
3、物化视图加速 (高频使用)
-- 对高频复杂查询,通过 物化视图(Materialized View) 预计算关联结果:
-- 创建物化视图
CREATE MATERIALIZED VIEW mv_order_user
PARTITION BY dt
DISTRIBUTED BY HASH(order_id)
AS
SELECT o.order_id, u.user_name, o.amount, o.dt
FROM ads_order o
JOIN dim_user u ON o.user_id = u.user_id;

-- 查询时自动命中物化视图
SELECT user_name, SUM(amount)
FROM mv_order_user
WHERE dt = '2023-01-01'
GROUP BY user_name;
4、执行计划调优
分析执行计划
使用 EXPLAIN 命令查看查询计划,重点关注:
  数据分布:是否触发 Broadcast/Shuffle/Colocate Join。
  谓词下推:过滤条件是否提前应用。
  资源消耗:是否存在内存溢出(如 EXCEEDED_MEMORY_LIMIT)。

调整并行度
  通过 set parallel_fragment_exec_instance_num = 8; 增加并行度,提升计算资源利用率。
  调整 BE(Backend)节点的资源配置,避免资源竞争。
5、统计信息收集
定期收集表统计信息,帮助优化器生成高效执行计划:

-- 手动收集统计信息
ANALYZE TABLE ads_order COMPUTE STATISTICS;
6、资源隔离与优先级
为 ADS 层查询分配独立资源组(Resource Group),避免与其他任务竞争资源:

CREATE RESOURCE GROUP ads_group 
TO ('user_ads') 
WITH (
    'cpu_core_limit' = '16',
    'mem_limit' = '80%'
);
7、冷热数据分层
对历史冷数据使用 冷存储策略(如对象存储),降低存储成本。
对热数据保留在本地 SSD,提升查询性能。

Explain(逻辑执行计划) 操作案例:

对一下三个语句进行执行计划分析对比,查询优化
------------------------------------------
--
EXPLAIN
select custId,commOwner,commSchool,commTime,commStatus,commprojIntention from (
    select a.custId ,a.commOwner ,a.commSchool ,a.commTime ,a.commStatus, b.commprojIntention, ROW_NUMBER() over(PARTITION by b.custId order by b.commTime) as rnk
        from datawarehouse.dwd_ct_CtComm a
        join datawarehouse.dwd_ct_CtCommproj b on a.commId = b.commId and a.commDelstatus = 'N' and a.isPhysicsDel = 2 and b.commprojDelstatus = 'N' and b.isPhysicsDel = 2
    ) t where rnk = 1 and custId <> ''

-----------------------------------------
--
EXPLAIN
select c.custId, c.commOwner, c.commSchool, c.commTime, c.commStatus,a.commprojIntention
from ( select commId ,min(commprojIntention) commprojIntention from datawarehouse.dwd_ct_CtCommproj where isPhysicsDel = 2  group by commId  ) a
join  (
select a.commId ,a.custId, a.commOwner, a.commSchool, a.commTime, a.commStatus
from  datawarehouse.dwd_ct_CtComm a  
where  commId in  ( select min(commId) as commId  from datawarehouse.dwd_ct_CtComm where   commDelstatus = 'N' AND isPhysicsDel = 2  group by custId   )
) c on a.commId =c.commId


-----------------------------------------
--
EXPLAIN
select a.custId, a.commOwner, a.commSchool, a.commTime, a.commStatus, c.commprojIntention
from datawarehouse.dwd_ct_CtComm a
join (select min(commId) commId from datawarehouse.dwd_ct_CtComm where commDelstatus = 'N' AND isPhysicsDel = 2  group by custId) b  on a.commId =b.commId
join (select commId,min(commprojIntention) commprojIntention from  datawarehouse.dwd_ct_CtCommproj where  isPhysicsDel = 2  group by commId  ) c  on b.commId =c.commId
执行计划:
-----------------------------------------
--
PLAN FRAGMENT 0
OUTPUT EXPRS:8: custId | 7: commOwner | 19: commSchool | 9: commTime | 10: commStatus | 36: commprojIntention
  PARTITION: UNPARTITIONED

  RESULT SINK

  13:EXCHANGE

PLAN FRAGMENT 1   -- 任务碎片
OUTPUT EXPRS:    -- 输出表达式
  PARTITION: HASH_PARTITIONED: 33: custId  -- 分区

  STREAM DATA SINK    
    EXCHANGE ID: 13  
    UNPARTITIONED      --  未分区

  12:Project
  |  <slot 7> : 7: commOwner
  |  <slot 8> : 8: custId
  |  <slot 9> : 9: commTime
  |  <slot 10> : 10: commStatus
  |  <slot 19> : 19: commSchool
  |  <slot 36> : 36: commprojIntention
  |  
  11:SELECT
  |  predicates: 44: row_number() = 1, 8: custId != ''
  |  
  10:ANALYTIC
  |  functions: [, row_number(), ]
  |  partition by: 33: custId
  |  order by: 31: commTime ASC
  |  window: ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  |  
  9:SORT
  |  order by: <slot 33> 33: custId ASC, <slot 31> 31: commTime ASC
  |  offset: 0
  |  
  8:EXCHANGE

PLAN FRAGMENT 2
OUTPUT EXPRS:
  PARTITION: RANDOM

  STREAM DATA SINK
    EXCHANGE ID: 08
    HASH_PARTITIONED: 33: custId

  7:PARTITION-TOP-N
  |  partition by: 33: custId
  |  partition limit: 1
  |  order by: <slot 33> 33: custId ASC, <slot 31> 31: commTime ASC
  |  offset: 0
  |  
  6:Project
  |  <slot 7> : 7: commOwner
  |  <slot 8> : 8: custId
  |  <slot 9> : 9: commTime
  |  <slot 10> : 10: commStatus
  |  <slot 19> : 19: commSchool
  |  <slot 31> : 31: commTime
  |  <slot 33> : 33: custId
  |  <slot 36> : 36: commprojIntention
  |  
  5:HASH JOIN
  |  join op: INNER JOIN (BUCKET_SHUFFLE)
  |  colocate: false, reason:
  |  equal join conjunct: 24: commId = 1: commId
  |  
  |----4:EXCHANGE
  |    
  1:Project
  |  <slot 24> : 24: commId
  |  <slot 31> : 31: commTime
  |  <slot 33> : 33: custId
  |  <slot 36> : 36: commprojIntention
  |  
  0:OlapScanNode
     TABLE: dwd_ct_CtCommproj
     PREAGGREGATION: OFF. Reason: None aggregate function
     PREDICATES: 38: commprojDelstatus = 'N', 42: isPhysicsDel = 2
     partitions=1/1
     rollup: dwd_ct_CtCommproj
     tabletRatio=60/60
     tabletList=76203,76207,76211,76215,76219,76223,76227,76231,76235,76239 ...
     cardinality=150433003
     avgRowSize=19.56558
     numNodes=0

PLAN FRAGMENT 3
OUTPUT EXPRS:
  PARTITION: RANDOM

  STREAM DATA SINK
    EXCHANGE ID: 04
    BUCKET_SHUFFLE_HASH_PARTITIONED: 1: commId

  3:Project
  |  <slot 1> : 1: commId
  |  <slot 7> : 7: commOwner
  |  <slot 8> : 8: custId
  |  <slot 9> : 9: commTime
  |  <slot 10> : 10: commStatus
  |  <slot 19> : 19: commSchool
  |  
  2:OlapScanNode
     TABLE: dwd_ct_CtComm  -- 表名
     PREAGGREGATION: OFF. Reason: Has can not pre-aggregation Join
     PREDICATES: 18: commDelstatus = 'N', 22: isPhysicsDel = 2
     partitions=1/1
     rollup: dwd_ct_CtComm
     tabletRatio=60/60
     tabletList=75066,75070,75074,75078,75082,75086,75090,75094,75098,75102 ...
     cardinality=103274862   -- 扫描表的数据总行数
     avgRowSize=28.0   -- 扫描数据行的平均大小
     numNodes=0

-- 另外两个省略
分析图如下:

alt text

alt text

alt text

posted on   cloud_wh  阅读(13)  评论(0编辑  收藏  举报

相关博文:
阅读排行:
· winform 绘制太阳,地球,月球 运作规律
· 震惊!C++程序真的从main开始吗?99%的程序员都答错了
· AI与.NET技术实操系列(五):向量存储与相似性搜索在 .NET 中的实现
· 超详细:普通电脑也行Windows部署deepseek R1训练数据并当服务器共享给他人
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理
点击右上角即可分享
微信分享提示