hive数据倾斜

第五天笔记

Hive With as 用法

// 之前的写法
select  t.id
        ,t.name
        ,t.clazz
        ,t.score_id
        ,t.score
        ,c.subject_name
from(
    select  a.id
        ,a.name
        ,a.clazz
        ,b.score_id
        ,b.score
    from (
        select  id
                ,name
                ,clazz
        from
        students
    ) a left join (
    select  id
            ,score_id
            ,score
    from score
    ) b
    on a.id = b.id
) t left join (
    select  subject_id
            ,subject_name 
    from subject
) c on t.score_id = c.subject_id
limit 10;

// with as 可以把子查询拿出来，让代码逻辑更加清晰，提高效率
// 必须跟着sql一起使用
with tmp1 as (
    select  id
            ,name
            ,clazz
    from students
), tmp2 as ( 
    select  score_id
            ,id
            ,score
    from
    score
), tmp1Jointmp2 as (
    select  a.id
            ,a.name
            ,a.clazz
            ,b.score_id
            ,b.score
    from tmp1 a
    left join tmp2 b
    on a.id = b.id
), tmp3 as (
select   subject_id
        ,subject_name 
from subject
)select  t.id
        ,t.name
        ,t.clazz
        ,t.score_id
        ,t.score
        ,c.subject_name
from tmp1Jointmp2 t left join tmp3 c
on t.score_id = c.subject_id
limit 10;

Hive数据倾斜

原因：

key分布不均匀，数据重复

表现：

任务进度长时间维持在99%（或100%），查看任务监控页面，发现只有少量（1个或几个）reduce子任务未完成。因为其处理的数据量和其他reduce差异过大。

单一reduce的记录数与平均记录数差异过大，通常可能达到3倍甚至更多。最长时长远大于平均时长。

解决方案：

1、从数据源头，业务层面进行优化
2、找到key重复的具体值，进行拆分，hash。异步求和。

create table data_skew(
    key string
    ,col string
) row format delimited fields terminated by ',';

// 直接分组求count
select key,count(*) from data_skew group by key;


// 使用hash 异步求和
select  key
        ,sum(cnt) as sum_cnt
from(
    select  key
            ,hash_key
            ,count(*) as cnt
    from(
    select  key
            ,col
            ,if(key=='84401' or key == 'null',hash(floor(rand()*6)),0)  as hash_key
    from data_skew 
    ) t1 group by key,hash_key
) tt1 group by tt1.key;

Hive读写模式

Hive在加载数据的时候，能不能查询？

Hive是读时模式的：hive在加载数据的时候，并不会检查我们的数据是不是符合规范，只有在读的时候才会根据schema去解析数据

MYSQL是写时模式的

posted @ 2021-08-31 17:09 tonggang_bigdata 阅读(45) 评论(0) 编辑收藏举报

刷新页面返回顶部

xiguabigdata