使用union all 命令之后如何对hive表格进行去重

业务场景大概是这样的，这里由两个hive表格，tableA 和 tableB, 格式内容都是这样的:
uid cate1 cate2

在hive QL中，我们知道union有着自动去重的功能，但是那是真对几行内容完全一致的情况下才可以。现在我们要进行去重的情况是根据uid进行去重。
也就是说可能存在这种情况:
1234 老师唱歌
1234 老师跳舞
对于hive表格中的这两行数据我们只想要保留其中的一行。

针对这种情况，我们做的大致思路就是，取两个表格数据的时候同时人为加上一个flag，然后使用python代码根据flag进行区分保留。
为了进行去重，我们写了两个代码，一个是取得hive数据的shell脚本，一个是处理hive数据的python脚本

vim get_data.sh
function merge(){
cat <<EOF
add file ./process.py;
    select transform(a.*) using 'python tt.py' as uid,cate1,cate2 from

    (select * from
    (select uid,cate1,cate2,"0" as flag from tableA where dt='sth1'
    union all
    select uid,cate1,cate2,"1" as flag from tableB where dt='sth2'
    )ts
    distribute by uid sort by uid,flag asc
    )a
EOF
}

对于上面这个代码，我觉得有一点需要特别注意，就是

distribute by uid sort by uid,flag asc

为了了解这行代码，我特意去看了看这里的解释参考
简单来说就是说，distribute by uid代表的就是所有uid相同的数据会被送到同一个reducer中去处理。

vim process.py

#!/bin/env python
#-*- encoding:utf-8 -*-
import os
import sys

def set_values(value):
        if value.isdigit():
                return int(value)
        else :
                return 0

lastuid=""
cate1=""
cate2=""
flag=""

for line in sys.stdin :
        line=line.replace("\n","").replace(" ","")
        v=line.split("\t")
        try :
                uid=v[0]
                if not uid.isdigit() or len(v) != 4:
                        pass
                if lastuid!="" and lastuid!=uid:
                        print (lastuid+"\t"+str(cate1)+"\t"+str(cate2))
                        lastuid=""
                        cate1=""
                        cate2=""
                        flag=""
                cate1=v[1]
                cate2=v[2]
                flag=v[3]
                lastuid=uid
        except :
                pass

print (lastuid+"\t"+str(cate1)+"\t"+str(cate2)) #这行代码是为了输出最后一行，这行代码很类似于python word count中的示例代码

posted @ 2019-03-15 12:24 DUDUDA 阅读(3153) 评论(0) 编辑收藏举报

刷新页面返回顶部

DUDUDA

使用union all 命令之后如何对hive表格进行去重

公告