如何在创建hive表格的python代码中导入外部文件

业务场景大概是这样的，我要对用户博文进行分词(这个步骤可以看这篇文章如何在hive调用python的时候使用第三方不存在的库-how to use external python library in hadoop）
然后在对每条博文进行分词之后呢，我需要做的就是对分词之后的结果去除停用词，但是在公司hadoop集群是是没有我们所需要的停用词文件的，其实解决这个问题很类似我上面列出来的文章，就是如果在hive的自定义函数中使用我们自己的文件或者包

解决办法大概是这样:
首先在shell脚本中加入 add file ./stop_word.txt;

function zida(){
cat <<EOF
add file ./jieba.mod;
add file ./stop_word.txt;
add file ./zida.py;

    select transform(tmp.*) using 'python zida.py test'
    AS uid,bowen
    FROM(
        select *  from hive_table)tmp
EOF
}

hive -e "`zida`"
echo "zida"

然后在python脚本中加入对应代码:

import io
stopwords = [line.strip() for line in io.open('stop_word.txt','r',encoding='utf-8').readlines()]

在这个办法中，会出现报错，原因就是公司python运行环境比较老旧，所以在读取中文文本的时候会出现问题:
代码是这样的

stopwords = [line.strip() for line in open('stop_word.txt','r',encoding='utf-8').readlines()]

出现报错:
'encoding' is an invalid keyword argument for this function

解决办法如下:

import io
stopwords = [line.strip() for line in io.open('stop_word.txt','r',encoding='utf-8').readlines()]

这个问题的解决是参考的这里

参考链接:
关于这个方法一个很好的总结-hive+python数据分析入门
 Accessing external file in Python UDF

posted @ 2019-03-15 12:21 DUDUDA 阅读(660) 评论(0) 收藏举报

刷新页面返回顶部

DUDUDA

如何在创建hive表格的python代码中导入外部文件

公告