impyla访问hive

python如何访问hive，接下来我将遇到问题进行总结

（说明一下：这篇文章中的各种坑的解决，翻阅了网上无数的帖子，最好一GIT上面一个帖子的角落里面带了这么一句，否则很容易翻船。但是由于帖子太多，所以我就不一一帖出来了）

首先是选组件，我选择的是使用：impala+Python3.7来连接Hadoop数据库，如果你不是的话，就不要浪费宝贵时间继续阅读了。

执行的代码如下：

import impala.dbapi as ipdb
conn = ipdb.connect(host="192.168.XX.XXX",port=10000,user="xxx",password="xxxxxx",database="xxx",auth_mechanism='PLAIN')
cursor = conn.cursor()
cursor.execute('select * From xxxx')
print(cursor.description)  # prints the result set's schema
for rowData in cursor.fetchall():
    print(rowData)
conn.close()

坑一：提示语法错误
现象：

/Users/wangxxin/miniconda3/bin/python3.7 /Users/wangxxin/Documents/Python/PythonDataAnalyze/project/knDt/pyHiveTest.py
Traceback (most recent call last):
  File "/Users/wangxxin/Documents/Python/PythonDataAnalyze/project/knDt/pyHiveTest.py", line 1, in <module>
    import impala.dbapi as ipdb
  File "/Users/wangxxin/miniconda3/lib/python3.7/site-packages/impala/dbapi.py", line 28, in <module>
    import impala.hiveserver2 as hs2
  File "/Users/wangxxin/miniconda3/lib/python3.7/site-packages/impala/hiveserver2.py", line 340
    async=True)

解决办法：将参数async全部修改为“async_”（当然这个可以随便，只要上下文一致，并且不是关键字即可），原因：在Python3.0中，已经将async标为关键词，如果再使用async做为参数，会提示语法错误；应该包括以下几个地方：

#hiveserver2.py文件338行左右
op = self.session.execute(self._last_operation_string,
                                  configuration,
                                  async_=True)
#hiveserver2.py文件1022行左右
def execute(self, statement, configuration=None, async_=False):
    req = TExecuteStatementReq(sessionHandle=self.handle,
                               statement=statement,
                               confOverlay=configuration,
                               runAsync=async_)

坑二：提供的Parser.py文件有问题，加载的时候会报错
解决办法：

#根据网上的意见对原代码进行调整
elif url_scheme in ('c', 'd', 'e', 'f'):
    with open(path) as fh:
        data = fh.read()
elif url_scheme in ('http', 'https'):
    data = urlopen(path).read()
else:
    raise ThriftParserError('ThriftPy does not support generating module '
                            'with path in protocol \'{}\''.format(
                                url_scheme))

以上的坑一、坑二建议你直接修改。这两点是肯定要调整的；

坑三：上面的两个问题处理好之后，继续运行，会报如下错误：

TProtocolException: TProtocolException(type=4)

解决办法：
原因是由于connect方法里面没有增加参数：auth_mechanism='PLAIN，修改如下所示：

import impala.dbapi as ipdb
conn = ipdb.connect(host="192.168.XX.XXX",port=10000,user="xxx",password="xxxxxx",database="xxx",auth_mechanism='PLAIN')`

坑四：问题三修改好之后，继续运行程序，你会发现继续报错：

AttributeError: 'TSocket' object has no attribute 'isOpen'

解决办法：
由于是thrift-sasl的版本太高了(0.3.0)，故将thrift-sasl的版本降级到0.2.1

pip uninstall thrift-sasl
pip install thrift-sasl==0.2.1

坑五：处理完这个问题后，继续运行，继续报错（这个时间解决有点快崩溃的节奏了，但是请坚持住，其实你已经很快接近最后结果了）：

thriftpy.transport.TTransportException: TTransportException(type=1, message="Could not start SASL: b'Error in sasl_client_start (-4) SASL(-4): no mechanism available: Unable to find a callback: 2'")

解决办法：这个是最麻烦的，也是目前最难找到解决办法的。

I solved the issue, had to uninstall the package SASL and install PURE-SASL, when impyla can´t find the sasl package it works with pure-sasl and then everything goes well.

主要原因其实还是因为sasl和pure-sasl有冲突，这种情况下，直接卸载sasl包就可能了。

pip uninstall SASL

坑六：但是执行完成，继续完成，可能还是会报错：

TypeError: can't concat str to bytes

定位到错误的最后一条，在init.py第94行（标黄的部分）

header = struct.pack(">BI", status, len(body))
#按照网上的提供的办法增加对BODY的处理
if (type(body) is str):
  body = body.encode()
self._trans.write(header + body)
self._trans.flush()

经过以上步骤，大家应该可以连接Hive库查询数据，应该是不存在什么问题了。

最后总结一下，连接Hadoop数据库中各种依赖包，请大家仔细核对一下依赖包（最好是依赖包相同，也就是不多不少[我说的是相关的包]，这样真的可以避免很多问题的出现）

序号	包名	版本号	安装命令行
1	pure_sasl	0.5.1	pip install pure_sasl==0.5.1 -i https://pypi.tuna.tsinghua.edu.cn/simple
2	thrift	0.9.3	pip install thrift==0.9.3 -i https://pypi.tuna.tsinghua.edu.cn/simple
3	bitarray	0.8.3	pip install bitarray==0.8.3 -i https://pypi.tuna.tsinghua.edu.cn/simple
4	thrift_sasl	0.2.1	pip install thrift_sasl==0.2.1 -i https://pypi.tuna.tsinghua.edu.cn/simple
5	thriftpy	0.3.9	pip install thriftpy==0.3.9 -i https://pypi.tuna.tsinghua.edu.cn/simple
6	impyla	0.14.1	pip install impyla==0.14.1 -i https://pypi.tuna.tsinghua.edu.cn/simple

建议按顺序安装，我这边之前有依赖包的问题，但是最终我是通过conda进行安装的。
其中在安装thriftpy、thrift_sasl、impyla报的时候报错，想到自己有conda，直接使用conda install，会自动下载依赖的包，如下所示（供没有conda环境的同学参考）

package	build	size
ply-3.11	py37_0	80 KB
conda-4.6.1	py37_0	1.7 MB
thriftpy-0.3.9	py37h1de35cc_2	171 KB

祝您好运！如果在实际过程中还是遇到各种各样的问题，请你留言。

另外我看到有人用以下依赖包也试了一下，没有问题:

python3.6.4，bitarray==1.1.0，thrift==0.9.3，thrift-sasl==0.2.1，six==1.12.0，pure-sasl==0.6.2，impyla==0.15.0
最后有一点提示：
SQL里面不要带分号，否则会报错。但是这个就不是环境问题了。报错如下：

impala.error.HiveServer2Error: Error while compiling statement: FAILED: ParseException line 2:83 cannot recognize input near ';' '<EOF>' '<EOF>' in expression specification

posted @ 2020-08-26 16:31 DB乐之者阅读(735) 评论(0) 编辑收藏举报

刷新页面返回顶部

Stay hungery

impyla访问hive

公告