Impala SQL常见报错问题排查与解决记录
背景
最近在全权负责的一款数据产品,基于Hive-JDBC执行用户提交的Impala SQL,实现数据集,数据监控,数据推送等功能。Hive-JDBC版本2.1.1-cdh6.2.1
:
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>2.1.1-cdh6.2.1</version>
</dependency>
报错
ParseException: Syntax error in line 34:sum(enable)Encountered: ENABLE.Expected: ALL…
之前CDH版本执行无误的SQL,在CDH集群升级到7.1.7版本后执行失败,具体报错信息如下:ParseException: Syntax error in line 34:\nsum(enable) as 辅助\n ^\nEncountered: ENABLE\nExpected: ALL, CASE, CAST, DATE, DEFAULT, DISTINCT, EXISTS, FALSE, IF, INTERVAL, LEFT, NOT, NULL, REPLACE, RIGHT, TRUNCATE, TRUE, IDENTIFIER\n\nCAUSED BY: Exception: Syntax error
即,enable是一个敏感关键词,报语法错误。
解决方法:加上反单引号,即:
sum(`enable`)
注:在SQL第一行加set ENABLE_EXPR_REWRITES = 0;
不能解决问题。
关于set ENABLE_EXPR_REWRITES = 0;
,请参考set ENABLE_EXPR_REWRITES = 0及GROUP BY expression must not contain aggregate functions等问题解决
default.url_decode()
unknown for database default. Currently this db has 4 functions.
SQL里面包含url_decode()
函数,SQL执行失败。如图,非hive-jdbc问题,在可视化查询界面执行也是失败的:
TransmitData() to ip:27000 failed: Network error: Client connection negotiation failed: client connection to ip:27000: connect: Connection refused (error 111)
节点TCP通信问题,手动重试。参考TransmitData 27000 Network-error
Failed to close HDFS file.The DiskSpace quota of is exceeded
具体的报错信息如下:
Failed to close HDFS file: hdfs://ppdhdpha/user/hive/warehouse/lkzc.db/jh_100701_cdd_noapply_cutdown_user/_impala_insert_staging//.0.parq\\nError(255): Unknown error 255\\nRoot cause: RemoteException: The DiskSpace quota of /user/hive/warehouse/lkzc.db is exceeded: quota = 65970697666560 B = 60 TB but diskspace consumed = 65971183008777 B = 60.00 TB
lkzc业务线的库空间满,无法继续写入数据,见上面的报错语句里_impala_insert_staging
提示。
解决方案:删除无用的大表,释放空间。
The library hdfs:/udf/hive.udf-1.0.2.jar last modified time does not match the expected last modified time. Run ‘refresh functions <db name>
’.
某个SQL通过hive-jdbc提交到大数据平台,报错如下:
java.lang.Exception: java.sql.SQLException: The library hdfs:///udf/hive.udf-1.0.2.jar last modified time 1670929674 does not match the expected last modified time 1653377949. Run 'refresh functions <db name>'.
很莫名其妙的报错。偶现。再次执行没有问题。
Failed to close HDFS file.Unable to close file because the last blockBP does not have enough number of replicas.
报错信息:
java.sql.SQLException: Failed to close HDFS file: hdfs:/user/hive/warehouse/lkzc.db/jh_061201_ffhywkt_chuda/_impala_insert_staging/
Error(255): Unknown error 255
Root cause: IOException: Unable to close file because the last blockBP does not have enough number of replicas.
Failed due to unreachable impalad(s): hadoopcbd008156.ppdgdsl.com:2200
SQL提交到大数据impala集群某个节点,而这个节点刚好有点网络异常等原因,导致unreachable。
暂时没有比较好的解决方法:
- 手动retry,再提交一次SQL,大概率(?)就不会到这个节点;
- 程序增加自动retry逻辑,针对特定类型的报错才有retry逻辑;retry次数不能过多,比如1~2次即可。因为平台用户提交的SQL多达上w条,需要考虑用户SQL自动重试后对impala集群资源的使用率问题
org.apache.hive.service.cli.HiveSQLException: Invalid or unknown query handle
SQLException: Cancelled
具体报错信息如下:
java.sql.SQLException: Cancelled
at org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:381)
at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:260)
at org.apache.hive.jdbc.HiveStatement.executeQuery(HiveStatement.java:473)
at com.alibaba.druid.pool.DruidPooledStatement.executeQuery(DruidPooledStatement.java:308)
通过hive-jdbc提交到大数据平台的SQL执行报错:Cancelled
,没有更多报错日志。原因:大数据平台节点资源紧张,直接抛弃任务。
区别于下面的Cancelled from Impala’s debug web interface报错。
解决方法:重试,优化SQL。貌似没有更好的方法。
Cancelled in ScannerContext
Cancelled from Impala’s debug web interface
具体报错信息如下:
java.sql.SQLException: Cancelled from Impala's debug web interface
at org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:381)
at org.apache.hive.jdbc.HiveQueryResultSet.next(HiveQueryResultSet.java:356)
at com.alibaba.druid.pool.DruidPooledResultSet.next(DruidPooledResultSet.java:68)
解决方法:重试,优化SQL。貌似没有更好的方法。
Invalid query handle: xxxx.yyyy
报错信息如下:
ERROR c.a.druid.pool.DruidPooledStatement - clearResultSet error
org.apache.hive.service.cli.HiveSQLException: Invalid query handle: d84d9133d8a6ce9c:9a77cd100000000
at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:266)
at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:252)
at org.apache.hive.jdbc.HiveStatement.closeStatementIfNeeded(HiveStatement.java:210)
at org.apache.hive.jdbc.HiveStatement.closeClientOperation(HiveStatement.java:221)
at org.apache.hive.jdbc.HiveQueryResultSet.close(HiveQueryResultSet.java:308)
at com.alibaba.druid.pool.DruidPooledResultSet.close(DruidPooledResultSet.java:86)
at com.alibaba.druid.pool.DruidPooledStatement.clearResultSet(DruidPooledStatement.java:206)
at com.alibaba.druid.pool.DruidPooledStatement.close(DruidPooledStatement.java:514)
报错代码片段:
finally {
if (ps != null) {
ps.close();
}
if (con != null) {
con.close();
}
}
发生在Statement.close()
处。
任务超时被kill。大数据平台资源有限,不可能让用户的查询SQL无限期执行下去。任务查杀规则综合考虑3方面因素(查询数据量 + 执行时间 + 完成度):
针对数据块这一问题,非常不建议使用select *
,而列出具体的查询字段,另外加上limit 10000
字句限制条数。
SQLException: Sender timed out waiting for receiver fragment instance: , dest node: 66
具体的报错信息如下:
java.sql.SQLException: Sender timed out waiting for receiver fragment instance: 394c696029ddcce6:a51b7cab000007cc, dest node: 66
at org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:381)
at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:260)
at org.apache.hive.jdbc.HiveStatement.executeQuery(HiveStatement.java:473)
at com.alibaba.druid.pool.DruidPooledStatement.executeQuery(DruidPooledStatement.java:308)
有时候重试成功,有时候失败。登录到Hadoop集群机器里,查看任务调度执行日志:
发现这个SQL,居然要全表扫描一张1.2W+分区,扫描20.1TB数据。
解决方法:重试,优化SQL。貌似没有更好的方法。
Failed to open HDFS file.No such file or directory.Root cause: RemoteException: File does not exist:
任务执行失败,具体的报错信息如下:
java.sql.SQLException: Disk I/O error: Failed to open HDFS file hdfs://ppdhdpha/user/hive/warehouse/test.db/chengzhangquanyi_huolizhiguoqi_chuda/2c43254ab60211d3-cf0e47d200000235_298950249_data.0.
Error(2): No such file or directory
Root cause: RemoteException: File does not exist: /user/hive/warehouse/test.db/chengzhangquanyi_huolizhiguoqi_chuda/2c43254ab60211d3-cf0e47d200000235_298950249_data.0.
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:85)
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:75)
at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:152)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1909)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:735)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:415)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
报错原因:数据表不存在。尝试方法:refresh table
或invalidate table
Error(s) deleting partition directories. First error (of 37) was: Hdfs op. Input/output error
java.sql.SQLException: Error(s) deleting partition directories. First error (of 37) was: Hdfs op (DELETE hdfs://ppdhdpha/user/hive/warehouse/cszc.db/zmj_sop_m0_base_snp/1d460f3a4d87ea14-a4c4521100000091_870332460_data.0.) failed, error was: hdfs://ppdhdpha/user/hive/warehouse/cszc.db/zmj_sop_m0_base_snp/1d460f3a4d87ea14-a4c4521100000091_870332460_data.0.
Error(5): Input/output error
解决方法:重试,优化SQL。貌似没有更好的方法。
java.sql.SQLException: Admission for query exceeded timeout 180000ms in pool root.report. Queued reason: queue is not empty (size 5); queued queries are executed first.
具体报错信息:
ERROR c.alibaba.druid.pool.DruidDataSource - create connection SQLException, url: jdbc:hive2://111.222.333.66:21050/default, errorCode 0, state HY000
java.sql.SQLException: Admission for query exceeded timeout 180000ms in pool root.report. Queued reason: queue is not empty (size 4); queued queries are executed first.
at org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:381)
at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:260)
at org.apache.hive.jdbc.HiveStatement.executeQuery(HiveStatement.java:473)
at com.alibaba.druid.pool.DruidAbstractDataSource.validateConnection(DruidAbstractDataSource.java:1434)
at com.alibaba.druid.pool.DruidAbstractDataSource.createPhysicalConnection(DruidAbstractDataSource.java:1733)
at com.alibaba.druid.pool.DruidDataSource$CreateConnectionThread.run(DruidDataSource.java:2838)
提交到大数据平台的SQL需要排队,排队超过3分钟被抛弃。
GetConnectionTimeoutException & Rejected query from pool root.report: queue full, limit=200, num_queued=200.
具体报错信息:
java.lang.Exception: java.lang.Exception: com.alibaba.druid.pool.GetConnectionTimeoutException: wait millis 120000, active 1, maxActive 150, creating 0, createErrorCount 2
at com.xy.cloudiview.common.util.JdbcUtil.getConnection(JdbcUtil.java:136)
at com.xy.cloudiview.datasetsubscript.business.service.impl.TableWarnServiceImpl.getSqlCount(TableWarnServiceImpl.java:244)
at com.xy.cloudiview.datasetsubscript.business.service.impl.TableWarnServiceImpl.executeTableWarnJob(TableWarnServiceImpl.java:66)
at com.xy.cloudiview.datasetsubscript.business.xxljob.IviewTableWarnJobHandler.execute(IviewTableWarnJobHandler.java:45)
at com.ppdai.job.core.thread.JobThread.run(JobThread.java:142)
Caused by: java.lang.Exception: com.alibaba.druid.pool.GetConnectionTimeoutException: wait millis 120000, active 1, maxActive 150, creating 0, createErrorCount 2
at com.xy.cloudiview.common.util.JdbcUtil.getConnection(JdbcUtil.java:122)
... 4 common frames omitted
Caused by: com.alibaba.druid.pool.GetConnectionTimeoutException: wait millis 120000, active 1, maxActive 150, creating 0, createErrorCount 2
at com.alibaba.druid.pool.DruidDataSource.getConnectionInternal(DruidDataSource.java:1773)
at com.alibaba.druid.pool.DruidDataSource.getConnectionDirect(DruidDataSource.java:1427)
at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:1407)
at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:1397)
at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:100)
at com.xy.cloudiview.common.util.JdbcUtil.getConnection(JdbcUtil.java:118)
... 4 common frames omitted
Caused by: java.sql.SQLException: Rejected query from pool root.report: queue full, limit=200, num_queued=200.
at org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:381)
at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:260)
at org.apache.hive.jdbc.HiveStatement.executeQuery(HiveStatement.java:473)
at com.alibaba.druid.pool.DruidAbstractDataSource.validateConnection(DruidAbstractDataSource.java:1434)
at com.alibaba.druid.pool.DruidAbstractDataSource.createPhysicalConnection(DruidAbstractDataSource.java:1733)
at com.alibaba.druid.pool.DruidDataSource$CreateConnectionThread.run(DruidDataSource.java:2838)
大数据平台设置SQL提交数量,最多运行150个,排队200个,排队超时时间为5分钟。后续提交的SQL,排队连接超时,报错。
ImpalaRuntimeException: Error making ‘updateTableColumnStatistics’ RPC to Hive Metastore: CAUSED BY: MetaException: Could not create “increment”/“table” value-generation container SEQUENCE_TABLE
since autoCreate flags do not allow it.
ImpalaRuntimeException: Error making ‘dropTable’ RPC to Hive Metastore: \nCAUSED BY: MetaException: One or more instances could not be deleted
Failed to close HDFS file: hdfs://ppdhdpha/user/hive/warehouse/xkyy.db/cdjk_apibg_willa/_impala_insert_staging/.0.parq\nError(255): Unknown error 255\nRoot cause: RemoteException: The DiskSpace quota of /user/hive/warehouse/xkyy.db is exceeded: quota = 43980465111040 B = 40 TB but diskspace consumed = 43981119305382 B = 40.00 TB
File ‘hdfs://ppdcdpha/user/hive/warehouse/usrt.db/willa_liubiao_daily’ has an invalid Parquet version number: 336\n\n. Please check that it is a valid Parquet file. This error can also occur due to stale metadata. If you believe this is a valid Parquet file, try running refresh usrt.willa_liubiao_daily
表不存在
有两类报错信息:
- Table does not exist:
- Could not resolve table reference
区别是什么???
Decompressor: invalid compressed length. Data is likely corrupt.
待执行的查询SQL依赖于表rhino_man_checked_hitrule,而表的存储文件损坏,需要重新生成表。
Expected to read a compressed text file of size 5553487 bytes. But only read 5401212 bytes. This may indicate data file corruption. (file: hdfs://hdpha/user/hive/warehouse/rpt.db/rhino_man_checked_hitrule/000000_0.snappy).
待执行的查询SQL依赖于表rhino_man_checked_hitrule,而表的存储文件损坏,需要重新生成表。
参考
Sender timed out waiting for receiver fragment instance: , dest node: 66
Invalid query handle,感觉借鉴意义不大