cloudera impala初用之问题集锦（一）

在按照https://github.com/cloudera/impala所给出的文档进行impala的源码编译之后，在运行下面的脚本之后，出现了一系列的问题：

${IMPALA_HOME}/bin/start-impalad.sh -use_statestore=false
${IMPALA_HOME}/bin/impala-shell.sh

问题1：

虽然本地集群的hive metastore已经配置好了，执行impala-shell.sh脚本后也能成功，但是执行show databases的时候，却看不到在hive里已经创建的test.db这个数据库，而脚本还在其执行目录生成了derby.log和metastore.db两个日志文件和目录，这是impala自带的hive元数据库，所以问题就很清楚了，这是因为impala未能了解你所配置的hive元数据。

按照官网上所说的，为了配置impala需要使用的hdfs，hbase,hive的metastore，其内部实现是将其配置文件放入fe/src/test/resources目录下,这个是在${IMPALA_HOME}/bin/set-classpath.sh中设置的，set-classpath.sh中的shell脚本如下：

#!/bin/sh
# Copyright 2012 Cloudera Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
 
# This script explicitly sets the CLASSPATH for embedded JVMs (e.g. in
# Impalad or in runquery) Because embedded JVMs do not honour
# CLASSPATH wildcard expansion, we have to add every dependency jar
# explicitly to the CLASSPATH.
 
CLASSPATH=\
$IMPALA_HOME/fe/src/test/resources:\
$IMPALA_HOME/fe/target/classes:\
$IMPALA_HOME/fe/target/dependency:\
$IMPALA_HOME/fe/target/test-classes:\
${HIVE_HOME}/lib/datanucleus-core-2.0.3.jar:\
${HIVE_HOME}/lib/datanucleus-enhancer-2.0.3.jar:\
${HIVE_HOME}/lib/datanucleus-rdbms-2.0.3.jar:\
${HIVE_HOME}/lib/datanucleus-connectionpool-2.0.3.jar:${CLASSPATH}
 
 
for jar in `ls ${IMPALA_HOME}/fe/target/dependency/*.jar`; do
  CLASSPATH=${CLASSPATH}:$jar
done
 
export CLASSPATH

但是问题出现了，我所编译成功后的源码fe/src目录下，并没有resources这个目录，所以我从其它地方将它下了下来，然后放到相应目录中，修改相应的core.site.xml,hdfs-site.xml,hive-site.xml这三个配置文件，和集群配置相同；然后执行source bin/set-classpath.sh，这样第一个问题就解决了！

问题2：

待到前面的那个问题解决之后，执行与前面相同的两个脚本，执行下面的命令：

Welcome to the Impala shell. Press TAB twice to see a list of available commands.
 
Copyright (c) 2012 Cloudera, Inc. All rights reserved.
 
(Build version: build version not available)
[Not connected] > connect hadoop-01
[hadoop-01:21000] > show databases;
default
test_impala
[hadoop-01:21000] > use test_impala;
 
[hadoop-01:21000] > show tables;
tab1
tab2
tab3
[hadoop-01:21000] > select * from tab3;
 
[hadoop-01:21000] > select * from tab1;
ERROR: Failed to open HDFS file hdfs://hadoop-01.localdomain:8030/user/impala/warehouse/test_impala.db/tab1/tab1.csv
Error(255): Unknown error 255
ERROR: Invalid query handle
[hadoop-01:21000] > select * from tab1;
ERROR: Failed to open HDFS file hdfs://hadoop-01.localdomain:8030/user/impala/warehouse/test_impala.db/tab1/tab1.csv
Error(255): Unknown error 255
ERROR: Invalid query handle
[hadoop-01:21000] > quit

后台impalad的日志信息如下：

13/01/18 11:50:46 INFO service.Frontend: createExecRequest for query select * from tab1
13/01/18 11:50:46 INFO service.JniFrontend: Plan Fragment 0
  UNPARTITIONED
  EXCHANGE (1)
    TUPLE IDS: 0 
 
Plan Fragment 1
  RANDOM
  STREAM DATA SINK
    EXCHANGE ID: 1
    UNPARTITIONED
 
  SCAN HDFS table=test_impala.tab1 (0)
    TUPLE IDS: 0 
 
13/01/18 11:50:46 INFO service.JniFrontend: returned TQueryExecRequest2: TExecRequest(stmt_type:QUERY, sql_stmt:select * from tab1, request_id:TUniqueId(hi:-6897121767931491435, lo:-4792011001236606993), query_options:TQueryOptions(abort_on_error:false, max_errors:0, disable_codegen:false, batch_size:0, return_as_ascii:true, num_nodes:0, max_scan_range_length:0, num_scanner_threads:0, max_io_buffers:0, allow_unsupported_formats:false, partition_agg:false), query_exec_request:TQueryExecRequest(desc_tbl:TDescriptorTable(slotDescriptors:[TSlotDescriptor(id:0, parent:0, slotType:INT, columnPos:0, byteOffset:4, nullIndicatorByte:0, nullIndicatorBit:1, slotIdx:1, isMaterialized:true), TSlotDescriptor(id:1, parent:0, slotType:BOOLEAN, columnPos:1, byteOffset:1, nullIndicatorByte:0, nullIndicatorBit:0, slotIdx:0, isMaterialized:true), TSlotDescriptor(id:2, parent:0, slotType:DOUBLE, columnPos:2, byteOffset:8, nullIndicatorByte:0, nullIndicatorBit:2, slotIdx:2, isMaterialized:true), TSlotDescriptor(id:3, parent:0, slotType:TIMESTAMP, columnPos:3, byteOffset:16, nullIndicatorByte:0, nullIndicatorBit:3, slotIdx:3, isMaterialized:true)], tupleDescriptors:[TTupleDescriptor(id:0, byteSize:32, numNullBytes:1, tableId:1)], tableDescriptors:[TTableDescriptor(id:1, tableType:HDFS_TABLE, numCols:4, numClusteringCols:0, hdfsTable:THdfsTable(hdfsBaseDir:hdfs://hadoop-01.localdomain:8030/user/impala/warehouse/test_impala.db/tab1, partitionKeyNames:[], nullPartitionKeyValue:__HIVE_DEFAULT_PARTITION__, partitions:{-1=THdfsPartition(lineDelim:10, fieldDelim:44, collectionDelim:44, mapKeyDelim:44, escapeChar:0, fileFormat:TEXT, partitionKeyExprs:[], blockSize:0, compression:NONE), 1=THdfsPartition(lineDelim:10, fieldDelim:44, collectionDelim:44, mapKeyDelim:44, escapeChar:0, fileFormat:TEXT, partitionKeyExprs:[], blockSize:0, compression:NONE)}), tableName:tab1, dbName:test_impala)]), fragments:[TPlanFragment(plan:TPlan(nodes:[TPlanNode(node_id:1, node_type:EXCHANGE_NODE, num_children:0, limit:-1, row_tuples:[0], nullable_tuples:[false], compact_data:false)]), output_exprs:[TExpr(nodes:[TExprNode(node_type:SLOT_REF, type:INT, num_children:0, slot_ref:TSlotRef(slot_id:0))]), TExpr(nodes:[TExprNode(node_type:SLOT_REF, type:BOOLEAN, num_children:0, slot_ref:TSlotRef(slot_id:1))]), TExpr(nodes:[TExprNode(node_type:SLOT_REF, type:DOUBLE, num_children:0, slot_ref:TSlotRef(slot_id:2))]), TExpr(nodes:[TExprNode(node_type:SLOT_REF, type:TIMESTAMP, num_children:0, slot_ref:TSlotRef(slot_id:3))])], partition:TDataPartition(type:UNPARTITIONED, partitioning_exprs:[])), TPlanFragment(plan:TPlan(nodes:[TPlanNode(node_id:0, node_type:HDFS_SCAN_NODE, num_children:0, limit:-1, row_tuples:[0], nullable_tuples:[false], compact_data:false, hdfs_scan_node:THdfsScanNode(tuple_id:0))]), output_sink:TDataSink(type:DATA_STREAM_SINK, stream_sink:TDataStreamSink(dest_node_id:1, output_partition:TDataPartition(type:UNPARTITIONED, partitioning_exprs:[]))), partition:TDataPartition(type:RANDOM, partitioning_exprs:[]))], dest_fragment_idx:[0], per_node_scan_ranges:{0=[TScanRangeLocations(scan_range:TScanRange(hdfs_file_split:THdfsFileSplit(path:hdfs://hadoop-01.localdomain:8030/user/impala/warehouse/test_impala.db/tab1/tab1.csv, offset:0, length:192, partition_id:1)), locations:[TScanRangeLocation(server:THostPort(hostname:192.168.1.2, ipaddress:192.168.1.2, port:50010), volume_id:0)])]}, query_globals:TQueryGlobals(now_string:2013-01-18 11:50:46.000000862)), result_set_metadata:TResultSetMetadata(columnDescs:[TColumnDesc(columnName:id, columnType:INT), TColumnDesc(columnName:col_1, columnType:BOOLEAN), TColumnDesc(columnName:col_2, columnType:DOUBLE), TColumnDesc(columnName:col_3, columnType:TIMESTAMP)]))
hdfsOpenFile(hdfs://hadoop-01.localdomain:8030/user/impala/warehouse/test_impala.db/tab1/tab1.csv): FileSystem#open((Lorg/apache/hadoop/fs/Path;I)Lorg/apache/hadoop/fs/FSDataInputStream;) error:
java.lang.IllegalArgumentException: Wrong FS: hdfs://hadoop-01.localdomain:8030/user/impala/warehouse/test_impala.db/tab1/tab1.csv, expected: hdfs://localhost:20500
    at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:547)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:169)
    at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:245)
    at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSyst

问题所指的是Wrong FS错误，expected：hdfs://localhost:20500，我在resources目录下的core-site.xml配置文件明明就已经指定了namenode的地址和端口为8030，后来看了下impala关于impala的源码，才发现在/ be / src / runtime / hdfs-fs-cache.cc目录下，有指定默认的nn和nn_port，

DEFINE_string(nn, "localhost", "hostname or ip address of HDFS namenode");
DEFINE_int32(nn_port, 20500, "namenode port");

所以，在启动impalad的服务的时候，需要同时指定nn和nn_port为集群所设置的相应地址和端口，如下所示：

/bin/start-impalad.sh -use_statestore=false -nn=hadoop-01.localdomain -nn_port=8030

这样关于expected: hdfs://localhost:20500第二个问题也就解决了，执行任何查询都没有问题！

posted on 2013-01-18 13:03 Loogn_qiang 阅读(1448) 评论(0) 编辑收藏举报