使用Saprk SQL 操作Hive的数据
前提准备:
1、启动Hdfs,hive的数据存储在hdfs中;
2、启动hive -service metastore,元数据存储在远端,可以远程访问;
3、在spark的conf目录下增加hive-site.xml配置文件,文件内容:
<configuration> <property> <name>hive.metastore.uris</name> <value>thrift://node1:9083</value> </property> </configuration>
编写Scala测试程序:
object Hive { def main(args: Array[String]) { val conf = new SparkConf() .setAppName("HiveDataSource") .setMaster("spark://node1:7077") val sc = new SparkContext(conf); val hiveContext = new HiveContext(sc); hiveContext.sql("SHOW tables").show() sc.stop() } }
将程序打包到spark主机,通过spark-submit命令执行:
./bin/spark-submit --class com.spark.test.Hive --master spark://node1:7077 ./jar/Test.jar
关于spark-submit命令的说明,参考官网:
http://spark.apache.org/docs/1.6.0/submitting-applications.html
注意点:
1、--deploy-mode cluster 在集群模式中要注意将jar文件放到hdfs中或都存在的文件位置中。
集群模式是将程序发布到works节点运行driver程序。本地模式只运行在程序提交的节点上client,结果也只输出在终端。
--------------------------------------------------------------------------------------------------------------
重新配置CDH版本的应用,spark调用hive, 出现了找不到jar包,及配置文件的情况。
错误信息:
WARN [Driver] metastore.HiveMetaStore: Retrying creating default database after error: Class org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
javax.jdo.JDOFatalUserException: Class org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
ERROR [Driver] yarn.ApplicationMaster: User class threw exception: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
这里通过增加调用参数来实现:
/home/hadoop/app/spark-1.6.0-cdh5.10.0/bin/spark-submit \
--class HiveSql \
--master yarn-cluster \
--executor-memory 512m \
--num-executors 2 \
--files /home/hadoop/app/spark-1.6.0-cdh5.10.0/conf/hive-site.xml \
--jars /home/hadoop/lib/datanucleus-rdbms-3.2.9.jar,/home/hadoop/lib/datanucleus-core-3.2.10.jar,/home/hadoop/lib/datanucleus-api-jdo-3.2.6.jar \
spark-vmware-sql.jar
引入下面三个jar包,在spark和hive的lib目录下可以找到。
datanucleus-core-3.2.10.jar
datanucleus-api-jdo-3.2.6.jar
datanucleus-api-jdo-3.2.6.jar