spark hive 数据不一致 spark默认本地数据元 spark不能插入hive数据
场景:spark+hive采用客户端和服务端分离的模式,客户端启动spark-sql 或者spark-submit、spark-shell 操作的都是本地数据源,无论服务端hive有没有启动,烦恼了一周,终于有了解决办法。
问题重现:采用spark-submit提交的方式
conf = (SparkConf().setAppName("My app")) sc = SparkContext(conf = conf) hive_context = HiveContext(sc) hive_context.sql(''' show tables ''').show()
--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| default| camera| false|
| default| src| false|
+--------+---------+-----------+
hive_context.sql(''' select * from camera ''').show()
+---+-------+---------+
| id|test_id|camera_id|
+---+-------+---------+
+---+-------+---------+
hive_context.sql(''' insert into table camera values(1,"3","145") ''').show()
++
||
++
++
hive_context.sql(''' select * from camera ''').show()
+---+-------+---------+
| id|test_id|camera_id|
+---+-------+---------+
+---+-------+---------+
同样采用spark-sql 也看不到hive的表
解决方法:
在pyspark编程时加上配置信息,怀疑是spark启动时并没有读取conf下的hive—sit.xml文件
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.warehouse.dir", "/usr/hive/warehouse") \
.config("hive.metastore.uris","thrift://slave1:9083") \
.config("fs.defaultFS","hdfs://master:9000") \
.enableHiveSupport() \
.getOrCreate()
# spark is an existing SparkSession
spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive")
spark.sql(''' insert into table src values(1,"145") ''').show()
# Queries are expressed in HiveQL
spark.sql("SELECT * FROM src").show()
+---+-----+
|key|value|
+---+-----+
| 1| 145|
+---+-----+
终于有了数据,注意服务器需要先后台开启 nohup hive --service metastore &
最后附上服务端和客户端的hive-site.xml文件
服务端slave1 (/usr/hive/conf/)hive-site.xml
<property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://master:3306/metastore?createDatabaseIfNotExist=true</value> <description>the URL of the MySQL database</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>123456</value> </property> <property> <name>datanucleus.autoCreateSchema</name> <value>false</value> </property> <property> <name>datanucleus.fixedDatastore</name> <value>true</value> </property> <property> <name>datanucleus.autoStartMechanism</name> <value>SchemaTable</value> </property> <property> <name>hive.metastore.schema.verification</name> <value>true</value> </property> <property> <name>hive.metastore.warehouse.dir</name> <value>/usr/hive/warehouse</value> <description>location of default database for the warehouse</description> </property> </configuration>
客户端master (/usr/spark/conf/ && /usr/hive/conf/ )hive-site.xml
<property> <name>hive.metastore.uris</name> <value>thrift://slave1:9083</value> <description></description> </property> <property> <name>hive.metastore.warehouse.dir</name> <value>/usr/hive/warehouse</value> <description>location of default database for the warehouse</description> </property> <property> <name>hive.exec.scratchdir</name> <value>/usr/hive/tmp</value> </property> <property> <name>hive.querylog.location</name> <value>/usr/hive/log</value> </property> </configuration>
配置并不是关键,还有可能是错的,但在编写pyspark程序时,必须加上那三句配置
完!