从Hive中使用HQL语句创建DataFrame--常用方式

  1. 默认情况下SparkSession不支持读取Hive中的数据,也不支持操作HQL语法,
    如果要读取Hive中的数据,需要开启Hive的支持,
    构建sparkSession的时候调用一个函数enableHiveSupport()
    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("demo")
    val session = SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate()

  1. 引入spark连接操作hive以及MySQL驱动依赖
<dependency>
    <groupId>mysql</groupId>
    <artifactId>mysql-connector-java</artifactId>
    <version>8.0.18</version>
</dependency>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-hive_2.11</artifactId>
    <version>2.3.1</version>
</dependency>
  1. sparksql操作hive需要通过连接Hive的元数据操作,需要将hive配置了元数据库的配置文件hive-site.xml文件放到项目的resources路径下
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
	<property>
	  <name>javax.jdo.option.ConnectionURL</name>
	  <value>jdbc:mysql://node1:3306/hive_metastore?serverTimezone=UTC&amp;createDatabaseIfNotExist=true</value>
	  <description>JDBC connect string for a JDBC metastore</description>
	</property>

	<property>
	  <name>javax.jdo.option.ConnectionDriverName</name>
	  <value>com.mysql.cj.jdbc.Driver</value>
	  <description>Driver class name for a JDBC metastore</description>
	</property>

	<property>
	  <name>javax.jdo.option.ConnectionUserName</name>
	  <value>root</value>
	  <description>username to use against metastore database</description>
	</property>

	<property>
	  <name>javax.jdo.option.ConnectionPassword</name>
	  <value>Jsq123456...</value>
	  <description>password to use against metastore database</description>
	</property>
	<property> 
	  <name>hive.metastore.warehouse.dir</name> 
	  <value>/user/hive/warehouse</value> 
	  <description>location of default database for the warehouse</description> 
 	</property>
	<property> 
	  <name>hive.cli.print.header</name> 
	  <value>true</value> 
 	</property> 
	<property> 
	  <name>hive.cli.print.current.db</name> 
	  <value>true</value> 
	</property>
	

	<property> 
  	  <name>hive.server2.authentication</name> 
 	  <value>NONE</value> 
	</property> 
	<property> 
	  <name>hive.server2.thrift.bind.host</name> 
	  <value>node1</value> 
 	</property> 
	<property> 
	  <name>hive.server2.thrift.port</name> 
	  <value>10000</value> 
	  <description>TCP port number to listen on, default 10000</description> 
	</property> 
	<property> 
	  <name>hive.server2.thrift.http.port</name> 
	  <value>10001</value> 
	</property> 
	<property> 
	  <name>hive.server2.thrift.client.user</name> 
	  <value>root</value> 
	  <description>Username to use against thrift client</description> 
	</property> 
	<property> 
	  <name>hive.server2.thrift.client.password</name> 
	  <value>root</value> 
	  <description>Password to use against thrift client</description> 
	</property>
</configuration>
  1. 编写代码
object HiveCreateDataFrame {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("demo")
    sparkConf.set("spark.sql.warehouse.dir", "hdfs://node1:9000/user/hive/warehouse")

    /*
    要开启Hive的支持,连接Hive操作Hive

    Spark SQL连接上Hive之后,可以在Spark SQL中使用sql函数,在sql函数中编写HQL语句
    (不仅限于查询语句,可以是创建语句)

    但是在SparkSQL中创建的数据库默认不是在HDFS上存储的,
    而是在代码的本地的一个路径下创建的
     */
    val session = SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate()

    // 查询Hive数据表构建DataFrame  hive中的project.area_pvs数据表构建DataFrame ???
    val dataFrame0: DataFrame = session.sql("create database if not exists aa")
    val dataFrame1: DataFrame = session.sql("show databases")
    dataFrame0.show()
    dataFrame1.show()
  }
}
  1. 注意
  • SparkSQL中创建的数据库默认不是在HDFS上存储的,而是在代码的本地的一个路径下创建的
    -- 如果想在HDFS上存储,需设置sparkConf.set("spark.sql.warehouse.dir", "hdfs://node1:9000/user/hive/warehouse") 或者设置配置文件hive-site.xml,将hive.metastore.warehouse.dir设置为hdfs://node1:9000/user/hive/warehouse
    -- 如果在本地中存储,无需设置即可。
posted @ 2022-08-27 12:21  jsqup  阅读(65)  评论(0编辑  收藏  举报