SparkSQL+Hive

我这里spark3.0.1+hive3.1.2

先启动Hive的metastore ----node2

nohup /usr/local/hive/bin/hive --service metastore &

jps

 

 

把hive的配置文件hive-site.xml拷贝到spark/conf目录,把mysql驱动上传到spark/jars里面--node1 (也可以把配置文件和jar分发到其他机器,在其他机器使用SparkSQL操作hive)

我的hive-site.xml

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
         <value>jdbc:mysql://master:3306/hive?createDatabaseIfNotExist=true&amp;useUnicode=true&amp;characterEncoding=UTF-8&amp;useSSL=false</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>root</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>123456</value>
    </property>
    <property>
        <name>datanucleus.readOnlyDatastore</name>
        <value>false</value>
    </property>
    <property>
        <name>datanucleus.fixedDatastore</name>
        <value>false</value>
    </property>
    <property>
        <name>datanucleus.autoCreateSchema</name>
        <value>true</value>
    </property>
    <property>
        <name>datanucleus.schema.autoCreateAll</name>
        <value>true</value>
    </property>
    <property>
        <name>datanucleus.autoCreateTables</name>
        <value>true</value>
    </property>
    <property>
        <name>datanucleus.autoCreateColumns</name>
        <value>true</value>
    </property>
    <property>
        <name>hive.metastore.local</name>
        <value>true</value>
    </property>
    <!-- 显示表的列名 -->
    <property>
        <name>hive.cli.print.header</name>
        <value>true</value>
    </property>
    <!-- 显示数据库名称 -->
    <property>
        <name>hive.cli.print.current.db</name>
        <value>true</value>
    </property>
        <property>
            <name>hive.server2.thrift.port</name>
                <value>10000</value>
        </property>
        <property>
             <name>hive.server2.thrift.bind.host</name>
                <value>192.168.58.128</value>
        </property>
    <property>
        <name>hive.server2.session.check.interval</name>
        <value>0</value>
    </property>
</configuration>

 

 启动hdfs服务

start-all.sh

启动spark/bin下的spark-sql命令行

cd /usr/local/spark/bin
spark-sql

执行sql语句--node1

show databases;
show tables;
CREATE TABLE person (id int, name string, age int) row format delimited fields terminated by ' ';
LOAD DATA LOCAL INPATH 'file:////home/hadoop/文档/person.txt' IN TABLE person;
show tables;
select * from person;

 

 

 

 

 person.txt

1 zhangsan 20
2 lisi 29
3 wangwu 25
4 zhaoliu 30
5 tianqi 35
6 kobe 40

 

代码中整合Hive

导入依赖

<dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_2.12</artifactId>
            <version>${spark.version}</version>
</dependency>

注意:需要先启动Hive的metastore

nohup /usr/local/hive/bin/hive --service metastore &

编写代码

package cn.itcast.sql

import org.apache.spark.SparkContext
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.{Dataset, SparkSession}

/**
 * Author itcast
 * Desc 演示SparkSQL-使用SparkSQL-UDF将数据转为大写
 */
object Demo09_Hive {
  def main(args: Array[String]): Unit = {
    //TODO 0.准备环境---需要增加参数配置和开启hivesql语法支持
    val spark: SparkSession = SparkSession.builder().appName("sparksql").master("local[*]")
      .config("spark.sql.shuffle.partitions", "4")//本次测试时将分区数设置小一点,实际开发中可以根据集群规模调整大小,默认200
      .config("spark.sql.warehouse.dir", "hdfs://master:9000/user/hive/warehouse")//指定Hive数据库在HDFS上的位置
      .config("hive.metastore.uris", "thrift://master:9083")
      .enableHiveSupport()//开启对hive语法的支持
      .getOrCreate()
    val sc: SparkContext = spark.sparkContext
    sc.setLogLevel("WARN")
    import spark.implicits._

    //TODO 1.操作Hive
    spark.sql("show databases").show(false)
    spark.sql("show tables").show(false)
    spark.sql("CREATE TABLE person4 (id int, name string, age int) row format delimited fields terminated by ' '")
    spark.sql("LOAD DATA LOCAL INPATH 'file:///D:/person.txt' INTO TABLE person4")
    spark.sql("show tables").show(false)
    spark.sql("select * from person4").show(false)

    spark.stop()
  }
}

 

 

 

 

 

Spark分布式SQL引擎

 

 

 

同样的SparkSQL也支持这样的访问方式:

1.SparkSQL的命令行

2.在某一台机器启动SparkSQL的ThriftServer,然后可以在其他机器使用beeline访问

使用beeline 客户端连接

hive的元数据库服务metastore 保持启动状态

nohup /usr/local/hive/bin/hive --service metastore &

启动SparkSQL的ThriftServer--类似于HiveServer2

cd /usr/local/spark/sbin/
start-thriftserver.sh

在其他机器(我这里全在master上)使用beeline连接sparkSQL的thriftserver

新开一个终端:

cd /usr/local/spark/bin
beeline

 

 执行sql

show databases;
show tables;
select * from person;

 

 http://master:4040/jobs/

查看spark任务

 

 

使用代码访问

导入依赖

<dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive-thriftserver_2.12</artifactId>
            <version>${spark.version}</version>
</dependency>

使用jdbc代码访问SparkSQL-thriftserver_2

package cn.itcast.sql

import java.sql.{Connection, DriverManager, PreparedStatement, ResultSet}

/**
 * Author itcast
 * Desc 演示使用jdbc访问SparkSQL的ThriftServer
 */
object Demo10_ThriftServer{
  def main(args: Array[String]): Unit = {
    //0.加载驱动
    Class.forName("org.apache.hive.jdbc.HiveDriver")

    //1.获取连接
    val conn: Connection = DriverManager.getConnection(
      "jdbc:hive2://master:10000/default", //看上去像是在使用Hive的server2,本质上使用Spark的ThriftServer
      "root",
      "4135"
    )

    //2.编写sql
    val sql = """select id,name,age from person"""

    //3.获取预编译语句对象
    val ps: PreparedStatement = conn.prepareStatement(sql)

    //4.执行sql
    val rs: ResultSet = ps.executeQuery()

    //5.处理结果
    while (rs.next()){
      val id: Int = rs.getInt("id")
      val name: String = rs.getString("name")
      val age: Int = rs.getInt("age")
      println(s"id=${id},name=${name},age=${age}")
    }

    //6.关闭资源
    if(rs != null) rs.close()
    if(ps != null) ps.close()
    if(conn != null) conn.close()
  }
}

 关闭:

虚拟机上运行:

cd /usr/local/spark/sbin
stop-thriftserver.sh
cd ~
stop-all.sh

 

posted @ 2021-03-09 22:36  ziyuliu  阅读(223)  评论(0编辑  收藏  举报