spark004 - jeasonchen001

SparkStreaming

1.课程目标

掌握SparkStreaming原理和架构
掌握DStream常用的操作
掌握SparkStreaming整合flume
掌握SparkStreaming整合kafka

2.SparkStreaming概念

SparkStreaming是用来开发实时数据处理程序的
- SparkStreaming可以接入不同实时数据源，Kafka、Flume、Socket等不同的数据源。SparkStreaming一旦启动，（程序本身）无法停止。除非有人工干预或者机器故障或者数据异常
SparkStreaming特点：
- 易用：可以支持多种开发语言
- 容错：Spark Streaming可以从盒子中恢复丢失的工作和操作员状态（如滑动窗口），而不需要任何额外的代码。
- 易整合：SparkStreaming 会整合Graphx以及机器学习。SparkStreaming 还可整合Kakfa、Flume、文件系统、SparkCore、SparkSQL

3.SparkStreaming 原理

SparkStreaming是伪实时数据处理程序（微批处理）程序。
	微：代表时间间隔 很小
 批处理：代表的离线数据处理
	
1.SparkStreaing 将时间间隔缩短成秒级甚至是毫秒级进行数据处理。
2.SparkStreaing 将按照时间间隔读取进来数据转换成RDD,然后RDD进行转换操作，最后也是提交Job。
SparkStreaming 中按照时间间隔自动提交Job程序，

SparkStreaming实时性

SparkStreaming目前能做到实时性 100毫秒，
	如果对于实时性要求非常高的项目，比如证券交易、新闻发布，在进行技术选型的时候，要考虑Storm或者Flink、Spark中Structured Streaming。 
  1.job中划分Task时间（1秒钟 50个task，每个task 20毫秒）
  2.将task划分到不同的worker节点（网络传输+序列化和反序列化过程+执行过程）
  以上两个过程时间间隔>=100毫秒
在实际开发过程中SparkStreaming一般处理到秒级

SparkStreaming 容错性
- 基于RDD的血统机制容错
- RDD缓存机制容错
- RDD的checkpoint进行容错

4.SparkStreaming的编程入口

导入pom依赖

<!--SparkStreaming 依赖-->
 <dependency>
 <groupId>org.apache.spark</groupId>
 <artifactId>spark-streaming_2.11</artifactId>
 <version>2.2.0</version>
  </dependency>

SparkCore编程入口 SparkContext
SparkSQL 编程入口 SparkSession
SparkStreaming的编程入口 StreamingContext

StreamingContext功能：
	1.接入数据源 并且创建DStream
    2.StreamingContext.start() 启动实时数据处理程序
    3.context.awaitTermination() 启动线程监控程序是否有异常情况出现，如果出现异常则调用stop方法
	4.StreamingContext.stop() 停止和释放程序资源

创建StreamingContext对象：需要SparkConf 和时间间隔参数

5.SparkStreaming编程模型：DStream

DStream：是离散流对象，是sparkStreaming的基础编程抽象。DStream实际上就是一个RDD的集合。
DStream 转换算子 map、flatMap、reduceByKey、groupByKey等操作  每次转换操作都会形成新的			Dstream
DStream 输出算子：print、foreachRDD、transform、saveAsTextFiles等操作

DStreams 三大特性：
	1.每个DStream都依赖于其他的DStreams
		def dependencies: List[DStream[_]]
    2.每一个时间间隔都会创建RDD
		var generatedRDDs = new HashMap[Time, RDD[T]]()
	3.在转换算子中每一个函数操作都会作用于DStream中的RDD之上
		算子中函数式通过DStream中compute方法作用于每个时间间隔对应的RDD之上
		override def compute(validTime: Time): Option[RDD[U]] = {
    parent.getOrCompute(validTime).map(_.map[U](mapFunc))
  }

6.SparkStreaming基于IDEA编程实战

pom依赖

<!--SparkStreaming 依赖-->
  <dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-streaming_2.11</artifactId>
  <version>2.2.0</version>
</dependency>

6.1 接入Socket数据

准备工作
- 安装socket数据源软件：命令 yum -y install nc
- 启动：nc -lk 端口号

编写程序

package cn.itcast.sparkStreaming

import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.{Seconds, StreamingContext}

object SocketSparkStreaming {
  def main(args: Array[String]): Unit = {
    //创建SparkConf对象
    val conf = new SparkConf().setAppName("SocketSparkStreaming").setMaster("local[2]")
    //构建StreamingContext
    val sc: StreamingContext = new StreamingContext(conf, Seconds(5))
    sc.sparkContext.setLogLevel("OFF")

    //接入数据源 创建Dstream
    val linesDstream: DStream[String] = sc.socketTextStream("node-01", 9999)
    // 输入数据格式 ， 分开 wordcount
    val wordCountDstream: DStream[(String, Int)] = linesDstream.flatMap(_.split(",")).map((_, 1)).reduceByKey(_ + _)

    //打印输出
    wordCountDstream.print()

    //启动实时数据处理程序
    sc.start()
    sc.awaitTermination()
  }
}

注意：SparkStreaming实时处理只能够处理当前批次的数据，并不能保存历史数据状态

保存实时数据处理历史状态，则需要使用updateStateByKey

package cn.itcast.sparkStreaming

import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.{Seconds, StreamingContext}

object UpdateStateSparkStreaming {
  /**
    *
    * @param oldValues 每一个历史批次的统计值:这里就相当于Dstream中每一个RDD的wordcount操作
    * @param newValue  当前批次的rdd操作wordcount
    * @return
    */
  def updateFunc(oldValues: Seq[Int], newValue: Option[Int]): Option[Int] = {
    val value = oldValues.sum + newValue.getOrElse(0)
    Option(value)
  }


  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("UpdateStateSparkStreaming").setMaster("local[2]")
    //创建StreamingContext对象
    val context = new StreamingContext(conf, Seconds(5))
    context.sparkContext.setLogLevel("OFF")
    //设置checkpoint路径 hdfs之上
    context.checkpoint("D:/ck")
    //接入数据源
    val linesDstream: DStream[String] = context.socketTextStream("node-01", 9999)

    val wordcountDStream: DStream[(String, Int)] = linesDstream.flatMap(_.split(",")).map((_, 1)).reduceByKey(_ + _)

    //保存历史状态 将数据打印到控制台
    wordcountDStream.updateStateByKey(updateFunc _).print()
    context.start()
    context.awaitTermination()
  }
}

宕机后历史状态无法保存。

实现宕机后依然保存历史状态

package cn.itcast.sparkStreaming

import cn.itcast.sparkStreaming.UpdateStateSparkStreaming.updateFunc
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.DStream

object SparkStreamingUpdate {

  def createFunc(): StreamingContext = {
    val conf = new SparkConf().setAppName("UpdateStateSparkStreaming").setMaster("local[2]")
    //创建StreamingContext对象
    val context = new StreamingContext(conf, Seconds(10))

    context.sparkContext.setLogLevel("OFF")
    //设置checkpoint路径 hdfs之上
    context.checkpoint("D:/ck")
    //接入数据源
    val linesDstream: DStream[String] = context.socketTextStream("node-01", 9999)

    val wordcountDStream: DStream[(String, Int)] = linesDstream.flatMap(_.split(",")).map((_, 1)).reduceByKey(_ + _)

    //保存历史状态 将数据打印到控制台
    wordcountDStream.updateStateByKey(updateFunc _).print()
    //返回值
    context
  }
  def main(args: Array[String]): Unit = {
      //设置checkpint路径   生产环境下 一般是hdfs路径
    val streamingContext=StreamingContext.getOrCreate("D:/ck", createFunc _)
    streamingContext.start()
    streamingContext.awaitTermination()
  }
}

需求：统计每20秒钟产生的热词，每10秒钟统计一次，sparkStreaming的时间间隔 5秒钟。SparkStreaming中窗口操作。

object SparkStreamingWindow {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("UpdateStateSparkStreaming").setMaster("local[2]")
    //创建StreamingContext对象
    val context = new StreamingContext(conf, Seconds(5))

    context.sparkContext.setLogLevel("OFF")

    //接入数据源
    val linesDstream: DStream[String] = context.socketTextStream("node-01", 9999)

    val wordcountDStream: DStream[(String, Int)] = linesDstream.flatMap(_.split(",")).map((_, 1))

      .reduceByKeyAndWindow((x: Int, y: Int) => x + y, Seconds(15), Seconds(10))
      //窗口操作的时间设置 必须是产生RDD时间间隔整数倍
    wordcountDStream.print()
    context.start()
    context.awaitTermination()

  }
}

6.2 SparkStreaming其他数据来源

SparkStreaming可以读取文件系统中文件

def main(args: Array[String]): Unit = {
  val context = new StreamingContext(new SparkConf().setAppName("SparkStreamingFile").setMaster("local[*]"), Seconds(5))
  context.sparkContext.setLogLevel("OFF")
  //读取文件数据
  val dstream = context.textFileStream("D:/data/")
  dstream.print()
  context.start()
  context.awaitTermination()
}

6.3自定义Receiver读取数据源

自定义类继承Receiver 实现revicer中抽象方法， onStart onStop

package cn.itcast.sparkStreaming

import java.io.{BufferedReader, InputStreamReader}
import java.net.Socket

import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.receiver.Receiver

//构建自定义receiver  接受socket数据
class MyReceiver(val host: String, val port: Int) extends Receiver[String](StorageLevel.MEMORY_ONLY) {
  var socket: Socket = _

  override def onStart(): Unit = {
    //接入数据源
    socket = new Socket(host, port)
    new Thread() {
      override def run(): Unit = {
        //读取数据并且将数据存储到缓存中
        receiverMsg()
      }
    }.start()
  }

  def receiverMsg(): Unit = {
    //获取数据流
    val inputStream = socket.getInputStream()
    //构建读取对象
    val reader = new InputStreamReader(inputStream)
    val bufferedReader = new BufferedReader(reader)

    var line: String = bufferedReader.readLine()
    while(!isStopped() && line != null) {
      //存储数据
      store(line)
      line = bufferedReader.readLine()
    }
  }

  override def onStop(): Unit = {
    //释放资源操作
    socket.close()
  }
}

object SparkStreamingReceiver {
  def main(args: Array[String]): Unit = {
    val context = new StreamingContext(new SparkConf().setAppName("SparkStreamingFile").setMaster("local[*]"), Seconds(5))
    context.sparkContext.setLogLevel("OFF")
    val dstream = context.receiverStream(new MyReceiver("node-01", 9999))
    dstream.print()
    context.start()
    context.awaitTermination()
  }
}

7.SparkStreaming读取Flume中数据

Flume日志采集，准备工作
- 将spark-streaming-flume-sink*.jar 替换成spark-streaming-flume-sink_2.11-2.2.0.jar
- 将 scala-library-2.10.5.jar 替换成 scala-library-2.11.8.jar

配置pom文件

<dependency>     
	<groupId>org.apache.spark</groupId>
	<artifactId>spark-streaming-flume_2.11</artifactId>     	              		<version>2.2.0</version> 
</dependency>

SparkStreaming 读取Flume中数据，有两种方式
- poll（拉取数据）：Flume会将收集到的数据发送到 Flume所在机器的某个特定端口
- push（推送方式）：Flume将收集到的数据推送到Spark应用程序执行的机器

读取数据步骤

创建flume-poll.properties文件

a1.sources = r1
a1.sinks = k1
a1.channels = c1
#source
a1.sources.r1.channels = c1
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /home/data
a1.sources.r1.fileHeader = true
#channel
a1.channels.c1.type =memory
a1.channels.c1.capacity = 20000
a1.channels.c1.transactionCapacity=5000
#sinks
a1.sinks.k1.channel = c1
a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink
a1.sinks.k1.hostname=node-03
a1.sinks.k1.port = 9999
a1.sinks.k1.batchSize= 2000

package cn.itcast.sparkStreaming

import org.apache.spark.SparkConf
import org.apache.spark.streaming.flume.{FlumeUtils, SparkFlumeEvent}
import org.apache.spark.streaming.{Seconds, StreamingContext}

object SparkStreamingFlume {
  def main(args: Array[String]): Unit = {
    //创建StreamingContext
    val conf = new SparkConf().setMaster("local[*]").setAppName("SparkStreamingFlume")
    val context = new StreamingContext(conf, Seconds(10))
    context.sparkContext.setLogLevel("OFF")
    //读取flume中数据
    /**
      * ssc: StreamingContext,
      * hostname: String
      * port: Int,
      */
    val flumeDstream = FlumeUtils.createPollingStream(context, "node-03", 9999)

    val wordcountDstream = flumeDstream.map((sfe: SparkFlumeEvent) => {
      val body = sfe.event.getBody
      new String(body.array())
    }).flatMap(_.split(",")).map((_, 1)).reduceByKey(_ + _)

    //排序
    val sortDstream = wordcountDstream.transform(rdd => {
      rdd.sortBy(t => t._2, false)
    })
    //topN
    //sortDstream.count().print()
  
    context.start()
    context.awaitTermination()

  }
}

push 模式消费数据

创建flume-push.properties

a1.sources = r1
a1.sinks = k1
a1.channels = c1
#source
a1.sources.r1.channels = c1
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /home/data/
a1.sources.r1.fileHeader = true
#channel
a1.channels.c1.type =memory
a1.channels.c1.capacity = 20000
a1.channels.c1.transactionCapacity=5000
#sinks
a1.sinks.k1.channel = c1
a1.sinks.k1.type = avro
#注意这里的ip需要指定的是我们spark程序所运行的服务器的ip，也就是我们的win7的ip地址
a1.sinks.k1.hostname=192.168.23.22
a1.sinks.k1.port = 8888
a1.sinks.k1.batchSize= 2000

package cn.itcast.sparkStreaming

import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.flume.{FlumeUtils, SparkFlumeEvent}
import org.apache.spark.streaming.{Seconds, StreamingContext}

object SparkStreamingFlume2 {
  def main(args: Array[String]): Unit = {
    //创建StreamingContext
    val conf = new SparkConf().setMaster("local[*]").setAppName("SparkStreamingFlume")
    val context = new StreamingContext(conf, Seconds(10))
    context.sparkContext.setLogLevel("OFF")

    /**
      * ssc: StreamingContext,
      * hostname: String,
      * port: Int
      */
    val dstream:DStream[SparkFlumeEvent] = FlumeUtils.createStream(context, "192.168.23.22", 8888)
    dstream.map(sfe=>{
      new String(sfe.event.getBody.array())
    }).print()
    context.start()
    context.awaitTermination()

  }
}

8.SparkStreaming消费kafka中数据

操作步骤
- 启动Kafka 集群

通过Receiver消费kafka中数据

package cn.itcast.sparkStreaming

import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * createStream 这种消费方式 kafka的offset 存放在zookeeper中，由程序自动管理的
  *   以receiver方式 消费kafka中数据，receiver以多线程的方式消费的， 线程数量 跟每一个topic对应partition数量有关
  *     topic1->3  topic2->3  线程数大小 6
  */
object SparkStreamingKakfa_8 {
  def main(args: Array[String]): Unit = {
    //创建StreamingContext
    val conf = new SparkConf().setMaster("local[*]").setAppName("SparkStreamingFlume")
    val context = new StreamingContext(conf, Seconds(10))
    context.sparkContext.setLogLevel("OFF")

    //链接kafka数据源
    /**
      * ssc: StreamingContext,
      * zkQuorum: String,
      * groupId: String,
      * topics: Map[String, Int],
      */
    val zkQuorum: String = "node-01:2181,node-02:2181,node-03:2181"
    val groupId: String = "shenzhen"
    val topics: Map[String, Int] = Map("shenzhen_itcast" -> 3)

    val kafkaDstream =KafkaUtils.createStream(context, zkQuorum, groupId, topics)
    val wcDstream = kafkaDstream.map(_._2).flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _)
    wcDstream.print()
    context.start()
    context.awaitTermination()
  }
}

/**
  * Direct 方式消费 kafka中数据，offset 存在于kafka的topic
  */
object SparkStreamingKafkaDriect_8 {
  def main(args: Array[String]): Unit = {
    //创建StreamingContext
    val conf = new SparkConf().setMaster("local[*]").setAppName("SparkStreamingFlume")
    val context = new StreamingContext(conf, Seconds(10))
    context.sparkContext.setLogLevel("OFF")
    //通过Dirvect方式消费kafka中数据
    /**
      * ssc: StreamingContext,
      * kafkaParams: Map[String, String],
      * topics: Set[String]
      */
    val kafkaParams: Map[String, String] = Map(
      "metadata.broker.list" -> "node-01:9092,node-02:9092,node-03:9092",
      "groupId" -> "shenzhen"
    )
    val topics: Set[String] = Set("shenzhen_itcast")
    val directStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](context, kafkaParams, topics)


    var offsetRanges = Array.empty[OffsetRange]

    directStream.transform { rdd =>
      offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
      rdd
    }.foreachRDD { rdd =>

      for (o <- offsetRanges) {
        println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
      }
      //fromOffset 一般存放到redise
    }
    context.start()
    context.awaitTermination()
  }
}

基于kafka10版本

object SparkStreamingKafka10 {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[*]").setAppName("SparkStreamingFlume")
    val context = new StreamingContext(conf, Seconds(10))
    context.sparkContext.setLogLevel("OFF")
    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "node-01:9092,node-02:9092,node-03:9092",
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "use_a_separate_group_id_for_each_stream",
      "auto.offset.reset" -> "latest",  //从哪里开始消费
      "enable.auto.commit" -> (false: java.lang.Boolean)  //设置不自动提交offset
    )

    val topics = Array("shenzhen_itcast")
    val stream = KafkaUtils.createDirectStream[String, String](
      context,
      PreferConsistent,
      Subscribe[String, String](topics, kafkaParams)
    )
    stream.foreachRDD { rdd =>
      val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
      rdd.foreachPartition { iter =>
        val o: OffsetRange = offsetRanges(TaskContext.get.partitionId)
        println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
      }
      //提交offset 提交zookeeper中 生产环境下 一般通过redis管理offset
      stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
    }
    context.start()
    context.awaitTermination()
/*
shenzhen_itcast 2 10 10
shenzhen_itcast 0 11 11
shenzhen_itcast 1 11 11
 */
  }
}

消费数据语义
- 至少消费一次（可以重复消费）
  - 通过receiver方式消费kafka数据，将数据读取到SparkStreaming，但是并没有自动提交offset，
  - 如果需要消费仅且消息一次，需要开启WAL（预写日志方式）日志HDFS之上这样就保证了数据不会丢失。
- 至多消费一次（可能丢失数据）
  - 通过receiver方式消费kafka数据，数据还没有进行处理程序异常退出，但是offset已经提交了
- 消费仅且消息一次
  - 类似于关系型数据库中事务

posted on 2019-07-15 22:35 jeasonchen001 阅读(118) 评论(0) 编辑收藏举报

poll（拉取数据）：Flume会将收集到的数据发送到 Flume所在机器的某个特定端口

push（推送方式）：Flume将收集到的数据推送到Spark应用程序执行的机器