SparkStreaming

1.课程目标

  • 掌握SparkStreaming原理和架构
  • 掌握DStream常用的操作
  • 掌握SparkStreaming整合flume
  • 掌握SparkStreaming整合kafka

2.SparkStreaming概念

  • SparkStreaming是用来开发实时数据处理程序的
    • SparkStreaming可以接入不同实时数据源,Kafka、Flume、Socket等不同的数据源。SparkStreaming一旦启动,(程序本身)无法停止。除非有人工干预或者机器故障或者数据异常
  • SparkStreaming特点:
    • 易用 :可以支持多种开发语言
    • 容错:Spark Streaming可以从盒子中恢复丢失的工作和操作员状态(如滑动窗口),而不需要任何额外的代码。
    • 易整合:SparkStreaming 会整合Graphx以及机器学习。SparkStreaming 还可整合Kakfa、Flume、文件系统、SparkCore、SparkSQL

3.SparkStreaming 原理

  • SparkStreaming是伪实时数据处理程序(微批处理)程序。
    	微:代表时间间隔 很小
     批处理:代表的离线数据处理
    	
    1.SparkStreaing 将时间间隔缩短成秒级甚至是毫秒级进行数据处理。
    2.SparkStreaing 将按照时间间隔读取进来数据转换成RDD,然后RDD进行转换操作,最后也是提交Job。
    SparkStreaming 中按照时间间隔自动提交Job程序,
    
  • SparkStreaming实时性
    • SparkStreaming目前能做到实时性 100毫秒,
      	如果对于实时性要求非常高的项目,比如证券交易、新闻发布,在进行技术选型的时候,要考虑Storm或者Flink、Spark中Structured Streaming。 
        1.job中划分Task时间(1秒钟 50个task,每个task 20毫秒)
        2.将task划分到不同的worker节点(网络传输+序列化和反序列化过程+执行过程)
        以上两个过程时间间隔>=100毫秒
      在实际开发过程中SparkStreaming一般处理到秒级
      
  • SparkStreaming 容错性
    • 基于RDD的血统机制容错
    • RDD缓存机制容错
    • RDD的checkpoint进行容错

4.SparkStreaming的编程入口

  • 导入pom依赖

    • <!--SparkStreaming 依赖-->
       <dependency>
       <groupId>org.apache.spark</groupId>
       <artifactId>spark-streaming_2.11</artifactId>
       <version>2.2.0</version>
        </dependency>
      
  • SparkCore编程入口 SparkContext
    SparkSQL 编程入口 SparkSession
    SparkStreaming的编程入口 StreamingContext
    
    StreamingContext功能:
    	1.接入数据源 并且创建DStream
        2.StreamingContext.start() 启动实时数据处理程序
        3.context.awaitTermination() 启动线程监控程序是否有异常情况出现,如果出现异常则调用stop方法
    	4.StreamingContext.stop() 停止和释放程序资源
    
    创建StreamingContext对象:需要SparkConf 和时间间隔参数
    

5.SparkStreaming编程模型:DStream

  • DStream:是离散流对象,是sparkStreaming的基础编程抽象。DStream实际上就是一个RDD的集合。
    DStream 转换算子 map、flatMap、reduceByKey、groupByKey等操作  每次转换操作都会形成新的			Dstream
    DStream 输出算子:print、foreachRDD、transform、saveAsTextFiles等操作
    
    DStreams 三大特性:
    	1.每个DStream都依赖于其他的DStreams
    		def dependencies: List[DStream[_]]
        2.每一个时间间隔都会创建RDD
    		var generatedRDDs = new HashMap[Time, RDD[T]]()
    	3.在转换算子中每一个函数操作都会作用于DStream中的RDD之上
    		算子中函数式通过DStream中compute方法作用于每个时间间隔对应的RDD之上
    		override def compute(validTime: Time): Option[RDD[U]] = {
        parent.getOrCompute(validTime).map(_.map[U](mapFunc))
      }
    

6.SparkStreaming基于IDEA编程实战

  • pom依赖
    • <!--SparkStreaming 依赖-->
        <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-streaming_2.11</artifactId>
        <version>2.2.0</version>
      </dependency>
      

6.1 接入Socket数据

  • 准备工作
    • 安装socket数据源软件 :命令 yum -y install nc
    • 启动:nc -lk 端口号
  • 编写程序
    • package cn.itcast.sparkStreaming
      
      import org.apache.spark.SparkConf
      import org.apache.spark.streaming.dstream.DStream
      import org.apache.spark.streaming.{Seconds, StreamingContext}
      
      object SocketSparkStreaming {
        def main(args: Array[String]): Unit = {
          //创建SparkConf对象
          val conf = new SparkConf().setAppName("SocketSparkStreaming").setMaster("local[2]")
          //构建StreamingContext
          val sc: StreamingContext = new StreamingContext(conf, Seconds(5))
          sc.sparkContext.setLogLevel("OFF")
      
          //接入数据源 创建Dstream
          val linesDstream: DStream[String] = sc.socketTextStream("node-01", 9999)
          // 输入数据格式 , 分开 wordcount
          val wordCountDstream: DStream[(String, Int)] = linesDstream.flatMap(_.split(",")).map((_, 1)).reduceByKey(_ + _)
      
          //打印输出
          wordCountDstream.print()
      
          //启动实时数据处理程序
          sc.start()
          sc.awaitTermination()
        }
      }
      
      注意:SparkStreaming实时处理只能够处理当前批次的数据,并不能保存历史数据状态
      
  • 保存实时数据处理历史状态,则需要使用updateStateByKey
    • package cn.itcast.sparkStreaming
      
      import org.apache.spark.SparkConf
      import org.apache.spark.streaming.dstream.DStream
      import org.apache.spark.streaming.{Seconds, StreamingContext}
      
      object UpdateStateSparkStreaming {
        /**
          *
          * @param oldValues 每一个历史批次的统计值:这里就相当于Dstream中每一个RDD的wordcount操作
          * @param newValue  当前批次的rdd操作wordcount
          * @return
          */
        def updateFunc(oldValues: Seq[Int], newValue: Option[Int]): Option[Int] = {
          val value = oldValues.sum + newValue.getOrElse(0)
          Option(value)
        }
      
      
        def main(args: Array[String]): Unit = {
          val conf = new SparkConf().setAppName("UpdateStateSparkStreaming").setMaster("local[2]")
          //创建StreamingContext对象
          val context = new StreamingContext(conf, Seconds(5))
          context.sparkContext.setLogLevel("OFF")
          //设置checkpoint路径 hdfs之上
          context.checkpoint("D:/ck")
          //接入数据源
          val linesDstream: DStream[String] = context.socketTextStream("node-01", 9999)
      
          val wordcountDStream: DStream[(String, Int)] = linesDstream.flatMap(_.split(",")).map((_, 1)).reduceByKey(_ + _)
      
          //保存历史状态 将数据打印到控制台
          wordcountDStream.updateStateByKey(updateFunc _).print()
          context.start()
          context.awaitTermination()
        }
      }
      
  • 宕机后历史状态无法保存。
    • 实现宕机后依然保存历史状态
      • package cn.itcast.sparkStreaming
        
        import cn.itcast.sparkStreaming.UpdateStateSparkStreaming.updateFunc
        import org.apache.spark.SparkConf
        import org.apache.spark.streaming.{Seconds, StreamingContext}
        import org.apache.spark.streaming.dstream.DStream
        
        object SparkStreamingUpdate {
        
          def createFunc(): StreamingContext = {
            val conf = new SparkConf().setAppName("UpdateStateSparkStreaming").setMaster("local[2]")
            //创建StreamingContext对象
            val context = new StreamingContext(conf, Seconds(10))
        
            context.sparkContext.setLogLevel("OFF")
            //设置checkpoint路径 hdfs之上
            context.checkpoint("D:/ck")
            //接入数据源
            val linesDstream: DStream[String] = context.socketTextStream("node-01", 9999)
        
            val wordcountDStream: DStream[(String, Int)] = linesDstream.flatMap(_.split(",")).map((_, 1)).reduceByKey(_ + _)
        
            //保存历史状态 将数据打印到控制台
            wordcountDStream.updateStateByKey(updateFunc _).print()
            //返回值
            context
          }
          def main(args: Array[String]): Unit = {
              //设置checkpint路径   生产环境下 一般是hdfs路径
            val streamingContext=StreamingContext.getOrCreate("D:/ck", createFunc _)
            streamingContext.start()
            streamingContext.awaitTermination()
          }
        }
        
  • 需求:统计每20秒钟产生的热词,每10秒钟统计一次,sparkStreaming的时间间隔 5秒钟。SparkStreaming中窗口操作。
    • object SparkStreamingWindow {
        def main(args: Array[String]): Unit = {
          val conf = new SparkConf().setAppName("UpdateStateSparkStreaming").setMaster("local[2]")
          //创建StreamingContext对象
          val context = new StreamingContext(conf, Seconds(5))
      
          context.sparkContext.setLogLevel("OFF")
      
          //接入数据源
          val linesDstream: DStream[String] = context.socketTextStream("node-01", 9999)
      
          val wordcountDStream: DStream[(String, Int)] = linesDstream.flatMap(_.split(",")).map((_, 1))
      
            .reduceByKeyAndWindow((x: Int, y: Int) => x + y, Seconds(15), Seconds(10))
            //窗口操作的时间设置 必须是产生RDD时间间隔整数倍
          wordcountDStream.print()
          context.start()
          context.awaitTermination()
      
        }
      }
      

6.2 SparkStreaming其他数据来源

  • SparkStreaming可以读取文件系统中文件
    • def main(args: Array[String]): Unit = {
        val context = new StreamingContext(new SparkConf().setAppName("SparkStreamingFile").setMaster("local[*]"), Seconds(5))
        context.sparkContext.setLogLevel("OFF")
        //读取文件数据
        val dstream = context.textFileStream("D:/data/")
        dstream.print()
        context.start()
        context.awaitTermination()
      }
      

6.3自定义Receiver读取数据源

  • 自定义类继承Receiver 实现revicer中抽象方法, onStart onStop
  • package cn.itcast.sparkStreaming
    
    import java.io.{BufferedReader, InputStreamReader}
    import java.net.Socket
    
    import org.apache.spark.SparkConf
    import org.apache.spark.storage.StorageLevel
    import org.apache.spark.streaming.{Seconds, StreamingContext}
    import org.apache.spark.streaming.receiver.Receiver
    
    //构建自定义receiver  接受socket数据
    class MyReceiver(val host: String, val port: Int) extends Receiver[String](StorageLevel.MEMORY_ONLY) {
      var socket: Socket = _
    
      override def onStart(): Unit = {
        //接入数据源
        socket = new Socket(host, port)
        new Thread() {
          override def run(): Unit = {
            //读取数据并且将数据存储到缓存中
            receiverMsg()
          }
        }.start()
      }
    
      def receiverMsg(): Unit = {
        //获取数据流
        val inputStream = socket.getInputStream()
        //构建读取对象
        val reader = new InputStreamReader(inputStream)
        val bufferedReader = new BufferedReader(reader)
    
        var line: String = bufferedReader.readLine()
        while(!isStopped() && line != null) {
          //存储数据
          store(line)
          line = bufferedReader.readLine()
        }
      }
    
      override def onStop(): Unit = {
        //释放资源操作
        socket.close()
      }
    }
    
    object SparkStreamingReceiver {
      def main(args: Array[String]): Unit = {
        val context = new StreamingContext(new SparkConf().setAppName("SparkStreamingFile").setMaster("local[*]"), Seconds(5))
        context.sparkContext.setLogLevel("OFF")
        val dstream = context.receiverStream(new MyReceiver("node-01", 9999))
        dstream.print()
        context.start()
        context.awaitTermination()
      }
    }
    

7.SparkStreaming读取Flume中数据

  • Flume日志采集,准备工作
    • 将spark-streaming-flume-sink*.jar 替换成spark-streaming-flume-sink_2.11-2.2.0.jar
    • 将 scala-library-2.10.5.jar 替换成 scala-library-2.11.8.jar
  • 配置pom文件
    • <dependency>     
      	<groupId>org.apache.spark</groupId>
      	<artifactId>spark-streaming-flume_2.11</artifactId>     	              		<version>2.2.0</version> 
      </dependency> 
      
  • SparkStreaming 读取Flume中数据,有两种方式
    • poll(拉取数据):Flume会将收集到的数据发送到 Flume所在机器的某个特定端口

    • push(推送方式):Flume将收集到的数据推送到Spark应用程序执行的机器

  • 读取数据步骤
    • 创建flume-poll.properties文件
      • a1.sources = r1
        a1.sinks = k1
        a1.channels = c1
        #source
        a1.sources.r1.channels = c1
        a1.sources.r1.type = spooldir
        a1.sources.r1.spoolDir = /home/data
        a1.sources.r1.fileHeader = true
        #channel
        a1.channels.c1.type =memory
        a1.channels.c1.capacity = 20000
        a1.channels.c1.transactionCapacity=5000
        #sinks
        a1.sinks.k1.channel = c1
        a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink
        a1.sinks.k1.hostname=node-03
        a1.sinks.k1.port = 9999
        a1.sinks.k1.batchSize= 2000
        
      • package cn.itcast.sparkStreaming
        
        import org.apache.spark.SparkConf
        import org.apache.spark.streaming.flume.{FlumeUtils, SparkFlumeEvent}
        import org.apache.spark.streaming.{Seconds, StreamingContext}
        
        object SparkStreamingFlume {
          def main(args: Array[String]): Unit = {
            //创建StreamingContext
            val conf = new SparkConf().setMaster("local[*]").setAppName("SparkStreamingFlume")
            val context = new StreamingContext(conf, Seconds(10))
            context.sparkContext.setLogLevel("OFF")
            //读取flume中数据
            /**
              * ssc: StreamingContext,
              * hostname: String
              * port: Int,
              */
            val flumeDstream = FlumeUtils.createPollingStream(context, "node-03", 9999)
        
            val wordcountDstream = flumeDstream.map((sfe: SparkFlumeEvent) => {
              val body = sfe.event.getBody
              new String(body.array())
            }).flatMap(_.split(",")).map((_, 1)).reduceByKey(_ + _)
        
            //排序
            val sortDstream = wordcountDstream.transform(rdd => {
              rdd.sortBy(t => t._2, false)
            })
            //topN
            //sortDstream.count().print()
          
            context.start()
            context.awaitTermination()
        
          }
        }
        
  • push 模式消费数据
    • 创建flume-push.properties
      • a1.sources = r1
        a1.sinks = k1
        a1.channels = c1
        #source
        a1.sources.r1.channels = c1
        a1.sources.r1.type = spooldir
        a1.sources.r1.spoolDir = /home/data/
        a1.sources.r1.fileHeader = true
        #channel
        a1.channels.c1.type =memory
        a1.channels.c1.capacity = 20000
        a1.channels.c1.transactionCapacity=5000
        #sinks
        a1.sinks.k1.channel = c1
        a1.sinks.k1.type = avro
        #注意这里的ip需要指定的是我们spark程序所运行的服务器的ip,也就是我们的win7的ip地址
        a1.sinks.k1.hostname=192.168.23.22
        a1.sinks.k1.port = 8888
        a1.sinks.k1.batchSize= 2000
        
    • package cn.itcast.sparkStreaming
      
      import org.apache.spark.SparkConf
      import org.apache.spark.streaming.dstream.DStream
      import org.apache.spark.streaming.flume.{FlumeUtils, SparkFlumeEvent}
      import org.apache.spark.streaming.{Seconds, StreamingContext}
      
      object SparkStreamingFlume2 {
        def main(args: Array[String]): Unit = {
          //创建StreamingContext
          val conf = new SparkConf().setMaster("local[*]").setAppName("SparkStreamingFlume")
          val context = new StreamingContext(conf, Seconds(10))
          context.sparkContext.setLogLevel("OFF")
      
          /**
            * ssc: StreamingContext,
            * hostname: String,
            * port: Int
            */
          val dstream:DStream[SparkFlumeEvent] = FlumeUtils.createStream(context, "192.168.23.22", 8888)
          dstream.map(sfe=>{
            new String(sfe.event.getBody.array())
          }).print()
          context.start()
          context.awaitTermination()
      
        }
      }
      

8.SparkStreaming消费kafka中数据

  • 操作步骤
    • 启动Kafka 集群
  • 通过Receiver消费kafka中数据
    • package cn.itcast.sparkStreaming
      
      import org.apache.spark.SparkConf
      import org.apache.spark.streaming.kafka.KafkaUtils
      import org.apache.spark.streaming.{Seconds, StreamingContext}
      
      /**
        * createStream 这种消费方式 kafka的offset 存放在zookeeper中,由程序自动管理的
        *   以receiver方式 消费kafka中数据,receiver以多线程的方式消费的, 线程数量 跟每一个topic对应partition数量有关
        *     topic1->3  topic2->3  线程数大小 6
        */
      object SparkStreamingKakfa_8 {
        def main(args: Array[String]): Unit = {
          //创建StreamingContext
          val conf = new SparkConf().setMaster("local[*]").setAppName("SparkStreamingFlume")
          val context = new StreamingContext(conf, Seconds(10))
          context.sparkContext.setLogLevel("OFF")
      
          //链接kafka数据源
          /**
            * ssc: StreamingContext,
            * zkQuorum: String,
            * groupId: String,
            * topics: Map[String, Int],
            */
          val zkQuorum: String = "node-01:2181,node-02:2181,node-03:2181"
          val groupId: String = "shenzhen"
          val topics: Map[String, Int] = Map("shenzhen_itcast" -> 3)
      
          val kafkaDstream =KafkaUtils.createStream(context, zkQuorum, groupId, topics)
          val wcDstream = kafkaDstream.map(_._2).flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _)
          wcDstream.print()
          context.start()
          context.awaitTermination()
        }
      }
      
  • /**
      * Direct 方式消费 kafka中数据,offset 存在于kafka的topic
      */
    object SparkStreamingKafkaDriect_8 {
      def main(args: Array[String]): Unit = {
        //创建StreamingContext
        val conf = new SparkConf().setMaster("local[*]").setAppName("SparkStreamingFlume")
        val context = new StreamingContext(conf, Seconds(10))
        context.sparkContext.setLogLevel("OFF")
        //通过Dirvect方式消费kafka中数据
        /**
          * ssc: StreamingContext,
          * kafkaParams: Map[String, String],
          * topics: Set[String]
          */
        val kafkaParams: Map[String, String] = Map(
          "metadata.broker.list" -> "node-01:9092,node-02:9092,node-03:9092",
          "groupId" -> "shenzhen"
        )
        val topics: Set[String] = Set("shenzhen_itcast")
        val directStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](context, kafkaParams, topics)
    
    
        var offsetRanges = Array.empty[OffsetRange]
    
        directStream.transform { rdd =>
          offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
          rdd
        }.foreachRDD { rdd =>
    
          for (o <- offsetRanges) {
            println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
          }
          //fromOffset 一般存放到redise
        }
        context.start()
        context.awaitTermination()
      }
    }
    
  • 基于kafka10版本
    
    object SparkStreamingKafka10 {
      def main(args: Array[String]): Unit = {
        val conf = new SparkConf().setMaster("local[*]").setAppName("SparkStreamingFlume")
        val context = new StreamingContext(conf, Seconds(10))
        context.sparkContext.setLogLevel("OFF")
        val kafkaParams = Map[String, Object](
          "bootstrap.servers" -> "node-01:9092,node-02:9092,node-03:9092",
          "key.deserializer" -> classOf[StringDeserializer],
          "value.deserializer" -> classOf[StringDeserializer],
          "group.id" -> "use_a_separate_group_id_for_each_stream",
          "auto.offset.reset" -> "latest",  //从哪里开始消费
          "enable.auto.commit" -> (false: java.lang.Boolean)  //设置不自动提交offset
        )
    
        val topics = Array("shenzhen_itcast")
        val stream = KafkaUtils.createDirectStream[String, String](
          context,
          PreferConsistent,
          Subscribe[String, String](topics, kafkaParams)
        )
        stream.foreachRDD { rdd =>
          val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
          rdd.foreachPartition { iter =>
            val o: OffsetRange = offsetRanges(TaskContext.get.partitionId)
            println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
          }
          //提交offset 提交zookeeper中 生产环境下 一般通过redis管理offset
          stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
        }
        context.start()
        context.awaitTermination()
    /*
    shenzhen_itcast 2 10 10
    shenzhen_itcast 0 11 11
    shenzhen_itcast 1 11 11
     */
      }
    }
    
  • 消费数据语义
    • 至少消费一次 (可以重复消费)
      • 通过receiver方式消费kafka数据,将数据读取到SparkStreaming,但是并没有自动提交offset,
      • 如果需要消费仅且消息一次,需要开启WAL(预写日志方式) 日志HDFS之上 这样就保证了数据不会丢失。
    • 至多消费一次 (可能丢失数据)
      • 通过receiver方式消费kafka数据,数据还没有进行处理程序异常退出,但是offset已经提交了
    • 消费仅且消息一次
      • 类似于关系型数据库中事务
posted on 2019-07-15 22:35  jeasonchen001  阅读(118)  评论(0编辑  收藏  举报