Spark Streaming DStream 创建方式

1、通过RDD队列创建DStream

测试过程中,可以通过使用ssc.queueStream(queueOfRDDs)来创建DStream,每一个推送到这个队列中的RDD,都会作为一个DStream处理。

创建方式

  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("SparkStreaing")
    //StreamingContext 两个参数 sparkConf 配置文件 Seconds(3) 微批采集周期
    val ssc = new StreamingContext(sparkConf, Seconds(3))
    //声明队列
    val rddQueue = new mutable.Queue[RDD[Int]]()
    //ssc.queueStream(rddQueue, oneAtATime = false) oneAtATime 一个采集周期只出现一次,默认 true
    val inputSream: InputDStream[Int] = ssc.queueStream(rddQueue, oneAtATime = false)
    val mapStream: DStream[(Int, Int)] = inputSream.map((_, 1))
    val reduceStream: DStream[(Int, Int)] = mapStream.reduceByKey(_ + _)
    reduceStream.print()

    // 启动采集器
    ssc.start()
    for (i <- 1 to 5) {
      // 放数据到 队列
      rddQueue += ssc.sparkContext.makeRDD(seq = 1 to 5, numSlices = 10)
      Thread.sleep(2000)
    }
    //等待采集器关闭
    ssc.awaitTermination()
  }

执行效果

-------------------------------------------
Time: 1650099129000 ms
-------------------------------------------
(4,2)
(1,2)
(5,2)
(2,2)
(3,2)

-------------------------------------------
Time: 1650099132000 ms
-------------------------------------------
(4,1)
(1,1)
(5,1)
(2,1)
(3,1)

-------------------------------------------
Time: 1650099135000 ms
-------------------------------------------
(4,2)
(1,2)
(5,2)
(2,2)
(3,2)

2、自定义数据源

需要继承Receiver,并实现onStart、onStop方法来自定义数据源采集。

实现方式

  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("SparkStreaing")
    //StreamingContext 两个参数 sparkConf 配置文件 Seconds(3) 微批采集周期
    val ssc = new StreamingContext(sparkConf, Seconds(3))

    val line: ReceiverInputDStream[String] = ssc.receiverStream(new myReceiver("hadoop103", 9999))
    line
      .flatMap(_.split(" "))
      .map((_, 1))
      .reduceByKey(_ + _)
      .print()
    // 启动采集器
    ssc.start()

    //等待采集器关闭
    ssc.awaitTermination()
  }

  /**
   * 自定义数据采集器
   * 1、继承 Receiver,定义泛型,传参数
   *
   */
  private class myReceiver(host: String, port: Int) extends Receiver[String](StorageLevel.MEMORY_ONLY) {
    private var socket: Socket = _

    override def onStart(): Unit = {
      new Thread("socket Receiver") {
        setDaemon(true)

        override def run() {
          receiver()
        }
      }.start()
    }

    def receiver(): Unit = {
      try {
        //读取端口数据
        socket = new Socket(host, port)
        val bf: BufferedReader = new BufferedReader(new InputStreamReader(socket.getInputStream, StandardCharsets.UTF_8))
        //定义变量存储读取的数据
        var line: String = null
        while ((line = bf.readLine()) != null) {
          //缓存到内存 store() 是 Receiver 提供的方法
          store(line)
        }
      } catch {
        case e: ConnectException =>
          restart(s"Error connecting to $host:$port...", e)
          return
      }
    }

    override def onStop(): Unit = {
      synchronized {
        if (socket != null) {
          socket.close()
          socket = null
        }
      }
    }
  }

3、Kafka 数据源

ReceiverAPI:需要一个专门的Executor去接收数据,然后发送给其他的Executor做计算。存在的问题,接收数据的Executor和计算的Executor速度会有所不同,特别在接收数据的Executor速度大于计算的Executor速度,会导致计算数据的节点内存溢出。DirectAPI:是由计算的Executor来主动消费Kafka的数据,速度由自身控制

实现方式

  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("SparkStreaing")
    //StreamingContext 两个参数 sparkConf 配置文件 Seconds(3) 微批采集周期
    val ssc: StreamingContext = new StreamingContext(sparkConf, Seconds(3))

    val kafkaPara: Map[String, Object] = Map[String, Object](
      ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "hadoop103:9092,hadoop104:9092,hadoop105:9092", //kafka所在集群主机端口信息
      ConsumerConfig.GROUP_ID_CONFIG -> "hui", "key.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer", "value.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer")
    //从 kafka 读取数据
    val kfkDataDS: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream[String, String](
      ssc, //ssc 上下文环境对象
      LocationStrategies.PreferConsistent, //LocationStrategies 位置策略 PreferConsistent 采集节点和计算节点自己控制
      ConsumerStrategies.Subscribe[String, String](Set("tbg"), //ConsumerStrategies 消费策略  tbg kafka topic
        kafkaPara //kafka主题 kafkaPara kafka配置
      ))
    kfkDataDS
      .flatMap(_.value().split(" "))
      .map((_, 1))
      .reduceByKey(_ + _)
      .print()


    /**
     bin/kafka-topics.sh --bootstrap-server hadoop103:9092 --list
     bin/kafka-console-producer.sh --bootstrap-server hadoop103:9092 --topic tbg
     **/
    // 启动采集器
    ssc.start()

    //等待采集器关闭
    ssc.awaitTermination()
  }
posted @ 2022-04-20 08:01  晓枫的春天  阅读(126)  评论(0编辑  收藏  举报