Spark Streaming DStream 创建方式
1、通过RDD队列创建DStream
测试过程中,可以通过使用ssc.queueStream(queueOfRDDs)来创建DStream,每一个推送到这个队列中的RDD,都会作为一个DStream处理。
创建方式
def main(args: Array[String]): Unit = { val sparkConf = new SparkConf().setMaster("local[*]").setAppName("SparkStreaing") //StreamingContext 两个参数 sparkConf 配置文件 Seconds(3) 微批采集周期 val ssc = new StreamingContext(sparkConf, Seconds(3)) //声明队列 val rddQueue = new mutable.Queue[RDD[Int]]() //ssc.queueStream(rddQueue, oneAtATime = false) oneAtATime 一个采集周期只出现一次,默认 true val inputSream: InputDStream[Int] = ssc.queueStream(rddQueue, oneAtATime = false) val mapStream: DStream[(Int, Int)] = inputSream.map((_, 1)) val reduceStream: DStream[(Int, Int)] = mapStream.reduceByKey(_ + _) reduceStream.print() // 启动采集器 ssc.start() for (i <- 1 to 5) { // 放数据到 队列 rddQueue += ssc.sparkContext.makeRDD(seq = 1 to 5, numSlices = 10) Thread.sleep(2000) } //等待采集器关闭 ssc.awaitTermination() }
执行效果
-------------------------------------------
Time: 1650099129000 ms
-------------------------------------------
(4,2)
(1,2)
(5,2)
(2,2)
(3,2)
-------------------------------------------
Time: 1650099132000 ms
-------------------------------------------
(4,1)
(1,1)
(5,1)
(2,1)
(3,1)
-------------------------------------------
Time: 1650099135000 ms
-------------------------------------------
(4,2)
(1,2)
(5,2)
(2,2)
(3,2)
2、自定义数据源
需要继承Receiver,并实现onStart、onStop方法来自定义数据源采集。
实现方式
def main(args: Array[String]): Unit = { val sparkConf = new SparkConf().setMaster("local[*]").setAppName("SparkStreaing") //StreamingContext 两个参数 sparkConf 配置文件 Seconds(3) 微批采集周期 val ssc = new StreamingContext(sparkConf, Seconds(3)) val line: ReceiverInputDStream[String] = ssc.receiverStream(new myReceiver("hadoop103", 9999)) line .flatMap(_.split(" ")) .map((_, 1)) .reduceByKey(_ + _) .print() // 启动采集器 ssc.start() //等待采集器关闭 ssc.awaitTermination() } /** * 自定义数据采集器 * 1、继承 Receiver,定义泛型,传参数 * */ private class myReceiver(host: String, port: Int) extends Receiver[String](StorageLevel.MEMORY_ONLY) { private var socket: Socket = _ override def onStart(): Unit = { new Thread("socket Receiver") { setDaemon(true) override def run() { receiver() } }.start() } def receiver(): Unit = { try { //读取端口数据 socket = new Socket(host, port) val bf: BufferedReader = new BufferedReader(new InputStreamReader(socket.getInputStream, StandardCharsets.UTF_8)) //定义变量存储读取的数据 var line: String = null while ((line = bf.readLine()) != null) { //缓存到内存 store() 是 Receiver 提供的方法 store(line) } } catch { case e: ConnectException => restart(s"Error connecting to $host:$port...", e) return } } override def onStop(): Unit = { synchronized { if (socket != null) { socket.close() socket = null } } } }
3、Kafka 数据源
ReceiverAPI:需要一个专门的Executor去接收数据,然后发送给其他的Executor做计算。存在的问题,接收数据的Executor和计算的Executor速度会有所不同,特别在接收数据的Executor速度大于计算的Executor速度,会导致计算数据的节点内存溢出。DirectAPI:是由计算的Executor来主动消费Kafka的数据,速度由自身控制
实现方式
def main(args: Array[String]): Unit = { val sparkConf = new SparkConf().setMaster("local[*]").setAppName("SparkStreaing") //StreamingContext 两个参数 sparkConf 配置文件 Seconds(3) 微批采集周期 val ssc: StreamingContext = new StreamingContext(sparkConf, Seconds(3)) val kafkaPara: Map[String, Object] = Map[String, Object]( ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "hadoop103:9092,hadoop104:9092,hadoop105:9092", //kafka所在集群主机端口信息 ConsumerConfig.GROUP_ID_CONFIG -> "hui", "key.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer", "value.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer") //从 kafka 读取数据 val kfkDataDS: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream[String, String]( ssc, //ssc 上下文环境对象 LocationStrategies.PreferConsistent, //LocationStrategies 位置策略 PreferConsistent 采集节点和计算节点自己控制 ConsumerStrategies.Subscribe[String, String](Set("tbg"), //ConsumerStrategies 消费策略 tbg kafka topic kafkaPara //kafka主题 kafkaPara kafka配置 )) kfkDataDS .flatMap(_.value().split(" ")) .map((_, 1)) .reduceByKey(_ + _) .print() /** bin/kafka-topics.sh --bootstrap-server hadoop103:9092 --list bin/kafka-console-producer.sh --bootstrap-server hadoop103:9092 --topic tbg **/ // 启动采集器 ssc.start() //等待采集器关闭 ssc.awaitTermination() }