spark版本定制十八:Spark Streaming中空RDD处理及流处理程序优雅的停止
本期内容:
1、Spark Streaming中RDD的空处理
2、StreamingContext程序的停止
一、Spark Streaming中RDD的空处理
案例代码:
Scala代码:
package com.dt.spark.sparkstreaming
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* 使用Scala开发集群运行的Spark Streaming的foreachRDD把处理后的数据写入外部存储系统中
*
*
* 背景描述:在广告点击计费系统中,我们在线过滤掉黑名单的点击,进而保护广告商的利益,只进行有效的广告点击计费
* 或者在防刷评分(或者流量)系统,过滤掉无效的投票或者评分或者流量;
* 实现技术:使用transform Api直接基于RDD编程,进行join操作
*
*/
object OnlineForeachRDD2DB {
def main(args: Array[String]){
val conf = new SparkConf() //创建SparkConf对象
conf.setAppName("OnlineWordcount") //设置应用程序的名称,在程序运行的监控界面可以看到名称
conf.setMaster("spark://Master:7077") //此时,程序在Spark集群
/**
* 设置batchDuration时间间隔来控制Job生成的频率并且创建Spark Streaming执行的入口
*/
val ssc = new StreamingContext(conf, Seconds(300))
val lines = ssc.socketTextStream("Master", 9999)
val words = lines.flatMap(line => line.split(" "))
val wordCounts = words.map(word => (word,1)).reduceByKey(_ + _)
wordCounts.foreachRDD{ rdd =>
/**
* 例如:rdd为空,rdd为空会产生什么问题呢?
* rdd没有任何元素,但是也会做做foreachPartition,也会进行写数据库的操作或者把数据写到HDFS上,
* rdd里面没有任何记录,但是还会获取计算资源,然后计算一下,消耗计算资源,这个时候纯属浪费资源,
* 所以必须对空rdd进行处理;
*
* 背景描述:在广告点击计费系统中,我们在线过滤掉黑名单的点击,进而保护广告商的利益,只进行有效的广告点击计费
* 或者在防刷评分(或者流量)系统,过滤掉无效的投票或者评分或者流量;
* 实现技术:使用transform Api直接基于RDD编程,进行join操作
*
*/
object OnlineForeachRDD2DB {
def main(args: Array[String]){
val conf = new SparkConf() //创建SparkConf对象
conf.setAppName("OnlineWordcount") //设置应用程序的名称,在程序运行的监控界面可以看到名称
conf.setMaster("spark://Master:7077") //此时,程序在Spark集群
/**
* 设置batchDuration时间间隔来控制Job生成的频率并且创建Spark Streaming执行的入口
*/
val ssc = new StreamingContext(conf, Seconds(300))
val lines = ssc.socketTextStream("Master", 9999)
val words = lines.flatMap(line => line.split(" "))
val wordCounts = words.map(word => (word,1)).reduceByKey(_ + _)
wordCounts.foreachRDD{ rdd =>
/**
* 例如:rdd为空,rdd为空会产生什么问题呢?
* rdd没有任何元素,但是也会做做foreachPartition,也会进行写数据库的操作或者把数据写到HDFS上,
* rdd里面没有任何记录,但是还会获取计算资源,然后计算一下,消耗计算资源,这个时候纯属浪费资源,
* 所以必须对空rdd进行处理;
*
* 例如:使用rdd.count()>0,但是rdd.count()会触发一个Job;
* 使用rdd.isEmpty()的时候,take也会触发Job;
* def isEmpty(): Boolean = withScope {
* partitions.length == 0 || take(1).length == 0
* }
*
* rdd.partitions.isEmpty里判断的是length是否等于0,就代表是否有partition
* def isEmpty: Boolean = { length == 0 }
*
*
* 注:rdd.isEmpty()和rdd.partitions.isEmpty是两种概念;
*/
rdd.partitions.isEmpty
if(rdd.isEmpty()) {
rdd.foreachPartition{ partitonOfRecord =>
val connection = ConnectionPool.getConnection()
partitonOfRecord.foreach(record => {
val sql = "insert into streaming_itemcount(item,rcount) values('" + record._1 + "'," + record._2 + ")"
val stmt = connection.createStatement()
stmt.executeUpdate(sql)
stmt.close()
})
ConnectionPool.returnConnection(connection)
}}
}
* rdd.partitions.isEmpty里判断的是length是否等于0,就代表是否有partition
* def isEmpty: Boolean = { length == 0 }
*
*
* 注:rdd.isEmpty()和rdd.partitions.isEmpty是两种概念;
*/
rdd.partitions.isEmpty
if(rdd.isEmpty()) {
rdd.foreachPartition{ partitonOfRecord =>
val connection = ConnectionPool.getConnection()
partitonOfRecord.foreach(record => {
val sql = "insert into streaming_itemcount(item,rcount) values('" + record._1 + "'," + record._2 + ")"
val stmt = connection.createStatement()
stmt.executeUpdate(sql)
stmt.close()
})
ConnectionPool.returnConnection(connection)
}}
}
ssc.start()
ssc.awaitTermination()
}
}
ssc.awaitTermination()
}
}
二、StreamingContext程序的停止
第一种停止方式是不管接受到数据是否处理完成,直接被停止掉,第二种方式是接受到数据全部处理完成才停止掉,一般采用第二种方式。
第一种停止方式:
/**
* Stop the execution of the streams immediately (does not wait for all received data
* to be processed). By default, if `stopSparkContext` is not specified, the underlying
* SparkContext will also be stopped. This implicit behavior can be configured using the
* SparkConf configuration spark.streaming.stopSparkContextByDefault.
*
* 把streams的执行直接停止掉(并不会等待所有接受到的数据处理完成),默认情况下SparkContext也会被停止掉,
* 隐式的行为可以做配置,配置参数为spark.streaming.stopSparkContextByDefault。
*
* @param stopSparkContext If true, stops the associated SparkContext. The underlying SparkContext
* will be stopped regardless of whether this StreamingContext has been
* started.
*/
def stop(
stopSparkContext: Boolean = conf.getBoolean("spark.streaming.stopSparkContextByDefault", true)
): Unit = synchronized {
stop(stopSparkContext, false)
* Stop the execution of the streams immediately (does not wait for all received data
* to be processed). By default, if `stopSparkContext` is not specified, the underlying
* SparkContext will also be stopped. This implicit behavior can be configured using the
* SparkConf configuration spark.streaming.stopSparkContextByDefault.
*
* 把streams的执行直接停止掉(并不会等待所有接受到的数据处理完成),默认情况下SparkContext也会被停止掉,
* 隐式的行为可以做配置,配置参数为spark.streaming.stopSparkContextByDefault。
*
* @param stopSparkContext If true, stops the associated SparkContext. The underlying SparkContext
* will be stopped regardless of whether this StreamingContext has been
* started.
*/
def stop(
stopSparkContext: Boolean = conf.getBoolean("spark.streaming.stopSparkContextByDefault", true)
): Unit = synchronized {
stop(stopSparkContext, false)
}
第二种停止方式:
/**
* Stop the execution of the streams, with option of ensuring all received data
* has been processed.
*
* Stop the execution of the streams, with option of ensuring all received data
* has been processed.
*
* 所有接受到的数据全部被处理完成,才把streams的执行停止掉
*
* @param stopSparkContext if true, stops the associated SparkContext. The underlying SparkContext
* will be stopped regardless of whether this StreamingContext has been
* started.
* @param stopGracefully if true, stops gracefully by waiting for the processing of all
* received data to be completed
*/
def stop(stopSparkContext: Boolean, stopGracefully: Boolean): Unit = {
var shutdownHookRefToRemove: AnyRef = null
if (AsynchronousListenerBus.withinListenerThread.value) {
throw new SparkException("Cannot stop StreamingContext within listener thread of" +
" AsynchronousListenerBus")
}
synchronized {
try {
state match {
case INITIALIZED =>
logWarning("StreamingContext has not been started yet")
case STOPPED =>
logWarning("StreamingContext has already been stopped")
case ACTIVE =>
scheduler.stop(stopGracefully)
// Removing the streamingSource to de-register the metrics on stop()
env.metricsSystem.removeSource(streamingSource)
uiTab.foreach(_.detach())
StreamingContext.setActiveContext(null)
waiter.notifyStop()
if (shutdownHookRef != null) {
shutdownHookRefToRemove = shutdownHookRef
shutdownHookRef = null
}
logInfo("StreamingContext stopped successfully")
}
} finally {
// The state should always be Stopped after calling `stop()`, even if we haven't started yet
state = STOPPED
}
}
if (shutdownHookRefToRemove != null) {
ShutdownHookManager.removeShutdownHook(shutdownHookRefToRemove)
}
// Even if we have already stopped, we still need to attempt to stop the SparkContext because
// a user might stop(stopSparkContext = false) and then call stop(stopSparkContext = true).
if (stopSparkContext) sc.stop()
}
* @param stopSparkContext if true, stops the associated SparkContext. The underlying SparkContext
* will be stopped regardless of whether this StreamingContext has been
* started.
* @param stopGracefully if true, stops gracefully by waiting for the processing of all
* received data to be completed
*/
def stop(stopSparkContext: Boolean, stopGracefully: Boolean): Unit = {
var shutdownHookRefToRemove: AnyRef = null
if (AsynchronousListenerBus.withinListenerThread.value) {
throw new SparkException("Cannot stop StreamingContext within listener thread of" +
" AsynchronousListenerBus")
}
synchronized {
try {
state match {
case INITIALIZED =>
logWarning("StreamingContext has not been started yet")
case STOPPED =>
logWarning("StreamingContext has already been stopped")
case ACTIVE =>
scheduler.stop(stopGracefully)
// Removing the streamingSource to de-register the metrics on stop()
env.metricsSystem.removeSource(streamingSource)
uiTab.foreach(_.detach())
StreamingContext.setActiveContext(null)
waiter.notifyStop()
if (shutdownHookRef != null) {
shutdownHookRefToRemove = shutdownHookRef
shutdownHookRef = null
}
logInfo("StreamingContext stopped successfully")
}
} finally {
// The state should always be Stopped after calling `stop()`, even if we haven't started yet
state = STOPPED
}
}
if (shutdownHookRefToRemove != null) {
ShutdownHookManager.removeShutdownHook(shutdownHookRefToRemove)
}
// Even if we have already stopped, we still need to attempt to stop the SparkContext because
// a user might stop(stopSparkContext = false) and then call stop(stopSparkContext = true).
if (stopSparkContext) sc.stop()
}
特别感谢王家林老师的独具一格的讲解:
王家林老师名片:
中国Spark第一人
新浪微博:http://weibo.com/ilovepains
微信公众号:DT_Spark
博客:http://blog.sina.com.cn/ilovepains
QQ:1740415547
YY课堂:每天20:00现场授课频道68917580
posted on 2016-06-02 16:33 Harvey.Sun 阅读(716) 评论(0) 编辑 收藏 举报