Flume+Kafka+SparkStreaming+Hbase+可视化(三)

SparkStreaming
  1).概述:可扩展、高可用、容错性
        一站式解决方案
    2).原理
    粗粒度:Spark Streaming 接受实时流数据,将数据切分为批次数据,交由Spark Engine处理数据。组RDD,微批处理。
    细粒度:
  3).核心
  StreamingContext:
    start 之后不能 add StreamingContext
    Stop 之后不能 restart
    一个JVM仅仅能有一个StreamingContext
    stop(false) 能仅仅停止StreamingContext而不停止SparkContext。stop() 两者全部停止
    ssc = new StreamingContext(conf, Seconds(2))
  Dstream:
    来源于Dstrem或者Source
    一系列的RDD,每个批次
    Dstream操作底层为RDD操作
    Input Dstream & Receviers:
  sources,除了文件系统源,所有的input dstream都需要关联Receivers(ReceiverDstream)
  基于Receiver的Dstream线程数必须大于 Receiver数量
  transform Output Operator
 
  4).带状态的算子 updateStateByKey
  
package com.spark.train

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

/*
spark streaming 有状态的WordCount
updateStateByKey
 */

object statefulWordCount {

  def main(args: Array[String]): Unit = {

    val conf  = new SparkConf().setMaster("local[2]").setAppName("updateStateByKey for wordCount")
    val ssc = new StreamingContext(conf, Seconds(4))

    // need a checkPoint
    ssc.checkpoint("/opt/datas/spark_data/updateStateByKey1")

    val lines = ssc.socketTextStream("localhost", 9999)
    val words = lines.flatMap(_.split(" ")).map((_, 1))
    val wordCount = words.updateStateByKey[Int](updateFunc _)

    wordCount.print()


    ssc.start()
    ssc.awaitTermination()

  }

  /** updateStateByKey update function
    * currentValues: new count
    * preValues: previous count
    *
    * Some(): som
    */
  def updateFunc(currentValues:Seq[Int], preValuse:Option[Int]):Option[Int] = {

    val newCount = currentValues.sum
    val preCount = preValuse.getOrElse(0)
    Some(newCount + preCount)
  }
}

  

  5).黑名单过滤
  
package com.spark.train

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}


/**
  * 黑名单过滤
  */

object filterBlackListSparkStreaming {

  def main(args: Array[String]): Unit = {

    val conf = new SparkConf().setMaster("local[2]").setAppName("filter black list")
    val ssc = new StreamingContext(conf, Seconds(5))

    /**
      * 黑名单转 RDD
      *
      */
    val blackList = List("ls","zs")
    val rddBlackList = ssc.sparkContext.parallelize(blackList).map(x => (x, true))

    val logs = ssc.socketTextStream("localhost", 9999)
      .map(x => (x.split(",")(1), x))
      .transform(rdd => {
        rdd.leftOuterJoin(rddBlackList)
          .filter(x => x._2._2.getOrElse(false) != true)
          .map(x => x._2._1)
      })

    logs.print()

    ssc.start()
    ssc.awaitTermination()
  }
}

  

  Flume和Spark Streaming集成两种方式:代码如下:
  
package com.spark.test

import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.flume.FlumeUtils

object flumeSparkStreaming {
  
  def main(args: Array[String]): Unit = {

    val conf = new SparkConf().setMaster("local[2]").setAppName("flumeDstream")
    val ssc = new StreamingContext(conf, Seconds(5))

    /** 第一种方式
      * define flumeDstream
      * 先启动spark streaming ,在启动flume,最好telnet 开始wordCount
      */
    val flumeDstream = FlumeUtils.createStream(ssc, "bigdata", 3333)
    val wordCnt = flumeDstream.map(x => new String(x.event.getBody.array()).trim())
        .flatMap(_.split(" "))
        .map((_,1)).reduceByKey(_ + _)

    wordCnt.print()

    
    /**第二种方式
      * createPollingStream
      * 需要重新设置flume sink为 org.apache.spark.streaming.flume.sink.SparkSink
      * 其他配置一样,需要先启动flume,数据会缓存,然后streaming 去pull 数据
      */
//    val XXX = FlumeUtils.createPollingStream(ssc, "bigdata", 3333)
//      .map(x => new String(x.event.getBody.array()).trim)
//      .flatMap(_.split(""))
//      .map((_, 1))
//      .reduceByKey(_ + _)

    ssc.start()
    ssc.awaitTermination()
  }
}

posted @ 2018-09-18 14:39  麻辣小虾  阅读(836)  评论(0编辑  收藏  举报