Mr、Hive、Spark、SparkStreaming实现wordCount

Hive：

https://www.cnblogs.com/Coeus-P/p/13356392.html

Spark：

def main(args: Array[String]): Unit = {
    //println("Hello, WordCount")
    val conf = new SparkConf()
    conf.setAppName("wordCount")
    conf.setMaster("local")

    val sc = new SparkContext(conf)
    //A B
    val lines: RDD[String] = sc.textFile("./data/words")
    //A
    //B
    val words: RDD[String] = lines.flatMap(lines => {
      lines.split(" ")
    })
    //A 1
    //B 1
    val pairWords: RDD[(String, Int)] = words.map(words=>{new Tuple2(words, 1)})
    //A 1
    //B 2
    val result:RDD[(String, Int)] = pairWords.reduceByKey((v1:Int, v2:Int)=>{v1+v2})
    val result1 = result.sortBy(tp=>{tp._2})
    result1.foreach(one=>{
      println(one)
    })
    sc.stop()
  }

SparkStreaming：

/**
  * SparkStreaming 注意：
  * 1.需要设置local[2]，因为一个线程是读取数据，一个线程是处理数据
  * 2.创建StreamingContext两种方式，如果采用的是StreamingContext(conf,Durations.seconds(5))这种方式，不能在new SparkContext
  * 3.Durations 批次间隔时间的设置需要根据集群的资源情况以及监控每一个job的执行时间来调节出最佳时间。
  * 4.SparkStreaming所有业务处理完成之后需要有一个output operato操作
  * 5.StreamingContext.start()straming框架启动之后是不能在次添加业务逻辑
  * 6.StreamingContext.stop()无参的stop方法会将sparkContext一同关闭，stop(false) ,默认为true，会一同关闭
  * 7.StreamingContext.stop()停止之后是不能在调用start
  */
object WordCountFromSocket {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
    conf.setAppName("WordCountOnLine").setMaster("local[2]")
    val ssc = new StreamingContext(conf,Durations.seconds(5))
    ssc.sparkContext.setLogLevel("ERROR")
    //使用new StreamingContext(conf,Durations.seconds(5)) 这种方式默认会创建SparkContext
//    val sc = new SparkContext(conf)
    //从ssc中获取SparkContext()
//    val context: SparkContext = ssc.sparkContext

    val lines: ReceiverInputDStream[String] = ssc.socketTextStream("c7node5",9999)
    val words: DStream[String] = lines.flatMap(line=>{line.split(" ")})
    val pairWords: DStream[(String, Int)] = words.map(word=>{(word,1)})
    val result: DStream[(String, Int)] = pairWords.reduceByKey((v1, v2)=>{v1+v2})

    result.print()

    /**
      * foreachRDD 注意事项：
      * 1.foreachRDD中可以拿到DStream中的RDD，对RDD进行操作，但是一点要使用RDD的action算子触发执行，不然DStream的逻辑也不会执行
      * 2.froeachRDD算子内，拿到的RDD算子操作外，这段代码是在Driver端执行的，可以利用这点做到动态的改变广播变量
      *
      */
//    result.foreachRDD(wordCountRDD=>{
//
//      println("******* produce in Driver *******")
//
//      val sortRDD: RDD[(String, Int)] = wordCountRDD.sortByKey(false)
//      val result: RDD[(String, Int)] = sortRDD.filter(tp => {
//        println("******* produce in Executor *******")
//        true
//      })
//      result.foreach(println)
//    })

    ssc.start()
    ssc.awaitTermination()
    ssc.stop(false)

  }
}

posted @ 2020-10-05 12:14 Loading~ 阅读(144) 评论(0) 编辑收藏举报

刷新页面返回顶部

No pain No gain

本博客用于知识的总结与梳理

Mr、Hive、Spark、SparkStreaming实现wordCount

Hive：

Spark：

SparkStreaming：

公告