学习笔记：spark Streaming的入门

　spark Streaming的入门

　　　1.概述

　　　　　spark streaming 是spark core api的一个扩展，可实现实时数据的可扩展，高吞吐量，容错流处理。

　　　　　从上图可以看出，数据可以有很多来源，如kafka,flume,Twitter,HDFS/S3,Kinesis用的比较少；这些采集回来的数据可以使用以高级的函数（map,reduce等）表达的复杂算法进行处理，经过sparkstreaming框架处理后的数据可以推送到文件系统，数据板或是实时仪表板上；除此之外，我们还可以在数据流上应用spark的机器学习算法和图像处理算法。

　　　　spark streaming简单的个人定义:将不同数据源的数据经过spark Streaming框架处理之后将结果输出到外部文件系统。

　　　　特点：

　　　　　　低延迟

　　　　　　能从错误中高效的恢复：fault-tolerant

　　　　　　能够运行在成百上千的节点上

　　　　　　能将批处理、机器学习、图计算等子框架和spark streaming综合起来使用

　　 2.应用场景：

　　　　实时反映电子设备实时监测

　　　　交易过程中实时的金融欺诈

　　　　电商行业的推荐信息

　　 3.集成spark生态系统的使用

　　　　　spark SQL、spark streaming、MLlib和GraphX都是基于spark core的扩展和开发，那它们是如何进行交互的？（后期补充）

　　4.spark的发展史

　　 5.从词频统计功能着手Spark Streaming入门

spark-submit执行（开发）

package org.apache.spark.examples.streaming

import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
 * Counts words in UTF8 encoded, '\n' delimited text received from the network every second.
 *
 * Usage: NetworkWordCount <hostname> <port>
 * <hostname> and <port> describe the TCP server that Spark Streaming would connect to receive data.
 *
 * To run this on your local machine, you need to first run a Netcat server
 *    `$ nc -lk 9999`
 * and then run the example
 *    `$ bin/run-example org.apache.spark.examples.streaming.NetworkWordCount localhost 9999`
 */
object NetworkWordCount {
  def main(args: Array[String]) {
    if (args.length < 2) {
      System.err.println("Usage: NetworkWordCount <hostname> <port>")
      System.exit(1)
    }

    StreamingExamples.setStreamingLogLevels()

    // Create the context with a 1 second batch size
    val sparkConf = new SparkConf().setAppName("NetworkWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(1))

    // Create a socket stream on target ip:port and count the
    // words in input stream of \n delimited text (eg. generated by 'nc')
    // Note that no duplication in storage level only for running locally.
    // Replication necessary in distributed scenario for fault tolerance.
    val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER)
    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
    wordCounts.print()
    ssc.start()
    ssc.awaitTermination()
  }
}


使用spark-submit方式提交的命令如下(不懂看代码前面的解析)：
./spark-submit --master local[2] --class org.apache.spark.examples.streaming.NetworkWordCount --name NetworkWordCount /home/hadoop/app/spark/eaxmple/jars/spark-example_2.11-2.2.20.jar  hadoop0000  9999

spark-shell执行(测试)

val ssc = new StreamingContext(sparkConf, Seconds(1))
val lines = ssc.socketTextStream("hadoop000", 9999)
    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
    wordCounts.print()
    ssc.start()
    ssc.awaitTermination()

只需要运行./spark-shell --master loacal[2]，之后直接把代码拷贝上去运行即可。

　　 6.工作原理

　　　　　粗粒度：spark streaming接受实时数据流，把数据按照指定的时间段切成一片片小的数据块（spark streaming把每个小的数据块当成RDD来处理），然后把这些数据块传给Spark Engine处理，处理完之后的结果也是分批次的返回。

　　　　细粒度：application中有两个context,SparkContext和StreamingContext，使用receiver来接收数据。run receivers as taskes去executor上请求数据，当executor接收到数据后会将数据按时间段进行切分并存放在内存中，如设置了多副本将会拷贝到其他的Exceutor上进行数据的备份(replicate blocks), exceutor的receiver会将blocks的信息告诉StreamingContext，每到指定的周期 StreamingContext 将会通知SparkContext启动jobs并把这些jobs分发到exceutor上执行。

posted on 2019-03-20 14:32 xiaoshiyun 阅读(567) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

xiaoshiyun

学习笔记：spark Streaming的入门

公告

导航