JIANGzihao0222

导航

< 2025年3月 >
23 24 25 26 27 28 1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31 1 2 3 4 5
统计
 
本次学习了DStream无状态转换操作

DStream:RDD的集合

map(func) :对源DStream的每个元素,采用func函数进行转换,得到一个新的Dstream

flatMap(func): 与map相似,但是每个输入项可用被映射为0个或者多个输出项

filter(func): 返回一个新的DStream,仅包含源DStream中满足函数func的项

repartition(numPartitions): 通过创建更多或者更少的分区改变DStream的并行程度

reduce(func):利用函数func聚集源DStream中每个RDD的元素,返回一个包含单元素RDDs的新DStream

count():统计源DStream中每个RDD的元素数量

union(otherStream): 返回一个新的DStream,包含源DStream和其他DStream的元素

countByValue():应用于元素类型为K的DStream上,返回一个(K,V)键值对类型的新DStream,每个键的值是在原DStream的每个RDD中的出现次数

无状态转换操作实例:

之前“套接字流”部分介绍的词频统计,就是采用无状态转换,每次统计,都是只统计当前批次到达的单词的词频,和之前批次无关,不会进行累计

DStream有状态转换操作

包含历史信息

滑动窗口转换操作

事先设定一个滑动窗口的长度(也就是窗口的持续时间)设定滑动窗口的时间间隔(每隔多长时间执行一次计算),让窗口按照指定时间间隔在源DStream上滑动每次窗口停放的位置上,都会有一部分Dstream(或者一部分RDD)被框入窗口内,形成一个小段的Dstream可以启动对这个小段DStream的计算

 

window(windowLength, slideInterval) 基于源DStream产生的窗口化的批数据,计算得到一个新的Dstream

countByWindow(windowLength, slideInterval) 返回流中元素的一个滑动窗口数

reduceByWindow(func, windowLength, slideInterval) 返回一个单元素流。利用函数func聚集滑动时间间隔的流的元素创建这个单元素流。函数func必须满足结合律,从而可以支持并行计算

reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks]) 应用到一个(K,V)键值对组成的DStream上时,会返回一个由(K,V)键值对组成的新的DStream。每一个key的值均由给定的reduce函数(func函数)进行聚合计算。注意:在默认情况下,这个算子利用了Spark默认的并发任务数去分组。可以通过numTasks参数的设置来指定不同的任务数

reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval, [numTasks]) 更加高效的reduceByKeyAndWindow,每个窗口的reduce值,是基于先前窗口的reduce值进行增量计算得到的;它会对进入滑动窗口的新数据进行reduce操作,并对离开窗口的老数据进行“逆向reduce”操作。但是,只能用于“可逆reduce函数”,即那些reduce函数都有一个对应的“逆向reduce函数”(以InvFunc参数传入)

countByValueAndWindow(windowLength, slideInterval, [numTasks]) 当应用到一个(K,V)键值对组成的DStream上,返回一个由(K,V)键值对组成的新的DStream。每个key的值都是它们在滑动窗口中出现的频率

updateStateByKey操作

updateStateByKey(updateFunc)方法的输入参数updateFunc是一个函数,该函数的类型如下:(Seq[V], Option[S]) => Option[S]其中,V和S表示数据类型,如Int。可以看出,updateFunc函数的第1个输入参数属于Seq[V]类型,表示当前key对应的所有value,第2个输入参数属于Option[S]类型,表示当前key的历史状态,函数返回值类型Option[S],表示当前key的新状态。

updateStateByKey操作

需要在跨批次之间维护状态时,就必须使用updateStateByKey操作

词频统计实例:

对于有状态转换操作而言,本批次的词频统计,会在之前批次的词频统计结果的基础上进行不断累加,所以,最终统计得到的词频,是所有批次的单词的总的词频统计结果

和套接字的流程相似:

首先,进行目录结构的创建

cd /usr/local/spark/mycode/streaming
mkdir stateful
cd stateful/
mkdir -p src/main/scala
cd src/main/scala
vim NetworkWordCountStateful.scala
复制代码
import org.apache.spark._
import org.apache.spark.streaming._
object NetworkWordCountStateful {
  def main(args: Array[String]) {
    //定义状态更新函数
    val updateFunc = (values: Seq[Int], state: Option[Int]) => {
      val currentCount = values.foldLeft(0)(_ + _)
      val previousCount = state.getOrElse(0)
      Some(currentCount + previousCount)
    }
      
    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCountStateful")
    val sc = new SparkContext(sparkConf)
    sc.setLogLevel("ERROR")    
    val ssc = new StreamingContext(sc, Seconds(5))
    ssc.checkpoint("file:///usr/local/spark/mycode/streaming/stateful/")    //设置检查点,检查点具有容错机制
    val lines = ssc.socketTextStream("localhost", 9999)
    val words = lines.flatMap(_.split(" "))
    val wordDstream = words.map(x => (x, 1))
    val stateDstream = wordDstream.updateStateByKey[Int](updateFunc)
    stateDstream.print()
    ssc.start()
    ssc.awaitTermination()
  }
}
复制代码
cd /usr/local/spark/mycode/streaming/stateful
vim simple.sbt
name := "Simple Project"
version := "1.0"                    
scalaVersion := "2.12.15"
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.2.0"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.12" % "3.2.0" % "provided"
/usr/local/sbt/sbt package
/usr/local/spark/bin/spark-submit --class "NetworkWordCountStateful" ./target/scala-2.12/simple-project_2.12-1.0.jar

出现:

复制代码
2024-01-25 23:23:09,506 ERROR scheduler.ReceiverTracker: Deregistered receiver for stream 0: Restarting receiver with delay 2000ms: Error connecting to localhost:9999 - java.net.ConnectException: 拒绝连接 (Connection refused)
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:589)
    at java.net.Socket.connect(Socket.java:538)
    at java.net.Socket.<init>(Socket.java:434)
    at java.net.Socket.<init>(Socket.java:211)
    at org.apache.spark.streaming.dstream.SocketReceiver.onStart(SocketInputDStream.scala:61)
    at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:149)
    at org.apache.spark.streaming.receiver.ReceiverSupervisor.$anonfun$restartReceiver$1(ReceiverSupervisor.scala:198)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
    at scala.util.Success.$anonfun$map$1(Try.scala:255)
    at scala.util.Success.map(Try.scala:213)
    at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
    at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
    at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

-------------------------------------------
Time: 1706196190000 ms
-------------------------------------------

2024-01-25 23:23:11,508 ERROR scheduler.ReceiverTracker: Deregistered receiver for stream 0: Restarting receiver with delay 2000ms: Error connecting to localhost:9999 - java.net.ConnectException: 拒绝连接 (Connection refused)
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:589)
    at java.net.Socket.connect(Socket.java:538)
    at java.net.Socket.<init>(Socket.java:434)
    at java.net.Socket.<init>(Socket.java:211)
    at org.apache.spark.streaming.dstream.SocketReceiver.onStart(SocketInputDStream.scala:61)
    at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:149)
    at org.apache.spark.streaming.receiver.ReceiverSupervisor.$anonfun$restartReceiver$1(ReceiverSupervisor.scala:198)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
    at scala.util.Success.$anonfun$map$1(Try.scala:255)
    at scala.util.Success.map(Try.scala:213)
    at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
    at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
    at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

2024-01-25 23:23:13,511 ERROR scheduler.ReceiverTracker: Deregistered receiver for stream 0: Restarting receiver with delay 2000ms: Error connecting to localhost:9999 - java.net.ConnectException: 拒绝连接 (Connection refused)
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:589)
    at java.net.Socket.connect(Socket.java:538)
    at java.net.Socket.<init>(Socket.java:434)
    at java.net.Socket.<init>(Socket.java:211)
    at org.apache.spark.streaming.dstream.SocketReceiver.onStart(SocketInputDStream.scala:61)
    at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:149)
    at org.apache.spark.streaming.receiver.ReceiverSupervisor.$anonfun$restartReceiver$1(ReceiverSupervisor.scala:198)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
    at scala.util.Success.$anonfun$map$1(Try.scala:255)
    at scala.util.Success.map(Try.scala:213)
    at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
    at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
    at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

-------------------------------------------
Time: 1706196195000 ms
-------------------------------------------
复制代码

再打开一个终端:

[atguigu@hadoop102 ~]$ nc -localhost 9999
Hdao
Hdoop
a
Hadoop
Spark
Hbase
Hadoop

结果:

复制代码
-------------------------------------------
Time: 1706196250000 ms
-------------------------------------------
(H,1)

-------------------------------------------
Time: 1706196255000 ms
-------------------------------------------
(Hdoop,1)
(H,1)
(,1)

-------------------------------------------
Time: 1706196260000 ms
-------------------------------------------
(Hdoop,1)
(H,1)
(Spark,1)
(,1)
(Hadoop,1)

-------------------------------------------
Time: 1706196265000 ms
复制代码

输出操作

修改源文件即可

复制代码
import org.apache.spark._
import org.apache.spark.streaming._
object NetworkWordCountStatefulTxt {
  def main(args: Array[String]) {
    //定义状态更新函数
    val updateFunc = (values: Seq[Int], state: Option[Int]) => {
      val currentCount = values.foldLeft(0)(_ + _)
      val previousCount = state.getOrElse(0)
      Some(currentCount + previousCount)
    }
      
    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCountStateful")
    val sc = new SparkContext(sparkConf)
    sc.setLogLevel("ERROR")
    val ssc = new StreamingContext(sc, Seconds(5))
    ssc.checkpoint("file:///usr/local/spark/mycode/streaming/stateful/")    //设置检查点,检查点具有容错机制
    val lines = ssc.socketTextStream("localhost", 9999)
    val words = lines.flatMap(_.split(" "))
    val wordDstream = words.map(x => (x, 1))
    val stateDstream = wordDstream.updateStateByKey[Int](updateFunc)
    stateDstream.print()
    //下面是新增的语句,把DStream保存到文本文件中
    stateDstream.saveAsTextFiles("file:///usr/local/spark/mycode/streaming/dstreamoutput/output")
    ssc.start()
    ssc.awaitTermination()
  }
}
复制代码

在程序后加入这个即可:stateDstream.saveAsTextFiles("file:///usr/local/spark/mycode/streaming/dstreamoutput/output")

会生成多个文件

output -1479951955000
output -1479951960000
output -1479951965000
output -1479951970000
output -1479951975000
output-1479951980000
output-1479951985000

写操作到mysql里

启动MySQL数据库,并完成数据库和表的创建:

[atguigu@hadoop102 stateful]$ sudo service mysqld start
Redirecting to /bin/systemctl start mysqld.service
[atguigu@hadoop102 stateful]$ mysql -u root -p
Enter password: 
Welcome to the MySQL monitor.  Commands end with ; or \g.
...
mysql> use spark
mysql> create table wordcount (word char(20), count int(4));

创建好的“spark”数据库中创建一个名称为“wordcount”的表。

复制代码
[atguigu@hadoop102 streaming]$ mkdir statefulmysql
[atguigu@hadoop102 streaming]$ cd statefulmysql/
[atguigu@hadoop102 statefulmysql]$ mkdir -p src/main/scala
[atguigu@hadoop102 statefulmysql]$ cd src/main/scala
[atguigu@hadoop102 scala]$ vim NetworkWordCountStatefulMySQL.scala
[atguigu@hadoop102 scala]$ cd ..
[atguigu@hadoop102 main]$ cd ..
[atguigu@hadoop102 src]$ cd ..
[atguigu@hadoop102 statefulmysql]$ vim simple.sbt
import java.sql.{PreparedStatement, Connection, DriverManager}
import java.util.concurrent.atomic.AtomicInteger
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.{Seconds, StreamingContext}


object NetworkWordCountStatefulMySQL {
  def main(args: Array[String]) {
    //定义状态更新函数
    val updateFunc = (values: Seq[Int], state: Option[Int]) => {
      val currentCount = values.foldLeft(0)(_ + _)
      val previousCount = state.getOrElse(0)
      Some(currentCount + previousCount)
    }
    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCountStatefulMySQL")
    val sc = new SparkContext(sparkConf)
    sc.setLogLevel("ERROR")
    val ssc = new StreamingContext(sc, Seconds(5))
    ssc.checkpoint("file:///usr/local/spark/mycode/streaming/dstreamoutput/")    //设置检查点,检查点具有容错机制
    val lines = ssc.socketTextStream("localhost", 9999)
    val words = lines.flatMap(_.split(" "))
    val wordDstream = words.map(x => (x, 1))
    val stateDstream = wordDstream.updateStateByKey[Int](updateFunc)
    stateDstream.print()

    //下面是新增的语句,把DStream保存到MySQL数据库中
    stateDstream.foreachRDD(rdd => {
      //内部函数
      def func(records: Iterator[(String, Int)]) {
        var conn: Connection = null
        var stmt: PreparedStatement = null
        try {
          val url = "jdbc:mysql://localhost:3306/spark"
          val user = "root"
          val password = "123456" //数据库密码是123456
          conn = DriverManager.getConnection(url, user, password)
          records.foreach(p => {
            val sql = "insert into wordcount(word,count) values (?,?)"
            stmt = conn.prepareStatement(sql);
            stmt.setString(1, p._1.trim)
            stmt.setInt(2, p._2.toInt)
            stmt.executeUpdate()
          })
        } catch {
          case e: Exception => e.printStackTrace()
        } finally {
          if (stmt != null) {
            stmt.close()
          }
          if (conn != null) {
            conn.close()
          }
        }
      }

      val repartitionedRDD = rdd.repartition(3)
      repartitionedRDD.foreachPartition(func)
    })


    ssc.start()
    ssc.awaitTermination()
  }
}
复制代码
name := "Simple Project"
version := "1.0"                    
scalaVersion := "2.12.15"
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.2.0"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.12" % "3.2.0" % "provided"

进行打包:

/usr/local/sbt/sbt package

执行:

/usr/local/spark/bin/spark-submit --jars /usr/local/spark/jars/mysql-connector-java-5.1.37.jar --class "NetworkWordCountStatefulMySQL" /usr/local/spark/mycode/streaming/statefulmysql/target/scala-2.12/simple-project_2.12-1.0.jar

在打开一个终端进行输入

[atguigu@hadoop102 scala]$ nc -localhost 9999
woyaosuijiao
suijio^Ha
suijioa
woyaosuijiao
woyaosuijiao

屏幕显示

复制代码
Time: 1706199380000 ms
-------------------------------------------
(suijioa,1)
(suijia,1)
(woyaosuijiao,1)

-------------------------------------------
Time: 1706199385000 ms
-------------------------------------------
(suijioa,1)
(suijia,1)
(woyaosuijiao,3)
复制代码

mysql结果:

复制代码
[atguigu@hadoop102 ~]$ sudo service mysqld start
Redirecting to /bin/systemctl start mysqld.service
[atguigu@hadoop102 ~]$ mysql -u root -p
Enter password: 
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 30
Server version: 5.7.28 MySQL Community Server (GPL)

Copyright (c) 2000, 2019, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> use spark
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> select * from wordcount;
+--------------+-------+
| word         | count |
+--------------+-------+
| suijia     |     1 |
| woyaosuijiao |     1 |
| suijioa      |     1 |
| suijia     |     1 |
| woyaosuijiao |     1 |
| suijioa      |     1 |
| suijia     |     1 |
| woyaosuijiao |     1 |
| suijioa      |     1 |
| suijia     |     1 |
| woyaosuijiao |     1 |
| suijioa      |     1 |
| suijia     |     1 |
| woyaosuijiao |     3 |
+--------------+-------+
复制代码

 

posted on   实名吓我一跳  阅读(44)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· 无需6万激活码!GitHub神秘组织3小时极速复刻Manus,手把手教你使用OpenManus搭建本
· Manus爆火,是硬核还是营销?
· 终于写完轮子一部分:tcp代理 了,记录一下
· 别再用vector<bool>了!Google高级工程师:这可能是STL最大的设计失误
· 单元测试从入门到精通
 
点击右上角即可分享
微信分享提示