2020年寒假学习进度第六天

今天主要进行了spark实验六的学习，Spark Streaming 编程初级实践

Flume 是非常流行的日志采集系统，可以作为 Spark Streaming 的高级数据源。请把 Flume Source 设置为 netcat 类型，从终端上不断给 Flume Source 发送各种消息，Flume 把消息汇集到 Sink，这里把 Sink 类型设置为 avro，由 Sink 把消息推送给 Spark Streaming，由自己编写的 Spark Streaming 应用程序对消息进行处理

⑴配置 Flume 数据源

1.cd /usr/local/flume

2.cd conf

3.vim flume-to-spark.conf

flume-to-spark.conf 文件中写入如下内容：

#flume-to-spark.conf: A single-node Flume configuration
 # Name the components on this agent
 a1.sources = r1
 a1.sinks = k1
 a1.channels = c1
 # Describe/configure the source
 a1.sources.r1.type = netcat
 a1.sources.r1.bind = localhost
 a1.sources.r1.port = 33333
# Describe the sink
 a1.sinks.k1.type = avro
 a1.sinks.k1.hostname = localhost
 a1.sinks.k1.port =44444
 # Use a channel which buffers events in memory
 a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000000
 a1.channels.c1.transactionCapacity = 1000000
 # Bind the source and sink to the channel
 a1.sources.r1.channels = c1
 a1.sinks.k1.channel = c1

(2)Spark 的准备工作

1.cd /usr/local/spark

2. ./bin/spark-shell

启动成功后，在 spark-shell 中执行下面 import 语句：

1.scala> import org.apache.spark.streaming.flume._

你可以看到，马上会报错，因为找不到相关的 jar 包。所以，现在我们就需要下载spark-streaming-flume_2.11-2.1.0.jar，其中2.11表示对应的Scala版本号，2.1.0表示Spark版本号。现在请在 Linux 系统中，打开一个火狐浏览器，打开下方的网址http://mvnrepository.com/artifact/org.apache.spark/spark-streaming-flume_2.11/2.1.0，里面有提供 spark-streaming-flume_2.11-2.1.0.jar 文件的下载。

随后：

1.cd /usr/local/spark/jars

2.mkdir flume

3.cd ~

4.cd 下载

5.cp ./spark-streaming-flume_2.11-2.1.0.jar

6./usr/local/spark/jars/flume

7.cd /usr/local/flume/lib

8.ls

9.cp ./* /usr/local/spark/jars/flume

⑶编写 Spark 程序使用 Flume 数据源

1.cd /usr/local/spark/mycode

2.mkdir flume

3.cd flume

4.mkdir -p src/main/scala

5.cd src/main/scala

6.vim FlumeEventCount.scala

package org.apache.spark.examples.streaming
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming._
import org.apache.spark.streaming.flume._
import org.apache.spark.util.IntParam
object FlumeEventCount {
 def main(args: Array[String]) {
 if (args.length < 2) {
 System.err.println(
 "Usage: FlumeEventCount <host> <port>")
 System.exit(1)
 }
StreamingExamples.setStreamingLogLevels()
 val Array(host, IntParam(port)) = args
 val batchInterval = Milliseconds(2000)
 // Create the context and set the batch size
 val sparkConf = new SparkConf().setAppName("FlumeEventCount").setMaster("local
[2]")
 val ssc = new StreamingContext(sparkConf, batchInterval)
 // Create a flume stream
 val stream = FlumeUtils.createStream(ssc, host, port, StorageLevel.MEMORY_ONLY_S
ER_2)
 // Print out the count of events received from this server in each batch
 stream.count().map(cnt => "Received " + cnt + " flume events." ).print()
 ssc.start()
 ssc.awaitTermination()
 } }

然后再使用 vim 编辑器新建 StreamingExamples.scala 文件，输入如下代码，用于控

制日志输出格式：

package org.apache.spark.examples.streaming
import org.apache.log4j.{Level, Logger}
import org.apache.spark.internal.Logging
object StreamingExamples extends Logging {
 /** Set reasonable logging levels for streaming if the user has not configured log4
j. */
 def setStreamingLogLevels() {
 val log4jInitialized = Logger.getRootLogger.getAllAppenders.hasMoreElements
 if (!log4jInitialized) {
 // We first log something to initialize Spark's default logging, then we overri
de the
 // logging level.
 logInfo("Setting log level to [WARN] for streaming example." +
 " To override add a custom log4j.properties to the classpath.")
 Logger.getRootLogger.setLevel(Level.WARN)
 }
 }
}

然后，请执行下面命令新建一个 simple.sbt 文件：

name := "Simple Project"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-streaming-flume_2.11" % "2.1.0"

　　进行打包：

1.cd /usr/local/spark/mycode/flume/

2./usr/local/sbt/sbt package#执行这一步之前需要先安装 sbt 插件

⑷测试程序效果

关闭之前打开的所有终端。首先，请新建第 1 个 Linux 终端，启动 Spark Streaming 应用程序，命令如下：

1.cd /usr/local/spark

2. ./bin/spark-submit --driver-class-path /usr/local/spark/jars/*:/usr/local/spark/jars/flume/* --class "org.apache.spark.examples.streaming.FlumeEventCount" /usr/local/spark/mycode/flume/target/scala-2.11/simple-project_2.11-1.0.jar localhost 44444

通过上面命令，我们为应用程序提供 host 和 port 两个参数的值分别为 localhost 和44444，程序会对 localhost 的 44444 端口进行监听，Milliseconds(2000)设置了时间间隔为2 秒，所以，该程序每隔 2 秒就会从指定的端口中获取由 Flume Sink 发给该端口的消息，然后进行处理，对消息进行统计，打印出“Received 0 flume events.”这样的信息。执行该命令后，屏幕上会显示程序运行的相关信息，并会每隔 2 秒钟刷新一次信息，大量信息中会包含如下重要信息：

-------------------------------------------

Time: 1488029430000 ms

-------------------------------------------

Received 0 flume events.

因为目前 Flume 还没有启动，没有给 FlumeEventCount 发送任何消息，所以 FlumeEvents 的数量是 0。第 1 个终端不要关闭，让它一直处于监听状态。现在，我们可以再另外新建第 2 个终端，在这个新的终端中启动 Flume Agent，命令如下：

1.cd /usr/local/flume

2.bin/flume-ng agent --conf ./conf --conf-file ./conf/flume-to-spark.conf --name a1 -Dflume.root.logger=INFO,console

启动 agent 以后，该 agent 就会一直监听 localhost 的 33333 端口，这样，我们下面就可以通过“telnet localhost 33333”命令向 Flume Source 发送消息。第 2 个终端也不要关闭，让它一直处于监听状态。

请另外新建第 3 个终端，执行如下命令：

1.telnet localhost 33333

执行该命令以后，就可以在这个窗口里面随便敲入若干个字符和若干个回车，这些消息都会被 Flume 监听到，Flume 把消息采集到以后汇集到 Sink，然后由 Sink 发送给 Spark的 FlumeEventCount 程序进行处理。然后，你就可以在运行 FlumeEventCount 的前面那个终端窗口内看到类似如下的统计结果：

-------------------------------------------

Time: 1488029430000 ms

-------------------------------------------

Received 0 flume events.

#这里省略了其他屏幕信息

-------------------------------------------

Time: 1488029432000 ms

-------------------------------------------

Received 8 flume events.

#这里省略了其他屏幕信息

-------------------------------------------

Time: 1488029434000 ms

-------------------------------------------

Received 21 flume events.

以上是关于实验的一些操作步骤，最后经过实验出现了一个问题，启动Flume之后，就应该能接受到Received 21 flume events这个提示，但是我做实验启动flume之后，依然是0，不知原因，此问题还未解决。

posted @ 2020-02-05 13:29 生活依旧阅读(173) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

2020年寒假学习进度第六天

公告