推送式
- 将flume采集的数据主动推送给Spark程序,容易导致Spark程序接受数据出问题,推送式整合是基于avro端口下沉地方式完成
- 引入SparkStreaming和Flume整合的依赖
| <dependency> |
| <groupId>org.apache.spark</groupId> |
| <artifactId>spark-streaming_2.11</artifactId> |
| <version>2.3.1</version> |
| </dependency> |
| <dependency> |
| <groupId>org.apache.spark</groupId> |
| <artifactId>spark-streaming-flume_2.11</artifactId> |
| <version>2.3.1</version> |
| </dependency> |
| </dependencies> |
- 定义Flume采集数据进程脚本,把sink下沉地指定为avro类型的端口下沉底
| [root@node1 data] |
| |
| a1.sources = s1 |
| a1.sinks = k1 |
| a1.channels = c1 |
| |
| a1.sources.s1.type = netcat |
| a1.sources.s1.bind = node1 |
| a1.sources.s1.port = 44444 |
| |
| a1.sinks.k1.type = avro |
| a1.sinks.k1.hostname = node1 |
| a1.sinks.k1.port = 8888 |
| a1.sinks.k1.batch-size = 1 |
| |
| a1.channels.c1.type = memory |
| a1.channels.c1.capacity = 1000 |
| a1.channels.c1.transactionCapacity = 100 |
| |
| a1.sources.s1.channels = c1 |
| a1.sinks.k1.channel = c1 |
- 通过FileUtils.createStream方法从avro的端口中获取flume采集到avro端口的实时数据
| package SparkStreaming.flume |
| |
| import org.apache.spark.SparkConf |
| import org.apache.spark.storage.StorageLevel |
| import org.apache.spark.streaming.flume.FlumeUtils |
| import org.apache.spark.streaming.{Seconds, StreamingContext} |
| |
| object ByFlumePush { |
| def main(args: Array[String]): Unit = { |
| val conf = new SparkConf().setMaster("local[3]").setAppName("hdfs") |
| val ssc: StreamingContext = new StreamingContext(conf, Seconds(10)) |
| val ds = FlumeUtils.createStream(ssc, "node1", 8888, StorageLevel.MEMORY_ONLY) |
| ds.print() |
| ssc.start() |
| ssc.awaitTermination() |
| } |
| } |
| 1. 启动flume |
| flume-ng agent -n a1 -f portToSpark.conf -Dflume.root.logger=INFO |
| 2. 运行主类,将java代码打包上传到node1上 |
| spark-submit --class flume.Demo01 ssc.jar |
| 3. 开启监听的端口号 |
| [root@node1 ~] |
- 注意:
必须保证Spark Streaming运行程序和Flume采集进程在同一个节点上,保证Spark Streaming打包的jar包必须把spark-streaming-flume_2.11:2.3.1版本的依赖包全部打包到jar包中
(这里用的别人打的包,保存在G://shixun//ssc.jar路径下了)
拉取式
- 将Flume采集的数据发送给sink了,sink并不是直接把数据立马给了Spark,而是先把数据缓冲,Spark接收器可以按照我的需求主动去sink中拉取数据.
拉取式整合方式是基于Spark下沉地完成----建议使用
- 引入依赖:
| <dependency> |
| <groupId>org.apache.spark</groupId> |
| <artifactId>spark-streaming_2.11</artifactId> |
| <version>2.3.1</version> |
| </dependency> |
| <dependency> |
| <groupId>org.apache.spark</groupId> |
| <artifactId>spark-streaming-flume_2.11</artifactId> |
| <version>2.3.1</version> |
| </dependency> |
| </dependencies> |
- 定义flume脚本文件,和上面的方式同,但把sink的下沉地改为SparkSink
| [root@node1 data] |
| |
| a1.sources = s1 |
| a1.sinks = k1 |
| a1.channels = c1 |
| |
| a1.sources.s1.type = netcat |
| a1.sources.s1.bind = node1 |
| a1.sources.s1.port = 44444 |
| |
| a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink |
| a1.sinks.k1.hostname = node1 |
| a1.sinks.k1.port = 8888 |
| a1.sinks.k1.batch-size = 1 |
| |
| a1.channels.c1.type = memory |
| a1.channels.c1.capacity = 1000 |
| a1.channels.c1.transactionCapacity = 100 |
| |
| a1.sources.s1.channels = c1 |
| a1.sinks.k1.channel = c1 |
| package SparkStreaming.flume |
| |
| import org.apache.spark.SparkConf |
| import org.apache.spark.storage.StorageLevel |
| import org.apache.spark.streaming.flume.FlumeUtils |
| import org.apache.spark.streaming.{Seconds, StreamingContext} |
| |
| object ByFlumePush { |
| def main(args: Array[String]): Unit = { |
| val conf = new SparkConf().setMaster("local[3]").setAppName("hdfs") |
| val ssc: StreamingContext = new StreamingContext(conf, Seconds(10)) |
| val ds = FlumeUtils.createPollingStream(ssc, "node1", 8888, StorageLevel.MEMORY_ONLY) |
| ds.print() |
| ssc.start() |
| ssc.awaitTermination() |
| } |
| } |
- [注意]:
SparkStreaming的依赖jar包复制到flume软件的lib目录下,把spark-streaming-flume的依赖jar包放到flume软件的lib目录下
| [root@node1 jars]# pwd |
| /opt/app/spark-2.3.1/jars |
| [root@node1 jars]# cp spark-streaming_2.11-2.3.1.jar /opt/app/flume-1.8.0/lib/ |
| [root@node1 data]# pwd |
| /opt/data |
| [root@node1 data]# cp ssc.jar /opt/app/flume-1.8.0/lib/ |
(ssc.jar为别人打的包,保存在G://shixun//ssc.jar路径下了)
| [root@node1 data]# flume-ng agent -n a1 -f portToSpark2.conf -Dflume.root.logger=INFO,console |
| [root@node1 data]# spark-submit --class flume.ByFlumePush ssc2.jar |
| [root@node1 data]# telnet node1 44444 |
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· TypeScript + Deepseek 打造卜卦网站:技术与玄学的结合
· Manus的开源复刻OpenManus初探
· AI 智能体引爆开源社区「GitHub 热点速览」
· 三行代码完成国际化适配,妙~啊~
· .NET Core 中如何实现缓存的预热?