各Flume的source、channel、sink解析和实例用法
常用:去官网一顿撸,有很多哦#
最右侧有目录导航哟=================>>>>>>>>
一、Source:#
avro#
多用于复制(a1.sources.r1.selector.type = replicating)、多路复用(a1.sources.r1.selector.type = multiplexing)、负载均衡、故障转移(a1.sinkgroups.g1.processor.type = failover)
exec#
一般监测启动一类的,hive.log启动日志
netcat#
监控端口数据
spooling directory#
监测目录下的多个文件,上传完成后文件结尾为COMPLETED,但是在上传完成后不能修改文件,否则会报错
taildir#
描述:
实时监测某些文件,支持持续修改文件,支持正则表达式
Taildir Source维护了一个json格式的position File,其会定期的往position File中更新每个文件读取到的最新的位置,因此能够实现断点续传)
二、Channel:#
Memory Channel#
Kafka Channel(直接干到Kafka)#
File Channel#
三、Sink:#
avro#
发送到下游的avro source?都可以啊
File Roll#
Stores events on the local filesystem(将事件存储在本地文件系统上)
HBase#
hdfs#
输出到hdfs
logger#
输出到控制台
四、案例#
1. 监控端口数据#
案例需求:使用Flume监听一个端口,收集该端口数据,并打印到控制台。
( source:netcat channel:memory channel sink:logger)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | a1.sources = r1 a1.channels = c1 a1.sinks = k1 a1.sources.r1.type = exec a1.sources.r1.command = tail -F /opt/module/hive- 3.1 . 2 /logs/hive.log a1.channels.c1.type = memory a1.channels.c1.capacity = 10000 a1.channels.c1.transactionCapacity = 10000 a1.sinks.k1.type = hdfs a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1 #设置文件上传到hdfs的路径 a1.sinks.k1.hdfs.path = hdfs: //Linux201:8020/flume/hive-events/%y-%m-%d/%H #设置文件前缀 a1.sinks.k1.hdfs.filePrefix = logs- #设置每个文件每60s滚动 a1.sinks.k1.hdfs.rollInterval = 60 #设置每个文件到达128M时滚动 a1.sinks.k1.hdfs.rollSize = 134217728 #设置每多少个event就滚动一个文件(此设置就是不依据event) a1.sinks.k1.hdfs.rollCount = 0 #设置每多少个event就写入hdfs(不是文件滚动的意思) a1.sinks.k1.hdfs.batchSize = 100 #设置文件格式,此格式不会压缩(但是支持压缩?) a1.sinks.k1.hdfs.fileType = DataStream #设置时间戳四舍五入 a1.sinks.k1.hdfs.round = true #设置多长时间创建一个文件夹 a1.sinks.k1.hdfs.roundValue = 1 #设置四舍五入的值的单位 a1.sinks.k1.hdfs.roundUnit = hour #设置使用本地时间,而不是事件标头中的时间戳 a1.sinks.k1.hdfs.useLocalTimeStamp = true |
2. 实时监控单个追加文件#
案例需求:实时监控Hive日志,并上传到HDFS中
(source: exec channel:memory sink: hdfs)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | a1.sources = r1 a1.channels = c1 a1.sinks = k1 a1.sources.r1.type = exec a1.sources.r1.command = tail -F /opt/module/hive- 3.1 . 2 /logs/hive.log a1.channels.c1.type = memory a1.channels.c1.capacity = 10000 a1.channels.c1.transactionCapacity = 10000 a1.sinks.k1.type = hdfs a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1 #设置文件上传到hdfs的路径 a1.sinks.k1.hdfs.path = hdfs: //Linux201:8020/flume/hive-events/%y-%m-%d/%H #设置文件前缀 a1.sinks.k1.hdfs.filePrefix = logs- #设置每个文件每60s滚动 a1.sinks.k1.hdfs.rollInterval = 60 #设置每个文件到达128M时滚动 a1.sinks.k1.hdfs.rollSize = 134217728 #设置每多少个event就滚动一个文件(此设置就是不依据event) a1.sinks.k1.hdfs.rollCount = 0 #设置每多少个event就写入hdfs(不是文件滚动的意思) a1.sinks.k1.hdfs.batchSize = 100 #设置文件格式,此格式不会压缩(但是支持压缩?) a1.sinks.k1.hdfs.fileType = DataStream #设置时间戳四舍五入 a1.sinks.k1.hdfs.round = true #设置多长时间创建一个文件夹 a1.sinks.k1.hdfs.roundValue = 1 #设置四舍五入的值的单位 a1.sinks.k1.hdfs.roundUnit = hour #设置使用本地时间,而不是事件标头中的时间戳 a1.sinks.k1.hdfs.useLocalTimeStamp = true |
3. 实时监控目录下多个新文件#
案例需求:使用Flume监听整个目录的新文件,并上传至HDFS
(source: spooldir channel: memory sink: hdfs)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | a3.sources = r3 a3.sinks = k3 a3.channels = c3 # Describe/configure the source a3.sources.r3.type = spooldir a3.sources.r3.spoolDir = /opt/module/flume/upload a3.sources.r3.fileSuffix = .COMPLETED a3.sources.r3.fileHeader = true #忽略所有以.tmp结尾的文件,不上传 a3.sources.r3.ignorePattern = ([^ ]*\.tmp) # Describe the sink a3.sinks.k3.type = hdfs a3.sinks.k3.hdfs.path = hdfs: //Linux201:8020/flume/upload/%Y%m%d/%H #上传文件的前缀 a3.sinks.k3.hdfs.filePrefix = upload- #是否按照时间滚动文件夹 a3.sinks.k3.hdfs.round = true #多少时间单位创建一个新的文件夹 a3.sinks.k3.hdfs.roundValue = 1 #重新定义时间单位 a3.sinks.k3.hdfs.roundUnit = hour #是否使用本地时间戳 a3.sinks.k3.hdfs.useLocalTimeStamp = true #积攒多少个Event才flush到HDFS一次 a3.sinks.k3.hdfs.batchSize = 100 #设置文件类型,可支持压缩 a3.sinks.k3.hdfs.fileType = DataStream #多久生成一个新的文件 a3.sinks.k3.hdfs.rollInterval = 60 #设置每个文件的滚动大小大概是128M a3.sinks.k3.hdfs.rollSize = 134217700 // ? 134217728 #文件的滚动与Event数量无关 a3.sinks.k3.hdfs.rollCount = 0 # Use a channel which buffers events in memory a3.channels.c3.type = memory a3.channels.c3.capacity = 1000 a3.channels.c3.transactionCapacity = 100 # Bind the source and sink to the channel a3.sources.r3.channels = c3 a3.sinks.k3.channel = c3 |
4. 实时监控目录下的多个追加文件#
案例需求:使用Flume监听整个目录的实时追加文件,并上传至HDFS
(source: taildir channel: memory sink: hdfs)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | a3.sources = r3 a3.sinks = k3 a3.channels = c3 # Describe/configure the source a3.sources.r3.type = TAILDIR a3.sources.r3.positionFile = /opt/module/flume/tail_dir.json a3.sources.r3.filegroups = f1 f2 a3.sources.r3.filegroups.f1 = /opt/module/flume/files/.*file.* a3.sources.r3.filegroups.f2 = /opt/module/flume/files/.*log.* # Describe the sink a3.sinks.k3.type = hdfs a3.sinks.k3.hdfs.path = hdfs: //Linux201:8020/flume/upload2/%Y%m%d/%H #上传文件的前缀 a3.sinks.k3.hdfs.filePrefix = upload- #是否按照时间滚动文件夹 a3.sinks.k3.hdfs.round = true #多少时间单位创建一个新的文件夹 a3.sinks.k3.hdfs.roundValue = 1 #重新定义时间单位 a3.sinks.k3.hdfs.roundUnit = hour #是否使用本地时间戳 a3.sinks.k3.hdfs.useLocalTimeStamp = true #积攒多少个Event才flush到HDFS一次 a3.sinks.k3.hdfs.batchSize = 100 #设置文件类型,可支持压缩 a3.sinks.k3.hdfs.fileType = DataStream #多久生成一个新的文件 a3.sinks.k3.hdfs.rollInterval = 60 #设置每个文件的滚动大小大概是128M a3.sinks.k3.hdfs.rollSize = 134217700 //? 134217728 #文件的滚动与Event数量无关 a3.sinks.k3.hdfs.rollCount = 0 # Use a channel which buffers events in memory a3.channels.c3.type = memory a3.channels.c3.capacity = 1000 a3.channels.c3.transactionCapacity = 100 # Bind the source and sink to the channel a3.sources.r3.channels = c3 a3.sinks.k3.channel = c3 |
5. 复制和多路复用#
案例需求
使用Flume-1监控文件变动,Flume-1将变动内容传递给Flume-2,Flume-2负责存储到HDFS。同时Flume-1将变动内容传递给Flume-3,Flume-3负责输出到Local FileSystem。
flume1:(source: exec channel: memory sink: avro)
flume2:(source: avro channel: memory sink: hdfs)
flume3:(source: avro channel: memory sink: File Roll Sink)
flume1:#
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | # Name the components on this agent a1.sources = r1 a1.sinks = k1 k2 a1.channels = c1 c2 # 将数据流复制给所有channel a1.sources.r1.selector.type = replicating # Describe/configure the source a1.sources.r1.type = exec a1.sources.r1.command = tail -F /opt/module/hive/logs/hive.log a1.sources.r1.shell = /bin/bash -c # Describe the sink # sink端的avro是一个数据发送者 a1.sinks.k1.type = avro a1.sinks.k1.hostname = Linux201 a1.sinks.k1.port = 4141 a1.sinks.k2.type = avro a1.sinks.k2.hostname = Linux201 a1.sinks.k2.port = 4142 # Describe the channel a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 a1.channels.c2.type = memory a1.channels.c2.capacity = 1000 a1.channels.c2.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 c2 a1.sinks.k1.channel = c1 a1.sinks.k2.channel = c2 |
flume2:#
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | # Name the components on this agent a2.sources = r1 a2.sinks = k1 a2.channels = c1 # Describe/configure the source # source端的avro是一个数据接收服务 a2.sources.r1.type = avro a2.sources.r1.bind = Linux201 a2.sources.r1.port = 4141 # Describe the sink a2.sinks.k1.type = hdfs a2.sinks.k1.hdfs.path = hdfs: //Linux201:8020/flume2/%Y%m%d/%H #上传文件的前缀 a2.sinks.k1.hdfs.filePrefix = flume2- #是否按照时间滚动文件夹 a2.sinks.k1.hdfs.round = true #多少时间单位创建一个新的文件夹 a2.sinks.k1.hdfs.roundValue = 1 #重新定义时间单位 a2.sinks.k1.hdfs.roundUnit = hour #是否使用本地时间戳 a2.sinks.k1.hdfs.useLocalTimeStamp = true #积攒多少个Event才flush到HDFS一次 a2.sinks.k1.hdfs.batchSize = 100 #设置文件类型,可支持压缩 a2.sinks.k1.hdfs.fileType = DataStream #多久生成一个新的文件 a2.sinks.k1.hdfs.rollInterval = 600 #设置每个文件的滚动大小大概是128M a2.sinks.k1.hdfs.rollSize = 134217700 #文件的滚动与Event数量无关 a2.sinks.k1.hdfs.rollCount = 0 # Describe the channel a2.channels.c1.type = memory a2.channels.c1.capacity = 1000 a2.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a2.sources.r1.channels = c1 a2.sinks.k1.channel = c1 |
flume3:#
注意:输出的本地目录必须是已经存在的目录,如果该目录不存在,并不会创建新的目录。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | # Name the components on this agent a3.sources = r1 a3.sinks = k1 a3.channels = c2 # Describe/configure the source a3.sources.r1.type = avro a3.sources.r1.bind = Linux201 a3.sources.r1.port = 4142 # Describe the sink a3.sinks.k1.type = file_roll a3.sinks.k1.sink.directory = /opt/module/datas/flume3 # Describe the channel a3.channels.c2.type = memory a3.channels.c2.capacity = 1000 a3.channels.c2.transactionCapacity = 100 # Bind the source and sink to the channel a3.sources.r1.channels = c2 a3.sinks.k1.channel = c2 |
6. 负载均衡和故障转移#
案例需求:使用Flume1监控一个端口,其sink组中的sink分别对接Flume2和Flume3,采用FailoverSinkProcessor,实现故障转移的功能。
flume1:(source: netcat channel: memory sink: avro)
flume2:(source: avro channel: memory sink: logger)
flume3:(source: avro channel: memory sink: logger)
flume1:#
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | # Name the components on this agent a1.sources = r1 a1.channels = c1 a1.sinkgroups = g1 a1.sinks = k1 k2 # Describe/configure the source a1.sources.r1.type = netcat a1.sources.r1.bind = localhost a1.sources.r1.port = 44444 a1.sinkgroups.g1.processor.type = failover a1.sinkgroups.g1.processor.priority.k1 = 5 a1.sinkgroups.g1.processor.priority.k2 = 10 a1.sinkgroups.g1.processor.maxpenalty = 10000 # Describe the sink a1.sinks.k1.type = avro a1.sinks.k1.hostname = Linux201 a1.sinks.k1.port = 4141 a1.sinks.k2.type = avro a1.sinks.k2.hostname = Linux201 a1.sinks.k2.port = 4142 # Describe the channel a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinkgroups.g1.sinks = k1 k2 a1.sinks.k1.channel = c1 a1.sinks.k2.channel = c1 |
flume2:#
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | # Name the components on this agent a2.sources = r1 a2.sinks = k1 a2.channels = c1 # Describe/configure the source a2.sources.r1.type = avro a2.sources.r1.bind = Linux201 a2.sources.r1.port = 4141 # Describe the sink a2.sinks.k1.type = logger # Describe the channel a2.channels.c1.type = memory a2.channels.c1.capacity = 1000 a2.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a2.sources.r1.channels = c1 a2.sinks.k1.channel = c1 |
flume3:#
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | # Name the components on this agent a3.sources = r1 a3.sinks = k1 a3.channels = c2 # Describe/configure the source a3.sources.r1.type = avro a3.sources.r1.bind = Linux201 a3.sources.r1.port = 4142 # Describe the sink a3.sinks.k1.type = logger # Describe the channel a3.channels.c2.type = memory a3.channels.c2.capacity = 1000 a3.channels.c2.transactionCapacity = 100 # Bind the source and sink to the channel a3.sources.r1.channels = c2 a3.sinks.k1.channel = c2 |
使用jps -ml查看Flume进程。
7. 聚合#
案例需求:
Linux201上的Flume-1监控文件/opt/module/group.log,
Linux202上的Flume-2监控某一个端口的数据流,
Flume-1与Flume-2将数据发送给Linux203上的Flume-3,Flume-3将最终数据打印到控制台
flume1:(source: exec channel: memory sink: avro)
flume2:(source: netcat channel: memory sink: avro)
flume3:(source: avro channel: memory sink: logger)
flume1:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | # Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source a1.sources.r1.type = exec a1.sources.r1.command = tail -F /opt/module/group.log a1.sources.r1.shell = /bin/bash -c # Describe the sink a1.sinks.k1.type = avro a1.sinks.k1.hostname = Linux203 a1.sinks.k1.port = 4141 # Describe the channel a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1 |
flume2:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | # Name the components on this agent a2.sources = r1 a2.sinks = k1 a2.channels = c1 # Describe/configure the source a2.sources.r1.type = netcat a2.sources.r1.bind = Linux202 a2.sources.r1.port = 44444 # Describe the sink a2.sinks.k1.type = avro a2.sinks.k1.hostname = Linux203 a2.sinks.k1.port = 4141 # Use a channel which buffers events in memory a2.channels.c1.type = memory a2.channels.c1.capacity = 1000 a2.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a2.sources.r1.channels = c1 a2.sinks.k1.channel = c1 |
flume3:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | # Name the components on this agent a3.sources = r1 a3.sinks = k1 a3.channels = c1 # Describe/configure the source a3.sources.r1.type = avro a3.sources.r1.bind = Linux203 a3.sources.r1.port = 4141 # Describe the sink # Describe the sink a3.sinks.k1.type = logger # Describe the channel a3.channels.c1.type = memory a3.channels.c1.capacity = 1000 a3.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a3.sources.r1.channels = c1 a3.sinks.k1.channel = c1 |
五、自定义Interceptor#
案例需求#
使用Flume采集服务器本地日志,需要按照日志类型的不同,将不同种类的日志发往不同的分析系统。此案例中,将字母开头和数字开头的数据发送到不同的控制台中
需求分析#
在实际的开发中,一台服务器产生的日志类型可能有很多种,不同类型的日志可能需要发送到不同的分析系统。此时会用到Flume拓扑结构中的Multiplexing结构,Multiplexing的原理是,根据event中Header的某个key的值,将不同的event发送到不同的Channel中,所以我们需要自定义一个Interceptor,为不同类型的event的Header中的value赋予不同的值。
在该案例中,我们以端口数据模拟日志,以数字(单个)和字母(单个)模拟不同类型的日志,我们需要自定义interceptor区分数字和字母,将其分别发往不同的分析系统(Channel)。
实现步骤#
(1)创建一个maven项目,并引入以下依赖
1 2 3 4 5 | <dependency> <groupId>org.apache.flume</groupId> <artifactId>flume-ng-core</artifactId> <version> 1.9 . 0 </version> </dependency> |
(2)定义CustomInterceptor类并实现Interceptor接口
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 | import org.apache.flume.Context; import org.apache.flume.Event; import org.apache.flume.interceptor.Interceptor; import java.io.UnsupportedEncodingException; import java.util.List; import java.util.Map; /** * 根据event的首字符是字母还是数字,添加不同header */ public class MyInterceptor implements Interceptor { /** * 初始化方法 */ public void initialize() { } /** * 对一个event进行更改:我们要插入header * * @param event 要更改的event * @return 更改完的event */ public Event intercept(Event event) { //取出event的header和body Map<String, String> headers = event.getHeaders(); byte [] body = event.getBody(); //body是二级制数组 //根据Body首字母不同,进行不同处理 String line = null ; try { line = new String(body, "utf-8" ); //将二进制数组转化为字符串,如果不加utf-8,默认使用当前环境的编解码方式 } catch (UnsupportedEncodingException e) { e.printStackTrace(); } char first = line.charAt( 0 ); if ((first >= 'a' && first <= 'z' ) || (first >= 'A' && first <= 'Z' )) { //是字母 headers.put( "AAA" , "XXX" ); } else if (first >= '0' && first <= '9' ) { //是数字 headers.put( "AAA" , "YYY" ); } else { //不是字母不是数字 headers.put( "AAA" , "ZZZ" ); } return event; } /** * 批量对一批事件进行更改 * * @param events 要更改的一批事件 * @return 更改完的事件 */ public List<Event> intercept(List<Event> events) { for (Event event : events) { intercept(event); } return events; } /** * 关闭资源的方法 */ public void close() { } /** * 用来构建Interceptor实体的类 */ public static class MyBuilder implements Interceptor.Builder { //构建方法 public Interceptor build() { return new MyInterceptor(); } /** * 配置方法 * * @param context 配置文件 */ public void configure(Context context) { } } } |
将其打包放入flume中的lib包下
(3)编辑flume配置文件(a1.sources.r1.interceptors.i1.type = MyInterceptor$MyBuilder使用的全类名)
为Linux201上的Flume1配置1个netcat source,1个sink group(3个avro sink),并配置相应的ChannelSelector和interceptor
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | a1.sources = r1 a1.sinks = k1 k2 k3 a1.channels = c1 c2 c3 a1.sources.r1.type = netcat a1.sources.r1.bind = localhost a1.sources.r1.port = 44444 a1.sources.r1.interceptors = i1 a1.sources.r1.interceptors.i1.type = MyInterceptor$MyBuilder a1.sources.r1.selector.type = multiplexing a1.sources.r1.selector.header = AAA a1.sources.r1.selector.mapping.XXX = c1 a1.sources.r1.selector.mapping.YYY = c2 a1.sources.r1.selector.mapping.ZZZ = c3 a1.sinks.k1.type = avro a1.sinks.k1.hostname = Linux201 a1.sinks.k1.port = 4141 a1.sinks.k2.type=avro a1.sinks.k2.hostname = Linux202 a1.sinks.k2.port = 4242 a1.sinks.k3.type=avro a1.sinks.k3.hostname = Linux203 a1.sinks.k3.port = 4343 a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 a1.channels.c2.type = memory a1.channels.c2.capacity = 1000 a1.channels.c2.transactionCapacity = 100 a1.channels.c3.type = memory a1.channels.c3.capacity = 1000 a1.channels.c3.transactionCapacity = 100 a1.sources.r1.channels = c1 c2 c3 a1.sinks.k1.channel = c1 a1.sinks.k2.channel = c2 a1.sinks.k3.channel = c3 |
为Linux201上的Flume2配置一个avro source和一个logger sink
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | a2.sources = r1 a2.sinks = k1 a2.channels = c1 a2.sources.r1.type = avro a2.sources.r1.bind = Linux201 a2.sources.r1.port = 4141 a2.sinks.k1.type = logger a2.channels.c1.type = memory a2.channels.c1.capacity = 1000 a2.channels.c1.transactionCapacity = 100 a2.sinks.k1.channel = c1 a2.sources.r1.channels = c1 |
为Linux202上的Flume3配置一个avro source和一个logger sink
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | a3.sources = r1 a3.sinks = k1 a3.channels = c1 a3.sources.r1.type = avro a3.sources.r1.bind = Linux202 a3.sources.r1.port = 4242 a3.sinks.k1.type = logger a3.channels.c1.type = memory a3.channels.c1.capacity = 1000 a3.channels.c1.transactionCapacity = 100 a3.sinks.k1.channel = c1 a3.sources.r1.channels = c1 |
为Linux203上的Flume4配置一个avro source和一个logger sink
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | a4.sources = r1 a4.sinks = k1 a4.channels = c1 a4.sources.r1.type = avro a4.sources.r1.bind = Linux203 a4.sources.r1.port = 4343 a4.sinks.k1.type = logger a4.channels.c1.type = memory a4.channels.c1.capacity = 1000 a4.channels.c1.transactionCapacity = 100 a4.sinks.k1.channel = c1 a4.sources.r1.channels = c1 |
(4)在Linux201(启两flume),Linux202,Linux203上启动flume进程,注意先后顺序
1 2 3 4 5 6 7 8 9 | Linux201: bin/flume-ng agent -c conf/ -n a1 -f job/flume-file-flume bin/flume-ng agent -c conf/ -n a2 -f job/flume-flume-console1 -Dflume.root.logger=INFO,console Linux202: bin/flume-ng agent -c conf/ -n a3 -f job/flume-flume-console2 -Dflume.root.logger=INFO,console Linux203: bin/flume-ng agent -c conf/ -n a4 -f job/flume-flume-console3 -Dflume.root.logger=INFO,console |
(5)在Linux201使用netcat向localhost:44444发送字母和数字
(6)观察Linux201、Linux202和Linux203打印的日志
六、自定义Source#
官方也提供了自定义source的接口:
https://flume.apache.org/FlumeDeveloperGuide.html#source根据官方说明自定义MySource需要继承AbstractSource类并实现Configurable和PollableSource接口。
使用场景:读取MySQL数据或者其他文件系统。
1、需求:#
使用flume接收数据,并给每条数据添加前缀,输出到控制台。前缀可从flume配置文件中配置。
2. 实现相应方法:#
getBackOffSleepIncrement()//失败后每次递增的时间
getMaxBackOffSleepInterval()//最多多长时间后就不source了
configure(Context context)//初始化context(读取配置文件内容)
process()//获取数据封装成event并写入channel,这个方法将被循环调用。
3. 编码#
(1)导入pom依赖
1 2 3 4 5 6 7 8 9 10 11 12 13 | <dependencies> <dependency> <groupId>org.apache.flume</groupId> <artifactId>flume-ng-core</artifactId> <version> 1.9 . 0 </version> </dependency> </dependencies> |
(2)编写代码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 | import org.apache.flume.Context; import org.apache.flume.Event; import org.apache.flume.EventDeliveryException; import org.apache.flume.PollableSource; import org.apache.flume.channel.ChannelProcessor; import org.apache.flume.conf.Configurable; import org.apache.flume.event.SimpleEvent; import org.apache.flume.source.AbstractSource; /** * @version 1.0 * @Author: zls * @Date: 2020/10/10 12:40 * @Desc: */ public class MySource extends AbstractSource implements Configurable, PollableSource { //前缀 private String prefix; //间隔 private Long interval; /** * 框架调用该方法来拉取数据并处理 * * @return 事件处理的状态 * @throws EventDeliveryException */ public Status process() throws EventDeliveryException { Status status = null ; //获取ChannelProcessor ChannelProcessor channelProcessor = getChannelProcessor(); try { //处理事件 Event e = getSomeData(); channelProcessor.processEvent(e); status = Status.READY; } catch (Exception e) { //处理异常 e.printStackTrace(); status = Status.BACKOFF; } return status; } /** * 对于一个自定义源,获取数据的方式 * * @return 获取到的事件 */ private Event getSomeData() throws InterruptedException { Event event = new SimpleEvent(); event.setBody((prefix + "Test content" ).getBytes()); Thread.sleep(interval); return event; } /** * 如果出现异常,停止调用Source的递增时间 * * @return 递增时间 */ public long getBackOffSleepIncrement() { return 1000 ; } /** * 停止调用Source的最大时间 * * @return */ public long getMaxBackOffSleepInterval() { return 10000 ; } /** * 定义方法,可以用来配置我们的自定义Source * * @param context 配置文件 */ public void configure(Context context) { prefix = context.getString( "XXX" , "DD" ); interval = context.getLong( "YYY" , 500L); } } |
4. 测试#
(1)打包
将写好的代码打包,并放到flume的lib目录下。
(2)编写配置文件(a1.sources.r1.type = MySource使用的全类名)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | a1.sources = r1 a1.channels = c1 a1.sinks = k1 a1.sources.r1.type = MySource a1.sources.r1.XXX = Myprefix a1.sources.r1.YYY = 1000 a1.sinks.k1.type = logger a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1 |
(3)开启flume,测试一下
1 | [zls @Linux201 flume]$ bin/flume-ng agent -c conf/ -f job/mysource.conf -n a1 -Dflume.root.logger=INFO,console |
结果:每隔1S输出以下数据
七、 自定义Sink#
官方提供了自定义sink的接口:
https://flume.apache.org/FlumeDeveloperGuide.html#sink根据官方说明自定义MySink需要继承AbstractSink类并实现Configurable接口。
实现相应方法:
configure(Context context)//初始化context(读取配置文件内容)
process()//从Channel读取获取数据(event),这个方法将被循环调用。
使用场景:读取Channel数据写入MySQL或者其他文件系统。
需求:#
使用flume接收数据,并在Sink端给每条数据添加前缀和后缀,输出到控制台。前后缀可在flume任务配置文件中配置。
编码#
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 | import org.apache.flume.*; import org.apache.flume.conf.Configurable; import org.apache.flume.sink.AbstractSink; import java.io.IOException; /** * @version 1.0 * @Author: zls * @Date: 2020/10/10 15:07 * @Desc: 将所有拿到的数据打印到控制台 */ public class MySink extends AbstractSink implements Configurable { private String prefix; private String suffix; /** * Sink会从channel中拉取数据并处理 * * @return * @throws EventDeliveryException */ public Status process() throws EventDeliveryException { Status status = null ; //获取Sink对应的Channel Channel channel = getChannel(); //事务 Transaction transaction = channel.getTransaction(); transaction.begin(); try { //做事情 //1. 从channel中拿数据 Event take = channel.take(); //2. 将数据写入到对应的sink storeSomeData(take); status = Status.READY; transaction.commit(); } catch (Exception e) { //处理异常 status = Status.BACKOFF; transaction.rollback(); } finally { transaction.close(); } return status; } /** * 将event数据进行处理(储存或者消费) * * @param take 拿到的数据 */ private void storeSomeData(Event take) throws IOException, InterruptedException { if (take != null ) { System.out.print(prefix); System.out.write(take.getBody()); System.out.println(suffix); } else { Thread.sleep(5000L); } } /** * 配置方法:用来配置我们的sink * * @param context */ public void configure(Context context) { prefix = context.getString( "XXX" , "DPre" ); suffix = context.getString( "YYY" , "DSuf" ); } } |
测试#
(1)打包
将写好的代码打包,并放到flume的lib目录下。
(2)配置文件
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | a1.sources = r1 a1.sinks = k1 a1.channels = c1 a1.sources.r1.type = netcat a1.sources.r1.bind = 0.0 . 0.0 a1.sources.r1.port = 44444 a1.sinks.k1.type = MySink a1.sinks.k1.XXX = zls: a1.sinks.k1.YYY = :zls a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1 |
(3)开启任务
1 2 3 4 5 6 | [zls @Linux201 flume]$ bin/flume-ng agent -c conf/ -f job/mysink.conf -n a1 -Dflume.root.logger=INFO,console [zls @Linux201 ~]$ nc localhost 44444 1 OK hello OK |
结果
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 浏览器原生「磁吸」效果!Anchor Positioning 锚点定位神器解析
· DeepSeek 开源周回顾「GitHub 热点速览」
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布:重大改进与新特性概览!
· AI与.NET技术实操系列(二):开始使用ML.NET
· 单线程的Redis速度为什么快?