Flume

flume前置

flume是一个日志收集系统，可以从各种地方收集数据来存放到指定的地方
flume有sources、channels、sinks，分别是数据源，管道，存放数据的位置，配置好这三个就能开始收集数据了
flume的启动命令bin/flume agent -n a1 -f 自己写的配置文件 -Dflume.root.logger=INFO,console
一般来说flume会和kakfa一起配合使用，flume用来采集数据，kafka用于保存数据
事件是flume采集数据的最小单位，一条数据就是一个事件

总结：flume解压之后写一个配置文件，需要包含sources/channels/sinks，然后启动就可以

# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Flume常用的source/Sink

配置从哪个地方读取数据，可以有很多地方，比如一个端口，一个文件等。。。

Avro（Source/Sink）

用于实现多个flume连接，实现多级流动

Source，接收上一级的数据

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4141

SInk，用于把数据输出到下一级

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = 10.10.10.10
a1.sinks.k1.port = 4545

Exec（Source）

用于采集日志文件中的数据，通常指定一个tail -F 文件路劲命令来采集数据

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /var/log/secure
a1.sources.r1.channels = c1

Kafka（source/Sink）

采集kafka中主题的数据，kafkaSource实际上就是一个kafka的消费者，从topic中读取信息

Source有两种消费主题的配置方式

第一种使用逗号隔开消费的多个主题

a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.kafka.bootstrap.servers = localhost:9092
a1.sources.r1.kafka.topics = test1, test2
a1.sources.r1.batchSize = 5000
a1.sources.r1.batchDurationMillis = 2000
a1.sources.r1.kafka.consumer.group.id = test-consumer-group
a1.sources.r1.channels = c1

第二种是使用正则的方式来匹配多个主题

a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.kafka.bootstrap.servers = localhost:9092
a1.sources.r1.kafka.topics.regex = ^topic[0-9]$
a1.sources.r1.channels = c1

Sink可以指定保存数据的topic和分区

a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = topic
a1.sinks.k1.kafka.bootstrap.servers = localhost:9092
a1.sinks.k1.defaultPartitionId = 4
a1.sinks.k1.chennel = c1

NetcatTCP（Source）

监听一个TCP端口来接收数据，转成event写入channel

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 6666
a1.sources.r1.channels = c1

Spooling Directory（Source）

用于监视一个文件夹，如果有新增文件则转成event写入channel，需要注意这些文件是不可变的，就是说不能对这些文件进行增删改操作，而且不能重名，如果改了会被重新采集

a1.channels = c1
a1.sources = r1

# 使用这种方式监视的文件夹中不能有子文件夹
a1.sources.r1.type = spooldir
a1.sources.r1.channels = c1
# 监控的文件夹的路径
a1.sources.r1.spoolDir = /var/log/apache/flumeSpool
a1.sources.r1.fileHeader = true
# 采集完成的文件自动加上后缀
a1.sources.r1.fileSuffix = .ok

Taildir（Source）

监听多个文本文档，如果这个文件夹中有文件写入了新数据，那么写入的数据会被读取到

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = TAILDIR
a1.sources.r1.channels = c1
a1.sources.r1.positionFile = /var/log/flume/taildir_position.json
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /var/log/test1/example.log
a1.sources.r1.headers.f1.headerKey1 = value1
a1.sources.r1.filegroups.f2 = /var/log/test2/.*log.*
a1.sources.r1.headers.f2.headerKey1 = value2
a1.sources.r1.headers.f2.headerKey2 = value2-2
a1.sources.r1.fileHeader = true
a1.sources.ri.maxBatchCount = 1000

HDFS（Sink）

用于把获取的数据存入到hdfs中

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
# %y-%m-%d可以获取当前的时间来命名文件夹存放到hdfs中
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S
# 创建文件的前缀与后缀
a1.sinks.k1.hdfs.filePrefix = FlumeData
a1.sinks.k1.hdfs.fileSuffix = xxx
# 正在写入文件的前缀与后缀
a1.sinks.k1.hdfs.inUsePrefix = xxx
a1.sinks.k1.hdfs.inUseSuffix = .tmp
# 以固定时间触发滚动,单位秒(0表示不触发)
a1.sinks.k1.hdfs.rollInterval = 30
# 以文件大小触发滚动,单位字节(0表示不触发)
a1.sinks.k1.hdfs.rollSize = 1024
# 以事件多少触发滚动,单位事件(0表示不触发)
a1.sinks.k1.hdfs.rollCount = 10
# 当事件达到该数量式写入内容(试了多次好像并没有什么用,只会刷新一次)
a1.sinks.k1.hdfs.batchSize = 100

Hive（Sink）

可以把数据存入到Hive中

a1.channels = c1
a1.channels.c1.type = memory
a1.sinks = k1
a1.sinks.k1.type = hive
a1.sinks.k1.channel = c1
a1.sinks.k1.hive.metastore = thrift://127.0.0.1:9083
a1.sinks.k1.hive.database = logsdb
a1.sinks.k1.hive.table = weblogs
a1.sinks.k1.hive.partition = asia,%{country},%y-%m-%d-%H-%M
a1.sinks.k1.useLocalTimeStamp = false
a1.sinks.k1.round = true
a1.sinks.k1.roundValue = 10
a1.sinks.k1.roundUnit = minute
a1.sinks.k1.serializer = DELIMITED
a1.sinks.k1.serializer.delimiter = "\t"
a1.sinks.k1.serializer.serdeSeparator = "\t"
a1.sinks.k1.serializer.fieldnames =id,,msg

Logger（Sink）

用于把收集的数据输出到控制台

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1

HBase（Sink）

用于把收集的数据存入到HBase中

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = hbase
a1.sinks.k1.table = foo_table
a1.sinks.k1.columnFamily = bar_cf
a1.sinks.k1.serializer = org.apache.flume.sink.hbase.RegexHbaseEventSerializer
a1.sinks.k1.channel = c1

Flume的Channel

channel用于临时存放数据，相当于一个缓冲区

File Channel

使用文件作为暂存区

a1.channels = c1
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /mnt/flume/checkpoint
a1.channels.c1.dataDirs = /mnt/flume/data

Memory Channel

将数据队列存储在内存中，可以配置分配的内存大小

a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000

使用flume采集文件到hdfs

如果使用默认的配置会使hdfs中过多小文件，需要进行配置其他参数让其文件大一些

# 类型与路径不能落下,这个路径只会让文件存储在这个文件夹下
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /test/xxx
# 关闭以时间来触发关闭文件(时间)
a1.sinks.k1.hdfs.rollInterval = 0
# 关闭以事件数量来触发关闭文件(Event)
a1.sinks.k1.hdfs.rollCount = 0
# 开启以采集一定大小的数据量来触发关闭文件(字节)
a1.sinks.k1.hdfs.rollSize = 1024
# 如果使用了以日期时间来命名文件,则需要开启使用本地的时间语义
a1.sinks.k1.hdfs.filePrefix = %Y-%m-%d
a1.sinks.k1.hdfs.useLocalTimeStamp = true
# 开启收集多少个事件再往文件中写入
a1.sinks.k1.hdfs.batchSize = 100

拦截器配置

首先使用java代码编写拦截器

public class interceptor1 implements Interceptor {
    // 存储事件列表用
    private static List<Event> events;
    // flume拦截器启动时会运行一次
    public void initialize() {
        events = new ArrayList<>();
        System.out.println("开启拦截器");
    }
    // 编写逻辑为事件加上头信息
    public Event intercept(Event event) {
        Map<String, String> headers = event.getHeaders();
        String body = new String(event.getBody());
        if (body.split(",")[0].equals("0")) {
            headers.put("id", "zero");
        } else {
            headers.put("id", "rest");
        }
        System.out.println(headers + "---" + body);
        return event;
    }
   	// 将采集的世界列表使用intercept方法加上头信息
    public List<Event> intercept(List<Event> list) {
        for (Event e : list) {
            events.add(intercept(e));
        }
        return events;
    }
    // flume拦截器关闭时会触发一次
    public void close() {
        System.out.println("关闭拦截器");
    }
    // 内部类
    public static class Builder implements Interceptor.Builder {
        // 返回一个拦截器
        public Interceptor build() {
            return new interceptor1();
        }
        // 配置文件
        public void configure(Context context) { }
    }
}

编写之后打包到flume/lib目录中，然后编写配置文件

# sources、channels、sinks编写
a1.sources = r1
a1.channels = c1 c2
a1.sinks = k1 k2
# sources配置
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F <fileName>
# 拦截器配置
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = <java编写flume拦截器的包名>
# 配置多路选择器（在自定义的拦截器jar包中选择头信息）
a1 sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = <代码中给头设置的键是什么就填什么>
a1.sources.r1.selector.mapping.<代码中id对应的值是什么就填什么> = <要输出到哪个channel>
a1.sources.r1.selector.mapping.zero = c2
# channels配置（需要给多一些线程和事件最大容量）
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 1000
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 1000
# sinks配置
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = order1
a1.sinks.k1.kafka.bootstrap.servers = master:9092
a1.sinks.k1.kafka.defaultPartitionId = 3
a1.sinks.k2.type = hdfs
a1.sinks.k2.hdfs.path = /user/test/flumebackup
a1.sinks.k2.hdfs.rollInterval = 30
a1.sinks.k2.hdfs.rollCount = 0
a1.sinks.k2.hdfs.rollSize = 0
# 组装
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

posted @ 2022-06-29 20:42 耿集阅读(108) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

阅读排行：
· 阿里巴巴 QwQ-32B真的超越了 DeepSeek R-1吗？
· 10年+ .NET Coder 心语 ── 封装的思维：从隐藏、稳定开始理解其本质意义
· 【设计模式】告别冗长if-else语句：使用策略模式优化代码结构
· 字符编码：从基础到乱码解决
· 提示词工程——AI应用必不可少的技术

和平村