Flume部署与基础应用
一、Flume部署与基础应用
- 部署
1)tar -zxvf apache-flume-1.9.0-bin.tar.gz -C /app,将tar包解压导/app目录
2)添加环境变量
export FLUME_HOME=/app/apache-flume-1.9.0-bin
export PATH=$PATH:$FLUME_HOME/bin
3)将Flume Home下flume-env.sh.template改为flume-env.sh, 并添加JAVA环境变量
cd $FLUME_HOME/conf
mv flume-env.sh.template flume-env.sh
vim flume-env.sh -> export JAVA_HOME=/opt/lagou/servers/jdk1.8.0_231
4)调整Flume的运行内存,避免GC,一般Xms和Xmx配置一样,防止内存抖动带来性能影响
vim flume-env.sh -> export JAVA_OPTS="-Xms2048m -Xmx2048m -Dcom.sun.management.jmxremote"
- 基础应用
exec source, netcat source,kafka source,Taildir Source;比较常用Taildir Source,可以监听多个文件,数据可靠性高,不会丢失数据;
常见的channel组件:memory channel,file channel,kafka channel,jdbc channel
常见的sink组件:HDFS channel,Hive channel,kafka channel
测试案例:采集kafka数据导入HDFS:
a1.sources = r1
a1.channels = c1
a1.sinks = k1
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.kafka.bootstrap.servers = node01:9092,node02:9092,node03:9092
a1.sources.r1.kafka.topics = event_test_topic
a1.sources.r1.kafka.consumer.group.id = custom.g.id
a1.sources.r1.kafka.consumer.auto.offset.reset = earliest
# 配置拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = org.bigdata.com.interceptor.CustomerInterceptor$Builder
# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 2000
a1.channels.c1.transactionCapacity = 1000
# 配置sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/log/p_dymd=%{log_time}
a1.sinks.k1.hdfs.filePrefix = event
a1.sinks.k1.hdfs.rollCount=0
a1.sinks.k1.hdfs.rollSize=134217728
a1.sinks.k1.hdfs.rollInterval=7200
a1.sinks.k1.hdfs.minBlockReplicas = 1
a1.sinks.k1.hdfs.threadsPoolSize = 10
a1.sinks.k1.hdfs.hdfs.callTimeout = 20000
# 压缩算法
# a1.sinks.k1.hdfs.codeC = bzip2
# a1.sinks.k1.hdfs.fileType = CompressedStream
a1.sinks.k1.hdfs.fileType = DataStream
# 配置关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
- 采集到HDFS小文件过多处理
影响:
1)元数据:每个小文件都有一份元数据,其中包括文件路径,文件名,所有者,所属组,权限,创建时间等,这些信息都保存在Namenode内存中。所以小文件过多,会占用Namenode服务器大量内存,影响Namenode性能和使用寿命,每个元数据150byte;
2)计算:mapreduce处理时,会为每一个文件启动一个map task,影响性能,同时也会增大寻址时间
处理:
1)官方默认的这三个参数配置写入HDFS后会产生小文件,相应的可以根据实际业务调整每个参数:hdfs.rollInterval、hdfs.rollSize、hdfs.rollCount基于以上hdfs.rollInterval=3600,hdfs.rollSize=134217728,hdfs.rollCount =0,hdfs.roundValue=3600,hdfs.roundUnit= second
三、Flume自定义拦截器
package org.bigdata.com.interceptor; import com.alibaba.fastjson.JSON; import com.alibaba.fastjson.JSONObject; import com.google.common.base.Charsets; import org.apache.flume.Context; import org.apache.flume.Event; import org.apache.flume.interceptor.Interceptor; import java.text.SimpleDateFormat; import java.util.ArrayList; import java.util.List; import java.util.Map; /** * @author shydow * @date 2021-04-14 */ public class CustomerInterceptor implements Interceptor { private static SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd"); @Override public void initialize() { } @Override public Event intercept(Event event) { // 获取event主体 String eventBody = new String(event.getBody(), Charsets.UTF_8); // 获取header Map<String, String> headers = event.getHeaders(); try { JSONObject jsonObject = JSON.parseObject(eventBody); String trigger_time = jsonObject.getString("server_time"); headers.put("log_time", format.format(Long.parseLong(trigger_time))); event.setHeaders(headers); } catch (Exception e){ headers.put("log_time", "unknown"); event.setHeaders(headers); } return event; } @Override public List<Event> intercept(List<Event> list) { ArrayList<Event> out = new ArrayList<>(); for (Event event : list) { Event outEvent = intercept(event); if (null != outEvent){ out.add(outEvent); } } return out; } @Override public void close() {} public static class Builder implements Interceptor.Builder { @Override public Interceptor build() { return new CustomerInterceptor(); } @Override public void configure(Context context) { } } }