Flume学习笔记
第1章 Flume概述
- Flume 是Cloudera 提供的一个高可用的,高可靠的,分布式的海量日志采集、聚合和传输的系统。Flume 基于流式架构,灵活简单
- 基础架构
- Agent是一个 JVM 进程,它以 事件 的形式将数据从源头送至目的。主要有三部分组成:Source,Channel,Sink
- Source负责收集数据到flume,可以处理各种数据类型:TAILDIR、avro、exec、spooldir、netcate
TAILDIR:可以同时监控多个动态变化的文件,可以断点续传,不会丢失数据
exec:执行命令,如tail -F file_name
spooldir:监控文件夹内文件变化
netcate:监控端口数据 - Sink不断的轮询Channel中的事件并且批量的移除他们,将事件写入存储系统或另外一个flume agent。支持组件:hdfs、logger、avro、file、Hbase
- Channel是位于 Source 和 Sink 之间的缓冲区。因此, Channel 允许 Source 和 Sink 运作在不同的速率上。 Channel 是线程安全的,可以同时处理几个 Source 的写入操作和几个Sink 的读取操作。Flume自带两种 Channel,Memory Channel和File Channel 以及 Kafka Channel
- Event:传输单元,Flume 数据传输的基本单元,以 Event 的形式将数据从源头送至目的地。由Header和Body组成,Header存放event的一些属性,为kv结构
- Source负责收集数据到flume,可以处理各种数据类型:TAILDIR、avro、exec、spooldir、netcate
- Sink类型:负载均衡、故障转移、默认(一个sink),需要配置sinkgroups
- 负载均衡:策略:轮询、随机...
- 故障转移:优先级
- 缺点:不能动态添加flume agent,需要重新配置
第2章 Flume快速入门
http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html,配置参考官方文档
安装部署
- 解压,修改flume-env.sh中JAVA_HOME的绝对路径
案例1-监控端口
- 安装工具netcat:sudo yum install -y nc,网络工具;测试:nc -l localhost 44444,监听44444号端口,作为服务端;在另一台机器上:nc localhost102 44444,可以向端口发送信息;检测端口是否被占用:sudo netstat -tunlp | grep 44444
- 在flume下创建jobs,配置job文件:netcat-flume-logger.conf,监听端口数据,打印到控制台(source为netcat,sink为logger,memory channle)
- job文件分为5个部分:定义、source、sink、channel、bind
# example.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
- 开启flume监听端口:flume-ng agent -c conf -f /opt/module/hive/jobs/xxxx.conf -n a1 -Dflume.root.logger=INFO,console
- 使用netcat工具向本机的44444端口发送内容,可以在flume端看到发送的内容
案例2-监控单个文件
- 实时监控Hive 日志,并上传到HDFS 中。需要拷贝相关的jar包到flume/lib
- 配置job,file-flume-logger.conf,监控单个文件到控制台(source为exec,使用命令tail -F,sink为logger)
- 运行flume:bin/flume-ng agent -c conf/file-flume-logger.conf -n a1 --conf-file job/flume-file-hdfs.conf
- 配置job,file-flume-hdfs.conf,监控单个文件,并上传到hdfs(source为exec,使用命令tail -F,sink为hdfs)
- 运行flume:bin/flume-ng agent -c conf/file-flume-hdfs.conf -n a2
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/hive/logs/hive.log
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
# Name the components on this agent
a2.sources = r2
a2.sinks = k2
a2.channels = c2
# Describe/configure the source
a2.sources.r2.type = exec
a2.sources.r2.command = tail -F /opt/module/hive/logs/hive.log
# Describe the sink
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = hdfs://localhost102:9000/flume/%Y%m%d/%H
# 上传文件的前缀
a2.sinks.k2.hdfs.filePrefix = logs
# 是否按照时间滚动文件夹
a2.sinks.k2.hdfs.round = true
# 多少时间单位创建一个新的文件夹
a2.sinks.k2.hdfs.roundValue = 1
# 重新定义时间单位
a2.sinks.k2.hdfs.roundUnit = hour
# 是否使用本地时间戳
a2.sinks.k2.hdfs.useLocalTimeStamp = true
# 积攒多少个 Event 才 flush 到 HDFS 一次
a2.sinks.k2.hdfs.batchSize = 1000
# 设置文件类型,可支持压缩
a2.sinks.k2.hdfs.fileType = DataStream
# 多久生成一个新的文件
a2.sinks.k2.hdfs.rollInterval = 30
# 设置每个文件的滚动大小
a2.sinks.k2.hdfs.rollSize = 134217700
# 文件的滚动与 Event 数量无关
a2.sinks.k2.hdfs.rollCount = 0
# Use a channel which buffers events in memory
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2
案例3-监控文件夹多个新文件,并上传至hdfs
- 配置job,dir-flume-hdfs.conf,监控文件夹,并上传至hdfs(source为spoolDir,sink为hdfs)
- 运行flume:bin/flume-ng agent -c conf/dir-flume-hdfs.conf -n a2
- 注意:
- 在使用spooling Directory Source时,不要在监控目录中创建并持续修改文件
- 不要上传相同文件名的文件,都在会上传至hdfs但是无法在本地修改文件名
- 每500ms扫描一次文件夹
# Name the components on this agent
a3.sources = r3
a3.sinks = k3
a3.channels = c3
# Describe/configure the source
a3.sources.r3.type = spooldir
a3.sources.r3.spoolDir = /opt/module/flume/upload
a3.sources.r3.fileSuffix = .COMPLETED
a3.sources.r3.fileHeader = true
#忽略所有以.tmp 结尾的文件,不上传
a3.sources.r3.ignorePattern = ([^ ]*\.tmp)
# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path = hdfs://localhost102:9000/flume/%Y%m%d/%H
#上传文件的前缀
a3.sinks.k3.hdfs.filePrefix = upload-
#是否按照时间滚动文件夹
a3.sinks.k3.hdfs.round = true
#多少时间单位创建一个新的文件夹
a3.sinks.k3.hdfs.roundValue = 1
#重新定义时间单位
a3.sinks.k3.hdfs.roundUnit = hour
#是否使用本地时间戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true
#积攒多少个Event 才flush 到HDFS 一次
a3.sinks.k3.hdfs.batchSize = 100
#设置文件类型,可支持压缩
a3.sinks.k3.hdfs.fileType = DataStream
#多久生成一个新的文件
a3.sinks.k3.hdfs.rollInterval = 60
#设置每个文件的滚动大小大概是128M
a3.sinks.k3.hdfs.rollSize = 134217700
#文件的滚动与Event 数量无关
a3.sinks.k3.hdfs.rollCount = 0
# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100
# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3
案例4-监控多个追加文件,TAILDIR
- 支持断点续传,可以动态监控多个动态变化的文件(source为TAILDIR,sink为logger)
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /opt/module/flume/files/file1.txt
a1.sources.r1.filegroups.f2 = /opt/module/flume/files/file2.txt
a1.sources.r1.positionFile = /opt/module/flume/position/position.json
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
第3章 Flume进阶
Flume事务
- put事务:doput将数据写入临时缓冲区、doCommit提交事务、doRollback回滚事务
- take事务:dotake将数据提取到缓冲区、doCommit提交事务、toRollback回滚事务
Flume Agent内部原理
- Channel Selector选出Event 将要被发往哪个Channel
- Replicating(复制)
- Multiplexing(多路复用)
- SinkProcessor
- DefaultSinkProcessor ,单个sink
- LoadBalancingSinkProcessor,负载均衡
- FailoverSinkProcessor,故障转移
Flume拓扑结构
- 简单串联
- 复制和多路复用
- 负载均衡和故障转移
- 聚合
企业开发案例
- 1、复制和多路复用(需要自定义拦截器Intercepctor),将文件变动复制到hdfs和本地磁盘:
- 配置选择器类型:replicating(默认),multiplexing多路复用
- flume1监控文件变动,source为TAILDIR,sink为avro
- flume2读取后写入hdfs,source为avro,sink为hdfs
- flume3写入到本地磁盘,source为avro,sink为file_roll
- 2、负载均衡和故障转移,一个channel,多个sink,需要配置sinkgroups
- flume1监控端口,source为netcat,sink为多个,设置sinkgroups,类型为load_balance或failover
- flume2和flume3接收来此channel的event,根据负载均衡的策略或故障转移
- 3、聚合,将多个源的数据发送到同一个机器的端口(或不同端口),使用一个source(或多个)进行聚合
- flume1监控文件变化,source为TAILDIR,sink为avro,作为客户端写入某主机某端口
- flume2监控端口变化,source为netcat,sink为avro
- flume3监控端口变化,source为avro,监控一个或多个端口,sink为logger,或hdfs,或file_roll
复制配置:其他参考官方文档
# flume1
# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
# Describe/configure the source
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /opt/module/data/group1/hive.log
a1.sources.r1.positionFile = /opt/module/flume/position/position1.json
# 将数据流复制给所有 channel
a1.sources.r1.selector.type = replicating
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = localhost102
a1.sinks.k1.port = 4141
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = localhost102
a1.sinks.k2.port = 4142
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2
# flume2
#上传文件的前缀
a2.sinks.k1.hdfs.filePrefix = upload-
#是否按照时间滚动文件夹
a2.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个Event 才flush 到HDFS 一次
a2.sinks.k1.hdfs.batchSize = 100
#设置文件类型,可支持压缩
a2.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a2.sinks.k1.hdfs.rollInterval = 60
#设置每个文件的滚动大小大概是128M
a2.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与Event 数量无关
a2.sinks.k1.hdfs.rollCount = 0
# Use a channel which buffers events in memory
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1
# flume3
# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c1
# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = localhost102
a3.sources.r1.port = 4142
# Describe the sink
a3.sinks.k1.type = file_roll
a3.sinks.k1.sink.directory = /opt/module/data/group1-output
# Use a channel which buffers events in memory
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1
自定义Interceptor
- 多路复用,需要配置source的Interceptor,Selector,根据body的内容进入不同的channel,sink
- 自定义Interceptor:实现Interceptor接口,实现其中的方法,并自定义一个静态内部类实现Interceptor.Builder接口,帮助构建拦截器对象
自定义Source
- 继承抽象类,获取数据,封装成event
自定义Sink
- 继承抽象类,获取事件,写出
Flume数据流监控
- Ganglia,第三方框架实时监控flume
安装 httpd 服务与 php
sudo yum -y install httpd php
安装其他依赖
sudo yum -y install rrdtool perl rrdtool rrdtool devel
sudo yum -y install apr devel
安装 ganglia
sudo rpm -Uvh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
sudo yum -y install ganglia gmetad
sudo yum -y install ganglia web
sudo yum -y install ganglia gmond
修改配置文件
sudo vim /etc/httpd/conf.d/ganglia.conf
#Deny from all 改为 Allow from all
修改配置文件
sudo vim /etc/ganglia/gmetad.conf
data_source "localhost102" 192.168.33.102
修改配置文件 /etc/ganglia/gmond.conf
sudo vim /etc/ganglia/gmond.conf
name="localhost102"
host=192.168.33.102
bind=192.168.33.102
修改配置文件
sudo vim /etc/selinux/config
SELINUX=disabled ##需要重启 sudo setenforce 0 本次生效
启动 ganglia
sudo service httpd start
sudo service gmetad start
sudo service gmond start
打开网页浏览ganglia页面
http://192.168.33.102/ganglia
如果权限不足修改 /var/lib/ganglia 目录的权限:
sudo chmod -R 777 /var/lib/ganglia
修改 /opt/module/flume/conf 目录下的 flume env.sh 配置:
JAVA_OPTS="-Dflume.monitoring.type=ganglia
-Dflume.monitoring.hosts= 192.168.33.102:8649
-Xms100m
-Xmx200m"
启动flume
bin/flume-ng agent -c conf/ -n a1 -f jobs/xxx -Dflume.root.logger==INFO,console
-Dflume.monitoring.type=ganglia
-Dflume.monitoring.hosts= 192.168.33.102:8649
企业面试题
---
本文来自博客园,作者:Bingmous,转载请注明原文链接:https://www.cnblogs.com/bingmous/p/15643707.html