Flume学习笔记

第1章 Flume概述

Flume 是Cloudera 提供的一个高可用的，高可靠的，分布式的海量日志采集、聚合和传输的系统。Flume 基于流式架构，灵活简单
基础架构

Agent是一个 JVM 进程，它以事件的形式将数据从源头送至目的。主要有三部分组成：Source，Channel，Sink
- Source负责收集数据到flume，可以处理各种数据类型：TAILDIR、avro、exec、spooldir、netcate
  TAILDIR：可以同时监控多个动态变化的文件，可以断点续传，不会丢失数据
  exec：执行命令，如tail -F file_name
  spooldir：监控文件夹内文件变化
  netcate：监控端口数据
- Sink不断的轮询Channel中的事件并且批量的移除他们，将事件写入存储系统或另外一个flume agent。支持组件：hdfs、logger、avro、file、Hbase
- Channel是位于 Source 和 Sink 之间的缓冲区。因此， Channel 允许 Source 和 Sink 运作在不同的速率上。 Channel 是线程安全的，可以同时处理几个 Source 的写入操作和几个Sink 的读取操作。Flume自带两种 Channel，Memory Channel和File Channel 以及 Kafka Channel
- Event：传输单元，Flume 数据传输的基本单元，以 Event 的形式将数据从源头送至目的地。由Header和Body组成，Header存放event的一些属性，为kv结构
Sink类型：负载均衡、故障转移、默认（一个sink），需要配置sinkgroups
- 负载均衡：策略：轮询、随机...
- 故障转移：优先级
缺点：不能动态添加flume agent，需要重新配置

第2章 Flume快速入门

http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html，配置参考官方文档

安装部署

解压，修改flume-env.sh中JAVA_HOME的绝对路径

案例1-监控端口

安装工具netcat：sudo yum install -y nc，网络工具；测试：nc -l localhost 44444，监听44444号端口，作为服务端；在另一台机器上：nc localhost102 44444，可以向端口发送信息；检测端口是否被占用：sudo netstat -tunlp | grep 44444
在flume下创建jobs，配置job文件：netcat-flume-logger.conf，监听端口数据，打印到控制台（source为netcat，sink为logger，memory channle）
job文件分为5个部分：定义、source、sink、channel、bind

# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

开启flume监听端口：flume-ng agent -c conf -f /opt/module/hive/jobs/xxxx.conf -n a1 -Dflume.root.logger=INFO,console
使用netcat工具向本机的44444端口发送内容，可以在flume端看到发送的内容

案例2-监控单个文件

实时监控Hive 日志，并上传到HDFS 中。需要拷贝相关的jar包到flume/lib
配置job，file-flume-logger.conf，监控单个文件到控制台（source为exec，使用命令tail -F，sink为logger）
运行flume：bin/flume-ng agent -c conf/file-flume-logger.conf -n a1 --conf-file job/flume-file-hdfs.conf
配置job，file-flume-hdfs.conf，监控单个文件，并上传到hdfs（source为exec，使用命令tail -F，sink为hdfs）
运行flume：bin/flume-ng agent -c conf/file-flume-hdfs.conf -n a2

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/hive/logs/hive.log

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

# Name the components on this agent
a2.sources = r2
a2.sinks = k2
a2.channels = c2

# Describe/configure the source
a2.sources.r2.type = exec
a2.sources.r2.command = tail -F /opt/module/hive/logs/hive.log

# Describe the sink
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = hdfs://localhost102:9000/flume/%Y%m%d/%H
# 上传文件的前缀
a2.sinks.k2.hdfs.filePrefix = logs
# 是否按照时间滚动文件夹
a2.sinks.k2.hdfs.round = true
# 多少时间单位创建一个新的文件夹
a2.sinks.k2.hdfs.roundValue = 1
# 重新定义时间单位
a2.sinks.k2.hdfs.roundUnit = hour
# 是否使用本地时间戳
a2.sinks.k2.hdfs.useLocalTimeStamp = true
# 积攒多少个 Event 才 flush 到 HDFS 一次
a2.sinks.k2.hdfs.batchSize = 1000
# 设置文件类型，可支持压缩
a2.sinks.k2.hdfs.fileType = DataStream
# 多久生成一个新的文件
a2.sinks.k2.hdfs.rollInterval = 30
# 设置每个文件的滚动大小
a2.sinks.k2.hdfs.rollSize = 134217700
# 文件的滚动与 Event 数量无关
a2.sinks.k2.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2

案例3-监控文件夹多个新文件，并上传至hdfs

配置job，dir-flume-hdfs.conf，监控文件夹，并上传至hdfs（source为spoolDir，sink为hdfs）
运行flume：bin/flume-ng agent -c conf/dir-flume-hdfs.conf -n a2
注意：
- 在使用spooling Directory Source时，不要在监控目录中创建并持续修改文件
- 不要上传相同文件名的文件，都在会上传至hdfs但是无法在本地修改文件名
- 每500ms扫描一次文件夹

# Name the components on this agent
a3.sources = r3
a3.sinks = k3
a3.channels = c3

# Describe/configure the source
a3.sources.r3.type = spooldir
a3.sources.r3.spoolDir = /opt/module/flume/upload
a3.sources.r3.fileSuffix = .COMPLETED
a3.sources.r3.fileHeader = true

#忽略所有以.tmp 结尾的文件，不上传
a3.sources.r3.ignorePattern = ([^ ]*\.tmp)
# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path = hdfs://localhost102:9000/flume/%Y%m%d/%H

#上传文件的前缀
a3.sinks.k3.hdfs.filePrefix = upload-
#是否按照时间滚动文件夹
a3.sinks.k3.hdfs.round = true
#多少时间单位创建一个新的文件夹
a3.sinks.k3.hdfs.roundValue = 1
#重新定义时间单位
a3.sinks.k3.hdfs.roundUnit = hour
#是否使用本地时间戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true
#积攒多少个Event 才flush 到HDFS 一次
a3.sinks.k3.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a3.sinks.k3.hdfs.fileType = DataStream
#多久生成一个新的文件
a3.sinks.k3.hdfs.rollInterval = 60
#设置每个文件的滚动大小大概是128M
a3.sinks.k3.hdfs.rollSize = 134217700
#文件的滚动与Event 数量无关
a3.sinks.k3.hdfs.rollCount = 0
# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

案例4-监控多个追加文件，TAILDIR

支持断点续传，可以动态监控多个动态变化的文件（source为TAILDIR，sink为logger）

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /opt/module/flume/files/file1.txt
a1.sources.r1.filegroups.f2 = /opt/module/flume/files/file2.txt
a1.sources.r1.positionFile = /opt/module/flume/position/position.json

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

第3章 Flume进阶

Flume事务

put事务：doput将数据写入临时缓冲区、doCommit提交事务、doRollback回滚事务
take事务：dotake将数据提取到缓冲区、doCommit提交事务、toRollback回滚事务

Flume Agent内部原理

Channel Selector选出Event 将要被发往哪个Channel
- Replicating（复制）
- Multiplexing（多路复用）
SinkProcessor
- DefaultSinkProcessor ，单个sink
- LoadBalancingSinkProcessor，负载均衡
- FailoverSinkProcessor，故障转移

Flume拓扑结构

简单串联

复制和多路复用

负载均衡和故障转移

聚合

企业开发案例

1、复制和多路复用（需要自定义拦截器Intercepctor），将文件变动复制到hdfs和本地磁盘：
- 配置选择器类型：replicating（默认），multiplexing多路复用
- flume1监控文件变动，source为TAILDIR，sink为avro
- flume2读取后写入hdfs，source为avro，sink为hdfs
- flume3写入到本地磁盘，source为avro，sink为file_roll
2、负载均衡和故障转移，一个channel，多个sink，需要配置sinkgroups
- flume1监控端口，source为netcat，sink为多个，设置sinkgroups，类型为load_balance或failover
- flume2和flume3接收来此channel的event，根据负载均衡的策略或故障转移
3、聚合，将多个源的数据发送到同一个机器的端口（或不同端口），使用一个source（或多个）进行聚合
- flume1监控文件变化，source为TAILDIR，sink为avro，作为客户端写入某主机某端口
- flume2监控端口变化，source为netcat，sink为avro
- flume3监控端口变化，source为avro，监控一个或多个端口，sink为logger，或hdfs，或file_roll

复制配置：其他参考官方文档

# flume1
# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2

# Describe/configure the source
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /opt/module/data/group1/hive.log
a1.sources.r1.positionFile = /opt/module/flume/position/position1.json

# 将数据流复制给所有 channel
a1.sources.r1.selector.type = replicating

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = localhost102
a1.sinks.k1.port = 4141

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = localhost102
a1.sinks.k2.port = 4142

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

# flume2
#上传文件的前缀
a2.sinks.k1.hdfs.filePrefix = upload-
#是否按照时间滚动文件夹
a2.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个Event 才flush 到HDFS 一次
a2.sinks.k1.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a2.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a2.sinks.k1.hdfs.rollInterval = 60
#设置每个文件的滚动大小大概是128M
a2.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与Event 数量无关
a2.sinks.k1.hdfs.rollCount = 0


# Use a channel which buffers events in memory
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100


# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

# flume3
# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c1

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = localhost102
a3.sources.r1.port = 4142

# Describe the sink
a3.sinks.k1.type = file_roll
a3.sinks.k1.sink.directory = /opt/module/data/group1-output


# Use a channel which buffers events in memory
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100


# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

自定义Interceptor

多路复用，需要配置source的Interceptor，Selector，根据body的内容进入不同的channel，sink
自定义Interceptor：实现Interceptor接口，实现其中的方法，并自定义一个静态内部类实现Interceptor.Builder接口，帮助构建拦截器对象

自定义Source

继承抽象类，获取数据，封装成event

自定义Sink

继承抽象类，获取事件，写出

Flume数据流监控

Ganglia，第三方框架实时监控flume

安装 httpd 服务与 php
sudo yum -y install httpd php

安装其他依赖
sudo yum -y install rrdtool perl rrdtool rrdtool devel
sudo yum -y install apr devel

安装 ganglia
sudo rpm -Uvh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
sudo yum -y install ganglia gmetad
sudo yum -y install ganglia web
sudo yum -y install ganglia gmond

修改配置文件
sudo vim /etc/httpd/conf.d/ganglia.conf
#Deny from all 改为 Allow from all

修改配置文件
sudo vim /etc/ganglia/gmetad.conf
data_source "localhost102" 192.168.33.102

修改配置文件 /etc/ganglia/gmond.conf
sudo vim /etc/ganglia/gmond.conf
name="localhost102"
host=192.168.33.102
bind=192.168.33.102

修改配置文件
sudo vim /etc/selinux/config
SELINUX=disabled ##需要重启 sudo setenforce 0 本次生效

启动 ganglia
sudo service httpd start
sudo service gmetad start
sudo service gmond start

打开网页浏览ganglia页面
http://192.168.33.102/ganglia  
如果权限不足修改 /var/lib/ganglia 目录的权限：
sudo chmod -R 777 /var/lib/ganglia

修改 /opt/module/flume/conf 目录下的 flume env.sh 配置：
JAVA_OPTS="-Dflume.monitoring.type=ganglia 
-Dflume.monitoring.hosts= 192.168.33.102:8649
-Xms100m
-Xmx200m"

启动flume
bin/flume-ng agent -c conf/ -n a1 -f jobs/xxx -Dflume.root.logger==INFO,console
-Dflume.monitoring.type=ganglia
-Dflume.monitoring.hosts= 192.168.33.102:8649

企业面试题

posted on 2020-07-23 23:04 Bingmous 阅读(41) 评论(0) 编辑收藏举报

刷新页面返回顶部

bingmous

导航

公告

Flume学习笔记

第1章 Flume概述

第2章 Flume快速入门

安装部署

案例1-监控端口

案例2-监控单个文件

案例3-监控文件夹多个新文件，并上传至hdfs

案例4-监控多个追加文件，TAILDIR

第3章 Flume进阶

Flume事务

Flume Agent内部原理

Flume拓扑结构

企业开发案例

自定义Interceptor

自定义Source

自定义Sink

Flume数据流监控

企业面试题