bingmous

欢迎交流,不吝赐教~

导航

Flume学习笔记

第1章 Flume概述

  • Flume 是Cloudera 提供的一个高可用的,高可靠的,分布式的海量日志采集、聚合和传输的系统。Flume 基于流式架构,灵活简单
  • 基础架构

  • Agent是一个 JVM 进程,它以 事件 的形式将数据从源头送至目的。主要有三部分组成:Source,Channel,Sink
    • Source负责收集数据到flume,可以处理各种数据类型:TAILDIR、avro、exec、spooldir、netcate
      TAILDIR:可以同时监控多个动态变化的文件,可以断点续传,不会丢失数据
      exec:执行命令,如tail -F file_name
      spooldir:监控文件夹内文件变化
      netcate:监控端口数据
    • Sink不断的轮询Channel中的事件并且批量的移除他们,将事件写入存储系统或另外一个flume agent。支持组件:hdfs、logger、avro、file、Hbase
    • Channel是位于 Source 和 Sink 之间的缓冲区。因此, Channel 允许 Source 和 Sink 运作在不同的速率上。 Channel 是线程安全的,可以同时处理几个 Source 的写入操作和几个Sink 的读取操作。Flume自带两种 Channel,Memory ChannelFile Channel 以及 Kafka Channel
    • Event:传输单元,Flume 数据传输的基本单元,以 Event 的形式将数据从源头送至目的地。由Header和Body组成,Header存放event的一些属性,为kv结构
  • Sink类型:负载均衡、故障转移、默认(一个sink),需要配置sinkgroups
    • 负载均衡:策略:轮询、随机...
    • 故障转移:优先级
  • 缺点:不能动态添加flume agent,需要重新配置

第2章 Flume快速入门

http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html,配置参考官方文档

安装部署

  • 解压,修改flume-env.sh中JAVA_HOME的绝对路径

案例1-监控端口

  • 安装工具netcat:sudo yum install -y nc,网络工具;测试:nc -l localhost 44444,监听44444号端口,作为服务端;在另一台机器上:nc localhost102 44444,可以向端口发送信息;检测端口是否被占用:sudo netstat -tunlp | grep 44444
  • 在flume下创建jobs,配置job文件:netcat-flume-logger.conf,监听端口数据,打印到控制台(source为netcat,sink为logger,memory channle)
  • job文件分为5个部分:定义、source、sink、channel、bind
# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
  • 开启flume监听端口:flume-ng agent -c conf -f /opt/module/hive/jobs/xxxx.conf -n a1 -Dflume.root.logger=INFO,console
  • 使用netcat工具向本机的44444端口发送内容,可以在flume端看到发送的内容

案例2-监控单个文件

  • 实时监控Hive 日志,并上传到HDFS 中。需要拷贝相关的jar包到flume/lib
  • 配置job,file-flume-logger.conf,监控单个文件到控制台(source为exec,使用命令tail -F,sink为logger)
  • 运行flume:bin/flume-ng agent -c conf/file-flume-logger.conf -n a1 --conf-file job/flume-file-hdfs.conf
  • 配置job,file-flume-hdfs.conf,监控单个文件,并上传到hdfs(source为exec,使用命令tail -F,sink为hdfs)
  • 运行flume:bin/flume-ng agent -c conf/file-flume-hdfs.conf -n a2
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/hive/logs/hive.log

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
# Name the components on this agent
a2.sources = r2
a2.sinks = k2
a2.channels = c2

# Describe/configure the source
a2.sources.r2.type = exec
a2.sources.r2.command = tail -F /opt/module/hive/logs/hive.log

# Describe the sink
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = hdfs://localhost102:9000/flume/%Y%m%d/%H
# 上传文件的前缀
a2.sinks.k2.hdfs.filePrefix = logs
# 是否按照时间滚动文件夹
a2.sinks.k2.hdfs.round = true
# 多少时间单位创建一个新的文件夹
a2.sinks.k2.hdfs.roundValue = 1
# 重新定义时间单位
a2.sinks.k2.hdfs.roundUnit = hour
# 是否使用本地时间戳
a2.sinks.k2.hdfs.useLocalTimeStamp = true
# 积攒多少个 Event 才 flush 到 HDFS 一次
a2.sinks.k2.hdfs.batchSize = 1000
# 设置文件类型,可支持压缩
a2.sinks.k2.hdfs.fileType = DataStream
# 多久生成一个新的文件
a2.sinks.k2.hdfs.rollInterval = 30
# 设置每个文件的滚动大小
a2.sinks.k2.hdfs.rollSize = 134217700
# 文件的滚动与 Event 数量无关
a2.sinks.k2.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2

案例3-监控文件夹多个新文件,并上传至hdfs

  • 配置job,dir-flume-hdfs.conf,监控文件夹,并上传至hdfs(source为spoolDir,sink为hdfs)
  • 运行flume:bin/flume-ng agent -c conf/dir-flume-hdfs.conf -n a2
  • 注意:
    • 在使用spooling Directory Source时,不要在监控目录中创建并持续修改文件
    • 不要上传相同文件名的文件,都在会上传至hdfs但是无法在本地修改文件名
    • 每500ms扫描一次文件夹
# Name the components on this agent
a3.sources = r3
a3.sinks = k3
a3.channels = c3

# Describe/configure the source
a3.sources.r3.type = spooldir
a3.sources.r3.spoolDir = /opt/module/flume/upload
a3.sources.r3.fileSuffix = .COMPLETED
a3.sources.r3.fileHeader = true

#忽略所有以.tmp 结尾的文件,不上传
a3.sources.r3.ignorePattern = ([^ ]*\.tmp)
# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path = hdfs://localhost102:9000/flume/%Y%m%d/%H

#上传文件的前缀
a3.sinks.k3.hdfs.filePrefix = upload-
#是否按照时间滚动文件夹
a3.sinks.k3.hdfs.round = true
#多少时间单位创建一个新的文件夹
a3.sinks.k3.hdfs.roundValue = 1
#重新定义时间单位
a3.sinks.k3.hdfs.roundUnit = hour
#是否使用本地时间戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true
#积攒多少个Event 才flush 到HDFS 一次
a3.sinks.k3.hdfs.batchSize = 100
#设置文件类型,可支持压缩
a3.sinks.k3.hdfs.fileType = DataStream
#多久生成一个新的文件
a3.sinks.k3.hdfs.rollInterval = 60
#设置每个文件的滚动大小大概是128M
a3.sinks.k3.hdfs.rollSize = 134217700
#文件的滚动与Event 数量无关
a3.sinks.k3.hdfs.rollCount = 0
# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

案例4-监控多个追加文件,TAILDIR

  • 支持断点续传,可以动态监控多个动态变化的文件(source为TAILDIR,sink为logger)
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /opt/module/flume/files/file1.txt
a1.sources.r1.filegroups.f2 = /opt/module/flume/files/file2.txt
a1.sources.r1.positionFile = /opt/module/flume/position/position.json

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

第3章 Flume进阶

Flume事务

  • put事务:doput将数据写入临时缓冲区、doCommit提交事务、doRollback回滚事务
  • take事务:dotake将数据提取到缓冲区、doCommit提交事务、toRollback回滚事务

Flume Agent内部原理

  • Channel Selector选出Event 将要被发往哪个Channel
    • Replicating(复制)
    • Multiplexing(多路复用)
  • SinkProcessor
    • DefaultSinkProcessor ,单个sink
    • LoadBalancingSinkProcessor,负载均衡
    • FailoverSinkProcessor,故障转移

Flume拓扑结构

  • 简单串联

  • 复制和多路复用

  • 负载均衡和故障转移

  • 聚合

 

企业开发案例

  • 1、复制和多路复用(需要自定义拦截器Intercepctor),将文件变动复制到hdfs和本地磁盘:
    • 配置选择器类型:replicating(默认),multiplexing多路复用
    • flume1监控文件变动,source为TAILDIR,sink为avro
    • flume2读取后写入hdfs,source为avro,sink为hdfs
    • flume3写入到本地磁盘,source为avro,sink为file_roll
  • 2、负载均衡和故障转移,一个channel,多个sink,需要配置sinkgroups
    • flume1监控端口,source为netcat,sink为多个,设置sinkgroups,类型为load_balance或failover
    • flume2和flume3接收来此channel的event,根据负载均衡的策略或故障转移
  • 3、聚合,将多个源的数据发送到同一个机器的端口(或不同端口),使用一个source(或多个)进行聚合
    • flume1监控文件变化,source为TAILDIR,sink为avro,作为客户端写入某主机某端口
    • flume2监控端口变化,source为netcat,sink为avro
    • flume3监控端口变化,source为avro,监控一个或多个端口,sink为logger,或hdfs,或file_roll

复制配置:其他参考官方文档

# flume1
# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2

# Describe/configure the source
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /opt/module/data/group1/hive.log
a1.sources.r1.positionFile = /opt/module/flume/position/position1.json

# 将数据流复制给所有 channel
a1.sources.r1.selector.type = replicating

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = localhost102
a1.sinks.k1.port = 4141

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = localhost102
a1.sinks.k2.port = 4142

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2
# flume2
#上传文件的前缀
a2.sinks.k1.hdfs.filePrefix = upload-
#是否按照时间滚动文件夹
a2.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个Event 才flush 到HDFS 一次
a2.sinks.k1.hdfs.batchSize = 100
#设置文件类型,可支持压缩
a2.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a2.sinks.k1.hdfs.rollInterval = 60
#设置每个文件的滚动大小大概是128M
a2.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与Event 数量无关
a2.sinks.k1.hdfs.rollCount = 0


# Use a channel which buffers events in memory
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100


# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1
# flume3
# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c1

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = localhost102
a3.sources.r1.port = 4142

# Describe the sink
a3.sinks.k1.type = file_roll
a3.sinks.k1.sink.directory = /opt/module/data/group1-output


# Use a channel which buffers events in memory
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100


# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

自定义Interceptor

  • 多路复用,需要配置source的Interceptor,Selector,根据body的内容进入不同的channel,sink
  • 自定义Interceptor:实现Interceptor接口,实现其中的方法,并自定义一个静态内部类实现Interceptor.Builder接口,帮助构建拦截器对象

自定义Source

  • 继承抽象类,获取数据,封装成event

自定义Sink

  • 继承抽象类,获取事件,写出

Flume数据流监控

  • Ganglia,第三方框架实时监控flume
安装 httpd 服务与 php
sudo yum -y install httpd php

安装其他依赖
sudo yum -y install rrdtool perl rrdtool rrdtool devel
sudo yum -y install apr devel

安装 ganglia
sudo rpm -Uvh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
sudo yum -y install ganglia gmetad
sudo yum -y install ganglia web
sudo yum -y install ganglia gmond

修改配置文件
sudo vim /etc/httpd/conf.d/ganglia.conf
#Deny from all 改为 Allow from all

修改配置文件
sudo vim /etc/ganglia/gmetad.conf
data_source "localhost102" 192.168.33.102

修改配置文件 /etc/ganglia/gmond.conf
sudo vim /etc/ganglia/gmond.conf
name="localhost102"
host=192.168.33.102
bind=192.168.33.102

修改配置文件
sudo vim /etc/selinux/config
SELINUX=disabled ##需要重启 sudo setenforce 0 本次生效

启动 ganglia
sudo service httpd start
sudo service gmetad start
sudo service gmond start

打开网页浏览ganglia页面
http://192.168.33.102/ganglia  
如果权限不足修改 /var/lib/ganglia 目录的权限:
sudo chmod -R 777 /var/lib/ganglia

修改 /opt/module/flume/conf 目录下的 flume env.sh 配置:
JAVA_OPTS="-Dflume.monitoring.type=ganglia 
-Dflume.monitoring.hosts= 192.168.33.102:8649
-Xms100m
-Xmx200m"

启动flume
bin/flume-ng agent -c conf/ -n a1 -f jobs/xxx -Dflume.root.logger==INFO,console
-Dflume.monitoring.type=ganglia
-Dflume.monitoring.hosts= 192.168.33.102:8649

企业面试题

 

 

 

 

 

 

 

 

 

 

posted on 2020-07-23 23:04  Bingmous  阅读(32)  评论(0编辑  收藏  举报