flume

一、Flume安装部署

1.1、安装地址

1） Flume 官网地址

http://flume.apache.org/

2）文档查看地址

http://flume.apache.org/FlumeUserGuide.html

3）下载地址

http://archive.apache.org/dist/flume/

2.2、安装部署

1）将 apache-flume-1.7.0-bin.tar.gz 上传到 linux 的/opt/software 目录下
 
2）解压 apache-flume-1.7.0-bin.tar.gz 到/opt/module/目录下
[hadoop@hadoop102 software]$ tar -zxf apache-flume-1.7.0-bin.tar.gz -C /opt/module/
 
3）修改 apache-flume-1.7.0-bin 的名称为 flume
[hadoop@hadoop102 module]$ mv apache-flume-1.7.0-bin flume
 
4）将 flume/conf 下的 flume-env.sh.template 文件修改为 flume-env.sh，并配置 flumeenv.sh 文件
[hadoop@hadoop102 conf]$ mv flume-env.sh.template flume-env.sh
[hadoop@hadoop102 conf]$ vi flume-env.sh
export JAVA_HOME=/opt/module/jdk1.8.0_144

5）配置环境变量

[root@hadoop ~]# cat ~/.bashrc
# .bashrc

# User specific aliases and functions

alias rm='rm -i'
alias cp='cp -i'
alias mv='mv -i'

# Source global definitions
if [ -f /etc/bashrc ]; then
    . /etc/bashrc
fi
export PYTHON_HOME=/usr/local/python3
export PATH=$PYTHON_HOME/bin:$PYTHON_HOME/sbin:$PATH

export HADOOP_HOME=/bigdata/hadoop-2.10.1
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

export HIVE_HOME=/bigdata/apache-hive-1.2.2-bin
export PATH=$HIVE_HOME/bin:$PATH

export HBASE_HOME=/bigdata/hbase-2.3.7
export PATH=$HBASE_HOME/bin:$HBASE_HOME/sbin:$PATH

export MAVEN__HOME=/bigdata/apache-maven-3.3.9
export PATH=$MAVEN_HOME/bin:$MAVEN_HOME/sbin:$PATH


export SPARK_HOME=/bigdata/spark-3.2.0-bin-hadoop3.2
export PATH=$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH

export ZOOKEEPER_HOME=/bigdata/apache-hive-1.2.2-bin
export PATH=$ZOOKEEPER_HOME/bin:$PATH

export SQOOP_HOME=/bigdata/sqoop
export PATH=$SQOOP_HOME/bin:$PATH

export FLUME_HOME=/bigdata/flume
export PATH=$FLUME_HOME/bin:$PATH

6）查看版本

[hadoop@hadoop102 ~]$ flume-ng version
Flume 1.7.0
Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git
Revision: 511d868555dd4d16e6ce4fedc72c2d1454546707
Compiled by bessbd on Wed Oct 12 20:51:10 CEST 2016
From source with checksum 0d21b3ffdc55a07e1d08875872c00523

二：通过flume将业务数据服务器A的日志收集到hadoop服务器hdfs的hive中

1：flume 配置

官方文档 https://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html

创建一个collect_click.conf的文件，写入flume的配置

sources：为实时查看文件末尾，interceptors解析json文件
channels：指定内存存储，并且制定batchData的大小，PutList和TakeList的大小见参数，Channel总容量大小见参数
指定sink：形式直接到hdfs，以及路径，文件大小策略默认1024、event数量策略、文件闲置时间

collect_click.conf

[root@hadoop conf]# cat collect_click.conf 
a1.sources = s1
a1.sinks = k1
a1.channels = c1

a1.sources.s1.channels= c1
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /root/logs/userClick.log  #日志目录
a1.sources.s1.interceptors=i1 i2
a1.sources.s1.interceptors.i1.type=regex_filter
a1.sources.s1.interceptors.i1.regex=\\{.*\\} # 匹配json
a1.sources.s1.interceptors.i2.type=timestamp

# channel1
a1.channels.c1.type=memory
a1.channels.c1.capacity=30000
a1.channels.c1.transactionCapacity=1000

# k1
a1.sinks.k1.type=hdfs
a1.sinks.k1.channel=c1
a1.sinks.k1.hdfs.path=hdfs://192.168.50.129:9000/user/hive/warehouse/profile.db/user_action/%Y-%m-%d   # hdfs 的存储目录
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.writeFormat=Text
a1.sinks.k1.hdfs.rollInterval=0
a1.sinks.k1.hdfs.rollSize=10240
a1.sinks.k1.hdfs.rollCount=0
a1.sinks.k1.hdfs.idleTimeout=60
[root@hadoop conf]#

2： hive 的配置

解决办法可以按照日期分区

修改表结构
修改flume

在这里我们创建一个新的数据库profile，表示用户相关数据，画像存储到这里

create database if not exists profile comment "use action" location '/user/hive/warehouse/profile.db/';

在profile数据库中创建user_action表，指定格式

create table user_action(
actionTime STRING comment "user actions time",
readTime STRING comment "user reading time",
channelId INT comment "article channel id",
param map<string, string=""> comment "action parameter")
COMMENT "user primitive action"
PARTITIONED BY(dt STRING)  # 分区
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION '/user/hive/warehouse/profile.db/user_action';

ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe':添加一个json格式匹配

重要：这样就有HIVE的user_action表，并且hadoop有相应目录,flume会自动生成目录，但是如果想要通过spark sql 获取内容，每天每次还是要主动关联，后面知识点会提及)

# 如果flume自动生成目录后，需要手动关联分区
alter table user_action add partition (dt='2018-12-11') location "/user/hive/warehouse/profile.db/user_action/2018-12-11/"

3. 开启收集命令

flume-ng agent -c /root/bigdata/flume/conf -f /root/bigdata/flume/conf/collect_click.conf -Dflume.root.logger=INFO,console -name a1

在可视化页面中查看 http://192.168.50.129:50070/explorer.html#/user/hive/warehouse/profile.db/user_action/2022-02-17

在hive 中查询

hive> alter table user_action add partition (dt='2022-02-17') location "/user/hive/warehouse/profile.db/user_action/2022-02-17/";
OK
Time taken: 0.317 seconds
hive> select * from user_action where dt='2022-02-17';
OK
2019-04-10 18:15:32        18    {"action":"click","userId":"2","articleId":"18005","algorithmCombine":"C2"}    2022-02-17
2019-04-10 18:15:34    1053    18    {"action":"read","userId":"2","articleId":"18005","algorithmCombine":"C2"}    2022-02-17
2019-04-10 18:15:36        18    {"action":"click","userId":"2","articleId":"18577","algorithmCombine":"C2"}    2022-02-17
2019-04-10 18:15:38    1621    18    {"action":"read","userId":"2","articleId":"18577","algorithmCombine":"C2"}    2022-02-17
2019-04-10 18:15:39        18    {"action":"click","userId":"1","articleId":"14299","algorithmCombine":"C2"}    2022-02-17
2019-04-10 18:15:39        18    {"action":"click","userId":"2","articleId":"14299","algorithmCombine":"C2"}    2022-02-17
2019-04-10 18:15:41    914    18    {"action":"read","userId":"2","articleId":"14299","algorithmCombine":"C2"}    2022-02-17
2019-04-10 18:15:47    7256    18    {"action":"read","userId":"1","articleId":"14299","algorithmCombine":"C2"}    2022-02-17
2019-04-10 21:04:39        18    {"action":"click","userId":"2","articleId":"14299","algorithmCombine":"C2"}    2022-02-17
2019-04-10 21:04:39        18    {"action":"click","userId":"2","articleId":"14299","algorithmCombine":"C2"}    2022-02-17
2019-04-10 21:04:39        18    {"action":"click","userId":"2","articleId":"14299","algorithmCombine":"C2"}    2022-02-17
2019-04-10 21:04:39        18    {"action":"click","userId":"2","articleId":"14299","algorithmCombine":"C2"}    2022-02-17
2019-04-10 21:04:39        18    {"action":"click","userId":"2","articleId":"14299","algorithmCombine":"C2"}    2022-02-17
2019-04-10 21:04:39        18    {"action":"click","userId":"2","articleId":"14299","algorithmCombine":"C2"}    2022-02-17
2019-04-10 21:04:39        18    {"action":"click","userId":"2","articleId":"14299","algorithmCombine":"C2"}    2022-02-17
Time taken: 0.358 seconds, Fetched: 15 row(s)
hive>

posted on 2022-02-17 11:06 paike123 阅读(47) 评论(0) 编辑收藏举报

刷新页面返回顶部

莫凡