flume

一、Flume安装部署

1.1、安装地址

1) Flume 官网地址

http://flume.apache.org/

2)文档查看地址

http://flume.apache.org/FlumeUserGuide.html

3)下载地址

http://archive.apache.org/dist/flume/

2.2、安装部署

1)将 apache-flume-1.7.0-bin.tar.gz 上传到 linux 的/opt/software 目录下
 
2)解压 apache-flume-1.7.0-bin.tar.gz 到/opt/module/目录下
[hadoop@hadoop102 software]$ tar -zxf apache-flume-1.7.0-bin.tar.gz -C /opt/module/
 
3)修改 apache-flume-1.7.0-bin 的名称为 flume
[hadoop@hadoop102 module]$ mv apache-flume-1.7.0-bin flume
 
4)将 flume/conf 下的 flume-env.sh.template 文件修改为 flume-env.sh,并配置 flumeenv.sh 文件
[hadoop@hadoop102 conf]$ mv flume-env.sh.template flume-env.sh
[hadoop@hadoop102 conf]$ vi flume-env.sh
export JAVA_HOME=/opt/module/jdk1.8.0_144
 
 

image

5)配置环境变量

[root@hadoop ~]# cat ~/.bashrc
# .bashrc

# User specific aliases and functions

alias rm='rm -i'
alias cp='cp -i'
alias mv='mv -i'

# Source global definitions
if [ -f /etc/bashrc ]; then
    . /etc/bashrc
fi
export PYTHON_HOME=/usr/local/python3
export PATH=$PYTHON_HOME/bin:$PYTHON_HOME/sbin:$PATH

export HADOOP_HOME=/bigdata/hadoop-2.10.1
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

export HIVE_HOME=/bigdata/apache-hive-1.2.2-bin
export PATH=$HIVE_HOME/bin:$PATH

export HBASE_HOME=/bigdata/hbase-2.3.7
export PATH=$HBASE_HOME/bin:$HBASE_HOME/sbin:$PATH

export MAVEN__HOME=/bigdata/apache-maven-3.3.9
export PATH=$MAVEN_HOME/bin:$MAVEN_HOME/sbin:$PATH


export SPARK_HOME=/bigdata/spark-3.2.0-bin-hadoop3.2
export PATH=$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH

export ZOOKEEPER_HOME=/bigdata/apache-hive-1.2.2-bin
export PATH=$ZOOKEEPER_HOME/bin:$PATH

export SQOOP_HOME=/bigdata/sqoop
export PATH=$SQOOP_HOME/bin:$PATH

export FLUME_HOME=/bigdata/flume
export PATH=$FLUME_HOME/bin:$PATH
 

6)查看版本

[hadoop@hadoop102 ~]$ flume-ng version
Flume 1.7.0
Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git
Revision: 511d868555dd4d16e6ce4fedc72c2d1454546707
Compiled by bessbd on Wed Oct 12 20:51:10 CEST 2016
From source with checksum 0d21b3ffdc55a07e1d08875872c00523

二:通过flume将业务数据服务器A的日志收集到hadoop服务器hdfs的hive中

1:flume 配置

官方文档  https://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html

  • 进入flume/conf目录

创建一个collect_click.conf的文件,写入flume的配置

  • sources:为实时查看文件末尾,interceptors解析json文件
  • channels:指定内存存储,并且制定batchData的大小,PutList和TakeList的大小见参数,Channel总容量大小见参数
  • 指定sink:形式直接到hdfs,以及路径,文件大小策略默认1024、event数量策略、文件闲置时间

collect_click.conf

[root@hadoop conf]# cat collect_click.conf 
a1.sources = s1
a1.sinks = k1
a1.channels = c1

a1.sources.s1.channels= c1
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /root/logs/userClick.log  #日志目录
a1.sources.s1.interceptors=i1 i2
a1.sources.s1.interceptors.i1.type=regex_filter
a1.sources.s1.interceptors.i1.regex=\\{.*\\} # 匹配json
a1.sources.s1.interceptors.i2.type=timestamp

# channel1
a1.channels.c1.type=memory
a1.channels.c1.capacity=30000
a1.channels.c1.transactionCapacity=1000

# k1
a1.sinks.k1.type=hdfs
a1.sinks.k1.channel=c1
a1.sinks.k1.hdfs.path=hdfs://192.168.50.129:9000/user/hive/warehouse/profile.db/user_action/%Y-%m-%d   # hdfs 的存储目录
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.writeFormat=Text
a1.sinks.k1.hdfs.rollInterval=0
a1.sinks.k1.hdfs.rollSize=10240
a1.sinks.k1.hdfs.rollCount=0
a1.sinks.k1.hdfs.idleTimeout=60
[root@hadoop conf]#

 

 2: hive 的配置

解决办法可以按照日期分区

  • 修改表结构
  • 修改flume

在这里我们创建一个新的数据库profile,表示用户相关数据,画像存储到这里

create database if not exists profile comment "use action" location '/user/hive/warehouse/profile.db/';

在profile数据库中创建user_action表,指定格式

create table user_action(
actionTime STRING comment "user actions time",
readTime STRING comment "user reading time",
channelId INT comment "article channel id",
param map<string, string=""> comment "action parameter")
COMMENT "user primitive action"
PARTITIONED BY(dt STRING)  # 分区
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION '/user/hive/warehouse/profile.db/user_action';
  • ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe':添加一个json格式匹配

重要:这样就有HIVE的user_action表,并且hadoop有相应目录,flume会自动生成目录,但是如果想要通过spark sql 获取内容,每天每次还是要主动关联,后面知识点会提及)

# 如果flume自动生成目录后,需要手动关联分区
alter table user_action add partition (dt='2018-12-11') location "/user/hive/warehouse/profile.db/user_action/2018-12-11/"
3. 开启收集命令
flume-ng agent -c /root/bigdata/flume/conf -f /root/bigdata/flume/conf/collect_click.conf -Dflume.root.logger=INFO,console -name a1

 

在可视化页面中查看  http://192.168.50.129:50070/explorer.html#/user/hive/warehouse/profile.db/user_action/2022-02-17

在hive 中查询

hive> alter table user_action add partition (dt='2022-02-17') location "/user/hive/warehouse/profile.db/user_action/2022-02-17/";
OK
Time taken: 0.317 seconds
hive> select * from user_action where dt='2022-02-17';
OK
2019-04-10 18:15:32        18    {"action":"click","userId":"2","articleId":"18005","algorithmCombine":"C2"}    2022-02-17
2019-04-10 18:15:34    1053    18    {"action":"read","userId":"2","articleId":"18005","algorithmCombine":"C2"}    2022-02-17
2019-04-10 18:15:36        18    {"action":"click","userId":"2","articleId":"18577","algorithmCombine":"C2"}    2022-02-17
2019-04-10 18:15:38    1621    18    {"action":"read","userId":"2","articleId":"18577","algorithmCombine":"C2"}    2022-02-17
2019-04-10 18:15:39        18    {"action":"click","userId":"1","articleId":"14299","algorithmCombine":"C2"}    2022-02-17
2019-04-10 18:15:39        18    {"action":"click","userId":"2","articleId":"14299","algorithmCombine":"C2"}    2022-02-17
2019-04-10 18:15:41    914    18    {"action":"read","userId":"2","articleId":"14299","algorithmCombine":"C2"}    2022-02-17
2019-04-10 18:15:47    7256    18    {"action":"read","userId":"1","articleId":"14299","algorithmCombine":"C2"}    2022-02-17
2019-04-10 21:04:39        18    {"action":"click","userId":"2","articleId":"14299","algorithmCombine":"C2"}    2022-02-17
2019-04-10 21:04:39        18    {"action":"click","userId":"2","articleId":"14299","algorithmCombine":"C2"}    2022-02-17
2019-04-10 21:04:39        18    {"action":"click","userId":"2","articleId":"14299","algorithmCombine":"C2"}    2022-02-17
2019-04-10 21:04:39        18    {"action":"click","userId":"2","articleId":"14299","algorithmCombine":"C2"}    2022-02-17
2019-04-10 21:04:39        18    {"action":"click","userId":"2","articleId":"14299","algorithmCombine":"C2"}    2022-02-17
2019-04-10 21:04:39        18    {"action":"click","userId":"2","articleId":"14299","algorithmCombine":"C2"}    2022-02-17
2019-04-10 21:04:39        18    {"action":"click","userId":"2","articleId":"14299","algorithmCombine":"C2"}    2022-02-17
Time taken: 0.358 seconds, Fetched: 15 row(s)
hive> 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

posted on 2022-02-17 11:06  paike123  阅读(34)  评论(0编辑  收藏  举报

导航