说明
用户点击页面后数据存储到a.log文件中。(本项目省去了这一步,数据已经在a.log中了)
使用java代码将a.log文件中的数据,写入project.log中。
使用flume采集日志,监控project.log文件内容的变化,将新增的用户的数据写出到hdfs上。
a.log中的现成数据
点击查看代码
120.191 .181 .178 - - 2018 -02 -18 20 :24 :39 "POST https://www.taobao.com/item/b HTTP/1.1" 203 69172 https:
149.74 .183 .133 - - 2018 -09 -24 19 :38 :17 "GET https://www.taobao.com/register HTTP/1.0" 300 72815 https:
58.9 .92 .122 - - 2018 -08 -30 11 :28 :15 "GET https://www.taobao.com/list/ HTTP/1.0" 203 17119 https:
77.56 .72 .210 - - 2018 -05 -13 18 :11 :22 "POST https://www.taobao.com/category/b HTTP/1.0" 201 17843 https:
217.147 .196 .74 - - 2018 -08 -22 18 :06 :01 "GET https://www.taobao.com/category/c HTTP/1.0" 501 95033 https:
37.146 .124 .65 - - 2018 -05 -08 02 :12 :24 "POST https://www.taobao.com/category/d HTTP/1.1" 203 47329 https:
167.108 .24 .171 - - 2018 -08 -20 12 :19 :01 "POST https://www.taobao.com/recommand HTTP/1.1" 302 80056 https:
94.69 .229 .202 - - 2018 -07 -27 13 :46 :37 "GET https://www.taobao.com/recommand HTTP/1.1" 501 63116 https:
日志采集脚本的编写
[root@node1 dataCollect]
dataCollect.sources = s1
dataCollect.channels = c1
dataCollect.sinks = k1
dataCollect.sources.s1.type = exec
dataCollect.sources.s1.command = tail -F /opt/project/dataCollect/project.log
dataCollect.channels.c1.type = memory
dataCollect.channels.c1.capacity = 20000
dataCollect.channels.c1.transactionCapacity = 10000
dataCollect.channels.c1.byteCapacity = 1048576000
dataCollect.sinks.k1.type = hdfs
dataCollect.sinks.k1.hdfs.path = hdfs://node1:9000 /project/%Y%m%d/
dataCollect.sinks.k1.hdfs.filePrefix = project-
dataCollect.sinks.k1.hdfs.fileSuffix = .log
dataCollect.sinks.k1.hdfs.round = true
dataCollect.sinks.k1.hdfs.roundValue = 24
dataCollect.sinks.k1.hdfs.roundUnit = hour
dataCollect.sinks.k1.hdfs.useLocalTimeStamp = true
dataCollect.sinks.k1.hdfs.batchSize = 5000
dataCollect.sinks.k1.hdfs.fileType = DataStream
dataCollect.sinks.k1.hdfs.rollInterval = 21600
dataCollect.sinks.k1.hdfs.rollSize = 134217700
dataCollect.sinks.k1.hdfs.rollCount = 0
dataCollect.sinks.k1.hdfs.minBlockReplicas = 1
dataCollect.sources.s1.channels = c1
dataCollect.sinks.k1.channel = c1
使用java代码将用户行为数据写入project.log中
package com.sxuek;
import java.io.*;
import java.util.Random;
public class SimPro {
public static void main (String[] args) throws IOException, InterruptedException {
BufferedReader br = new BufferedReader (new InputStreamReader (new FileInputStream ("/root/project/dataCollect/a.log" ), "UTF-8" ));
BufferedWriter bw = new BufferedWriter (new OutputStreamWriter (new FileOutputStream ("/root/project/dataCollect/project.log" , true ), "utf-8" ));
String line = null ;
Random random = new Random ();
while ((line = br.readLine()) != null ) {
int time = random.nextInt(5000 );
int count = 1 + random.nextInt(10000 );
Thread.sleep(time);
System.out.println("间隔了" +time+"时间,有" +count+"个用户点击了网站,产生了用户行为日志数据" );
for (int i = 0 ; i < count; i++) {
bw.write(line);
bw.newLine();
bw.flush();
line = br.readLine();
}
}
}
}
打jar包上传,运行
开始采集
flume-ng agent -n dataCollect -f dataCollect.conf -Dflume.root.logger=INFO,console
java -jar untitled.jar SimPro.java
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· TypeScript + Deepseek 打造卜卦网站:技术与玄学的结合
· Manus的开源复刻OpenManus初探
· AI 智能体引爆开源社区「GitHub 热点速览」
· 三行代码完成国际化适配,妙~啊~
· .NET Core 中如何实现缓存的预热?