数据处理,简易流程开发
跟项目经理,确认业务
自己思考,写出大概流程,画出草体,跟同事交流,
写出具体流程,画出具体流程图,
确定业务,跟项目经理交流,进一步确定项目,
开始进行项目的开发
0 、创建项目scala项目:G A V ,构建项目目录树,导入依赖
向pom.xml中导入依赖
<scala.version>2.11.8</scala.version>
<kafka.version>0.9.0.0</kafka.version>
<spark.version>2.2.0</spark.version>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flume.flume-ng-clients</groupId>
<artifactId>flume-ng-log4jappender</artifactId>
<version>1.6.0</version>
</dependency>
1 、编写log4j,进行本地测试,测试log4j是否生成日志
src/main/java/com/imooc/LoggerGenerator.java
import org.apache.log4j.Logger;
/**
* 模拟日志生成
*/
public class GenerateLog4j {
private static Logger logger = Logger.getLogger(LoggerGenerator.class.getName());
public static void main(String[] args) throws InterruptedException {
int index = 0;
while (true) {
Thread.sleep(1000);
logger.info("current : " + index++);
}
}
}
src/resources/log4j.properties:
log4j.rootLogger=info,stdout,flume
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.Target=System.out
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss,SSS} [%t] [%c] [%p] - %m%n
测试正确后,下一步
2 、flume从log4j收集数据
在flume官网中到log4jAppender中对接flume的属性值
放入本地资源共享库resources内的log4j.properties中
src/resources/log4j.properties:
log4j.rootLogger=info,stdout,flume
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.Target=System.out
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss,SSS} [%t] [%c] [%p] - %m%n
#log4j对接flume,输出到控制台
log4j.appender.flume = org.apache.flume.clients.log4jappender.Log4jAppender
log4j.appender.flume.Hostname = Master
log4j.appender.flume.Port = 41414
log4j.appender.flume.UnsafeMode = true
编写log4j对接flume配置的测试,log4j-channels-flume.conf
然后进行控制台进行测试
Note:编写flume的配置文件时,直接取官网查找,然后粘贴复制,最好不要手敲,容易出错的
log4j-channels-flume.conf:
agent1.sources=avro-source
agent1.channels=logger-channel
agent1.sinks=log-sink
#define source
agent1.sources.avro-source.type = avro
agent1.sources.avro-source.bind = 0.0.0.0
agent1.sources.avro-source.port = 41414
#define channel
agent1.channels.logger-channel.type = memory
#define sink
agent1.sinks.log-sink.type = logger
#linked sources and sinks
agent1.sources.avro-source.channels = logger-channel
agent1.sinks.log-sink.channel = logger-channel
启动flume:
flume-ng agent \
--conf /Users/hadoop/app/flume/conf \
--conf-file /Users/hadoop/app/flume/Configure/log4j-channels-flume.conf \
--name agent1 \
-Dflume.root.logger=INFO,console
测试正确后,下一步
3 、flume输送数据到kafka
先启动zookeeper服务器,然后启动kafka服务器,创建并查看kafka的topic
编写flume配置文件flume-channels-kafka.conf,去官网找kafka sink的配置信息,进行更改
编写kafka对接SparkStreaming的测试代码。
启动flume,
启动kafka的消费者
启动log4jGenerated.java程序,
启动kafkaStreaming.scala程序
观察控制台和kafka消费者终端的信息。
$ ./zkServer.sh start
$ ./kafka-server-start.sh -daemon /Users/hadoop/app/kafka/config/server.properties &
$ ./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic streamingtopicdemo
$ ./kafka-topics.sh --list --zookeeper localhost:2181
$ ./kafka-topics.sh --describe --zookeeper localhost:2181 --topic streamingtopicdemo
$ ./kafka-console-consumer.sh --zookeeper Master:2181 --topic streamingtopic
$ flume-ng agent --conf $FLUME_HOME/conf --conf-file $FLUME_HOME/Configure/flume-channels-kafka.conf --name agent1 -Dflume.root.logger=INFO,consol
IDEA下 运行log4jGenerated.java
IDEA下 运行kafkaStreaming.scala
/src/main/scala/com/imooc/KafkaStreaming.scala
Program arguments: Master:2181 test streamingtopic 1
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* Kafka对接Spark Streaming
*/
object KafkaStreaming {
def main(args: Array[String]): Unit = {
if (args.length != 4) {
System.err.println("Usage: <zkQuorum> <groupId> <topics> <numThreads>")
}
val Array(zkQuorum, groupId, topics, numThreads) = args
val sparkConf = new SparkConf().setAppName("KafkaStreaming").setMaster("local[2]");
val ssc = new StreamingContext(sparkConf,Seconds(5))
val topicsMap = topics.split(",").map((_, numThreads.toInt)).toMap
// TODO... Spark Streaming如何对接Kafka
val message = KafkaUtils.createStream(ssc, zkQuorum, groupId, topicsMap)
// TODO... Spark Streaming的测试代码
message.map(_._2).count().print()
ssc.start()
ssc.awaitTermination()
}
}
测试正确后,下一步
4 、开始SparkStreaming代码的正式开发
现在是在本地进行测试的,在IDEA中运行LoggerGenerator,
然后使用Flume、Kafka以及Spark Streaming进行处理操作。在生产上肯定不是这么干的,怎么干呢?
1)打包jar,执行LoggerGenerator类
2)Flume、Kafka和我们的测试是一样的
3)Spark Streaming的代码也是需要打成jar包,然后使用spark-submit的方式进行提交到环境上执行
到底采用什么方式? 可以根据实际情况选择运行模式: local/yarn/standalone/mesos在生产上,整个流处理的流程都一样的,区别在于业务逻辑的复杂性