Flink WordCount计算详解
pom依赖准备:
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.flink</groupId> <artifactId>flinkdemo</artifactId> <version>1.0-SNAPSHOT</version> <dependencies> <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-scala --> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-streaming-scala_2.11</artifactId> <version>1.11.1</version> <scope>provided</scope> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-clients 从Flink1.11开始,移除了flink-streaming-java对flink-clients的依赖 --> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-clients_2.11</artifactId> <version>1.11.1</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-scala --> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-scala_2.11</artifactId> <version>1.11.1</version> </dependency> </dependencies> </project>
Flink WordCount主要分为batch处理和stream处理
1.batch(批处理)方式:
import org.apache.flink.api.scala.ExecutionEnvironment object DataSetFlink { def main(args: Array[String]): Unit = { val environment = ExecutionEnvironment.getExecutionEnvironment //引用批处理环境 import org.apache.flink.api.scala._ val data = environment.readTextFile("D:\\a.txt") val result = data.flatMap(_.split(" ")) .filter(_.nonEmpty) .map((_, 1)) .groupBy(0) .sum(1) result.print() //执行完则代表结束 } }
结果:
2.Stream(流)处理方式
准备一台虚拟机 开启监听端口;比如 nc -l 7777
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
object StreamingFlink {
def main(args: Array[String]): Unit = {
val environment = StreamExecutionEnvironment.getExecutionEnvironment //获取流处理环境
//外部参数封装对象
val tool = ParameterTool.fromArgs(args)
val host :String = tool.get("host")
val port = tool.getInt("port")
val datastream = environment.socketTextStream(host,port)
import org.apache.flink.api.scala._
val value = datastream
.flatMap(_.split(" "))
.filter(_.nonEmpty).map(a => (a,1))
.keyBy(0) //流处理使用的keyBy函数,批处理使用的是groupBy函数
.sum(1)
.print()
environment.execute() //此处要开启流处理环境处于监听状态
}
}
结果:
ps:注意流处理的时候,单词旁边的尖括号字段代表的电脑默认核数,可以理解为默认并行度
posted on 2020-11-05 19:11 RICH-ATONE 阅读(469) 评论(0) 编辑 收藏 举报