RICH-ATONE

Flink WordCount计算详解

 

pom依赖准备:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.flink</groupId>
    <artifactId>flinkdemo</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>

    <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-scala -->
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-streaming-scala_2.11</artifactId>
        <version>1.11.1</version>
        <scope>provided</scope>
    </dependency>




        <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-clients  从Flink1.11开始,移除了flink-streaming-java对flink-clients的依赖 -->
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-clients_2.11</artifactId>
        <version>1.11.1</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-scala -->
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-scala_2.11</artifactId>
        <version>1.11.1</version>
    </dependency>

</dependencies>

</project>
  

  Flink WordCount主要分为batch处理和stream处理

1.batch(批处理)方式:

import org.apache.flink.api.scala.ExecutionEnvironment 

object DataSetFlink {

  def main(args: Array[String]): Unit = {


    val environment = ExecutionEnvironment.getExecutionEnvironment  //引用批处理环境

    import org.apache.flink.api.scala._
    val data = environment.readTextFile("D:\\a.txt")


    val result = data.flatMap(_.split(" "))
      .filter(_.nonEmpty)
      .map((_, 1))
      .groupBy(0)
      .sum(1)

      result.print() //执行完则代表结束


  }
}  

  结果:

 

 

2.Stream(流)处理方式 

准备一台虚拟机 开启监听端口;比如 nc -l 7777

import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment

object StreamingFlink {


def main(args: Array[String]): Unit = {

val environment = StreamExecutionEnvironment.getExecutionEnvironment //获取流处理环境

//外部参数封装对象

val tool = ParameterTool.fromArgs(args)
val host :String = tool.get("host")
val port = tool.getInt("port")

val datastream = environment.socketTextStream(host,port)

import org.apache.flink.api.scala._


val value = datastream
.flatMap(_.split(" "))
.filter(_.nonEmpty).map(a => (a,1))
.keyBy(0) //流处理使用的keyBy函数,批处理使用的是groupBy函数
.sum(1)
.print()

environment.execute() //此处要开启流处理环境处于监听状态
}
}

 

  结果:

 

 ps:注意流处理的时候,单词旁边的尖括号字段代表的电脑默认核数,可以理解为默认并行度

 

 

posted on 2020-11-05 19:11  RICH-ATONE  阅读(465)  评论(0编辑  收藏  举报

导航