大数据学习(32)—— IDEA开发flink

我懒得去在linux虚机上搭建集群版了,这种网上一搜一大堆。直接弄个localhost的,在IDEA上跑一跑吧。

使用flink的场景,一般都是处理无界流,服务一旦启动,就不关闭了。我们来模拟一个接受无限输入单词的wordcount。

pom文件的主要内容如下,关注标红的部分就行。

<properties>
    <scala.version>2.11.0</scala.version>
    <scala.binary.version>2.11</scala.binary.version>
    <flink.version>1.6.1</flink.version>
  </properties>

  <repositories>
    <repository>
      <id>scala-tools.org</id>
      <name>Scala-Tools Maven2 Repository</name>
      <url>http://scala-tools.org/repo-releases</url>
    </repository>
  </repositories>

  <pluginRepositories>
    <pluginRepository>
      <id>scala-tools.org</id>
      <name>Scala-Tools Maven2 Repository</name>
      <url>http://scala-tools.org/repo-releases</url>
    </pluginRepository>
  </pluginRepositories>

  <dependencies>
    <dependency>
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-library</artifactId>
      <version>${scala.version}</version>
    </dependency>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.4</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.specs</groupId>
      <artifactId>specs</artifactId>
      <version>1.2.5</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-scala_${scala.binary.version}</artifactId>
      <version>${flink.version}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-streaming-scala_${scala.binary.version}</artifactId>
      <version>${flink.version}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-table_${scala.binary.version}</artifactId>
      <version>${flink.version}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-clients_${scala.binary.version}</artifactId>
      <version>${flink.version}</version>
    </dependency>
  </dependencies>

来吧,scala整个wordcount,逻辑非常简单。跟spark的RDD API类似,这两个技术有很多语法是相近的。

步骤大概是这么几步:

  • 创建evn,设置参数
  • source
  • transform
  • sink
  • 运行
object WordCount {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    import org.apache.flink.streaming.api.scala._

    env.setParallelism(1)

    val ds = env.socketTextStream("localhost",6666)
    val rs = ds.flatMap(_.split(" ")).map((_,1)).keyBy(0).sum(1)

    rs.print().setParallelism(2)

    env.execute("wordcount")
  }
}

用windows的cmd命令,打开一个监听端口,发送数据,再启动程序。

C:\Users\localhost>nc -L -t -p 6666
thank you
are you ok
hello
thank you
thank you very much
hello
thank you
thank you very much
how are you indian mi fans

来看看两个输出线程在控制台打印了啥。

2> (you,1)
1> (thank,1)
1> (are,1)
2> (you,2)
1> (ok,1)
2> (hello,1)
2> (you,3)
1> (thank,2)
2> (you,4)
2> (much,1)
1> (thank,3)
1> (very,1)
1> (hello,2)
2> (thank,4)
1> (you,5)
2> (thank,5)
2> (very,2)
1> (you,6)
1> (much,2)
2> (how,1)
1> (are,2)
1> (indian,1)
2> (you,7)
1> (fans,1)
2> (mi,1)

从打印结果来看,确实是来一条处理一条,每个切分出来的单词都有状态,知道上次是多少个。但一条记录之间的单词没有固定的输出顺序,是并行打印的。

这是flink基本应用,要学的东西还多着,官网有很不错的文档资料。如果确实需要用到flink,建议深入学习。

posted on 2021-11-30 10:47  别样风景天  阅读(445)  评论(0编辑  收藏  举报

导航