flink-java版学习-1-本地批处理和流处理

   

 

        之前学习了java基础知识,也忘记的差不多了,日常工作还是天天写sql,不写就是不行,还是跟着项目多写写,也学习下flink。

    学习主要参照尚硅谷的教程,尚硅谷Java版Flink(武老师清华硕士,原IBM-CDL负责人)_哔哩哔哩_bilibili,感谢尚硅谷的免费课程,有上海的周末培训班就好了,感谢B站。

1、首先,需要配置pom文件,我抄来的

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.shihuo</groupId>
    <artifactId>FlinkTutorial</artifactId>
    <version>1.0-SNAPSHOT</version>

 <dependencies>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-java</artifactId>
        <version>1.10.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-streaming-java_2.12</artifactId>
        <version>1.10.1</version>
    </dependency>
    </dependencies>

</project>

2、批任务的代码

package com.shihuo.wc;


import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.tuple.Tuple0;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;

//这是个批的例子
public class WorldCount {
    public static void main(String[] args) throws Exception{
        // 创建执行环境
        ExecutionEnvironment executionEnvironment = ExecutionEnvironment.getExecutionEnvironment();

        //从文件中读取数据
        String inputPath = "/Users/wangyuhang/Desktop/FlinkTutorial/src/main/resources/hello.txt";
        DataSet<String> stringDataSet = executionEnvironment.readTextFile(inputPath);
        //对数据集进行处理,按空格分词展开,转换成(world,1)二元祖进行统计
        DataSet<Tuple2<String, Integer>> resultSet =  stringDataSet.flatMap(new MyFlatMapper())
                .groupBy(0)
                .sum(1);        //这里指的是第二个位置

        resultSet.print();
    }
    //自定义类,实现FlatMapFunctionFunction接口
    public static class MyFlatMapper implements FlatMapFunction<String, Tuple2<String ,Integer>>{
        @Override
        public void flatMap(String value, Collector<Tuple2<String, Integer>> out) throws Exception {
            //按空格分词
            String[] words = value.split(" ");
            //遍历所有word,包成二元组输出
            for(String word:words){
                out.collect(new Tuple2<>(word,1));
            }
        }
    }
}

然后是结果,中间输出了一坨日志,我还以为是报错

3、流处理的例子

package com.shihuo.wc;

import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

public class StreamingWorldCount {
    public static void main(String[] args) throws Exception{
        //创建执行流处理的执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        //从文件中读取数据
        String inputPath = "/Users/wangyuhang/Desktop/FlinkTutorial/src/main/resources/hello.txt";
        DataStream<String> stringDataStream = env.readTextFile(inputPath);

        //基于数据流进行转换计算
        SingleOutputStreamOperator<Tuple2<String,Integer>> reslutStram = stringDataStream.flatMap(new WorldCount.MyFlatMapper())
                .keyBy(0)
                .sum(1);

        reslutStram.print();

        //需要执行
        env.execute();
    }
}

结果如下,这里数字代表分线程,默认并行度:

可以加上这个设置线程数:

env.setParallelism(4);

posted @ 2020-12-27 23:43  活不明白  阅读(39)  评论(0)    收藏  举报