Flink常用算子代码实现(Scala和Java)

Flink常用算子代码实现 (Scala版本和Java版本)

map之scala实现

map:

def main(args: Array[String]): Unit = {
  val env = ExecutionEnvironment.getExecutionEnvironment
  mapFunction(env)
}

def mapFunction(env: ExecutionEnvironment):Unit = {
  val data = env.fromCollection(List(1,2,3,4,5))
  data.map((x:Int)=>x+1).print()
}

输出:
2
3
4
5
6

scala语法简化:

data.map((x:Int)=>x+1).print()
println("----")
data.map((x)=>x+1).print()
println("----")
data.map(x=>x+1).print()
println("----")
data.map(_+1).print()

map之Java实现

public static void main(String[] args) throws Exception{
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
        mapFunction(env);
    }

    public static void mapFunction(ExecutionEnvironment env) throws Exception{
        List<Integer> list = new ArrayList<Integer>() ;
        for (int i = 1; i <= 5; i++) {
            list.add(i);
        }
        env.fromCollection(list).map(new MapFunction<Integer, Object>() {
            @Override
            public Object map(Integer input) {
                return input + 1;
            }
        }).print();
    }
    
输出:
2
3
4
5
6

filter之scala实现

filter算子,返回满足条件的结果。

def main(args: Array[String]): Unit = {
  val env = ExecutionEnvironment.getExecutionEnvironment
  filterFunction(env)
}

def filterFunction(env: ExecutionEnvironment):Unit = {
  env.fromCollection(List(1,2,3,4,5))
    .map(_+1)
    .filter(_>3)
    .print()
}

输出:
4
5
6

filter 之Java实现

public static void main(String[] args) throws Exception{
    ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
    filterFunction(env);
}

public static void filterFunction(ExecutionEnvironment env) throws Exception {
    List<Integer> list = new ArrayList<Integer>();
    for (int i = 1; i <= 5; i++) {
        list.add(i);
    }
    env.fromCollection(list).map(new MapFunction<Integer, Integer>() {
        @Override
        public Integer map(Integer input) throws Exception {
            return input + 1;
        }
    }).filter(new FilterFunction<Integer>() {
        @Override
        public boolean filter(Integer input) throws Exception{
            return input > 3;
        }
    }).print();
}

输出:
4
5
6

mapPartition 之scala实现

mapPartition的作用:原本是一个map调用一次,现在改成一个分区调用一次。

import scala.util.Random
//新建一个数据库工具类,用来连接数据库
object DBUtils {
  def getConnection() = {
    //获取数据库连接
    new Random().nextInt(10)
  }

  def returnConnection(connection: String) = {
    //把数据存到数据库
  }
}

如果使用map函数,每次都会去请求数据库连接,请求太频繁会把数据库搞崩溃,但是mapPartition就不会,它是一个分区的数据请求一次,可以设置并行度,较少数据库的请求压力。

def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment
    //filterFunction(env)
    mapPartitionFunction(env)
  }

  def mapPartitionFunction(env:ExecutionEnvironment):Unit = {
    val students = new ListBuffer[String]
    for(i <-1 to 100) {
      students.append("student " + i)
    }

    val data = env.fromCollection(students).setParallelism(4)

    data.mapPartition(x=>{
      val connection = DBUtils.getConnection()
      println(connection + "......")
      x
    }).print();

//    data.map(x=>{
//      //每一个元素要存储到数据库中,肯定要先获取到一个connection
//      val connection = DBUtils.getConnection() + "...."
//
//      //把数据保存到DB
//      DBUtils.returnConnection(connection)
//    }).print();

  }

现在的情况是: 使用map会请求100次,使用mapPartition 会请求4次,大大降低数据库的压力。

mapPartition之java实现

public static void main(String[] args) throws Exception{
    ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

    mapPartition(env);
}

public static void mapPartition(ExecutionEnvironment env) throws Exception {
    List<String> list = new ArrayList<String>();

    for (int i = 1; i <=100 ; i++) {
        list.add("Student " + i);
    }
    DataSource<String> data = env.fromCollection(list);

    data.map(new MapFunction<String, String>(){
        @Override
        public String map(String input) throws Exception {
            String connection = DBUtils.getConnection() + "";
            System.out.println("conncetion: [ " + connection + " ]");
            DBUtils.returnConnection(connection);
            return input;
        }
    }).print();
}

现在换成mapPartition:

public static void main(String[] args) throws Exception{
    ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

    mapPartition(env);
}

public static void mapPartition(ExecutionEnvironment env) throws Exception {
    List<String> list = new ArrayList<String>();

    for (int i = 1; i <=100 ; i++) {
        list.add("Student " + i);
    }
    DataSource<String> data = env.fromCollection(list).setParallelism(4);

    data.mapPartition(new MapPartitionFunction<String, String>() {
        @Override
        public void mapPartition(Iterable<String> inputs, Collector<String> collector) {
            String connection =  DBUtils.getConnection() + "";
            System.out.println("connect: [ " + connection + " ]");
            DBUtils.returnConnection(connection);
        }
    }).print();
}

输出:
connect: [ 9 ]
connect: [ 2 ]
connect: [ 2 ]
connect: [ 3 ]

只会创建4个链接。

first(n)之scala实现

def main(args: Array[String]): Unit = {
  val env = ExecutionEnvironment.getExecutionEnvironment
  firstFunction(env)
}

def firstFunction(env: ExecutionEnvironment) : Unit = {
  val info = ListBuffer[(Int,String)]()
  info.append((1,"Hadoop"))
  info.append((1,"Spark"))
  info.append((1,"Flink"))
  info.append((2,"Java"))
  info.append((2,"Spring"))
  info.append((3,"Linux"))
  info.append((4,"VUE"))

  val data = env.fromCollection(info)

  data.first(3).print()
输出:
(1,Hadoop)
(1,Spark)
(1,Flink)

data.groupBy(0).first(2).print()
输出:
(3,Linux)
(1,Hadoop)
(1,Spark)
(2,Java)
(2,Spring)
(4,VUE)

data.groupBy(0).sortGroup(1,Order.DESCENDING).first(2).print();
输出:
(3,Linux)
(1,Spark)
(1,Hadoop)
(2,Spring)
(2,Java)
(4,VUE)

data.groupBy(0).sortGroup(1,Order.ASCENDING).first(2).print();
(3,Linux)
(1,Flink)
(1,Hadoop)
(2,Java)
(2,Spring)
(4,VUE)

}

first 之 java实现

public static void main(String[] args) throws Exception{
    ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

    firstFunction(env);
}

public static void firstFunction(ExecutionEnvironment env) throws Exception {
    List<Tuple2<Integer, String>> info = new ArrayList<Tuple2<Integer, String>>();
    info.add(new Tuple2(1,"Hadoop"));
    info.add(new Tuple2(1,"Spark"));
    info.add(new Tuple2(1,"Flink"));
    info.add(new Tuple2(2,"Java"));
    info.add(new Tuple2(2,"Spring"));
    info.add(new Tuple2(3,"Linux"));
    info.add(new Tuple2(4,"VUE"));

    DataSource<Tuple2<Integer,String>> data = env.fromCollection(info);

    data.first(3).print();
    System.out.println("~~~~~~~");
    data.groupBy(0).first(2).print();
    System.out.println("~~~~~~~");
    data.groupBy(0).sortGroup(1, Order.DESCENDING).first(2).print();
    System.out.println("~~~~~~~");
    data.groupBy(0).sortGroup(1, Order.ASCENDING).first(2).print();
}

flatMap之scala实现

FlatMap:take one element and produce zero, one or more elements.

def main(args: Array[String]): Unit = {
  val env = ExecutionEnvironment.getExecutionEnvironment

  flatMapFunction(env)
}

def flatMapFunction(env: ExecutionEnvironment) : Unit = {
  val info = ListBuffer[String]()
  info.append("hadoop,spark")
  info.append("flink,spark")
  info.append("hadoop,flink,spark")

  env.fromCollection(info).flatMap(_.split(",")).print()
  
输出:
hadoop
spark
flink
spark
hadoop
flink
spark
  env.fromCollection(info).flatMap(_.split(",")).map((_,1)).groupBy(0).sum(1).print()
输出:
(hadoop,2)
(flink,2)
(spark,3)

}

flatMap之Java 实现

public static void main(String[] args) throws Exception{
    ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

    flatMapFunction(env);
}

public static void flatMapFunction(ExecutionEnvironment env) throws Exception {
    List<String> list = new ArrayList<String>();
    list.add("spark,hadoop,flink");
    list.add("sqoop,flink,spark");
    list.add("strom,flink");

    DataSource<String> data = env.fromCollection(list);

    data.flatMap(new FlatMapFunction<String, String>() {
        @Override
        public void flatMap(String input, Collector<String> collector) throws Exception{
            String splits[] = input.split(",");
            for (String split : splits){
                collector.collect(split);
            }
        }
    }).map(new MapFunction<String, Tuple2<String,Integer>>() {
        public Tuple2<String, Integer> map(String s) throws Exception {
            return new Tuple2<String, Integer>(s,1);
        }
    }).groupBy(0).sum(1).print();
}
输出:
(hadoop,1)
(flink,3)
(sqoop,1)
(spark,2)
(strom,1)
注意点:多写几遍

distinct 之scala实现

def main(args: Array[String]): Unit = {
  val env = ExecutionEnvironment.getExecutionEnvironment

  distinctFunction(env)
}

def distinctFunction(env: ExecutionEnvironment) :Unit = {
  val info = ListBuffer[String]()
  info.append("hadoop,spark")
  info.append("flink,spark")
  info.append("hadoop,flink,spark")

  env.fromCollection(info).flatMap(_.split(",")).distinct().print()
}

输出:
hadoop
flink
spark

distinct之java实现

public static void main(String[] args) throws Exception{
    ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

    distinctFunction(env);
}

public static void distinctFunction(ExecutionEnvironment env) throws Exception {
    List<String> list = new ArrayList<String>();
    list.add("spark,hadoop,flink");
    list.add("sqoop,flink,spark");
    list.add("strom,flink");

    DataSource<String> data = env.fromCollection(list);
    data.flatMap(new FlatMapFunction<String, String>(){
        @Override
        public void flatMap(String input, Collector<String> collector) throws Exception {
            String[] splits = input.split(",");
            for (String split : splits){
                collector.collect(split);
            }
        }
    }).distinct().print();
}
输出:
hadoop
flink
sqoop
spark
strom

join之scala实现

val result = input1.join(input2).where(0).equalTo(1)
解释:0表示第一个输入的字段,1表示第二个输入的字段。
input1的第0个字段和input2的第1个字段做join。

def main(args: Array[String]): Unit = {
  val env = ExecutionEnvironment.getExecutionEnvironment

  joinFunction(env)
}

def joinFunction(env: ExecutionEnvironment):Unit = {
  val info1 = ListBuffer[(Int,String)]()  //编号  名字
  info1.append((1,"张三"))
  info1.append((2,"李四"))
  info1.append((3,"王五"))
  info1.append((4,"小强"))

  val info2 = ListBuffer[(Int,String)]()  //编号  城市
  info2.append((1,"北京"))
  info2.append((2,"上海"))
  info2.append((3,"成都"))
  info2.append((5,"武汉"))

  val data1 = env.fromCollection(info1)
  val data2 = env.fromCollection(info2)

  data1.join(data2).where(0).equalTo(0).apply((first,second)=>{
    (first._1,first._2,second._2)
  }).print();
}
输出:
(3,王五,成都)
(1,张三,北京)
(2,李四,上海)

join之java实现

public static void main(String[] args) throws Exception{
    ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

    joinFunction(env);
}

public static void joinFunction(ExecutionEnvironment env) throws Exception{
    List <Tuple2<Integer, String>> info1 = new ArrayList<Tuple2<Integer, String>>();
    info1.add(new Tuple2(1,"张三"));  //编号 名字
    info1.add(new Tuple2(2,"李四"));
    info1.add(new Tuple2(3,"王五"));
    info1.add(new Tuple2(4,"小强"));

    List <Tuple2<Integer, String>> info2 = new ArrayList<Tuple2<Integer, String>>();
    info2.add(new Tuple2(1,"北京"));  // 编号,城市
    info2.add(new Tuple2(2,"上海"));
    info2.add(new Tuple2(3,"成都"));
    info2.add(new Tuple2(5,"杭州"));

    DataSource<Tuple2<Integer,String>> data1 = env.fromCollection(info1);
    DataSource<Tuple2<Integer,String>> data2 = env.fromCollection(info2);

    data1.join(data2).where(0).equalTo(0).with(new JoinFunction<Tuple2<Integer,String>, Tuple2<Integer,String>, Tuple3<Integer,String,String>>(){
        @Override
        public Tuple3<Integer, String, String> join(Tuple2<Integer, String> first, Tuple2<Integer, String> second) throws Exception{
            return new Tuple3<Integer, String, String>(first.f0, first.f1, second.f1);
        }
    }).print();
}

输出:
(3,王五,成都)
(1,张三,北京)
(2,李四,上海)

outjoin之scala实现

  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment

    outjoinFunction(env)
  }

  def outjoinFunction(env: ExecutionEnvironment):Unit = {
    val info1 = ListBuffer[(Int,String)]()  //编号  名字
    info1.append((1,"张三"))
    info1.append((2,"李四"))
    info1.append((3,"王五"))
    info1.append((4,"小强"))

    val info2 = ListBuffer[(Int,String)]()  //编号  城市
    info2.append((1,"北京"))
    info2.append((2,"上海"))
    info2.append((3,"成都"))
    info2.append((5,"武汉"))

    val data1 = env.fromCollection(info1)
    val data2 = env.fromCollection(info2)

    data1.leftOuterJoin(data2).where(0).equalTo(0).apply((first,second)=> {
      if (second == null) {
        (first._1, first._2, "null")
      } else {
        (first._1, first._2, second._2)
      }
    }).print();
}
输出:
(3,王五,成都)
(1,张三,北京)
(2,李四,上海)
(4,小强,null)

outerJoin之java实现

public static void main(String[] args) throws Exception{
    ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

    outerjoinFunction(env);
}

public static void outerjoinFunction(ExecutionEnvironment env) throws Exception{
    List <Tuple2<Integer, String>> info1 = new ArrayList<Tuple2<Integer, String>>();
    info1.add(new Tuple2(1,"张三"));  //编号 名字
    info1.add(new Tuple2(2,"李四"));
    info1.add(new Tuple2(3,"王五"));
    info1.add(new Tuple2(4,"小强"));

    List <Tuple2<Integer, String>> info2 = new ArrayList<Tuple2<Integer, String>>();
    info2.add(new Tuple2(1,"北京"));  // 编号,城市
    info2.add(new Tuple2(2,"上海"));
    info2.add(new Tuple2(3,"成都"));
    info2.add(new Tuple2(5,"杭州"));

    DataSource<Tuple2<Integer,String>> data1 = env.fromCollection(info1);
    DataSource<Tuple2<Integer,String>> data2 = env.fromCollection(info2);

    data1.leftOuterJoin(data2).where(0).equalTo(0).with(new JoinFunction<Tuple2<Integer,String>, Tuple2<Integer,String>, Tuple3<Integer,String,String>>(){
        @Override
        public Tuple3<Integer, String, String> join(Tuple2<Integer, String> first, Tuple2<Integer, String> second) throws Exception{
            if (second == null) {
                return new Tuple3<Integer, String, String>(first.f0,first.f1,"null");
            } else {
                return new Tuple3<Integer, String, String>(first.f0, first.f1, second.f1);
            }
        }
    }).print();
}
输出:
(3,王五,成都)
(1,张三,北京)
(2,李四,上海)
(4,小强,null)

cross之scala实现

def main(args: Array[String]): Unit = {
  val env = ExecutionEnvironment.getExecutionEnvironment

  crossFunction(env)
}

def crossFunction(env: ExecutionEnvironment):Unit = {
  val info1 = ListBuffer[String]()
  info1.append("长城")
  info1.append("长安")

  val info2 = ListBuffer[Int]()
  info2.append(1)
  info2.append(2)
  info2.append(3)

  val data1 = env.fromCollection(info1)
  val data2 = env.fromCollection(info2)

  data1.cross(data2).print()
}
输出:
(长城,1)
(长城,2)
(长城,3)
(长安,1)
(长安,2)
(长安,3)

cross之java实现

public static void main(String[] args) throws Exception{
    ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

    crossFunction(env);
}

public static void crossFunction(ExecutionEnvironment env) throws Exception{
    List<String> info1 = new ArrayList<String>();
    info1.add("张三");  
    info1.add("李四");

    List<Integer> info2 = new ArrayList<Integer>();
    info2.add(1);
    info2.add(2);
    info2.add(3);

    DataSource<String> data1 = env.fromCollection(info1);
    DataSource<Integer> data2 = env.fromCollection(info2);

    data1.cross(data2).print();
}
输出:
(张三,1)
(张三,2)
(张三,3)
(李四,1)
(李四,2)
(李四,3)

sink scala 代码

def main(args: Array[String]): Unit = {
  val env = ExecutionEnvironment.getExecutionEnvironment
  val data = 1.to(10)
  val text = env.fromCollection(data)
  val path = "/Users/zhiyingliu/tmp/flink/ouput"
  text.writeAsText(path,WriteMode.OVERWRITE).setParallelism(3)
  env.execute("sinkTest")
}

sink Java 代码

public static void main(String[] args) throws Exception{
    ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

    List<Integer> info = new ArrayList<Integer>();

    for (int i = 0; i < 10; i++) {
        info.add(i);
    }

    DataSource<Integer> data = env.fromCollection(info);

    String filePath  = "/Users/zhiyingliu/tmp/flink/ouput-java/";
data.writeAsText(filePath,FileSystem.WriteMode.OVERWRITE);
    env.execute("java-sink");
}
posted @ 2020-08-31 11:38  水木青楓  阅读(972)  评论(0编辑  收藏  举报