练习:Flink面试题
问题描述:
某APP用户点击日志,列名分别为时间,用户ID,产品代号,点击的功能代号,邮箱,省市,耗时,参数详情。需使用flink批处理进行数据清洗及开窗统计,样例数据如下:
data
说明:
- 数据的列分隔符为逗号,详情参数为json
- 数据行中存在脏数据
环境:
- 机器可联网,笔试机器的桌面上有idea开发环境,flink相关依赖需自己引入(需使用flink 1.11.1 以上版本)
- 样例数据存放在
链接: https://pan.baidu.com/s/18n4PjyXHsrXwWz6rdBzKIw?pwd=ct6c
提取码: ct6c
- 注意:笔试过程中避免重启机器,否则答题过程可能被还原
- 考试时间为2个小时
pom
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>org.example</groupId> <artifactId>FlinkTrue</artifactId> <version>1.0-SNAPSHOT</version> <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <maven.compiler.source>1.8</maven.compiler.source> <maven.compiler.target>1.8</maven.compiler.target> <flink.version>1.13.0</flink.version> <hadoop.version>3.1.3</hadoop.version> <scala.version>2.12</scala.version> <scala.binary.version>2.11</scala.binary.version> </properties> <dependencies> <!--flink-java-core-stream-clients --> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-core</artifactId> <version>${flink.version}</version> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-java</artifactId> <version>${flink.version}</version> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-streaming-java_${scala.version}</artifactId> <version>${flink.version}</version> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-clients_${scala.version}</artifactId> <version>${flink.version}</version> </dependency> <dependency> <groupId>org.projectlombok</groupId> <artifactId>lombok</artifactId> <version>1.16.22</version> </dependency> <!--jedis--> <dependency> <groupId>redis.clients</groupId> <artifactId>jedis</artifactId> <version>3.3.0</version> </dependency> <!--fastjson--> <dependency> <groupId>com.alibaba</groupId> <artifactId>fastjson</artifactId> <version>1.2.60</version> </dependency> <!--flink SQL table api--> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-table-api-java-bridge_${scala.version}</artifactId> <version>${flink.version}</version> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-table-planner-blink_${scala.version}</artifactId> <version>${flink.version}</version> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-table-common</artifactId> <version>${flink.version}</version> </dependency> <!--cep--> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-cep-scala_${scala.version}</artifactId> <version>${flink.version}</version> </dependency> <!--csv--> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-csv</artifactId> <version>${flink.version}</version> </dependency> <!--sink kafka--> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-connector-kafka_${scala.version}</artifactId> <version>${flink.version}</version> </dependency> <!--sink hadoop hdfs--> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-connector-filesystem_${scala.binary.version}</artifactId> <version>1.4.2</version> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-hadoop-compatibility_${scala.version}</artifactId> <version>${flink.version}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>${hadoop.version}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>${hadoop.version}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>${hadoop.version}</version> </dependency> <!--sink mysql--> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-jdbc_${scala.version}</artifactId> <version>1.9.2</version> </dependency> <dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> <version>5.1.38</version> </dependency> <!--sink数据到hbse--> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-hbase_${scala.version}</artifactId> <version>1.8.1</version> </dependency> <dependency> <groupId>org.apache.hbase</groupId> <artifactId>hbase-client</artifactId> <version>2.4.3</version> </dependency> <!--jdbc sink clickhouse--> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-connector-jdbc_${scala.version}</artifactId> <version>1.13.0</version> </dependency> <dependency> <groupId>ru.yandex.clickhouse</groupId> <artifactId>clickhouse-jdbc</artifactId> <version>0.2.4</version> </dependency> <!--Guava工程包含了若干被Google的Java项目广泛依赖的核心库,方便开发--> <dependency> <groupId>com.google.guava</groupId> <artifactId>guava</artifactId> <version>30.1.1-jre</version> </dependency> <!--jdbc sink Clickhouse exclusion--> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-connector-jdbc_${scala.version}</artifactId> <version>${flink.version}</version> </dependency> <dependency> <groupId>ru.yandex.clickhouse</groupId> <artifactId>clickhouse-jdbc</artifactId> <version>0.2.4</version> <exclusions> <exclusion> <groupId>com.fasterxml.jackson.core</groupId> <artifactId>jackson-databind</artifactId> </exclusion> <exclusion> <groupId>com.fasterxml.jackson.core</groupId> <artifactId>jackson-core</artifactId> </exclusion> </exclusions> </dependency> <!-- Flink连接redis的连接包--> <dependency> <groupId>org.apache.bahir</groupId> <artifactId>flink-connector-redis_2.11</artifactId> <version>1.0</version> </dependency> <!--jedis--> <dependency> <groupId>redis.clients</groupId> <artifactId>jedis</artifactId> <version>3.3.0</version> </dependency> <!--sink es--> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-connector-elasticsearch7_${scala.version}</artifactId> <version>1.10.1</version> </dependency> </dependencies> </project>
bean
import lombok.AllArgsConstructor; import lombok.Data; import lombok.NoArgsConstructor; @Data @AllArgsConstructor @NoArgsConstructor public class Log { //时间 private Long time ; // 用户ID private String uid ; // 产品代号 private String sid ; // 点击的功能代号 private String exeid ; // 邮箱 private String email ; // 省市 private String province ; // 耗时 private String spend ; // 参数详情 private String detail ; }
import lombok.AllArgsConstructor; import lombok.NoArgsConstructor; @lombok.Data @NoArgsConstructor @AllArgsConstructor public class Data { private Long time; // private Integer isman; private String level; // private Integer ts; // private Integer ver; private Integer count; }
test
import com.alibaba.fastjson.JSON; import com.alibaba.fastjson.JSONObject; import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner; import org.apache.flink.api.common.eventtime.WatermarkStrategy; import org.apache.flink.api.common.functions.AggregateFunction; import org.apache.flink.api.common.functions.FilterFunction; import org.apache.flink.api.common.functions.MapFunction; import org.apache.flink.api.common.functions.ReduceFunction; import org.apache.flink.api.java.ExecutionEnvironment; import org.apache.flink.api.java.operators.DataSource; import org.apache.flink.streaming.api.TimeCharacteristic; import org.apache.flink.streaming.api.datastream.DataStreamSource; import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.streaming.api.functions.KeyedProcessFunction; import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor; import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction; import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows; import org.apache.flink.streaming.api.windowing.time.Time; import org.apache.flink.streaming.api.windowing.windows.TimeWindow; import org.apache.flink.util.Collector; import java.text.SimpleDateFormat; import java.time.Duration; public class FlinkTrue { public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setParallelism(1); // DataSource<String> data = env.readTextFile(FlinkTrue.class.getClassLoader().getResource("simple.etl.csv").getPath()); DataStreamSource<String> data = env.readTextFile("C:\\Users\\liuyuan\\Desktop\\实训3\\FlinkTrue\\src\\main\\resources\\simple.etl.csv"); SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"); env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime); SingleOutputStreamOperator<Log> maped = data.map(new MapFunction<String, Log>() { @Override public Log map(String s) throws Exception { String[] split = s.split(","); Log log = new Log(); if (split.length >= 11) { log = new Log(sdf.parse(split[0]).getTime(), split[1], split[2], split[3], split[4], split[5], split[6], split[7]+","+split[8]+","+split[9]+","+split[10]); } return log; } }); SingleOutputStreamOperator<Log> map = maped.filter(new FilterFunction<Log>() { @Override public boolean filter(Log log) throws Exception { return !String.valueOf(log.getTime()).equals("null"); } }); SingleOutputStreamOperator<Data> mapData = map.map(new MapFunction<Log, Data>() { @Override public Data map(Log log) throws Exception { String detail = log.getDetail(); String s = detail.replaceAll("[\"]", ""); JSONObject jsonObject = JSON.parseObject(s); return new Data(log.getTime(),String.valueOf(jsonObject.get("level")),0); } }); /* SingleOutputStreamOperator<Data> stream = mapData.assignTimestampsAndWatermarks( new BoundedOutOfOrdernessTimestampExtractor<Data>(Time.seconds(3)) { @Override public long extractTimestamp(Data data1) { return data1.getTime(); } });*/ SingleOutputStreamOperator<Data> stream = mapData.assignTimestampsAndWatermarks(WatermarkStrategy.<Data>forBoundedOutOfOrderness(Duration.ofSeconds(5)) .withTimestampAssigner(new SerializableTimestampAssigner<Data>() { @Override public long extractTimestamp(Data data, long l) { return data.getTime(); } })); SingleOutputStreamOperator<String> aggregate = stream.keyBy(d -> d.getLevel()) .window(TumblingEventTimeWindows.of(Time.seconds(30))) .aggregate(new AggregateFunction<Data, Integer, Integer>() { @Override public Integer createAccumulator() { return 0; } @Override public Integer add(Data data, Integer integer) { data.setCount(integer + 1); return data.getCount(); } @Override public Integer getResult(Integer integer) { return integer; } @Override public Integer merge(Integer integer, Integer acc1) { return null; } }, new ProcessWindowFunction<Integer, String, String, TimeWindow>() { @Override public void process(String s, Context context, Iterable<Integer> iterable, Collector<String> collector) throws Exception { long end = context.window().getEnd(); long start = context.window().getStart(); Integer next = iterable.iterator().next(); collector.collect("结束时间:"+start+",结束时间:"+end+ ",level:"+s+",次数:"+next); } }); aggregate.print(); //1.对数据进行必要的丢弃或修复 //2.以样例数据中第1列时间作为事件时间 //3.编写3s的窗口,统计详细参数里各level的出现次数 //4.设定水位线为5s,过滤部分延迟数据 //5.不允许使用过时方法· //6.请使用java语言实现 //控制台输出:窗口时间+level+次数,样例如下 timewindow: key: count: env.execute(); } }
评价标准:
考察范围 |
A |
B |
C |
数据丢弃及Json处理 |
结果正确,代码规范 |
结果正确,代码规范但存在瑕疵 |
不符合 |
Flink窗口使用 |
结果正确,代码规范 |
结果正确,代码规范但存在瑕疵 |
不符合 |
Flink水位线使用 |
结果正确,代码规范 |
结果正确,代码规范但存在瑕疵 |
不符合 |