Flink实时数仓

总体架构

用户行为采集平台

与离线数仓一致:[Hive离线数仓]

业务数据采集平台

与离线数仓基本一致:[Hive离线数仓]

不同点:

1)使用Maxwell同步所有表,不需要过滤

2)将所有表同步到一个Kafka Topic

注意:若离线数仓想要使用同一套采集平台,需要在Flume时间戳拦截器中补充表名,从而导入到不同的HDFS目录中

数据仓库

ODS

采集到 Kafka 的 topic_log 和 topic_db 主题的数据即为实时数仓的 ODS 层,这一层的作用是对数据做原样展示和备份。

DIM

DIM层设计要点

(1)DIM层的设计依据是维度建模理论,该层存储维度模型的维度表。

(2)DIM层的数据存储在 HBase 表中

DIM 层表是用于维度关联的,要通过主键去获取相关维度信息,这种场景下 K-V 类型数据库的效率较高。常见的 K-V 类型数据库有 Redis、HBase,而 Redis 的数据常驻内存,会给内存造成较大压力,因而选用 HBase 存储维度数据。

(3)DIM层表名的命名规范为dim_表名

DIM层主要任务

(1)接收Kafka数据,过滤空值数据

(2)动态拆分维度表功能

(3)把流中的数据保存到对应的维度表

接收Kafka数据,过滤空值数据

对Maxwell抓取的数据进行ETL,有用的部分保留,没用的过滤掉。

(1)编写Kafka工具类

public class MyKafkaUtil {
    
    public static FlinkKafkaConsumer<String> getKafkaConsumer(String topic, String group_id){

        Properties properties = new Properties();   
        properties.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "hadoop102:9092");
        properties.setProperty(ConsumerConfig.GROUP_ID_CONFIG, group_id);

        return new FlinkKafkaConsumer<String>(topic,
                new KafkaDeserializationSchema<String>() {
                    @Override
                    public boolean isEndOfStream(String s) {
                        return false;
                    }
                    @Override
                    public String deserialize(ConsumerRecord<byte[], byte[]> record) throws Exception {
                        if(record == null || record.value() == null){
                            return "";
                        }else{
                            return new String(record.value());
                        }
                    }
                    @Override
                    public TypeInformation<String> getProducedType() {
                        return BasicTypeInfo.STRING_TYPE_INFO;
                    }
                },
                properties);
    }
}

注意:不使用new SimpleStringSchema(),因为遇到空数据会报错

(2)读取数据,转换为JSON数据流,脏数据写入到侧输出流

//TODO 获取执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);//Kafka主题分区数
env.setStateBackend(new HashMapStateBackend());//状态后端设置
env.enableCheckpointing(5000L);
env.getCheckpointConfig().setCheckpointTimeout(10000L);
env.getCheckpointConfig().setCheckpointStorage("hdfs:hadoop102:8020//checkpoint");
//TODO 读取Kafka topic_db 主题创建流
DataStreamSource<String> kafkaDS = env.addSource(MyKafkaUtil.getKafkaConsumer("topic_db", "dim_app"));
//TODO 过滤非JSON数据,写入侧输出流
OutputTag<String> dirtyDataTag = new OutputTag<String>("Dirty") {};
SingleOutputStreamOperator<JSONObject> jsonObjDS = kafkaDS.process(new ProcessFunction<String, JSONObject>() {
    @Override
    public void processElement(String s, Context context, Collector<JSONObject> collector) throws Exception {
        try {
            JSONObject jsonObject = JSON.parseObject(s);
            collector.collect(jsonObject);
        } catch (Exception e) {
            context.output(dirtyDataTag, s);
        }
    }
});
//取出脏数据
DataStream<String> sideOutput = jsonObjDS.getSideOutput(dirtyDataTag);
sideOutput.print("Dirty>");

动态拆分维度表功能

动态配置方案:

MySQL + 定时任务 + JDBC

MySQL + FlinkCDC

File + Flume + Kafka + FlinkKafkaConsumer

Zookeeper + Watch

这里选择MySQL + FlinkCDC,主要是MySQL对于配置数据初始化和维护管理,使用FlinkCDC读取配置信息表,将配置流作为广播流与主流进行connect(也可以keyby+connect,但是存在数据倾斜问题)

(1)创建MySQL建配置表,创建对应实体类

@Data
public class TableProcess {
    //来源表
    String sourceTable;
    //输出表
    String sinkTable;
    //输出字段
    String sinkColumns;
    //主键字段
    String sinkPk;
    //建表扩展
    String sinkExtend;
}

(2)FlinkCDC读取配置表,处理为广播流,连接主流与广播流

//TODO 使用FlinkCDC读取MySQL中的配置信息
MySqlSource<String> mySqlSource = MySqlSource.<String>builder()
        .hostname("hadoop102")
        .port(3306)
        .username("root")
        .password("000000")
        .databaseList("gmall-config")
        .tableList("gmall-config.table_process")
        .deserializer(new JsonDebeziumDeserializationSchema())
        .startupOptions(StartupOptions.initial())
        .build();
DataStreamSource<String> mysqlSourceDS = env.fromSource(mySqlSource, WatermarkStrategy.noWatermarks(), "MysqlSource");
//TODO 将配置信息流处理成广播流
//而下游子任务收到广播出来的规则,会把它保存成一个状态,这就是所谓的“广播状态”(broadcast state)
MapStateDescriptor<String, TableProcess> mapStateDescriptor = new MapStateDescriptor<>("map-state", String.class, TableProcess.class);
BroadcastStream<String> broadcastStream = mysqlSourceDS.broadcast(mapStateDescriptor)
//TODO 连接主流与广播流
BroadcastConnectedStream<JSONObject, String> connectedStream = jsonObjDS.connect(broadcastStream);

(3)根据广播流处理主流

SingleOutputStreamOperator<JSONObject> hbaseDS = connectedStream.process(new TableProcessFunction(mapStateDescriptor));

自定义类TableProcessFunction

public class TableProcessFunction extends BroadcastProcessFunction<JSONObject, String, JSONObject> {

    private Connection connection;

    MapStateDescriptor<String, TableProcess> mapStateDescriptor;

    public TableProcessFunction(MapStateDescriptor<String, TableProcess> mapStateDescriptor) {
        this.mapStateDescriptor = mapStateDescriptor;
    }

    @Override
    public void open(Configuration parameters) throws Exception {
        connection = DriverManager.getConnection(GmallConfig.PHOENIX_SERVER);
    }


    //处理广播流
    @Override
    public void processBroadcastElement(String value, Context ctx, Collector<JSONObject> collector) throws Exception {
        //获取并解析数据为JavaBean对象
        JSONObject jsonObject = JSON.parseObject(value);
        TableProcess tableProcess = JSON.parseObject(jsonObject.getString("after"), TableProcess.class);
        //校验表是否存在,不存在建表
        checkTable(
            tableProcess.getSinkTable(),
            tableProcess.getSinkColumns(),
            tableProcess.getSinkPk(),
            tableProcess.getSinkExtend()
        );
        //将数据写入状态
        String key = tableProcess.getSourceTable();
        BroadcastState<String, TableProcess> broadcastState = ctx.getBroadcastState(mapStateDescriptor);
        broadcastState.put(key,tableProcess);
    }

    //建表
    private void checkTable(String sinkTable, String sinkColumns, String sinkPk, String sinkExtend) {
        PreparedStatement preparedStatement = null;
        try {
            if(sinkPk == null || sinkPk.equals("")){
                sinkPk = "id";
            }
            if(sinkExtend == null){
                sinkExtend = "";
            }
            StringBuilder sql = new StringBuilder("create table if not exists ")
                    .append(GmallConfig.HBASE_SCHEMA)
                    .append(".")
                    .append(sinkTable)
                    .append("(");
            String[] columns = sinkColumns.split(",");
            for (int i = 0; i < columns.length; i++) {
                String column = columns[i];
                if (sinkPk.equals(column)){
                    sql.append(column).append(" varchar primary key");
                }else {
                    sql.append(column).append(" varchar");
                }
                if(i < columns.length - 1){
                    sql.append(",");
                }
            }
            sql.append(")").append(sinkExtend);

            preparedStatement = connection.prepareStatement(sql.toString());
            preparedStatement.execute();


        }catch (SQLException e){
            throw new RuntimeException("建表" + sinkTable + "失败");
        }finally {
            if(preparedStatement != null){
                try {
                    preparedStatement.close();
                } catch (SQLException e) {
                    e.printStackTrace();
                }
            }

        }

    }

    //处理主流
    @Override
    public void processElement(JSONObject value, ReadOnlyContext ctx, Collector<JSONObject> out) throws Exception {
        //获取广播配置信息
        ReadOnlyBroadcastState<String, TableProcess> broadcastState = ctx.getBroadcastState(mapStateDescriptor);
        TableProcess tableProcess = broadcastState.get(value.getString("table"));
        String type = value.getString("type");
        if(tableProcess != null && ("bootstrap-insert".equals(type) || "insert".equals(type) || "update".equals(type))){
            //根据配置信息过滤字段
            filter(value.getJSONObject("data"), tableProcess.getSinkColumns());
            //补充sinkTable字段写出
            value.put("sinkTable", tableProcess.getSinkTable());
            out.collect(value);
        }else{
            System.out.println("过滤数据:" + value);
        }
    }


    //根据配置信息过滤字段
    private void filter(JSONObject data, String sinkColumns) {

        String[] split = sinkColumns.split(",");
        List<String> columnsList = Arrays.asList(split);

        Iterator<Map.Entry<String, Object>> iterator = data.entrySet().iterator();
        while (iterator.hasNext()){
            Map.Entry<String, Object> next = iterator.next();
            if(!columnsList.contains(next.getKey())){
                iterator.remove();
            }
        }

    }
}

把流中的数据保存到对应的维度表

维度数据保存到HBase的表中

//TODO 将数据写出到Phoenix
hbaseDS.addSink(new DimSinkFunction());
//TODO 启动任务
env.execute("DimApp");

自定义类DimSinkFunction

public class DimSinkFunction extends RichSinkFunction<JSONObject> {

    private Connection connection;

    @Override
    public void open(Configuration parameters) throws Exception {
        connection = DriverManager.getConnection(GmallConfig.PHOENIX_SERVER);
    }

    @Override
    public void invoke(JSONObject value, Context context) throws Exception {
        PreparedStatement preparedStatement = null;
        try {
            String sql = genUpsertSql(value.getString("sinkTable"), value.getJSONObject("data"));
            preparedStatement = connection.prepareStatement(sql);
            preparedStatement.execute();
            connection.commit();
        } catch (SQLException e) {
            System.out.println("插入数据失败");
        } finally {
            if(preparedStatement != null){
                preparedStatement.close();
            }
        }
    }

    private String genUpsertSql(String sinkTable, JSONObject data) {

        Set<String> columns = data.keySet();
        Collection<Object> values = data.values();

        return "upsert into " + GmallConfig.HBASE_SCHEMA + "." + sinkTable + "(" +
                StringUtils.join(columns,",") + ") values ('" +
                StringUtils.join(values,"','") + "')";
    }
}

DWD

DWD层设计要点

(1)DWD层的设计依据是维度建模理论,该层存储维度模型的事实表。

(2)DWD层表名的命名规范为dwd_数据域_表名

流量域未经加工的事务事实表

主要任务

1)数据清洗(ETL)

2)新老访客状态标记修复

3)分流 

数据清洗

//TODO 获取执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);//Kafka主题分区数
env.setStateBackend(new HashMapStateBackend());
env.enableCheckpointing(5000L);
env.getCheckpointConfig().setCheckpointTimeout(10000L);
env.getCheckpointConfig().setCheckpointStorage("hdfs:hadoop102:8020//checkpoint");
//TODO 读取kafka topic_log 主题数据创建流
DataStreamSource<String> kafkaDS = env.addSource(MyKafkaUtil.getKafkaConsumer("topic_log", "base_log_app"));
//TODO 将数据转换为JSON格式,并过滤非JSON格式数据
OutputTag<String> dirtyTag = new OutputTag<String>("Dirty"){};
SingleOutputStreamOperator<JSONObject> jsonObjDS = kafkaDS.process(new ProcessFunction<String, JSONObject>() {
    @Override
    public void processElement(String value, Context ctx, Collector<JSONObject> out) throws Exception {
        try {
            JSONObject jsonObject = JSON.parseObject(value);
            out.collect(jsonObject);
        } catch (Exception e) {
            ctx.output(dirtyTag, value);
        }
    }
});
DataStream<String> dirtyDS = jsonObjDS.getSideOutput(dirtyTag);
dirtyDS.print("Dirty>");

新老访客状态标记修复

1)日期格式化工具类

public class DateFormatUtil {

    private static final DateTimeFormatter dtf = DateTimeFormatter.ofPattern("yyyy-MM-dd");
    private static final DateTimeFormatter dtfFull = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss");

    public static Long toTs(String dtStr, boolean isFull) {

        LocalDateTime localDateTime = null;
        if (!isFull) {
            dtStr = dtStr + " 00:00:00";
        }
        localDateTime = LocalDateTime.parse(dtStr, dtfFull);

        return localDateTime.toInstant(ZoneOffset.of("+8")).toEpochMilli();
    }

    public static Long toTs(String dtStr) {
        return toTs(dtStr, false);
    }

    public static String toDate(Long ts) {
        Date dt = new Date(ts);
        LocalDateTime localDateTime = LocalDateTime.ofInstant(dt.toInstant(), ZoneId.systemDefault());
        return dtf.format(localDateTime);
    }

    public static String toYmdHms(Long ts) {
        Date dt = new Date(ts);
        LocalDateTime localDateTime = LocalDateTime.ofInstant(dt.toInstant(), ZoneId.systemDefault());
        return dtfFull.format(localDateTime);
    }

    public static void main(String[] args) {
        System.out.println(toYmdHms(System.currentTimeMillis()));
    }
}

注意:不用SimpleDateFormat是因为format()有线程安全问题

(2)处理状态编程做新老用户校验

//TODO 使用状态编程做新老用户校验
KeyedStream<JSONObject, String> keyedByMidStream = jsonObjDS.keyBy(jsonObject -> jsonObject.getJSONObject("common").getString("mid"));
SingleOutputStreamOperator<JSONObject> jsonObjWithNewFlagDS = keyedByMidStream.map(new RichMapFunction<JSONObject, JSONObject>() {

    private ValueState<String> lastVisitDtState;

    @Override
    public void open(Configuration parameters) throws Exception {
        lastVisitDtState = getRuntimeContext().getState(new ValueStateDescriptor<String>("last-visit", String.class));
    }

    @Override
    public JSONObject map(JSONObject value) throws Exception {
        String isNew = value.getJSONObject("common").getString("is_new");
        String lastVisitDt = lastVisitDtState.value();
        Long ts = value.getLong("ts");
        if ("1".equals(isNew)) {
            String curDt = DateFormatUtil.toDate(ts);
            if (lastVisitDt == null) {
                lastVisitDtState.update(curDt);
            } else if (!lastVisitDt.equals(curDt)) {
                value.getJSONObject("common").put("is_new", "0");
            }
        } else if (lastVisitDt == null) {
            String yesterday = DateFormatUtil.toDate(ts - 24 * 60 * 60 * 1000L);
            lastVisitDtState.update(yesterday);
        }
        return value;
    }
});

注意:状态的访问需要获取运行时上下文,只能在富函数类(Rich Function)中获取到

分流

数据量大分流,数据量小直接过滤取数据

(1)使用侧输出流对数据进行分流

//TODO 使用侧输出流对数据进行分流
OutputTag<String> startTag = new OutputTag<String>("start"){}; OutputTag<String> displayTag = new OutputTag<String>("display"){}; OutputTag<String> actionTag = new OutputTag<String>("action"){}; OutputTag<String> errorTag = new OutputTag<String>("error"){}; SingleOutputStreamOperator<String> pageDS = jsonObjWithNewFlagDS.process(new ProcessFunction<JSONObject, String>() { @Override public void processElement(JSONObject value, Context ctx, Collector<String> out) throws Exception { String jsonString = value.toJSONString(); String error = value.getString("err"); if (error != null) { ctx.output(errorTag, jsonString); } String start = value.getString("start"); if (start != null) { ctx.output(startTag, jsonString); } else { String pageId = value.getJSONObject("page").getString("page_id"); Long ts = value.getLong("ts"); String common = value.getString("common"); JSONArray displays = value.getJSONArray("displays"); if (displays != null && displays.size() > 0) { for (int i = 0; i < displays.size(); i++) { JSONObject display = displays.getJSONObject(i); display.put("page_id", pageId); display.put("ts", ts); display.put("common", common); ctx.output(displayTag, display.toJSONString()); } } JSONArray actions = value.getJSONArray("actions"); if (actions != null && actions.size() > 0) { for (int i = 0; i < actions.size(); i++) { JSONObject action = actions.getJSONObject(i); action.put("page_id", pageId); action.put("ts", ts); action.put("common", common); ctx.output(actionTag, action.toJSONString()); } } value.remove("displays"); value.remove("action"); out.collect(value.toJSONString()); } } });

注意:为什么输出标签要加{}? 使用匿名实现类对象,防止泛型擦除

(2)修改Kafka工具类

public class MyKafkaUtil {

    private static Properties properties = new Properties();
    static {
        properties.setProperty(CommonClientConfigs.BOOTSTRAP_SERVERS_CONFIG, "hadoop102:9092");
    }

    public static FlinkKafkaConsumer<String> getKafkaConsumer(String topic, String group_id){

        properties.setProperty(ConsumerConfig.GROUP_ID_CONFIG, group_id);

        return new FlinkKafkaConsumer<String>(topic,
                new KafkaDeserializationSchema<String>() {
                    @Override
                    public boolean isEndOfStream(String s) {
                        return false;
                    }
                    @Override
                    public String deserialize(ConsumerRecord<byte[], byte[]> record) throws Exception {
                        if(record == null || record.value() == null){
                            return "";
                        }else{
                            return new String(record.value());
                        }
                    }
                    @Override
                    public TypeInformation<String> getProducedType() {
                        return BasicTypeInfo.STRING_TYPE_INFO;
                    }
                },
                properties);
    }

    public static FlinkKafkaProducer<String> getKafkaProducer(String topic){
        return new FlinkKafkaProducer<String>(topic,
                new SimpleStringSchema(),
                properties
        );
    }
}

(3)提取各个流,写出Kafka

//TODO 提取各个流的数据
DataStream<String> startDS = pageDS.getSideOutput(startTag);
DataStream<String> errorDS = pageDS.getSideOutput(errorTag);
DataStream<String> displayDS = pageDS.getSideOutput(displayTag);
DataStream<String> actionDS = pageDS.getSideOutput(actionTag);
//TODO 写出到Kafka
String page_topic = "dwd_traffic_page_log";
String start_topic = "dwd_traffic_start_log";
String display_topic = "dwd_traffic_display_log";
String action_topic = "dwd_traffic_action_log";
String error_topic = "dwd_traffic_error_log";
pageDS.addSink(MyKafkaUtil.getKafkaProducer(page_topic));
startDS.addSink(MyKafkaUtil.getKafkaProducer(start_topic));
errorDS.addSink(MyKafkaUtil.getKafkaProducer(error_topic));
displayDS.addSink(MyKafkaUtil.getKafkaProducer(display_topic));
actionDS.addSink(MyKafkaUtil.getKafkaProducer(action_topic));
//TODO 启动
env.execute("BaseLogApp");

流量域独立访客事务事实表

主要任务

1)过滤 last_page_id 不为 null 的数据

2)筛选独立访客记录

3)状态存活时间设置

过滤 last_page_id 不为 null 的数据

独立访客数据对应的页面必然是会话起始页面,last_page_id 必为 null。过滤 last_page_id != null 的数据,减小数据量,提升计算效率。

//TODO 获取执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);//Kafka主题分区数
env.setStateBackend(new HashMapStateBackend());
env.enableCheckpointing(5000L);
env.getCheckpointConfig().setCheckpointTimeout(10000L);
env.getCheckpointConfig().setCheckpointStorage("hdfs:hadoop102:8020//checkpoint");
//TODO 读取 Kafka dwd_traffic_page_log 主题数据创建流
String sourceTopic = "dwd_traffic_page_log";
String groupId = "dwd_traffic_unique_visitor_detail";
DataStreamSource<String> kafkaDS = env.addSource(MyKafkaUtil.getKafkaConsumer(sourceTopic, groupId));
//TODO 将每行数据转换为JSON对象
SingleOutputStreamOperator<JSONObject> jsonObjDS = kafkaDS.map(line -> JSON.parseObject(line));
//TODO 过滤掉上一跳页面id不等于null的数据
SingleOutputStreamOperator<JSONObject> filterDS = jsonObjDS.filter(new FilterFunction<JSONObject>() {
    @Override
    public boolean filter(JSONObject value) throws Exception {
        return value.getJSONObject("page").getString("last_page_id") == null;
    }
});

筛选独立访客记录

如果末次登录日期为 null 或者不是今日,则本次访问是该 mid 当日首次访问,保留数据,将末次登录日期更新为当日。否则不是当日首次访问,丢弃数据。

//TODO 按照mid分组
KeyedStream<JSONObject, String> keyedByMidStream = filterDS.keyBy(json -> json.getJSONObject("common").getString("mid"));
//TODO 使用状态编程进行每日登录数据去重
SingleOutputStreamOperator<JSONObject> uvDetailDS = keyedByMidStream.filter(new RichFilterFunction<JSONObject>() {
    private ValueState<String> visitDtState;

    @Override
    public void open(Configuration parameters) throws Exception {
        ValueStateDescriptor<String> stringValueStateDescriptor = new ValueStateDescriptor<>("visit-dt", String.class);
        visitDtState = getRuntimeContext().getState(stringValueStateDescriptor);
    }

    @Override
    public boolean filter(JSONObject value) throws Exception {
        String dt = visitDtState.value();
        String curDt = DateFormatUtil.toDate(value.getLong("ts"));
        if (dt == null || !dt.equals(curDt)) {
            visitDtState.update(curDt);
            return true;
        } else {
            return false;
        }

    }
});
//TODO 将数据写出到Kafka
String targetTopic = "dwd_traffic_unique_visitor_detail";
FlinkKafkaProducer<String> kafkaProducer = MyKafkaUtil.getKafkaProducer(targetTopic);
uvDetailDS.map(JSONAware::toJSONString).addSink(kafkaProducer);

//TODO 启动任务
env.execute("DwdTrafficUniqueVisitorDetail");

状态存活时间设置

当天状态只对当天数据起到去重作用,第二天应该及时清理掉

方案一:定时器零点清理,需要使用process

方案二:设置状态的 TTL 为 1 天,存在问题:4-1 9:00来了一条数据保留,4-2 8:00来了一条数据保留,4-2 9:00清空状态,4-2 10:00来了一条数据保留

方案三:设置状态的 TTL 为 1 天,更新模式为 OnCreateAndWrite,表示在创建和更新状态时重置状态存活时间。

@Override
public void open(Configuration parameters) throws Exception {
    ValueStateDescriptor<String> stringValueStateDescriptor = new ValueStateDescriptor<>("visit-dt", String.class);
    StateTtlConfig stateTtlConfig = new StateTtlConfig.Builder(Time.days(1))
            .setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
            .build();
    stringValueStateDescriptor.enableTimeToLive(stateTtlConfig);
    visitDtState = getRuntimeContext().getState(stringValueStateDescriptor);
}

流量域用户跳出事务事实表

跳出是指会话中只有一个页面的访问行为,如果能获取会话的所有页面,只要筛选页面数为 1 的会话即可获取跳出明细数据。

(1)离线数仓中我们可以获取一整天的数据,结合访问时间、page_id 和 last_page_id 字段对整体数据集做处理可以按照会话对页面日志进行划分,从而获得每个会话的页面数,只要筛选页面数为 1 的会话即可提取跳出明细数据;

(2)实时计算中无法考虑整体数据集,很难按照会话对页面访问记录进行划分。而本项目模拟生成的日志数据中没有 session_id(会话id)字段,也无法通过按照 session_id 分组的方式计算每个会话的页面数。

解决方案:

(1)会话窗口:短时间连续跳出,存在误差

(2)状态编程 + 定时器:乱序数据,存在误差

(3)状态编程 + 会话开窗:实现复杂,可以直接使用FlinkCEP代替

(4)FlinkCEP

主要任务

(1)消费数据按照mid分组

(2)形成规则流

(3)提取两条流,合并输出

消费数据按照mid分组

//TODO 获取执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);//Kafka主题分区数
env.setStateBackend(new HashMapStateBackend());
env.enableCheckpointing(5000L);
env.getCheckpointConfig().setCheckpointTimeout(10000L);
env.getCheckpointConfig().setCheckpointStorage("hdfs:hadoop102:8020//checkpoint");
//TODO 读取 Kafka dwd_traffic_page_log 主题数据创建流
String sourceTopic = "dwd_traffic_page_log";
String groupId = "dwd_traffic_user_jump_detail";
DataStreamSource<String> kafkaDS = env.addSource(MyKafkaUtil.getKafkaConsumer(sourceTopic, groupId));
//TODO 将数据转换为JSON对象
SingleOutputStreamOperator<JSONObject> jsonObjDS = kafkaDS.map(JSON::parseObject);
//TODO 提取事件时间生成WaterMark
SingleOutputStreamOperator<JSONObject> jsonObjWithWmDS = jsonObjDS.assignTimestampsAndWatermarks(WatermarkStrategy.<JSONObject>forBoundedOutOfOrderness(Duration.ofSeconds(2))
        .withTimestampAssigner(new SerializableTimestampAssigner<JSONObject>() {
            @Override
            public long extractTimestamp(JSONObject element, long l) {
                return element.getLong("ts");
            }
        }));
//TODO 按照mid分组
KeyedStream<JSONObject, String> keyedByMidStream = jsonObjWithWmDS.keyBy(json -> json.getJSONObject("common").getString("mid"));

形成规则流

//TODO 定义模式序列
Pattern<JSONObject, JSONObject> pattern = Pattern.<JSONObject>begin("first").where(new SimpleCondition<JSONObject>() {
    @Override
    public boolean filter(JSONObject value) throws Exception {
        return value.getJSONObject("page").getString("last_page_id") == null;
    }
}).next("second").where(new SimpleCondition<JSONObject>() {
    @Override
    public boolean filter(JSONObject value) throws Exception {
        return value.getJSONObject("page").getString("last_page_id") == null;
    }
}).within(Time.seconds(10));
//TODO 将模式序列作用到流上
PatternStream<JSONObject> patternStream = CEP.pattern(keyedByMidStream, pattern);

提取两条流,合并输出

//TODO 提取匹配事件以及超时事件
OutputTag<JSONObject> timeOutTag = new OutputTag<JSONObject>("time-out"){};
SingleOutputStreamOperator<JSONObject> selectDS = patternStream.select(timeOutTag, new PatternTimeoutFunction<JSONObject, JSONObject>() {
    @Override
    public JSONObject timeout(Map<String, List<JSONObject>> map, long l) throws Exception {
        return map.get("first").get(0);
    }
}, new PatternSelectFunction<JSONObject, JSONObject>() {
    @Override
    public JSONObject select(Map<String, List<JSONObject>> map) throws Exception {
        return map.get("first").get(0);
    }
});
DataStream<JSONObject> timeoutDS = selectDS.getSideOutput(timeOutTag);
//TODO 合并两个流
DataStream<JSONObject> unionDS = selectDS.union(timeoutDS);
//TODO 将数据写出到Kafka
String targetTopic = "dwd_traffic_user_jump_detail";
unionDS.map(JSONAware::toJSONString)
        .addSink(MyKafkaUtil.getKafkaProducer(targetTopic));
//TODO 启动
env.execute("DwdTrafficUserJumpDetail");

交易域加购事务事实表

提取加购操作生成加购表,并将字典表中的相关维度退化到加购表中,写出到 Kafka 对应主题。

主要任务

(1)读取购物车表数据

(2)建立 Mysql-LookUp 字典表

(3)关联购物车表和字典表,维度退化

(4)将数据写回到Kafka

读取购物车表数据

(1)补充Kafka工具类

/**
 * Kafka-Source DDL 语句
 *
 * @param topic   数据源主题
 * @param groupId 消费者组
 * @return 拼接好的 Kafka 数据源 DDL 语句
 */
public static String getKafkaDDL(String topic, String groupId) {
    return " with ('connector' = 'kafka', " +
            " 'topic' = '" + topic + "'," +
            " 'properties.bootstrap.servers' = '" + BOOTSTRAP_SERVERS + "', " +
            " 'properties.group.id' = '" + groupId + "', " +
            " 'format' = 'json', " +
            " 'scan.startup.mode' = 'latest-offsets')";
}

public static String getTopicDbDDL(String groupId){
    return "create table topic_db( " +
            "   `database` string, " +
            "   `table` string, " +
            "   `type` string, " +
            "   `data` map<string, string>, " +
            "   `old` map<string, string>, " +
            "   `ts` string, " +
            "   `proc_time` as PROCTIME() " +
            ")" + MyKafkaUtil.getKafkaDDL("topic_db", groupId);
}

(2)使用DDL方式读取Kafka topic_db 主题数据

//TODO 获取执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

env.setParallelism(1);
env.setStateBackend(new HashMapStateBackend());
env.enableCheckpointing(5000L);
env.getCheckpointConfig().setCheckpointTimeout(10000L);
env.getCheckpointConfig().setCheckpointStorage("hdfs:hadoop102:8020//checkpoint");

StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
tableEnv.getConfig().setIdleStateRetention(Duration.ofSeconds(5));
//TODO 使用DDL方式读取Kafka topic_db 主题数据
tableEnv.executeSql(MyKafkaUtil.getTopicDbDDL("dwd_trade_cart_add"));

(3)过滤出加购数据

//TODO 过滤出加购数据
Table cartAdd = tableEnv.sqlQuery("" +
        "select " +
        "   data['id'] id, " +
        "   data['user_id'] user_id, " +
        "   data['sku_id'] sku_id, " +
        "   data['source_id'] source_id, " +
        "   data['source_type'] source_type, " +
        "   if(`type` = 'insert', " +
        "   data['sku_num'],cast((cast(data['sku_num'] as int) - cast(`old`['sku_num'] as int)) as string)) sku_num, " +
        "   ts, " +
        "   proc_time " +
        "from `topic_db`  " +
        "where `table` = 'cart_info' " +
        "   and (`type` = 'insert' " +
        "   or (`type` = 'update'  " +
        "   and `old`['sku_num'] is not null  " +
        "   and cast(data['sku_num'] as int) > cast(`old`['sku_num'] as int)))");
tableEnv.createTemporaryView("cart_add", cartAdd);

注意:关键字加飘号,比如`table`

建立 Mysql-LookUp 字典表

为什么使用 Mysql-LookUp? 维表数据不能过期;维表数据更新只能关联更新的这条数据。

(1)编写SQL工具类

public class MysqlUtil {
    public static String getBaseDicLookUpDDL() {
        return "create table `base_dic`( " +
                "`dic_code` string, " +
                "`dic_name` string, " +
                "`parent_code` string, " +
                "`create_time` timestamp, " +
                "`operate_time` timestamp, " +
                "primary key(`dic_code`) not enforced " +
                ")" + MysqlUtil.mysqlLookUpTableDDL("base_dic");
    }

    public static String mysqlLookUpTableDDL(String tableName) {
        return "WITH ( " +
                "'connector' = 'jdbc', " +
                "'url' = 'jdbc:mysql://hadoop102:3306/gmall', " +
                "'table-name' = '" + tableName + "', " +
                "'lookup.cache.max-rows' = '10', " +
                "'lookup.cache.ttl' = '1 hour', " +
                "'username' = 'root', " +
                "'password' = '000000', " +
                "'driver' = 'com.mysql.cj.jdbc.Driver' " +
                ")";
    }
}

注意:使用lookup缓存,可以提供吞吐量,但修改数据可能会造成数据不一致。而这里不会修改base_dic表,只会新增,故不会产生影响。

(2)读取MySQL中的base_dic表构建维表

//TODO 读取MySQL中的base_dic表构建维表
tableEnv.executeSql(MysqlUtil.getBaseDicLookUpDDL());

关联购物车表和字典表,维度退化

//TODO 关联两张表(维度退化)
Table resultTable = tableEnv.sqlQuery("" +
        "select " +
        "   cadd.id, " +
        "   user_id, " +
        "   sku_id, " +
        "   source_id, " +
        "   source_type, " +
        "   dic_name source_type_name, " +
        "   sku_num, " +
        "   ts " +
        "from cart_add cadd " +
        "   left join base_dic for system_time as of cadd.proc_time as dic " +
        "   on cadd.source_type=dic.dic_code");
tableEnv.createTemporaryView("result_table", resultTable);

注意:主表关联lookup表需要处理时间 proc_time 字段,保证维表更新后,存在多条数据,关联合适的数据

将数据写回到Kafka

//TODO 将数据写回到Kafka
String sinkTopic = "dwd_trade_cart_add";
tableEnv.executeSql("" +
        "create table dwd_trade_cart_add( " +
        "   id string, " +
        "   user_id string, " +
        "   sku_id string, " +
        "   source_id string, " +
        "   source_type_code string, " +
        "   source_type_name string, " +
        "   sku_num string, " +
        "   ts string, " +
        "   primary key(id) not enforced " +
        ")" + MyKafkaUtil.getKafkaDDL(sinkTopic, ""));
tableEnv.executeSql("insert into dwd_trade_cart_add select * from result_table");

交易域订单预处理表

关联订单明细表、订单表、订单明细活动关联表、订单明细优惠券关联表四张事实业务表和字典表(维度业务表)形成订单预处理表,写入 Kafka 对应主题。

主要任务

(1)从 Kafka topic_db 主题读取业务数据;

(2)筛选订单明细表数据;

(3)筛选订单表数据;

(4)筛选订单明细活动关联表数据;

(5)筛选订单明细优惠券关联表数据;

(6)建立 MySQL-Lookup 字典表;

(7)关联上述五张表获得订单宽表,写入 Kafka 主题

从 Kafka topic_db 主题读取业务数据

//TODO 获取执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

env.setParallelism(1);
env.setStateBackend(new HashMapStateBackend());
env.enableCheckpointing(5000L);
env.getCheckpointConfig().setCheckpointTimeout(10000L);
env.getCheckpointConfig().setCheckpointStorage("hdfs:hadoop102:8020//checkpoint");

StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
tableEnv.getConfig().setIdleStateRetention(Duration.ofDays(3));
//TODO 使用DDL方式读取 Kafka topic_db 主题的数据
tableEnv.executeSql(MyKafkaUtil.getTopicDbDDL("dwd_trade_cart_add"));

筛选订单明细表数据

//TODO 过滤出订单明细数据
Table orderDetail = tableEnv.sqlQuery("" +
        "select  " +
        "   data['id'] id, " +
        "   data['order_id'] order_id, " +
        "   data['sku_id'] sku_id, " +
        "   data['sku_name'] sku_name, " +
        "   data['create_time'] create_time, " +
        "   data['source_id'] source_id, " +
        "   data['source_type'] source_type, " +
        "   data['sku_num'] sku_num, " +
        "   cast(cast(data['sku_num'] as decimal(16,2)) * " +
        "   cast(data['order_price'] as decimal(16,2)) as String) split_original_amount, " +
        "   data['split_total_amount'] split_total_amount, " +
        "   data['split_activity_amount'] split_activity_amount, " +
        "   data['split_coupon_amount'] split_coupon_amount, " +
        "   ts od_ts, " +
        "   proc_time " +
        "from `topic_db` where `table` = 'order_detail' " +
        "and `type` = 'insert' ");
tableEnv.createTemporaryView("order_detail", orderDetail);

筛选订单表数据

//TODO 过滤出订单数据
Table orderInfo = tableEnv.sqlQuery("" +
        "select  " +
        "   data['id'] id, " +
        "   data['user_id'] user_id, " +
        "   data['province_id'] province_id, " +
        "   data['operate_time'] operate_time, " +
        "   data['order_status'] order_status, " +
        "   `type`, " +
        "   `old`, " +
        "   ts oi_ts " +
        "from `topic_db` " +
        "where `table` = 'order_info' " +
        "and (`type` = 'insert' or `type` = 'update')");
tableEnv.createTemporaryView("order_info", orderInfo);

筛选订单明细活动关联表数据

//TODO 过滤出订单明细活动数据
Table orderDetailActivity = tableEnv.sqlQuery("" +
        "select  " +
        "   data['order_detail_id'] order_detail_id, " +
        "   data['activity_id'] activity_id, " +
        "   data['activity_rule_id'] activity_rule_id " +
        "from `topic_db` " +
        "where `table` = 'order_detail_activity' " +
        "and `type` = 'insert' ");
tableEnv.createTemporaryView("order_detail_activity", orderDetailActivity);

筛选订单明细优惠券关联表数据

//TODO 过滤出订单明细购物券数据
Table orderDetailCoupon = tableEnv.sqlQuery("" +
        "select " +
        "   data['order_detail_id'] order_detail_id, " +
        "   data['coupon_id'] coupon_id " +
        "from `topic_db` " +
        "where `table` = 'order_detail_coupon' " +
        "and `type` = 'insert' ");
tableEnv.createTemporaryView("order_detail_coupon", orderDetailCoupon);

建立 MySQL-Lookup 字典表

//TODO 构建MySQL-Lookup表 base_dic
tableEnv.executeSql(MysqlUtil.getBaseDicLookUpDDL());

关联上述五张表获得订单宽表,写入 Kafka 主题

(1)关联5张表
//TODO 关联5张表
Table resultTable = tableEnv.sqlQuery("" +
        "select  " +
        "   od.id, " +
        "   od.order_id, " +
        "   oi.user_id, " +
        "   oi.order_status, " +
        "   od.sku_id, " +
        "   od.sku_name, " +
        "   oi.province_id, " +
        "   act.activity_id, " +
        "   act.activity_rule_id, " +
        "   cou.coupon_id, " +
        "   date_format(od.create_time, 'yyyy-MM-dd') date_id, " +
        "   od.create_time, " +
        "   date_format(oi.operate_time, 'yyyy-MM-dd') operate_date_id, " +
        "   oi.operate_time, " +
        "   od.source_id, " +
        "   od.source_type, " +
        "   dic.dic_name source_type_name, " +
        "   od.sku_num, " +
        "   od.split_original_amount, " +
        "   od.split_activity_amount, " +
        "   od.split_coupon_amount, " +
        "   od.split_total_amount, " +
        "   oi.`type`, " +
        "   oi.`old`, " +
        "   od.od_ts, " +
        "   oi.oi_ts, " +
        "   current_row_timestamp() row_op_ts " +
        "from order_detail od  " +
        "join order_info oi " +
        "   on od.order_id = oi.id " +
        "left join order_detail_activity act " +
        "   on od.id = act.order_detail_id " +
        "left join order_detail_coupon cou " +
        "   on od.id = cou.order_detail_id " +
        "left join `base_dic` for system_time as of od.proc_time as dic " +
        "   on od.source_type = dic.dic_code");
tableEnv.createTemporaryView("result_table", resultTable);

(2)补充Kafka工具类

/**
 * Kafka-Sink DDL 语句
 *
 * @param topic 输出到 Kafka 的目标主题
 * @return 拼接好的 Kafka-Sink DDL 语句
 */
public static String getUpsertKafkaDDL(String topic) {
    return "WITH ( " +
            "  'connector' = 'upsert-kafka', " +
            "  'topic' = '" + topic + "', " +
            "  'properties.bootstrap.servers' = '" + BOOTSTRAP_SERVERS + "', " +
            "  'key.format' = 'json', " +
            "  'value.format' = 'json' " +
            ")";
}

注意:因为left join输出为撤回流(lookup join 不会输出撤回流),所以需要upsert-kafka。使用upsert-kafka必须定义主键,撤回数据在kafka中为null

(3)写出数据

//TODO 创建 upsert-kafka表
tableEnv.executeSql("" +
        "create table dwd_trade_order_pre_process( " +
        "   id string, " +
        "   order_id string, " +
        "   user_id string, " +
        "   order_status string, " +
        "   sku_id string, " +
        "   sku_name string, " +
        "   province_id string, " +
        "   activity_id string, " +
        "   activity_rule_id string, " +
        "   coupon_id string, " +
        "   date_id string, " +
        "   create_time string, " +
        "   operate_date_id string, " +
        "   operate_time string, " +
        "   source_id string, " +
        "   source_type string, " +
        "   source_type_name string, " +
        "   sku_num string, " +
        "   split_original_amount string, " +
        "   split_activity_amount string, " +
        "   split_coupon_amount string, " +
        "   split_total_amount string, " +
        "   `type` string, " +
        "   `old` map<string,string>, " +
        "   od_ts string, " +
        "   oi_ts string, " +
        "   row_op_ts timestamp_ltz(3), " +
        "primary key(id) not enforced " +
        ")" + MyKafkaUtil.getUpsertKafkaDDL("dwd_trade_order_pre_process"));
//TODO 将数据写出
tableEnv.executeSql("" +
        "insert into dwd_trade_order_pre_process \n" +
        "select * from result_table");

交易域下单事务事实表

从 Kafka 读取订单预处理表数据,筛选订单明细数据,写入 Kafka 对应主题。

主要任务

(1)从 Kafka dwd_trade_order_pre_process 主题读取订单预处理数据;

(2)筛选订单明细数据:新增数据,即订单表操作类型为 insert 的数据即为订单明细数据;

(3)写入 Kafka 订单明细主题。

交易域取消订单事务事实表

从 Kafka 读取订单预处理表数据,筛选取消订单明细数据,写入 Kafka 对应主题。

主要任务

(1)从 Kafka dwd_trade_order_pre_process 主题读取订单预处理数据;

(2)筛选取消订单明细数据:保留修改了 order_status 字段且修改后该字段值为 "1003" 的数据;

(3)写入 Kafka 取消订单主题。

交易域支付成功事务事实表

从 Kafka topic_db主题筛选支付成功数据、从dwd_trade_order_detail主题中读取订单事实数据、MySQL-LookUp字典表,关联三张表形成支付成功宽表,写入 Kafka 支付成功主题。

主要任务

1)获取订单明细数据

2)筛选支付表数据

3)构建 MySQL-LookUp 字典表

4)关联上述三张表形成支付成功宽表,写入 Kafka 支付成功主题

交易域退单事务事实表

从 Kafka 读取业务数据,筛选退单表数据,筛选满足条件的订单表数据,建立 MySQL-Lookup 字典表,关联三张表获得退单明细宽表。

主要任务

1)筛选退单表数据

2)筛选订单表数据

3)建立 MySQL-Lookup 字典表

4)关联这几张表获得退单明细宽表,写入 Kafka 退单明细主题

交易域退款成功事务事实表

主要任务

1)从退款表中提取退款成功数据

2)从订单表中提取退款成功订单数据

3)从退单表中提取退款成功的明细数据

4)建立 MySQL-Lookup 字典表

5)关联这几张表获得退款明细宽表,写入 Kafka 退款明细主题

工具域优惠券领取事务事实表

主要任务

1)筛选优惠券领用表中领取数据

2)写入 Kafka 优惠券领取主题

工具域优惠券使用(下单)事务事实表

主要任务

1)筛选优惠券领用表数据,封装为流

2)过滤满足条件的优惠券下单数据,封装为表

3)写入 Kafka 优惠券下单主题

 注意:这里需要的是old中有using_time字段,如果使用SQL(`old`['using_time'] is not null)需要的数据会被过滤掉,所以先转换为流处理后再转换回表。

//读取优惠券领用表数据,封装为流
Table couponUseOrder = tableEnv.sqlQuery("select " +
        "data['id'] id, " +
        "data['coupon_id'] coupon_id, " +
        "data['user_id'] user_id, " +
        "data['order_id'] order_id, " +
        "date_format(data['using_time'],'yyyy-MM-dd') date_id, " +
        "data['using_time'] using_time, " +
        "`old`, " +
        "ts " +
        "from topic_db " +
        "where `table` = 'coupon_use' " +
        "and `type` = 'update' ");
DataStream<CouponUseOrderBean> couponUseOrderDS = tableEnv.toAppendStream(couponUseOrder, CouponUseOrderBean.class);

// 过滤满足条件的优惠券下单数据,封装为表
SingleOutputStreamOperator<CouponUseOrderBean> filteredDS = couponUseOrderDS.filter(
        couponUseOrderBean -> {
            String old = couponUseOrderBean.getOld();
            if(old != null) {
                Map oldMap = JSON.parseObject(old, Map.class);
                Set changeKeys = oldMap.keySet();
                return changeKeys.contains("using_time");
            }
            return false;
        }
);
Table resultTable = tableEnv.fromDataStream(filteredDS);
tableEnv.createTemporaryView("result_table", resultTable);

工具域优惠券使用(支付)事务事实表

主要任务

1)筛选优惠券领用表中支付数据

2)写入 Kafka 优惠券支付主题

互动域收藏商品事务事实表

主要任务

1)筛选收藏数据

2)写入 Kafka 收藏主题

互动域评价事务事实表

主要任务

1)筛选评价数据

2)建立 MySQL-Lookup 字典表

3)关联这两张表,写入 Kafka 评价主题

 

用户域用户注册事务事实表

主要任务

1)筛选用户注册数据

2)写入 Kafka 用户注册主题

DWS

DWS层设计要点

(1)DWS层的设计参考指标体系。

(2)DWS层表名的命名规范为dws_数据域_统计粒度_业务过程_统计周期(window)

注:window 表示窗口对应的时间范围

流量域来源关键词粒度页面浏览各窗口汇总表

从 Kafka 页面浏览明细主题读取数据,过滤搜索行为,使用自定义 UDTF(一进多出)函数对搜索内容分词。统计各窗口各关键词出现频次,写入 ClickHouse。

主要任务

1)消费数据并设置水位线

2)过滤搜索行为

3)分词

4)分组开窗聚合

5)将数据转换为流并写入ClickHouse

消费数据并设置水位线

//TODO 获取执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(4);
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);

env.enableCheckpointing(3000L, CheckpointingMode.EXACTLY_ONCE);
env.getCheckpointConfig().setCheckpointTimeout(60 * 1000L);
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(3000L);
env.getCheckpointConfig().enableExternalizedCheckpoints(
        CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION
);
env.setRestartStrategy(RestartStrategies.failureRateRestart(
        3, Time.days(1), Time.minutes(1)
));
env.setStateBackend(new HashMapStateBackend());
env.getCheckpointConfig().setCheckpointStorage(
        "hdfs://hadoop102:8020/ck"
);
System.setProperty("HADOOP_USER_NAME", "atguigu");
//TODO 使用DDL方式读取 DWD层页面浏览日志 创建表,同时获取事件时间生成Watermark
String topic = "dwd_traffic_page_log";
String groupId = "dws_traffic_source_keyword_page_view_window";
tableEnv.executeSql("" +
        "create table page_log( " +
        "   `page` map<string, string>, " +
        "   `ts` bigint, " +
        "   row_time AS TO_TIMESTAMP(FROM_UNIXTIME(ts/1000, 'yyyy-MM-dd HH:mm:ss')), " +
        "WATERMARK FOR row_time AS row_time - INTERVAL '3' SECOND " +
        ")" + MyKafkaUtil.getKafkaDDL(topic, groupId));

过滤搜索行为

//TODO 过滤出搜索数据
Table searchTable = tableEnv.sqlQuery("" +
        "select " +
        "   page['item'] full_word, " +
        "   row_time " +
        "from page_log " +
        "where page['item'] is not null " +
        "and page['last_page_id'] = 'search' " +
        "and page['item_type'] = 'keyword'");
tableEnv.createTemporaryView("search_table", searchTable);

分词

(1)编写关键词分词工具类

public class KeywordUtil {
    public static List<String> splitKeyWord(String keyword) throws IOException {
        ArrayList<String> result = new ArrayList<>();
        //创建分词器对象
        StringReader reader = new StringReader(keyword);
        IKSegmenter ikSegmenter = new IKSegmenter(reader, false);
        //提取分词
        Lexeme next = ikSegmenter.next();
        while (next != null){
            String word = next.getLexemeText();
            result.add(word);
            next = ikSegmenter.next();
        }
        return result;
    }
    
}

(2)编写分词函数

@FunctionHint(output = @DataTypeHint("Row<word STRING>"))
public class SplitFunction extends TableFunction<Row> {
    public void eval(String keyword){
        List<String> list;
        try {
            list = KeywordUtil.splitKeyWord(keyword);
            for (String word : list) {
                collect(Row.of(word));
            }
        } catch (IOException e) {
            collect(Row.of(keyword));
        }

    }
}

(3)使用分词函数

//TODO 使用自定义函数分词处理
tableEnv.createTemporaryFunction("SplitFunction", SplitFunction.class);
Table splitTable = tableEnv.sqlQuery("" +
        "select " +
        "   keyword, " +
        "   row_time  " +
        "from search_table, " +
        "lateral table(SplitFunction(full_word)) " +
        "as t(keyword)");
tableEnv.createTemporaryView("split_table", splitTable);

分组开窗聚合

//TODO 分组开窗聚合
Table KeywordBeanSearch = tableEnv.sqlQuery("" +
        "select " +
        "   DATE_FORMAT(TUMBLE_START(row_time, INTERVAL '10' SECOND),'yyyy-MM-dd HH:mm:ss') stt, " +
        "   DATE_FORMAT(TUMBLE_END(row_time, INTERVAL '10' SECOND),'yyyy-MM-dd HH:mm:ss') edt, " +
        "   'search' source, " +
        "   keyword, " +
        "   count(*) keyword_count, " +
        "   UNIX_TIMESTAMP()*1000 ts " +
        "from split_table " +
        "GROUP BY TUMBLE(row_time, INTERVAL '10' SECOND),keyword");

注意:ts作为后续幂等写入ReplacingMergeTree表的版本,不写版本默认按照插入顺序保留。

将数据转换为流并写入ClickHouse

 (1)编写ClickHouse工具类

public class MyClickHouseUtil {
    public static <T> SinkFunction<T> getClickHouseSink(String sql){

        return JdbcSink.<T>sink(sql,
                new JdbcStatementBuilder<T>() {
                    @SneakyThrows
                    @Override
                    public void accept(PreparedStatement preparedStatement, T t) throws SQLException {
                        //使用反射的方式提取字段
                        Class<?> clz = t.getClass();
                        Field[] fields = clz.getDeclaredFields();
                        int offset = 0;
                        for (int i = 0; i < fields.length; i++) {
                            Field field = fields[i];
                            field.setAccessible(true);
                            TransientSink transientSink = field.getAnnotation(TransientSink.class);
                            if(transientSink != null){
                                offset++;
                                continue;
                            }
                            //获取数据并给占位符赋值
                            Object value = field.get(t);
                            preparedStatement.setObject(i + 1 - offset, value);
                        }
                    }
                }, new JdbcExecutionOptions.Builder()
                        .withBatchSize(5)
                        .withBatchIntervalMs(1000L)
                        .build(),
                new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
                        .withDriverName(GmallConfig.CLICKHOUSE_DRIVER)
                        .withUrl(GmallConfig.CLICKHOUSE_URL)
                        .build()
        );
    }
}

(2)转换为流

//TODO 将数据转换为流
DataStream<KeywordBean> keywordBeanDS = tableEnv.toAppendStream(KeywordBeanSearch, KeywordBean.class);

注意:这里开窗聚合,追加流即可

(3)建表

drop table if exists dws_traffic_source_keyword_page_view_window;
create table if not exists dws_traffic_source_keyword_page_view_window
(
    stt           DateTime,
    edt           DateTime,
    source        String,
    keyword       String,
    keyword_count UInt64,
    ts            UInt64
) engine = ReplacingMergeTree(ts)
      partition by toYYYYMMDD(stt)
      order by (stt, edt, source, keyword);

(4)写入ClickHouse

//TODO 写出到ClickHouse
keywordBeanDS.addSink(MyClickHouseUtil.getClickHouseSink("" +
        "insert into dws_traffic_source_keyword_page_view_window values(?,?,?,?,?,?))"));
//TODO 启动任务
env.execute("DwsTrafficSourceKeywordPageViewWindow");

流量域版本-渠道-地区-访客类别粒度页面浏览各窗口汇总表

DWS 层是为 ADS 层服务的,通过对指标体系的分析,本节汇总表中需要有会话数、页面浏览数、浏览总时长、独立访客数、跳出会话数五个度量字段。本节的任务是统计这五个指标,并将维度和度量数据写入 ClickHouse 汇总表。

主要任务

1)读取3个主题的数据创建3各流

2)统一数据格式并合并

3)分组开窗聚合

4)写到ClickHouse

 

读取3个主题的数据创建3各流

//TODO 读取3个主题的数据创建三个流
    String page_topic = "dwd_traffic_page_log";
    String uv_topic = "dwd_traffic_unique_visitor_detail";
    String uj_topic = "dwd_traffic_user_jump_detail";
    String groupId = "dws_traffic_vc_ch_ar_isnew_pageview_window";
    DataStreamSource<String> pageStringDS = env.addSource(MyKafkaUtil.getKafkaConsumer(page_topic, groupId));
    DataStreamSource<String> uvStringDS = env.addSource(MyKafkaUtil.getKafkaConsumer(uv_topic, groupId));
    DataStreamSource<String> ujStringDS = env.addSource(MyKafkaUtil.getKafkaConsumer(uj_topic, groupId));

统一数据格式并合并

//TODO 3个流统一数据格式
SingleOutputStreamOperator<TrafficPageViewBean> trafficPageViewWithUvDS = uvStringDS.map(line -> {
    JSONObject jsonObject = JSON.parseObject(line);
    JSONObject common = jsonObject.getJSONObject("common");
    return new TrafficPageViewBean("", "",
            common.getString("vc"),
            common.getString("ch"),
            common.getString("ar"),
            common.getString("is_new"),
            1L, 0L, 0L, 0L, 0L,
            jsonObject.getLong("ts")
    );
});
SingleOutputStreamOperator<TrafficPageViewBean> trafficPageViewWithUjDS = ujStringDS.map(line -> {
    JSONObject jsonObject = JSON.parseObject(line);
    JSONObject common = jsonObject.getJSONObject("common");
    return new TrafficPageViewBean("", "",
            common.getString("vc"),
            common.getString("ch"),
            common.getString("ar"),
            common.getString("is_new"),
            0L, 0L, 0L, 0L, 1L,
            jsonObject.getLong("ts")
    );
});
SingleOutputStreamOperator<TrafficPageViewBean> trafficPageViewWithPvDS = pageStringDS.map(line -> {
    JSONObject jsonObject = JSON.parseObject(line);
    JSONObject common = jsonObject.getJSONObject("common");
    JSONObject page = jsonObject.getJSONObject("page");
    String lastPageId = page.getString("last_page_id");
    long sv = 0L;
    if(lastPageId == null){
        sv = 1;
    }
    return new TrafficPageViewBean("", "",
            common.getString("vc"),
            common.getString("ch"),
            common.getString("ar"),
            common.getString("is_new"),
            0L,
            sv,
            1L,
            jsonObject.getJSONObject("page").getLong("during_time"),
            0L,
            jsonObject.getLong("ts")
    );
});
//TODO 合并3个流并提取事件时间生成Watermark
SingleOutputStreamOperator<TrafficPageViewBean> unionDS = trafficPageViewWithPvDS.union(
        trafficPageViewWithUjDS,
        trafficPageViewWithUvDS
).assignTimestampsAndWatermarks(WatermarkStrategy.<TrafficPageViewBean>forBoundedOutOfOrderness(Duration.ofSeconds(2))
        .withTimestampAssigner(new SerializableTimestampAssigner<TrafficPageViewBean>() {
            @Override
            public long extractTimestamp(TrafficPageViewBean element, long l) {
                return element.getTs();
            }
        }));

分组开窗聚合

//TODO 分组开窗聚合
KeyedStream<TrafficPageViewBean, Tuple4<String, String, String, String>> keyedStream = unionDS.keyBy(new KeySelector<TrafficPageViewBean, Tuple4<String, String, String, String>>() {
    @Override
    public Tuple4<String, String, String, String> getKey(TrafficPageViewBean value) throws Exception {
        return new Tuple4<>(value.getAr(), value.getCh(), value.getIsNew(), value.getVc());
    }
});
WindowedStream<TrafficPageViewBean, Tuple4<String, String, String, String>, TimeWindow> windowedStream =
        keyedStream.window(TumblingEventTimeWindows.of(Time.seconds(10)));
SingleOutputStreamOperator<TrafficPageViewBean> reduceDS = windowedStream.reduce(new ReduceFunction<TrafficPageViewBean>() {
    @Override
    public TrafficPageViewBean reduce(TrafficPageViewBean value1, TrafficPageViewBean value2) throws Exception {
        value1.setUvCt(value1.getUvCt() + value2.getUvCt());
        value1.setSvCt(value1.getSvCt() + value2.getSvCt());
        value1.setPvCt(value1.getPvCt() + value2.getPvCt());
        value1.setDurSum(value1.getDurSum() + value2.getDurSum());
        value1.setUjCt(value1.getUjCt() + value2.getUjCt());
        return value1;
    }
}, new WindowFunction<TrafficPageViewBean, TrafficPageViewBean, Tuple4<String, String, String, String>, TimeWindow>() {
    @Override
    public void apply(Tuple4<String, String, String, String> key, TimeWindow window, Iterable<TrafficPageViewBean> input, Collector<TrafficPageViewBean> out) throws Exception {
        //获取数据
        TrafficPageViewBean trafficPageViewBean = input.iterator().next();
        //获取窗口信息
        long start = window.getStart();
        long end = window.getEnd();
        //补充窗口信息
        trafficPageViewBean.setStt(DateFormatUtil.toDate(start));
        trafficPageViewBean.setEdt(DateFormatUtil.toDate(end));
        //输出数据
        out.collect(trafficPageViewBean);
    }
});

写到ClickHouse

(1)建表

drop table if exists dws_traffic_vc_ch_ar_is_new_page_view_window;
create table if not exists dws_traffic_vc_ch_ar_is_new_page_view_window
(
    stt     DateTime,
    edt     DateTime,
    vc      String,
    ch      String,
    ar      String,
    is_new  String,
    uv_ct   UInt64,
    sv_ct   UInt64,
    pv_ct   UInt64,
    dur_sum UInt64,
    uj_ct   UInt64,
    ts      UInt64
) engine = ReplacingMergeTree(ts)
      partition by toYYYYMMDD(stt)
      order by (stt, edt, vc, ch, ar, is_new);

(2)写入ClickHouse

//TODO 写出到ClickHouse
reduceDS.addSink(MyClickHouseUtil.getClickHouseSink("" +
        "insert into dws_traffic_vc_ch_ar_is_new_page_view_window values(?,?,?,?,?,?,?,?,?,?,?,?)"));

存在问题

uj数据有两次开窗,当uj数据来到第二次开窗时,由于第一次开窗的延迟,第二个窗口已经关闭了,最终无法输出。

解决方案:一是增大第二次开窗水位线延迟;二是uj数据单独统计

流量域页面浏览各窗口汇总表

从 Kafka 页面日志主题读取数据,统计当日的首页和商品详情页独立访客数。

主要任务

1)过滤出主页和商品详情页数据

2)使用状态编程计算主页及商品详情页的独立访客

3)开窗聚合并输出

过滤出主页和商品详情页数据

//TODO 读取Kafka 页面日志主题 数据创建流
String page_topic = "dwd_traffic_page_log";
String groupId = "dws_traffic_page_view_window";
DataStreamSource<String> pageStringDS = env.addSource(MyKafkaUtil.getKafkaConsumer(page_topic, groupId));
//TODO 将数据转换为JSON对象
SingleOutputStreamOperator<JSONObject> jsonObjDS = pageStringDS.map(JSON::parseObject);
//TODO 过滤数据,只需要访问主页跟商品详情页的数据
SingleOutputStreamOperator<JSONObject> homeAndDetailPageDS = jsonObjDS.filter(new FilterFunction<JSONObject>() {
    @Override
    public boolean filter(JSONObject value) throws Exception {
        String pageId = value.getJSONObject("page").getString("page_id");
        return "good_detail".equals(pageId) || "home".equals(pageId);
    }
});
//TODO 提取事件时间生成Watermark
SingleOutputStreamOperator<JSONObject> homeAndDetailPageWithWmDS = homeAndDetailPageDS.assignTimestampsAndWatermarks(
        WatermarkStrategy.<JSONObject>forBoundedOutOfOrderness(Duration.ofSeconds(2))
        .withTimestampAssigner(new SerializableTimestampAssigner<JSONObject>() {
            @Override
            public long extractTimestamp(JSONObject element, long l) {
                return element.getLong("ts");
            }
        }));

使用状态编程计算主页及商品详情页的独立访客

//TODO 按照Mid分组
KeyedStream<JSONObject, String> keyedStream = homeAndDetailPageWithWmDS.keyBy(jsonObject -> jsonObject.getJSONObject("common").getString("mid"));
//TODO 使用状态编程计算主页及商品详情页的独立访客
SingleOutputStreamOperator<TrafficHomeDetailPageViewBean> trafficHomeDetailDS = keyedStream.flatMap(new RichFlatMapFunction<JSONObject, TrafficHomeDetailPageViewBean>() {
    private ValueState<String> homeLastVisitDt;
    private ValueState<String> detailLastVisitDt;

    @Override
    public void open(Configuration parameters) throws Exception {
        //状态TTL
        StateTtlConfig stateTtlConfig = new StateTtlConfig.Builder(Time.days(1)).setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite).build();

        ValueStateDescriptor<String> homeDtDescriptor = new ValueStateDescriptor<>("home-dt", String.class);
        homeDtDescriptor.enableTimeToLive(stateTtlConfig);
        homeLastVisitDt = getRuntimeContext().getState(homeDtDescriptor);

        ValueStateDescriptor<String> detailDtDescriptor = new ValueStateDescriptor<>("detail-dt", String.class);
        detailDtDescriptor.enableTimeToLive(stateTtlConfig);
        detailLastVisitDt = getRuntimeContext().getState(detailDtDescriptor);
    }

    @Override
    public void flatMap(JSONObject value, Collector<TrafficHomeDetailPageViewBean> out) throws Exception {
        //取出当前页面信息及时间
        String pageId = value.getJSONObject("page").getString("page_id");
        Long ts = value.getLong("ts");
        String curDt = DateFormatUtil.toDate(ts);
        //过滤
        long homeUvCt = 0L;
        long detailUvCt = 0L;
        if ("home".equals(pageId)) {
            String homeLastDt = homeLastVisitDt.value();
            if (homeLastDt == null || !homeLastDt.equals(curDt)) {
                homeUvCt = 1L;
                homeLastVisitDt.update(curDt);
            }
        } else {
            String detailLastDt = detailLastVisitDt.value();
            if (detailLastDt == null || !detailLastDt.equals(curDt)) {
                detailUvCt = 1L;
                detailLastVisitDt.update(curDt);
            }
        }
        if (homeUvCt != 0L || detailUvCt != 0L) {
            out.collect(new TrafficHomeDetailPageViewBean("", "",
                    homeUvCt,
                    detailUvCt,
                    System.currentTimeMillis()));
        }
    }
});

开窗聚合并输出

(1)开窗聚合

//TODO 开窗聚合
SingleOutputStreamOperator<TrafficHomeDetailPageViewBean> reduceDS = trafficHomeDetailDS.windowAll(TumblingEventTimeWindows.of(org.apache.flink.streaming.api.windowing.time.Time.seconds(10)))
        .reduce(new ReduceFunction<TrafficHomeDetailPageViewBean>() {
            @Override
            public TrafficHomeDetailPageViewBean reduce(TrafficHomeDetailPageViewBean value1, TrafficHomeDetailPageViewBean value2) throws Exception {
                value1.setHomeUvCt(value1.getHomeUvCt() + value2.getHomeUvCt());
                value1.setGoodDetailUvCt(value1.getGoodDetailUvCt() + value2.getGoodDetailUvCt());
                return value1;
            }
        }, new AllWindowFunction<TrafficHomeDetailPageViewBean, TrafficHomeDetailPageViewBean, TimeWindow>() {
            @Override
            public void apply(TimeWindow window, Iterable<TrafficHomeDetailPageViewBean> values, Collector<TrafficHomeDetailPageViewBean> out) throws Exception {
                TrafficHomeDetailPageViewBean pageViewBean = values.iterator().next();
                pageViewBean.setStt(DateFormatUtil.toYmdHms(window.getStart()));
                pageViewBean.setEdt(DateFormatUtil.toYmdHms(window.getEnd()));
                out.collect(pageViewBean);
            }
        });

(2)建表

drop table if exists dws_traffic_page_view_window;
create table if not exists dws_traffic_page_view_window
(
    stt               DateTime,
    edt               DateTime,
    home_uv_ct        UInt64,
    good_detail_uv_ct UInt64,
    ts                UInt64
) engine = ReplacingMergeTree(ts)
      partition by toYYYYMMDD(stt)
      order by (stt, edt);

(3)输出

//TODO 将数据输出到ClickHouse
reduceDS.addSink(MyClickHouseUtil.getClickHouseSink("insert into dws_traffic_page_view_window values(?,?,?,?,?)"));

用户域用户登陆各窗口汇总表

从 Kafka 页面日志主题读取数据,统计七日回流用户和当日独立用户数。

主要任务

1)过滤得到用户登录数据

2)提取回流用户数和独立用户数

3)开窗聚合,写入ClickHouse

 

 

过滤得到用户登录数据

//TODO 读取Kafka 页面日志主题 数据创建流
String page_topic = "dwd_traffic_page_log";
String groupId = "dws_user_user_login_window";
DataStreamSource<String> pageStringDS = env.addSource(MyKafkaUtil.getKafkaConsumer(page_topic, groupId));
//TODO 将数据转换为JSON对象
SingleOutputStreamOperator<JSONObject> jsonObjDS = pageStringDS.map(JSON::parseObject);
//TODO 过滤数据,只保留用户 id 不为 null 且 (last_page_id 为 null 或为 login 的数据)
SingleOutputStreamOperator<JSONObject> filterDS = jsonObjDS.filter(
        new FilterFunction<JSONObject>() {
            @Override
            public boolean filter(JSONObject value) throws Exception {
                JSONObject common = value.getJSONObject("common");
                JSONObject page = value.getJSONObject("page");
                return common.getString("uid") != null
                        && (page.getString("last_page_id") == null
                            || page.getString("last_page_id").equals("login"));
            }
        }
);
//TODO 提取事件时间生成Watermark
SingleOutputStreamOperator<JSONObject> filterWithWmDS = filterDS.assignTimestampsAndWatermarks(
        WatermarkStrategy.<JSONObject>forBoundedOutOfOrderness(Duration.ofSeconds(2))
                .withTimestampAssigner(new SerializableTimestampAssigner<JSONObject>() {
                    @Override
                    public long extractTimestamp(JSONObject element, long l) {
                        return element.getLong("ts");
                    }
                }));

提取回流用户数和独立用户数

//TODO 按照uid分组
KeyedStream<JSONObject, String> keyedStream = filterWithWmDS.keyBy(jsonObject -> jsonObject.getJSONObject("common").getString("uid"));
//TODO 使用状态编程实现 回流及独立用户的提取
SingleOutputStreamOperator<UserLoginBean> uvDS = keyedStream.process(new KeyedProcessFunction<String, JSONObject, UserLoginBean>() {
    private ValueState<String> lastVisitDt;

    @Override
    public void open(Configuration parameters) throws Exception {
        lastVisitDt = getRuntimeContext().getState(new ValueStateDescriptor<String>("last-dt", String.class));
    }

    @Override
    public void processElement(JSONObject value, Context ctx, Collector<UserLoginBean> out) throws Exception {
        //取出状态
        String lastDt = lastVisitDt.value();
        //获取数据日期
        Long ts = value.getLong("ts");
        String curDt = DateFormatUtil.toDate(ts);
        long uuCt = 0;
        long backCt = 0;
        if (lastDt == null) {
            uuCt = 1L;
            lastVisitDt.update(curDt);
        } else {
            if (!lastDt.equals(curDt)) {
                uuCt = 1L;
                lastVisitDt.update(curDt);
                Long lastTs = DateFormatUtil.toTs(lastDt);
                long days = (ts - lastTs) / (1000L * 60 * 60 * 24);
                if (days >= 8L) {
                    backCt = 1;
                }
            }
        }
        if (uuCt == 1L) {
            out.collect(new UserLoginBean("", "", backCt, uuCt, System.currentTimeMillis()));
        }
    }
});

开窗聚合,写入ClickHouse

(1)开窗聚合

//TODO 开窗聚合
SingleOutputStreamOperator<UserLoginBean> resultDS = uvDS.windowAll(TumblingEventTimeWindows.of(org.apache.flink.streaming.api.windowing.time.Time.seconds(10)))
        .reduce(new ReduceFunction<UserLoginBean>() {
            @Override
            public UserLoginBean reduce(UserLoginBean value1, UserLoginBean value2) throws Exception {
                value1.setUuCt(value1.getUuCt() + value2.getUuCt());
                value1.setBackCt(value1.getBackCt() + value2.getBackCt());
                return value1;
            }
        }, new AllWindowFunction<UserLoginBean, UserLoginBean, TimeWindow>() {
            @Override
            public void apply(TimeWindow window, Iterable<UserLoginBean> values, Collector<UserLoginBean> out) throws Exception {
                UserLoginBean userLoginBean = values.iterator().next();
                userLoginBean.setStt(DateFormatUtil.toYmdHms(window.getStart()));
                userLoginBean.setEdt(DateFormatUtil.toYmdHms(window.getEnd()));
                out.collect(userLoginBean);
            }
        });

(2)建表

drop table if exists dws_user_user_login_window;
create table if not exists dws_user_user_login_window
(
    stt     DateTime,
    edt     DateTime,    
    back_ct UInt64,
    uu_ct   UInt64,
    ts      UInt64
) engine = ReplacingMergeTree(ts)
      partition by toYYYYMMDD(stt)
      order by (stt, edt);

(3)写入ClickHouse

//TODO 写入ClickHouse
resultDS.addSink(MyClickHouseUtil.getClickHouseSink("insert into dws_user_user_login_window values(?,?,?,?,?)"));

用户域用户注册各窗口汇总表

从 DWD 层用户注册表中读取数据,统计各窗口注册用户数,写入 ClickHouse。

主要任务

开窗聚合

 

交易域加购各窗口汇总表

从 Kafka 读取用户加购明细数据,统计各窗口加购独立用户数,写入 ClickHouse

主要任务

使用状态编程过滤独立用户数据

注意:提取水位线时,有operate_tiem要使用operate_tiem,没有则使用create_time。

交易域下单各窗口汇总表

从 Kafka 订单明细主题读取数据,过滤 null 数据并去重,统计当日下单独立用户数和新增下单用户数,封装为实体类,写入 ClickHouse。

注意:null数据和重复数据都是由于使用了left join和upsert-kafka产生的

主要任务

1)过滤 null 数据并去重

2)统计当日下单独立用户数和新增下单用户数

3)写入 ClickHouse

过滤 null 数据并去重

(1)过滤null数据

//TODO 读取Kafka DWD 订单明细 主题数据
String topic = "dwd_trade_order_detail";
String groupId = "dws_trade_order_window";
DataStreamSource<String> orderDetailStrDS = env.addSource(MyKafkaUtil.getKafkaConsumer(topic, groupId));
//TODO 过滤数据 转换数据为JSON格式
SingleOutputStreamOperator<JSONObject> JSONObjDS = orderDetailStrDS.flatMap(new FlatMapFunction<String, JSONObject>() {
    @Override
    public void flatMap(String value, Collector<JSONObject> out) throws Exception {
        if (!"".equals(value)) {
            JSONObject jsonObject = JSON.parseObject(value);
            out.collect(jsonObject);
        }
    }
});

(2)时间戳比较工具类

public class TimestampLtz3CompareUtil {

    public static int compare(String timestamp1, String timestamp2) {
        // 数据格式 2022-04-01 10:20:47.302Z
        // 1. 去除末尾的时区标志,'Z' 表示 0 时区
        String cleanedTime1 = timestamp1.substring(0, timestamp1.length() - 1);
        String cleanedTime2 = timestamp2.substring(0, timestamp2.length() - 1);
        // 2. 提取小于 1秒的部分
        String[] timeArr1 = cleanedTime1.split("\\.");
        String[] timeArr2 = cleanedTime2.split("\\.");
        String milliseconds1 = new StringBuilder(timeArr1[timeArr1.length - 1])
                .append("000").toString().substring(0, 3);
        String milliseconds2 = new StringBuilder(timeArr2[timeArr2.length - 1])
                .append("000").toString().substring(0, 3);
        int milli1 = Integer.parseInt(milliseconds1);
        int milli2 = Integer.parseInt(milliseconds2);
        // 3. 提取 yyyy-MM-dd HH:mm:ss 的部分
        String date1 = timeArr1[0];
        String date2 = timeArr2[0];
        Long ts1 = DateFormatUtil.toTs(date1, true);
        Long ts2 = DateFormatUtil.toTs(date2, true);
        // 4. 获得精确到毫秒的时间戳
        long milliTs1 = ts1 + milli1;
        long milliTs2 = ts2 + milli2;

        long divTs = milliTs1 - milliTs2;

        return divTs < 0 ? -1 : divTs == 0 ? 0 : 1;
    }
}

(3)去重

//TODO 按照order_detail_id分组
KeyedStream<JSONObject, String> keyedStream = JSONObjDS.keyBy(jsonObject -> jsonObject.getString("order_detail_id"));
//TODO 去重
SingleOutputStreamOperator<JSONObject> orderDetailJSONObjDS = keyedStream.process(new KeyedProcessFunction<String, JSONObject, JSONObject>() {
    private ValueState<JSONObject> orderDetailState;

    @Override
    public void open(Configuration parameters) throws Exception {
        orderDetailState = getRuntimeContext().getState(new ValueStateDescriptor<JSONObject>("order-detail", JSONObject.class));
    }

    @Override
    public void processElement(JSONObject value, Context ctx, Collector<JSONObject> out) throws Exception {
        JSONObject orderDetail = orderDetailState.value();
        if (orderDetail == null) {
            orderDetailState.update(value);
            ctx.timerService().registerProcessingTimeTimer(ctx.timerService().currentProcessingTime() + 2000L);
        } else {
            String stateTs = orderDetail.getString("row_op_ts");
            String curTs = value.getString("row_op_ts");
            int compare = TimestampLtz3CompareUtil.compare(stateTs, curTs);
            if (compare != 1) {
                orderDetailState.update(value);
            }
        }
    }

    @Override
    public void onTimer(long timestamp, OnTimerContext ctx, Collector<JSONObject> out) throws Exception {
        JSONObject orderDetail = orderDetailState.value();
        out.collect(orderDetail);
    }
});

注意:row_op_ts为上游 通过current_row_timestamp()产生,代表订单详情关联其他表的时间

统计当日下单独立用户数和新增下单用户数

(1)提取下单独立用户并转换为JavaBean对象

//TODO 按照user_id分组
KeyedStream<JSONObject, String> keyedByUidStream = jsonObjWithWmDS.keyBy(jsonObject -> jsonObject.getString("user_id"));
//TODO 提取下单独立用户并转换为JavaBean对象
SingleOutputStreamOperator<TradeOrderBean> tradeOrderDS = keyedByUidStream.flatMap(new RichFlatMapFunction<JSONObject, TradeOrderBean>() {
    private ValueState<String> lastOrderDt;

    @Override
    public void open(Configuration parameters) throws Exception {
        lastOrderDt = getRuntimeContext().getState(new ValueStateDescriptor<String>("last-order", String.class));
    }

    @Override
    public void flatMap(JSONObject value, Collector<TradeOrderBean> out) throws Exception {
        String lastOrder = lastOrderDt.value();
        String curDt = value.getString("create_time").split(" ")[0];
        long orderUniqueUserCount = 0L;
        long orderNewUserCount = 0L;
        if (lastOrder == null) {
            orderUniqueUserCount = 1L;
            orderNewUserCount = 1L;
        } else {
            if (!lastOrder.equals(curDt)) {
                orderUniqueUserCount = 1L;
            }
        }
        lastOrderDt.update(curDt);
        Double splitActivityAmount = value.getDouble("split_activity_amount");
        if (splitActivityAmount == null) splitActivityAmount = 0.0D;
        Double splitCouponAmount = value.getDouble("split_coupon_amount");
        if (splitCouponAmount == null) splitCouponAmount = 0.0D;
        Double splitOriginalAmount = value.getDouble("split_original_amount");
        out.collect(new TradeOrderBean(
                "",
                "",
                orderUniqueUserCount,
                orderNewUserCount,
                splitActivityAmount,
                splitCouponAmount,
                splitOriginalAmount,
                0L
        ));
    }
});

(2)开窗聚合

//TODO 开窗聚合
AllWindowedStream<TradeOrderBean, TimeWindow> windowedStream = tradeOrderDS.windowAll(TumblingEventTimeWindows.of(org.apache.flink.streaming.api.windowing.time.Time.seconds(10)));
SingleOutputStreamOperator<Object> resultDS = windowedStream.reduce(new ReduceFunction<TradeOrderBean>() {
    @Override
    public TradeOrderBean reduce(TradeOrderBean value1, TradeOrderBean value2) throws Exception {
        value1.setOrderUniqueUserCount(value1.getOrderUniqueUserCount() + value2.getOrderUniqueUserCount());
        value1.setOrderNewUserCount(value1.getOrderNewUserCount() + value2.getOrderNewUserCount());
        value1.setOrderActivityReduceAmount(value1.getOrderActivityReduceAmount() + value2.getOrderActivityReduceAmount());
        value1.setOrderCouponReduceAmount(value1.getOrderCouponReduceAmount() + value2.getOrderCouponReduceAmount());
        value1.setOrderOriginalTotalAmount(value1.getOrderOriginalTotalAmount() + value2.getOrderOriginalTotalAmount());
        return value1;
    }
}, new AllWindowFunction<TradeOrderBean, Object, TimeWindow>() {
    @Override
    public void apply(TimeWindow window, Iterable<TradeOrderBean> values, Collector<Object> out) throws Exception {
        TradeOrderBean orderBean = values.iterator().next();
        orderBean.setStt(DateFormatUtil.toYmdHms(window.getStart()));
        orderBean.setEdt(DateFormatUtil.toYmdHms(window.getEnd()));
        orderBean.setTs(System.currentTimeMillis());
        out.collect(orderBean);
    }
});

写入 ClickHouse

(1)建表

drop table if exists dws_trade_order_window;
create table if not exists dws_trade_order_window
(
    stt                          DateTime,
    edt                          DateTime,
    order_unique_user_count      UInt64,
    order_new_user_count         UInt64,
    order_activity_reduce_amount Decimal(38, 20),
    order_coupon_reduce_amount   Decimal(38, 20),
    order_origin_total_amount    Decimal(38, 20),
    ts                           UInt64
) engine = ReplacingMergeTree(ts)
      partition by toYYYYMMDD(stt)
      order by (stt, edt);

(2)写入ClickHouse

//TODO 写入ClickHouse
resultDS.addSink(MyClickHouseUtil.getClickHouseSink("insert into dws_trade_order_window values(?,?,?,?,?,?,?,?)"));

交易域品牌-品类-用户-SPU粒度下单各窗口汇总表

从 Kafka 订单明细主题读取数据,过滤 null 数据并按照唯一键对数据去重,关联维度信息,按照维度分组,统计各维度各窗口的订单数和订单金额,将数据写入 ClickHouse 交易域品牌-品类-用户-SPU粒度下单各窗口汇总表。

主要任务

1)获取过滤后的OrderDetail表

2)转换数据结构

3)编写DIM工具类

4)优化-旁路缓存

5)关联维表-异步IO

6)分组开窗聚合

7)写入ClickHouse

获取过滤后的OrderDetail表

(1)提取公共方法

public class OrderDetailFilterFunction {
    public static SingleOutputStreamOperator<JSONObject> getDwdOrderDetail(StreamExecutionEnvironment env, String groupId){
        //TODO 读取Kafka DWD 订单明细 主题数据
        String topic = "dwd_trade_order_detail";
        DataStreamSource<String> orderDetailStrDS = env.addSource(MyKafkaUtil.getKafkaConsumer(topic, groupId));
        //TODO 过滤数据 转换数据为JSON格式
        SingleOutputStreamOperator<JSONObject> JSONObjDS = orderDetailStrDS.flatMap(new FlatMapFunction<String, JSONObject>() {
            @Override
            public void flatMap(String value, Collector<JSONObject> out) throws Exception {
                if (!"".equals(value)) {
                    JSONObject jsonObject = JSON.parseObject(value);
                    out.collect(jsonObject);
                }
            }
        });
        //TODO 按照order_detail_id分组
        KeyedStream<JSONObject, String> keyedStream = JSONObjDS.keyBy(jsonObject -> jsonObject.getString("order_detail_id"));
        //TODO 去重
        SingleOutputStreamOperator<JSONObject> orderDetailJSONObjDS = keyedStream.process(new KeyedProcessFunction<String, JSONObject, JSONObject>() {
            private ValueState<JSONObject> orderDetailState;

            @Override
            public void open(Configuration parameters) throws Exception {
                orderDetailState = getRuntimeContext().getState(new ValueStateDescriptor<JSONObject>("order-detail", JSONObject.class));
            }

            @Override
            public void processElement(JSONObject value, Context ctx, Collector<JSONObject> out) throws Exception {
                JSONObject orderDetail = orderDetailState.value();
                if (orderDetail == null) {
                    orderDetailState.update(value);
                    ctx.timerService().registerProcessingTimeTimer(ctx.timerService().currentProcessingTime() + 2000L);
                } else {
                    String stateTs = orderDetail.getString("row_op_ts");
                    String curTs = value.getString("row_op_ts");
                    int compare = TimestampLtz3CompareUtil.compare(stateTs, curTs);
                    if (compare != 1) {
                        orderDetailState.update(value);
                    }
                }
            }

            @Override
            public void onTimer(long timestamp, OnTimerContext ctx, Collector<JSONObject> out) throws Exception {
                JSONObject orderDetail = orderDetailState.value();
                out.collect(orderDetail);
            }
        });

        return orderDetailJSONObjDS;
    }
}

(2)调用这个公共方法过滤

//TODO 获取过滤后的OrderDetail表
String groupId = "sku_user_order_window";
SingleOutputStreamOperator<JSONObject> orderDetailJSONObjDS = OrderDetailFilterFunction.getDwdOrderDetail(env, groupId);

转换数据结构

(1)自定义JavaBean

@Data
@AllArgsConstructor
@Builder
public class TradeTrademarkCategoryUserSpuOrderBean {
    // 窗口起始时间
    String stt;
    // 窗口结束时间
    String edt;
    // 品牌 ID
    String trademarkId;
    // 品牌名称
    String trademarkName;
    // 一级品类 ID
    String category1Id;
    // 一级品类名称
    String category1Name;
    // 二级品类 ID
    String category2Id;
    // 二级品类名称
    String category2Name;
    // 三级品类 ID
    String category3Id;
    // 三级品类名称
    String category3Name;

    // sku_id
    @TransientSink
    String skuId;
  //order_id去重 @TransientSink Set
<String> orderIdSet; // 用户 ID String userId; // spu_id String spuId; // spu 名称 String spuName; // 下单次数 Long orderCount; // 下单金额 Double orderAmount; // 时间戳 Long ts; public static void main(String[] args) { TradeTrademarkCategoryUserSpuOrderBean build = builder().build(); System.out.println(build); } }

注意:使用@TransientSink,在写入ClickHouse时屏蔽该字段;使用构造者设计模式,方便创建只有部分属性的对象

(2)转换数据结构

//TODO 转换数据为JavaBean
SingleOutputStreamOperator<TradeTrademarkCategoryUserSpuOrderBean> skuUserOrderDS = orderDetailJSONObjDS.map(json -> TradeTrademarkCategoryUserSpuOrderBean.builder()
        .skuId(json.getString("sku_id"))
        .userId(json.getString("user_id"))
        .orderCount(1L)
        .orderAmount(json.getDouble("split_total_amount"))
        .ts(DateFormatUtil.toTs(json.getString("create_time"), true))
        .build());

注意:orderCount直接置为1会有重复数据,可以使用set辅助,后续聚合时去重

//TODO 转换数据为JavaBean
SingleOutputStreamOperator<TradeTrademarkCategoryUserSpuOrderBean> skuUserOrderDS = orderDetailJSONObjDS.map(json->{
    HashSet<String> orderIds = new HashSet<String>();
    orderIds.add(json.getString("order_id"));
    return TradeTrademarkCategoryUserSpuOrderBean.builder()
        .skuId(json.getString("sku_id"))
        .userId(json.getString("user_id"))
        .orderIdSet(orderIds)
        .orderAmount(json.getDouble("split_total_amount"))
        .ts(DateFormatUtil.toTs(json.getString("create_time"), true))
        .build();
});

编写DIM工具类

(1)编写JDBC工具类

public class JdbcUtil {
    public static <T> List<T> queryList(Connection connection, String querySql, Class<T> clz, boolean underScoreToCamel) throws Exception{
        ArrayList<T> list = new ArrayList<>();

        //预编译SQL
        PreparedStatement preparedStatement = connection.prepareStatement(querySql);
        //查询
        ResultSet resultSet = preparedStatement.executeQuery();
        //处理结果集
        ResultSetMetaData metaData = resultSet.getMetaData();
        int columnCount = metaData.getColumnCount();
        while (resultSet.next()){ //行遍历
            T t = clz.newInstance();
            for (int i = 0; i < columnCount; i++) {//列遍历
                //列名
                String columnName = metaData.getColumnName(i + 1);
                //列值
                Object value = resultSet.getObject(columnName);
                //转换列名
                if(underScoreToCamel){
                    columnName = CaseFormat.LOWER_UNDERSCORE.to(CaseFormat.LOWER_CAMEL, columnName.toLowerCase());
                }
                //对象赋值
                BeanUtils.setProperty(t, columnName, value);
            }
            list.add(t);
        }
        //释放资源
        resultSet.close();
        preparedStatement.close();

        return list;
    }
}

(2)编写DIM工具类

public class DimUtil {
    public static JSONObject getDimInfo(Connection connection, String tableName, String id) throws Exception {
        String querySql = "select * from " + GmallConfig.HBASE_SCHEMA + "." + tableName + " where id = '" + id + "'";
        List<JSONObject> list = JdbcUtil.queryList(connection, querySql, JSONObject.class, false);
        return list.get(0);
    }
}

优化-旁路缓存

(1)修改DIM工具类

public class DimUtil {
    public static JSONObject getDimInfo(Connection connection, String tableName, String id) throws Exception {

        //查询redis
        Jedis jedis = JedisUtil.getJedis();
        String redisKey = "DIM:" + tableName + ":" + id;
        String dimInfoStr = jedis.get(redisKey);
        if(dimInfoStr != null){
            jedis.expire(redisKey, 24 * 60 * 60);
            jedis.close();
            return JSON.parseObject(dimInfoStr);
        }

        //查询Phoenix
        String querySql = "select * from " + GmallConfig.HBASE_SCHEMA + "." + tableName + " where id = '" + id + "'";
        List<JSONObject> list = JdbcUtil.queryList(connection, querySql, JSONObject.class, false);
        JSONObject dimInfo = list.get(0);

        //写入redis
        jedis.set(redisKey, dimInfo.toJSONString());
        jedis.expire(redisKey, 24 * 60 * 60);
        jedis.close();

        //返回结果
        return dimInfo;
    }

    public static void delDimInfo(String tableName, String id){
        Jedis jedis = JedisUtil.getJedis();
        String redisKey = "DIM:" + tableName + ":" + id;
        jedis.del(redisKey);
        jedis.close();
    }
}

注意:缓存要设置过期时间;要考虑维度数据变化;

(2)修改DimSinkFunction,更新时删除缓存

public class DimSinkFunction extends RichSinkFunction<JSONObject> {

    private Connection connection;

    @Override
    public void open(Configuration parameters) throws Exception {
        connection = DriverManager.getConnection(GmallConfig.PHOENIX_SERVER);
    }

    @Override
    public void invoke(JSONObject value, Context context) throws Exception {
        PreparedStatement preparedStatement = null;
        try {
            String sinkTable = value.getString("sinkTable");
            JSONObject data = value.getJSONObject("data");
            String sql = genUpsertSql(sinkTable, data);
            preparedStatement = connection.prepareStatement(sql);
            //如果当前为更新数据,则需要删除redis缓存
            if("update".equals(value.getString("type"))){
                DimUtil.delDimInfo(sinkTable.toUpperCase(), data.getString("id"));
            }
            preparedStatement.execute();
            connection.commit();
        } catch (SQLException e) {
            System.out.println("插入数据失败");
        } finally {
            if(preparedStatement != null){
                preparedStatement.close();
            }
        }
    }

    private String genUpsertSql(String sinkTable, JSONObject data) {

        Set<String> columns = data.keySet();
        Collection<Object> values = data.values();

        return "upsert into " + GmallConfig.HBASE_SCHEMA + "." + sinkTable + "(" +
                StringUtils.join(columns,",") + ") values ('" +
                StringUtils.join(values,"','") + "')";
    }

}

关联维表-异步IO

当与外部系统交互时,需要注意与外部系统的通信延迟。简单地访问外部数据库中的数据,例如MapFunction中的数据,通常意味着同步交互:向数据库发送请求,MapFunction等待直到收到响应。在许多情况下,这种等待占据了函数的绝大多数时间。

而异步IO意味着单个并行函数实例可以同时处理多个请求并同时接收响应。这样,等待时间可以与发送其他请求和接收响应重叠。至少,等待时间是在多个请求中摊销的。这在大多数情况下会导致更高的流吞吐量。

实现异步I/O需要数据库的客户端支持异步请求。

如果没有,可以通过创建多个客户机并使用线程池处理同步调用,尝试将同步客户机转变为有限的并发客户机。然而,这种方法的效率通常低于适当的异步客户机。

(1)编写线程池工具类

public class ThreadPoolUtil {

    private static ThreadPoolExecutor threadPoolExecutor = null;

    private ThreadPoolUtil() {
    }

    public static ThreadPoolExecutor getThreadPoolExecutor() {
        //双重校验
        if (threadPoolExecutor == null) {
            synchronized (ThreadPoolUtil.class){
                if (threadPoolExecutor == null) {
                    threadPoolExecutor = new ThreadPoolExecutor(
                            4,//核心线程数
                            20,//等待队列满了才会创建
                            60,//空闲线程回收时间
                            TimeUnit.SECONDS,
                            new LinkedBlockingDeque<>()//线程不够,等待队列
                    );
                }
            }
        }
        return threadPoolExecutor;
    }
}

(2)编写维度关联异步方法

public abstract class DimAsyncFunction<T> extends RichAsyncFunction<T, T> implements DimJoinFunction<T>{

    private Connection connection;
    private ThreadPoolExecutor threadPoolExecutor;
    private String tableName;

    public DimAsyncFunction(String tableName) {
        this.tableName = tableName;
    }

    @Override
    public void open(Configuration parameters) throws Exception {
        connection = DriverManager.getConnection(GmallConfig.PHOENIX_SERVER);
        threadPoolExecutor = ThreadPoolUtil.getThreadPoolExecutor();
    }

    @Override
    public void asyncInvoke(T input, ResultFuture<T> resultFuture) throws Exception {
        threadPoolExecutor.execute(new Runnable() {
            @SneakyThrows
            @Override
            public void run() {
                //查询维表数据
                JSONObject dimInfo = DimUtil.getDimInfo(connection, tableName, getKey(input));
                //补充JavaBean
                join(input, dimInfo);
                //输出
                resultFuture.complete(Collections.singletonList(input));
            }
        });
    }

    @Override
    public void timeout(T input, ResultFuture<T> resultFuture) throws Exception {
        //再次查询补充信息
        System.out.println("timeout:" + input);
    }
}

注意:这里使用了泛型、抽象类、接口,使得在调用时,利用确定的信息补充方法

public interface DimJoinFunction<T> {

    public String getKey(T input);

    public void join(T input, JSONObject dimInfo);

}

(3)关联维度

//TODO 关联维表-异步IO
//关联DIM_SKU_INFO
SingleOutputStreamOperator<TradeTrademarkCategoryUserSpuOrderBean> withSkuDS = AsyncDataStream.unorderedWait(
        skuUserOrderDS,
        new DimAsyncFunction<TradeTrademarkCategoryUserSpuOrderBean>("DIM_SKU_INFO") {
            @Override
            public String getKey(TradeTrademarkCategoryUserSpuOrderBean input) {
                return input.getSkuId();
            }

            @Override
            public void join(TradeTrademarkCategoryUserSpuOrderBean input, JSONObject dimInfo) {
                if (dimInfo != null) {
                    input.setSpuId(dimInfo.getString("SPU_ID"));
                    input.setTrademarkId(dimInfo.getString("TM_ID"));
                    input.setCategory3Id(dimInfo.getString("CATEGORY4_ID"));
                }
            }
        },
        60,
        TimeUnit.SECONDS
);
//关联DIM_SPU_INFO

//关联DIM_BASE_TRADEMARK

//关联DIM_BASE_CATEGORY3

//关联DIM_BASE_CATEGORY2

//关联DIM_BASE_CATEGORY1

分组开窗聚合

//TODO 提取时间戳生成Watermark
SingleOutputStreamOperator<TradeTrademarkCategoryUserSpuOrderBean> tradeTrademarkCategoryUserSpuOrderWithWmDS =
        withCategory1DS.assignTimestampsAndWatermarks(
                WatermarkStrategy.<TradeTrademarkCategoryUserSpuOrderBean>forBoundedOutOfOrderness(Duration.ofSeconds(2))
                        .withTimestampAssigner(new SerializableTimestampAssigner<TradeTrademarkCategoryUserSpuOrderBean>() {
                            @Override
                            public long extractTimestamp(TradeTrademarkCategoryUserSpuOrderBean element, long l) {
                                return element.getTs();
                            }
                        }));
//TODO 分组开窗聚合
KeyedStream<TradeTrademarkCategoryUserSpuOrderBean, String> keyedStream = tradeTrademarkCategoryUserSpuOrderWithWmDS.keyBy(new KeySelector<TradeTrademarkCategoryUserSpuOrderBean, String>() {
    @Override
    public String getKey(TradeTrademarkCategoryUserSpuOrderBean value) throws Exception {
        return value.getUserId() + "-" +
                value.getCategory1Id() + "-" +
                value.getCategory1Name() + "-" +
                value.getCategory2Id() + "-" +
                value.getCategory2Name() + "-" +
                value.getCategory3Id() + "-" +
                value.getCategory3Name() + "-" +
                value.getSpuId() + "-" +
                value.getSpuName() + "-" +
                value.getTrademarkId() + "-" +
                value.getTrademarkName();
    }
});
WindowedStream<TradeTrademarkCategoryUserSpuOrderBean, String, TimeWindow> windowedStream =
        keyedStream.window(TumblingEventTimeWindows.of(org.apache.flink.streaming.api.windowing.time.Time.seconds(10)));

SingleOutputStreamOperator<TradeTrademarkCategoryUserSpuOrderBean> resultDS = windowedStream.reduce(new ReduceFunction<TradeTrademarkCategoryUserSpuOrderBean>() {
    @Override
    public TradeTrademarkCategoryUserSpuOrderBean reduce(TradeTrademarkCategoryUserSpuOrderBean value1, TradeTrademarkCategoryUserSpuOrderBean value2) throws Exception {
        value1.setOrderCount(value1.getOrderCount() + value2.getOrderCount());
        value1.setOrderAmount(value1.getOrderAmount() + value2.getOrderAmount());
//去重
     value1.getOrderIdSet().addAll(value2.getOrderIdSet());
return value1; } }, new WindowFunction<TradeTrademarkCategoryUserSpuOrderBean, TradeTrademarkCategoryUserSpuOrderBean, String, TimeWindow>() { @Override public void apply(String s, TimeWindow window, Iterable<TradeTrademarkCategoryUserSpuOrderBean> input, Collector<TradeTrademarkCategoryUserSpuOrderBean> out) throws Exception { TradeTrademarkCategoryUserSpuOrderBean orderBean = input.iterator().next(); orderBean.setStt(DateFormatUtil.toYmdHms(window.getStart())); orderBean.setEdt(DateFormatUtil.toYmdHms(window.getEnd())); orderBean.setTs(System.currentTimeMillis());
    
orderBean.setOrderCount((long)orderBean.getOrderIdSet().size()); out.collect(orderBean); } });

写入ClickHouse

(1)建表

drop table if exists dws_trade_trademark_category_user_spu_order_window;
create table if not exists dws_trade_trademark_category_user_spu_order_window
(
    stt            DateTime,
    edt            DateTime,
    trademark_id   String,
    trademark_name String,
    category1_id   String,
    category1_name String,
    category2_id   String,
    category2_name String,
    category3_id   String,
    category3_name String,
    user_id        String,
    spu_id         String,
    spu_name       String,
    order_count    UInt64,
    order_amount   Decimal(38, 20),
    ts             UInt64
) engine = ReplacingMergeTree(ts)
      partition by toYYYYMMDD(stt)
      order by (stt, edt, trademark_id, trademark_name, category1_id,
                category1_name, category2_id, category2_name, category3_id, category3_name, user_id);

(2)写入ClickHouse

//TODO 写出到ClickHouse
resultDS.addSink(MyClickHouseUtil.getClickHouseSink("insert into dws_trade_trademark_category_user_spu_order_window values(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)"));

交易域省份粒度下单各窗口汇总表

从 Kafka 读取订单明细数据,过滤 null 数据并按照唯一键对数据去重,统计各省份各窗口订单数和订单金额,将数据写入 ClickHouse 交易域省份粒度下单各窗口汇总表。

主要任务

1)获取过滤后的OrderDetail表

2)转换数据结构

3)分组开窗聚合

4)关联维表

5)写入ClickHouse

注意:这里可以先分组开窗聚合,再关联维表,减少数据量

交易域支付各窗口汇总表

从 Kafka 读取交易域支付成功主题数据,统计支付成功独立用户数和首次支付成功用户数。

主要任务

1)过滤 null 数据并去重

2)统计支付成功独立用户数和首次支付成功用户数

3)写入 ClickHouse

注意:去重可以不用定时器,直接状态编程取第一条。因为后续需求中不需要上游关联的属性。

//TODO 按照唯一键分组
KeyedStream<JSONObject, String> keyedByDetailIdStream = jsonObjDS.keyBy(json -> json.getString("order_detail_id"));
//TODO 状态编程去重
SingleOutputStreamOperator<JSONObject> filterDS = keyedByDetailIdStream.filter(new RichFilterFunction<JSONObject>() {
    private ValueState<String> valueState;
    @Override
    public void open(Configuration parameters) throws Exception {
        ValueStateDescriptor<String> descriptor = new ValueStateDescriptor<>("value", String.class);
        StateTtlConfig stateTtlConfig = new StateTtlConfig.Builder(Time.seconds(5))
                .setUpdateType(StateTtlConfig.UpdateType.OnReadAndWrite)
                .build();
     //设置状态TTL,因为上游重复数据是紧挨着的,可以设置一个小的TTL descriptor.enableTimeToLive(stateTtlConfig); valueState
= getRuntimeContext().getState(descriptor); } @Override public boolean filter(JSONObject value) throws Exception { String state = valueState.value(); if (state == null) { valueState.update("1"); return true; } else { return false; } } });

交易域品牌-品类-用户粒度退单各窗口汇总表

从 Kafka 读取退单明细数据,过滤 null 数据并按照唯一键对数据去重,关联维度信息,按照维度分组,统计各维度各窗口的退单数,将数据写入 ClickHouse 交易域品牌-品类-用户粒度退单各窗口汇总表。

主要任务

1)过滤 null 数据并按照唯一键对数据去重

2)转换数据结构

3)关联维表

4)分组开窗聚合

5)写入ClickHouse

ADS

流量主题

各渠道流量统计

统计周期

统计粒度

指标

说明

当日

渠道

独立访客数

统计访问人数

当日

渠道

会话总数

统计会话总数

当日

渠道

会话平均浏览页面数

统计每个会话平均浏览页面数

当日

渠道

会话平均停留时长

统计每个会话平均停留时长

当日

渠道

跳出率

只有一个页面的会话的比例

 

select toYYYYMMDD(stt)      dt,
       ch,
       sum(uv_ct)           uv_ct,
       sum(sv_ct)           sv_ct,
       sum(pv_ct) / sv_ct   pv_per_session,
       sum(dur_sum) / sv_ct dur_per_session,
       sum(uj_ct) / sv_ct   uj_rate
from dws_traffic_channel_page_view_window
where toYYYYMMDD(stt) = toYYYYMMDD(cast('2022-02-21 12:00:00' as DateTime))
group by toYYYYMMDD(stt), ch;

关键词统计

统计周期

统计粒度

指标

说明

当日

关键词

关键词评分

根据不同来源和频次计算得分

select toYYYYMMDD(stt) dt,
       keyword,
       sum(keyword_count * multiIf(
               source = 'SEARCH', 10,
               source = 'ORDER', 5,
               source = 'CART', 2,
               source = 'CLICK', 1, 0
           ))          keyword_count
from dws_traffic_source_keyword_page_view_window
where toYYYYMMDD(stt) = toYYYYMMDD(cast('2022-06-14 12:00:00' as DateTime))
group by toYYYYMMDD(stt), keyword;

用户主题

用户变动统计

统计周期

指标

说明

当日

回流用户数

之前的活跃用户,一段时间未活跃(流失),今日又活跃了,就称为回流用户。此处要求统计回流用户总数。

select toYYYYMMDD(stt) dt,
       sum(back_ct)    back_ct
from dws_user_user_login_window
where toYYYYMMDD(stt) = toYYYYMMDD(cast('2022-06-14 12:00:00' as DateTime))
group by toYYYYMMDD(stt);

用户新增活跃统计

统计周期

指标

指标说明

当日

新增用户数

当日

活跃用户数

select t1.dt,
       uu_ct,
       register_ct
from (select toYYYYMMDD(stt) dt,
             sum(uu_ct)      uu_ct
      from dws_user_user_login_window
      where toYYYYMMDD(stt) = toYYYYMMDD(cast('2022-06-14 12:00:00' as DateTime))
      group by toYYYYMMDD(stt)) t1
         full outer join
     (
         select toYYYYMMDD(stt)  dt,
                sum(register_ct) register_ct
         from dws_user_user_register_window
         where toYYYYMMDD(stt) = toYYYYMMDD(cast('2022-02-21 12:00:00' as DateTime))
         group by toYYYYMMDD(stt)) t2
     on t1.dt = t2.dt;

用户行为漏斗分析

统计周期

指标

说明

当日

首页浏览人数

当日

商品详情页浏览人数

当日

加购人数

当日

下单人数

当日

支付人数

支付成功人数

(1)当日首页浏览人数及当日商品详情页浏览人数

select toYYYYMMDD(stt)        dt,
       sum(home_uv_ct)        home_uv_count,
       sum(good_detail_uv_ct) good_detail_uv_ct
from dws_traffic_page_view_window
where toYYYYMMDD(stt) = toYYYYMMDD(cast('2022-06-14 12:00:00' as DateTime))
group by toYYYYMMDD(stt);

(2)当日加购人数

select toYYYYMMDD(stt)     dt,
       sum(cart_add_uu_ct) cart_add_uu_ct
from dws_trade_cart_add_uu_window
where toYYYYMMDD(stt) = toYYYYMMDD(cast('2022-06-14 12:00:00' as DateTime))
group by toYYYYMMDD(stt);

(3)当日下单人数

select toYYYYMMDD(stt)              dt,
       sum(order_unique_user_count) order_uu_count
from dws_trade_order_window
where toYYYYMMDD(stt) = toYYYYMMDD(cast('2022-02-21 12:00:00' as DateTime))
group by toYYYYMMDD(stt);

(4)当日支付人数

select toYYYYMMDD(stt)                    dt,
       sum(payment_suc_unique_user_count) payment_suc_uu_count
from dws_trade_payment_suc_window
where toYYYYMMDD(stt) = toYYYYMMDD(cast('2022-02-21 12:00:00' as DateTime))
group by toYYYYMMDD(stt);

新增交易用户统计

统计周期

指标

说明

当日

新增下单人数

当日

新增支付人数

select dt,
       order_new_user_count,
       pay_suc_new_user_count
from (select toYYYYMMDD(stt)           dt,
             sum(order_new_user_count) order_new_user_count
      from dws_trade_order_window
      where toYYYYMMDD(stt) = toYYYYMMDD(cast('2022-02-21 12:00:00' as DateTime))
      group by toYYYYMMDD(stt)) oct
         full outer join
     (select toYYYYMMDD(stt)             dt,
             sum(payment_new_user_count) pay_suc_new_user_count
      from dws_trade_payment_suc_window
      where toYYYYMMDD(stt) = toYYYYMMDD(cast('2022-02-21 12:00:00' as DateTime))
      group by toYYYYMMDD(stt)) pay_suc
     on oct.dt = pay_suc.dt;

商品主题

各品牌商品交易统计

统计周期

统计粒度

指标

说明

当日

品牌

订单数

当日

品牌

订单人数

当日

品牌

退单数

当日

品牌

退单人数

select dt,
       trademark_id,
       trademark_name,
       order_count,
       uu_count,
       order_amount,
       refund_count,
       refund_uu_count
from (select toYYYYMMDD(stt)         dt,
             trademark_id,
             trademark_name,
             sum(order_count)        order_count,
             count(distinct user_id) uu_count,
             sum(order_amount)       order_amount
      from dws_trade_trademark_category_user_spu_order_window
      where toYYYYMMDD(stt) = toYYYYMMDD(cast('2022-02-21 12:00:00' as DateTime))
      group by toYYYYMMDD(stt), trademark_id, trademark_name) oct
         full outer join
     (select toYYYYMMDD(stt)         dt,
             trademark_id,
             trademark_name,
             sum(refund_count)       refund_count,
             count(distinct user_id) refund_uu_count
      from dws_trade_trademark_category_user_refund_window
      where toYYYYMMDD(stt) = toYYYYMMDD(cast('2022-02-21 12:00:00' as DateTime))
      group by toYYYYMMDD(stt), trademark_id, trademark_name) rct
     on oct.dt = rct.dt
         and oct.trademark_id = rct.trademark_id;

各品类商品交易统计

统计周期

统计粒度

指标

说明

当日

品类

订单数

当日

品类

订单人数

当日

品类

订单金额

当日

品类

退单数

当日

品类

退单人数

select dt,
       category1_id,
       category1_name,
       category2_id,
       category2_name,
       category3_id,
       category3_name,
       order_count,
       uu_count,
       order_amount,
       refund_count,
       refund_uu_count
from (select toYYYYMMDD(stt)         dt,
             category1_id,
             category1_name,
             category2_id,
             category2_name,
             category3_id,
             category3_name,
             sum(order_count)        order_count,
             count(distinct user_id) uu_count,
             sum(order_amount)       order_amount
      from dws_trade_trademark_category_user_spu_order_window
      where toYYYYMMDD(stt) = toYYYYMMDD(cast('2022-02-21 12:00:00' as DateTime))
      group by toYYYYMMDD(stt),
               category1_id,
               category1_name,
               category2_id,
               category2_name,
               category3_id,
               category3_name) oct
         full outer join
     (select toYYYYMMDD(stt)         dt,
             category1_id,
             category1_name,
             category2_id,
             category2_name,
             category3_id,
             category3_name,
             sum(refund_count)       refund_count,
             count(distinct user_id) refund_uu_count
      from dws_trade_trademark_category_user_refund_window
      where toYYYYMMDD(stt) = toYYYYMMDD(cast('2022-02-21 12:00:00' as DateTime))
      group by toYYYYMMDD(stt),
               category1_id,
               category1_name,
               category2_id,
               category2_name,
               category3_id,
               category3_name) rct
     on oct.dt = rct.dt
         and oct.category1_id = rct.category1_id
         and oct.category2_id = rct.category2_id
         and oct.category3_id = rct.category3_id;

各 SPU 商品交易统计

select toYYYYMMDD(stt)         dt,
       spu_id,
       spu_name,
       sum(order_count)        order_count,
       count(distinct user_id) uu_count,
       sum(order_amount)       order_amount
from dws_trade_trademark_category_user_spu_order_window
where toYYYYMMDD(stt) = toYYYYMMDD(cast('2022-02-21 12:00:00' as DateTime))
group by toYYYYMMDD(stt), spu_id, spu_name;

交易主题

交易综合统计

统计周期

指标

说明

当日

订单总额

订单最终金额

当日

订单数

当日

订单人数

当日

退单数

当日

退单人数

(1)当日订单总额

select toYYYYMMDD(stt)   dt,
       sum(order_amount) order_total_amount
from dws_trade_province_order_window
where toYYYYMMDD(stt) = toYYYYMMDD(cast('2022-02-21 12:00:00' as DateTime))
group by toYYYYMMDD(stt);

(2)当日订单数与当日订单人数

select toYYYYMMDD(stt)         dt,
       sum(order_count)        order_count,
       count(distinct user_id) order_uu_count
from dws_trade_trademark_category_user_spu_order_window
where toYYYYMMDD(stt) = toYYYYMMDD(cast('2022-02-21 12:00:00' as DateTime))
group by toYYYYMMDD(stt);

(3)当日退单数与当日退单人数

select toYYYYMMDD(stt)         dt,
       sum(refund_count)       refund_count,
       count(distinct user_id) refund_uu_count
from dws_trade_trademark_category_user_refund_window
where toYYYYMMDD(stt) = toYYYYMMDD(cast('2022-02-21 12:00:00' as DateTime))
group by toYYYYMMDD(stt);

各省份交易统计

统计周期

统计粒度

指标

说明

当日

省份

订单数

当日

省份

订单金额

select toYYYYMMDD(stt)   dt,
       province_id,
       province_name,
       sum(order_count)  order_count,
       sum(order_amount) order_amount
from dws_trade_province_order_window
where toYYYYMMDD(stt) = toYYYYMMDD(cast('2022-02-21 12:00:00' as DateTime))
group by toYYYYMMDD(stt), province_id, province_name;

优惠券主题

当日优惠券补贴率

统计周期

统计粒度

指标

说明

当日

优惠券

补贴率

用券的订单明细优惠券减免金额总和/原始金额总和

select toYYYYMMDD(stt)                                     dt,
       sum(order_coupon_reduce_amount)                     coupon_reduce_amount,
       sum(order_origin_total_amount)                      origin_total_amount,
       round(round(toFloat64(coupon_reduce_amount), 5) /
             round(toFloat64(origin_total_amount), 5), 20) coupon_subsidy_rate
from dws_trade_order_window
where toYYYYMMDD(stt) = toYYYYMMDD(cast('2022-02-21 12:00:00' as DateTime))
group by toYYYYMMDD(stt);

活动主题

当日活动补贴率

统计周期

统计粒度

指标

说明

当日

活动

补贴率

参与促销活动的订单明细活动减免金额总和/原始金额总和

select toYYYYMMDD(stt)                                     dt,
       sum(order_activity_reduce_amount)                   activity_reduce_amount,
       sum(order_origin_total_amount)                      origin_total_amount,
       round(round(toFloat64(activity_reduce_amount), 5) /
             round(toFloat64(origin_total_amount), 5), 20) subsidyRate
from dws_trade_order_window
where toYYYYMMDD(stt) = toYYYYMMDD(cast('2022-02-21 12:00:00' as DateTime))
group by toYYYYMMDD(stt);
posted @ 2022-05-27 16:11  1243741754  阅读(730)  评论(0编辑  收藏  举报