自建基于Hadoop+Hive+Spark的离线数仓总结

写在前面

1. 什么是离线数仓,为什么做离线数仓?为什么mysql不能作为数仓的存储?

首先我们要知道我们想做什么,我们的目的不是做业务系统的OLTP工作。而是要对海量的数据做OLAP操作。海量数据的分布式存储mysql也许可以满足,但分布式计算,mysql便无法胜任。而hadoop的hdfs与mr则应运而生。

为了开发人员的开发效率,又诞生了hive框架,通过sql的方式,解析之后经过mr自动计算。

mr的落盘io的问题进而又诞生了spark计算引擎。hive on spark的操作把mr取代,有向无环图DAG的执行计划及内存计算方式让分布式计算得到提高。

sparkSql直接取代了sql的解析,只用了hive的元数据。

回过头来,spark依赖hadoop的yarn资源调度,连接mysql做分布式计算?想想可以,一般不这么做哈哈。

 

OLTP(Online transaction processing):在线/联机事务处理。典型的OLTP类操作都比较简单,主要是对数据库中的数据进行增删改查,操作主体一般是产品的用户。

OLAP(Online analytical processing):指联机分析处理。通过分析数据库中的数据来得出一些结论性的东西。比如给老总们看的报表,用于进行市场开拓的用户行为统计,不同维度的汇总分析结果等等。操作主体一般是运营、销售和市场等团队人员。

 

2. 在离线数仓中,什么是增量同步什么是离线同步?

增量同步:大表、小更新。一般用于事实表。实时同步。

离线同步:小表、大更新。一般用于维度表。每日定时同步。

在离线数仓中,hdfs到hive的ods层全部是定时操作。所以在整体上看离线数仓的所有数据都是n+1的离线同步。

 

整体架构图

 

服务台账

  hadoop1 hadoop2 hadoop3
  172.23.112.160 172.23.112.161 172.23.112.162
HDFS NameNode
DataNode
HDFS NameNode HTTP UI:9870
HDFS DataNode HTTP UI:9864
DataNode SecondaryNameNode
DataNode
  http://hadoop1:9870/dfshealth.html#tab-overview   http://hadoop3:9868/status.html
YARN NodeManager
8042端口
ResourceManager
NodeManager
NodeManager
    http://hadoop2:8088/cluster  
JOBHISTORY historyserver    
  http://hadoop1:19888/jobhistory    
日志mock user_action.jar user_action.jar  
flume-job实时toKfka Application Application  
flume-job实时toHdfs     Application-log
业务mock business_db.jar    
maxwell实时任务toKafka Maxwell    
maxwell离线增量+flume-job实时     Application-db
dataX离线全量 每日定时提交任务    
hive metaStoreServer hive2Server metaStoreServer hive2Server metaStoreServer hive2Server
spark/sparkSql 4040端口    
dolphinscheduler MasterServer    
  WorkerServer WorkerServer WorkerServer
  LoggerServer LoggerServer LoggerServer
  AlertServer    
  ApiApplicationServer    
  http://hadoop1:12345/dolphinscheduler    

简化数据流图 

 

按步骤解析

图中1.1 mysql->maxwell->kafka: 

采用maxwell中间件监听业务mysql的binlog日志(row模式),实时同步增删改的json格式记录到kafka的指定topic主题。实现增量同步。

maxwell配置不在赘述,需要做好maxwell库的初始化(核心主要用于存放binlog日志监听的步数),输入为开启binlog的msql库,输出为kafka集群。 

producer=kafka
kafka.bootstrap.servers=kafka1:9092,kafka2:9092
#kafka topic配置
kafka_topic=topic_db

# mysql login info
host=hadoop1
user=root
password=root123456
jdbc_options=useSSL=false&serverTimezone=Asia/Shanghai
View Code

 

图中1.2 kafka->flume->hdfs:  

由flume组件将监听的kafka主题数据传递至hdfs指定目录。

echo " --------启动 hadoop104 业务数据flume-------"
  ssh hadoop3 "nohup /opt/module/flume/bin/flume-ng agent -n a1 -c /opt/module/flume/conf -f /opt/module/flume/job/kafka_to_hdfs_db.conf >/dev/null 2>&1 &"
a1.sources = r1
a1.channels = c1
a1.sinks = k1

a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.batchSize = 5000
a1.sources.r1.batchDurationMillis = 2000
a1.sources.r1.kafka.bootstrap.servers = kafka1:9092,kafka2:9092
a1.sources.r1.kafka.topics = topic_db
a1.sources.r1.kafka.consumer.group.id = flume
a1.sources.r1.setTopicHeader = true
a1.sources.r1.topicHeader = topic
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.zfyz.flume.interceptor.TimeStampAndTableNameInterceptor$Builder

a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /opt/module/flume/checkpoint/behavior2
a1.channels.c1.dataDirs = /opt/module/flume/data/behavior2/
a1.channels.c1.maxFileSize = 2146435071
a1.channels.c1.capacity = 1000000
a1.channels.c1.keep-alive = 6

## sink1
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /origin_data/gmall/db/%{tableName}_inc/%Y-%m-%d
a1.sinks.k1.hdfs.filePrefix = db
a1.sinks.k1.hdfs.round = false


a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollSize = 134217728
a1.sinks.k1.hdfs.rollCount = 0


a1.sinks.k1.hdfs.fileType = CompressedStream
a1.sinks.k1.hdfs.codeC = gzip

## 拼装
a1.sources.r1.channels = c1
a1.sinks.k1.channel= c1
View Code

 

hdfs的输出做了gzip压缩以及路径的按天拆分/%Y-%m-%d

 

图中2.1 mysql->dataX->hdfs:

需要离线同步的部分表由dataX组件一次性读取全表数据text格式同步到hdfs集群。实现离线同步。

DATAX_HOME=/opt/module/datax

# 如果传入日期则do_date等于传入的日期,否则等于前一天日期
if [ -n "$2" ] ;then
    do_date=$2
else
    do_date=`date -d "-1 day" +%F`
fi

#处理目标路径,此处的处理逻辑是,如果目标路径不存在,则创建;若存在,则清空,目的是保证同步任务可重复执行
handle_targetdir() {
  hadoop fs -test -e $1
  if [[ $? -eq 1 ]]; then
    echo "路径$1不存在,正在创建......"
    hadoop fs -mkdir -p $1
  else
    echo "路径$1已经存在"
    fs_count=$(hadoop fs -count $1)
    content_size=$(echo $fs_count | awk '{print $3}')
    if [[ $content_size -eq 0 ]]; then
      echo "路径$1为空"
    else
      echo "路径$1不为空,正在清空......"
      hadoop fs -rm -r -f $1/*
    fi
  fi
}

#数据同步
import_data() {
  datax_config=$1
  target_dir=$2

  handle_targetdir $target_dir
  python $DATAX_HOME/bin/datax.py -p"-Dtargetdir=$target_dir" $datax_config
}

#举个例子
import_data /opt/module/datax/job/import/gmall.activity_info.json /origin_data/gmall/db/activity_info_full/$do_date
View Code

gmall.activity_info.json

{
    "job":{
        "content":[
            {
                "writer":{
                    "parameter":{
                        "writeMode":"append",
                        "fieldDelimiter":"\t",
                        "column":[
                            {
                                "type":"bigint",
                                "name":"id"
                            },
                            {
                                "type":"string",
                                "name":"activity_name"
                            },
                            {
                                "type":"string",
                                "name":"activity_type"
                            },
                            {
                                "type":"string",
                                "name":"activity_desc"
                            },
                            {
                                "type":"string",
                                "name":"start_time"
                            },
                            {
                                "type":"string",
                                "name":"end_time"
                            },
                            {
                                "type":"string",
                                "name":"create_time"
                            }
                        ],
                        "path":"${targetdir}",
                        "fileType":"text",
                        "defaultFS":"hdfs://hadoop1:8020",
                        "compress":"gzip",
                        "fileName":"activity_info"
                    },
                    "name":"hdfswriter"
                },
                "reader":{
                    "parameter":{
                        "username":"root",
                        "column":[
                            "id",
                            "activity_name",
                            "activity_type",
                            "activity_desc",
                            "start_time",
                            "end_time",
                            "create_time"
                        ],
                        "connection":[
                            {
                                "table":[
                                    "activity_info"
                                ],
                                "jdbcUrl":[
                                    "jdbc:mysql://hadoop1:3306/gmall"
                                ]
                            }
                        ],
                        "password":"root123456",
                        "splitPk":""
                    },
                    "name":"mysqlreader"
                }
            }
        ],
        "setting":{
            "speed":{
                "channel":3
            },
            "errorLimit":{
                "record":0,
                "percentage":0.02
            }
        }
    }
}

View Code

dataX的数据同步config可以由pathon脚本一次性生成

gen_import_config.py

# ecoding=utf-8
import json
import getopt
import os
import sys
import MySQLdb

#MySQL相关配置,需根据实际情况作出修改
mysql_host = "hadoop1"
mysql_port = "3306"
mysql_user = "root"
mysql_passwd = "root123456"

#HDFS NameNode相关配置,需根据实际情况作出修改
hdfs_nn_host = "hadoop1"
hdfs_nn_port = "8020"

#生成配置文件的目标路径,可根据实际情况作出修改
output_path = "/opt/module/datax/job/import"


def get_connection():
    return MySQLdb.connect(host=mysql_host, port=int(mysql_port), user=mysql_user, passwd=mysql_passwd)


def get_mysql_meta(database, table):
    connection = get_connection()
    cursor = connection.cursor()
    sql = "SELECT COLUMN_NAME,DATA_TYPE from information_schema.COLUMNS WHERE TABLE_SCHEMA=%s AND TABLE_NAME=%s ORDER BY ORDINAL_POSITION"
    cursor.execute(sql, [database, table])
    fetchall = cursor.fetchall()
    cursor.close()
    connection.close()
    return fetchall


def get_mysql_columns(database, table):
    return map(lambda x: x[0], get_mysql_meta(database, table))


def get_hive_columns(database, table):
    def type_mapping(mysql_type):
        mappings = {
            "bigint": "bigint",
            "int": "bigint",
            "smallint": "bigint",
            "tinyint": "bigint",
            "decimal": "string",
            "double": "double",
            "float": "float",
            "binary": "string",
            "char": "string",
            "varchar": "string",
            "datetime": "string",
            "time": "string",
            "timestamp": "string",
            "date": "string",
            "text": "string"
        }
        return mappings[mysql_type]

    meta = get_mysql_meta(database, table)
    return map(lambda x: {"name": x[0], "type": type_mapping(x[1].lower())}, meta)


def generate_json(source_database, source_table):
    job = {
        "job": {
            "setting": {
                "speed": {
                    "channel": 3
                },
                "errorLimit": {
                    "record": 0,
                    "percentage": 0.02
                }
            },
            "content": [{
                "reader": {
                    "name": "mysqlreader",
                    "parameter": {
                        "username": mysql_user,
                        "password": mysql_passwd,
                        "column": get_mysql_columns(source_database, source_table),
                        "splitPk": "",
                        "connection": [{
                            "table": [source_table],
                            "jdbcUrl": ["jdbc:mysql://" + mysql_host + ":" + mysql_port + "/" + source_database]
                        }]
                    }
                },
                "writer": {
                    "name": "hdfswriter",
                    "parameter": {
                        "defaultFS": "hdfs://" + hdfs_nn_host + ":" + hdfs_nn_port,
                        "fileType": "text",
                        "path": "${targetdir}",
                        "fileName": source_table,
                        "column": get_hive_columns(source_database, source_table),
                        "writeMode": "append",
                        "fieldDelimiter": "\t",
                        "compress": "gzip"
                    }
                }
            }]
        }
    }
    if not os.path.exists(output_path):
        os.makedirs(output_path)
    with open(os.path.join(output_path, ".".join([source_database, source_table, "json"])), "w") as f:
        json.dump(job, f)


def main(args):
    source_database = ""
    source_table = ""

    options, arguments = getopt.getopt(args, '-d:-t:', ['sourcedb=', 'sourcetbl='])
    for opt_name, opt_value in options:
        if opt_name in ('-d', '--sourcedb'):
            source_database = opt_value
        if opt_name in ('-t', '--sourcetbl'):
            source_table = opt_value

    generate_json(source_database, source_table)


if __name__ == '__main__':
    main(sys.argv[1:])
View Code

 

图中3.1 logFile->flume->kafka:

由flume组件监听对用的埋点日志目录,实时将json格式的日志数据传递至kafka集群的主题topic。实现增量同步。

echo " --------启动 $i 采集flume-------"
                ssh $i "nohup /opt/module/flume/bin/flume-ng agent -n a1 -c /opt/module/flume/conf/ -f /opt/module/flume/job/file_to_kafka.conf >/dev/null 2>&1 &"
#定义组件
a1.sources = r1
a1.channels = c1

#配置source
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /opt/module/applog/log/app.*
a1.sources.r1.positionFile = /opt/module/flume/taildir_position.json
a1.sources.r1.interceptors =  i1
a1.sources.r1.interceptors.i1.type = com.zfyz.flume.interceptor.ETLInterceptor$Builder

#配置channel
a1.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c1.kafka.bootstrap.servers = kafka1:9092,kafka2:9092
a1.channels.c1.kafka.topic = topic_log
a1.channels.c1.parseAsFlumeEvent = false

#组装 
a1.sources.r1.channels = c1
View Code

 

kafkaChannel模式不需要指定sink

 

图中3.2 kafka->flume->hdfs: 

由flume组件将监听的kafka主题数据传递至hdfs指定目录。

echo " --------启动 hadoop104 日志数据flume-------"
        ssh hadoop3 "nohup /opt/module/flume/bin/flume-ng agent -n a1 -c /opt/module/flume/conf -f /opt/module/flume/job/kafka_to_hdfs_log.conf >/dev/null 2>&1 &"
#定义组件
a1.sources=r1
a1.channels=c1
a1.sinks=k1

#配置source1
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.batchSize = 5000
a1.sources.r1.batchDurationMillis = 2000
a1.sources.r1.kafka.bootstrap.servers = kafka1:9092,kafka2:9092,kafka3:9092
a1.sources.r1.kafka.topics=topic_log
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.zfyz.flume.interceptor.TimeStampInterceptor$Builder

#配置channel
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /opt/module/flume/checkpoint/behavior1
a1.channels.c1.dataDirs = /opt/module/flume/data/behavior1
a1.channels.c1.maxFileSize = 2146435071
a1.channels.c1.capacity = 1000000
a1.channels.c1.keep-alive = 6

#配置sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /origin_data/gmall/log/topic_log/%Y-%m-%d
a1.sinks.k1.hdfs.filePrefix = log
a1.sinks.k1.hdfs.round = false


a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollSize = 134217728
a1.sinks.k1.hdfs.rollCount = 0

#控制输出文件类型
a1.sinks.k1.hdfs.fileType = CompressedStream
a1.sinks.k1.hdfs.codeC = gzip

#组装 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
View Code

 

同样是做了gzip压缩和日期的切换

Hive On Spark的离线数仓分层

正常的数仓分层开发是由下而上的,即由产品提出报表需求,逐层分析拆解得出事实表和聚合表。而为了方便理解,以小见大。我们从ads层出发倒推到ods层。如下

 

 

建标及每日装载语句

============================================================================================ads============================================================================================
交易主题 
=============各省份交易统计 ads_order_by_province
    建表语句
    DROP TABLE IF EXISTS ads_order_by_province;
    CREATE EXTERNAL TABLE ads_order_by_province
    (
        `dt`                 STRING COMMENT '统计日期',
        `recent_days`        BIGINT COMMENT '最近天数,1:最近1天,7:最近7天,30:最近30天',
        `province_id`        STRING COMMENT '省份ID',
        `province_name`      STRING COMMENT '省份名称',
        `area_code`          STRING COMMENT '地区编码',
        `iso_code`           STRING COMMENT '国际标准地区编码',
        `iso_code_3166_2`    STRING COMMENT '国际标准地区编码',
        `order_count`        BIGINT COMMENT '订单数',
        `order_total_amount` DECIMAL(16, 2) COMMENT '订单金额'
    ) COMMENT '各地区订单统计'
        ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
        LOCATION '/warehouse/gmall/ads/ads_order_by_province/';

    数据装载
    insert overwrite table ads_order_by_province
    select * from ads_order_by_province
    union
    select
        '2020-06-14' dt,
        1 recent_days,
        province_id,
        province_name,
        area_code,
        iso_code,
        iso_3166_2,
        order_count_1d,
        order_total_amount_1d
    from dws_trade_province_order_1d
    where dt='2020-06-14'
    union
    select
        '2020-06-14' dt,
        recent_days,
        province_id,
        province_name,
        area_code,
        iso_code,
        iso_3166_2,
        sum(order_count),
        sum(order_total_amount)
    from
    (
        select
            recent_days,
            province_id,
            province_name,
            area_code,
            iso_code,
            iso_3166_2,
            case recent_days
                when 7 then order_count_7d
                when 30 then order_count_30d
            end order_count,
            case recent_days
                when 7 then order_total_amount_7d
                when 30 then order_total_amount_30d
            end order_total_amount
        from dws_trade_province_order_nd lateral view explode(array(7,30)) tmp as recent_days
        where dt='2020-06-14'
    )t1
    group by recent_days,province_id,province_name,area_code,iso_code,iso_3166_2;

依赖
dws_trade_province_order_1d
dws_trade_province_order_nd 
============================================================================================dws============================================================================================

=============交易域省份粒度订单最近1日汇总表 dws_trade_province_order_1d
    建表语句
    DROP TABLE IF EXISTS dws_trade_province_order_1d;
    CREATE EXTERNAL TABLE dws_trade_province_order_1d
    (
        `province_id`               STRING COMMENT '用户id',
        `province_name`             STRING COMMENT '省份名称',
        `area_code`                 STRING COMMENT '地区编码',
        `iso_code`                  STRING COMMENT '旧版ISO-3166-2编码',
        `iso_3166_2`                STRING COMMENT '新版版ISO-3166-2编码',
        `order_count_1d`            BIGINT COMMENT '最近1日下单次数',
        `order_original_amount_1d`  DECIMAL(16, 2) COMMENT '最近1日下单原始金额',
        `activity_reduce_amount_1d` DECIMAL(16, 2) COMMENT '最近1日下单活动优惠金额',
        `coupon_reduce_amount_1d`   DECIMAL(16, 2) COMMENT '最近1日下单优惠券优惠金额',
        `order_total_amount_1d`     DECIMAL(16, 2) COMMENT '最近1日下单最终金额'
    ) COMMENT '交易域省份粒度订单最近1日汇总事实表'
        PARTITIONED BY (`dt` STRING)
        STORED AS ORC
        LOCATION '/warehouse/gmall/dws/dws_trade_province_order_1d'
        TBLPROPERTIES ('orc.compress' = 'snappy');
    每日装载
    insert overwrite table dws_trade_province_order_1d partition(dt='2020-06-15')
    select
        province_id,
        province_name,
        area_code,
        iso_code,
        iso_3166_2,
        order_count_1d,
        order_original_amount_1d,
        activity_reduce_amount_1d,
        coupon_reduce_amount_1d,
        order_total_amount_1d
    from
    (
        select
            province_id,
            count(distinct(order_id)) order_count_1d,
            sum(split_original_amount) order_original_amount_1d,
            sum(nvl(split_activity_amount,0)) activity_reduce_amount_1d,
            sum(nvl(split_coupon_amount,0)) coupon_reduce_amount_1d,
            sum(split_total_amount) order_total_amount_1d
        from dwd_trade_order_detail_inc
        where dt='2020-06-15'
        group by province_id
    )o
    left join
    (
        select
            id,
            province_name,
            area_code,
            iso_code,
            iso_3166_2
        from dim_province_full
        where dt='2020-06-15'
    )p
    on o.province_id=p.id;

=============交易域省份粒度订单最近n日汇总表 dws_trade_province_order_nd
    建表语句
    DROP TABLE IF EXISTS dws_trade_province_order_nd;
    CREATE EXTERNAL TABLE dws_trade_province_order_nd
    (
        `province_id`                STRING COMMENT '用户id',
        `province_name`              STRING COMMENT '省份名称',
        `area_code`                  STRING COMMENT '地区编码',
        `iso_code`                   STRING COMMENT '旧版ISO-3166-2编码',
        `iso_3166_2`                 STRING COMMENT '新版版ISO-3166-2编码',
        `order_count_7d`             BIGINT COMMENT '最近7日下单次数',
        `order_original_amount_7d`   DECIMAL(16, 2) COMMENT '最近7日下单原始金额',
        `activity_reduce_amount_7d`  DECIMAL(16, 2) COMMENT '最近7日下单活动优惠金额',
        `coupon_reduce_amount_7d`    DECIMAL(16, 2) COMMENT '最近7日下单优惠券优惠金额',
        `order_total_amount_7d`      DECIMAL(16, 2) COMMENT '最近7日下单最终金额',
        `order_count_30d`            BIGINT COMMENT '最近30日下单次数',
        `order_original_amount_30d`  DECIMAL(16, 2) COMMENT '最近30日下单原始金额',
        `activity_reduce_amount_30d` DECIMAL(16, 2) COMMENT '最近30日下单活动优惠金额',
        `coupon_reduce_amount_30d`   DECIMAL(16, 2) COMMENT '最近30日下单优惠券优惠金额',
        `order_total_amount_30d`     DECIMAL(16, 2) COMMENT '最近30日下单最终金额'
    ) COMMENT '交易域省份粒度订单最近n日汇总事实表'
        PARTITIONED BY (`dt` STRING)
        STORED AS ORC
        LOCATION '/warehouse/gmall/dws/dws_trade_province_order_nd'
        TBLPROPERTIES ('orc.compress' = 'snappy');
    数据装载
    insert overwrite table dws_trade_province_order_nd partition(dt='2020-06-14')
    select
        province_id,
        province_name,
        area_code,
        iso_code,
        iso_3166_2,
        sum(if(dt>=date_add('2020-06-14',-6),order_count_1d,0)),
        sum(if(dt>=date_add('2020-06-14',-6),order_original_amount_1d,0)),
        sum(if(dt>=date_add('2020-06-14',-6),activity_reduce_amount_1d,0)),
        sum(if(dt>=date_add('2020-06-14',-6),coupon_reduce_amount_1d,0)),
        sum(if(dt>=date_add('2020-06-14',-6),order_total_amount_1d,0)),
        sum(order_count_1d),
        sum(order_original_amount_1d),
        sum(activity_reduce_amount_1d),
        sum(coupon_reduce_amount_1d),
        sum(order_total_amount_1d)
    from dws_trade_province_order_1d
    where dt>=date_add('2020-06-14',-29)
    and dt<='2020-06-14'
    group by province_id,province_name,area_code,iso_code,iso_3166_2;

依赖
dwd_trade_order_detail_inc
dim_province_full
============================================================================================dwd============================================================================================
=============交易域下单事务事实表 dwd_trade_order_detail_inc
    建表语句
    DROP TABLE IF EXISTS dwd_trade_order_detail_inc;
    CREATE EXTERNAL TABLE dwd_trade_order_detail_inc
    (
        `id`                    STRING COMMENT '编号',
        `order_id`              STRING COMMENT '订单id',
        `user_id`               STRING COMMENT '用户id',
        `sku_id`                STRING COMMENT '商品id',
        `province_id`           STRING COMMENT '省份id',
        `activity_id`           STRING COMMENT '参与活动规则id',
        `activity_rule_id`      STRING COMMENT '参与活动规则id',
        `coupon_id`             STRING COMMENT '使用优惠券id',
        `date_id`               STRING COMMENT '下单日期id',
        `create_time`           STRING COMMENT '下单时间',
        `source_id`             STRING COMMENT '来源编号',
        `source_type_code`      STRING COMMENT '来源类型编码',
        `source_type_name`      STRING COMMENT '来源类型名称',
        `sku_num`               BIGINT COMMENT '商品数量',
        `split_original_amount` DECIMAL(16, 2) COMMENT '原始价格',
        `split_activity_amount` DECIMAL(16, 2) COMMENT '活动优惠分摊',
        `split_coupon_amount`   DECIMAL(16, 2) COMMENT '优惠券优惠分摊',
        `split_total_amount`    DECIMAL(16, 2) COMMENT '最终价格分摊'
    ) COMMENT '交易域下单明细事务事实表'
        PARTITIONED BY (`dt` STRING)
        ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
        STORED AS ORC
        LOCATION '/warehouse/gmall/dwd/dwd_trade_order_detail_inc/'
        TBLPROPERTIES ('orc.compress' = 'snappy');

    每日装载
    insert overwrite table dwd_trade_order_detail_inc partition (dt='2020-06-15')
    select
        od.id,
        order_id,
        user_id,
        sku_id,
        province_id,
        activity_id,
        activity_rule_id,
        coupon_id,
        date_id,
        create_time,
        source_id,
        source_type,
        dic_name,
        sku_num,
        split_original_amount,
        split_activity_amount,
        split_coupon_amount,
        split_total_amount
    from
    (
        select
            data.id,
            data.order_id,
            data.sku_id,
            date_format(data.create_time, 'yyyy-MM-dd') date_id,
            data.create_time,
            data.source_id,
            data.source_type,
            data.sku_num,
            data.sku_num * data.order_price split_original_amount,
            data.split_total_amount,
            data.split_activity_amount,
            data.split_coupon_amount
        from ods_order_detail_inc
        where dt = '2020-06-15'
        and type = 'insert'
    ) od
    left join
    (
        select
            data.id,
            data.user_id,
            data.province_id
        from ods_order_info_inc
        where dt = '2020-06-15'
        and type = 'insert'
    ) oi
    on od.order_id = oi.id
    left join
    (
        select
            data.order_detail_id,
            data.activity_id,
            data.activity_rule_id
        from ods_order_detail_activity_inc
        where dt = '2020-06-15'
        and type = 'insert'
    ) act
    on od.id = act.order_detail_id
    left join
    (
        select
            data.order_detail_id,
            data.coupon_id
        from ods_order_detail_coupon_inc
        where dt = '2020-06-15'
        and type = 'insert'
    ) cou
    on od.id = cou.order_detail_id
    left join
    (
        select
            dic_code,
            dic_name
        from ods_base_dic_full
        where dt='2020-06-15'
        and parent_code='24'
    )dic
    on od.source_type=dic.dic_code;

依赖
ods_order_detail_inc
ods_order_info_inc
ods_order_detail_activity_inc
ods_order_detail_coupon_inc
ods_base_dic_full
============================================================================================dim============================================================================================

=============地区维度表 dim_province_full
    建表语句
    DROP TABLE IF EXISTS dim_province_full;
    CREATE EXTERNAL TABLE dim_province_full
    (
        `id`            STRING COMMENT 'id',
        `province_name` STRING COMMENT '省市名称',
        `area_code`     STRING COMMENT '地区编码',
        `iso_code`      STRING COMMENT '旧版ISO-3166-2编码,供可视化使用',
        `iso_3166_2`    STRING COMMENT '新版IOS-3166-2编码,供可视化使用',
        `region_id`     STRING COMMENT '地区id',
        `region_name`   STRING COMMENT '地区名称'
    ) COMMENT '地区维度表'
        PARTITIONED BY (`dt` STRING)
        STORED AS ORC
        LOCATION '/warehouse/gmall/dim/dim_province_full/'
        TBLPROPERTIES ('orc.compress' = 'snappy');
    数据装载
    insert overwrite table dim_province_full partition(dt='2020-06-14')
    select
        province.id,
        province.name,
        province.area_code,
        province.iso_code,
        province.iso_3166_2,
        region_id,
        region_name
    from
    (
        select
            id,
            name,
            region_id,
            area_code,
            iso_code,
            iso_3166_2
        from ods_base_province_full
        where dt='2020-06-14'
    )province
    left join
    (
        select
            id,
            region_name
        from ods_base_region_full
        where dt='2020-06-14'
    )region
    on province.region_id=region.id;

依赖
ods_base_province_full
ods_base_region_full
============================================================================================ods============================================================================================
=============订单明细表(增量表)ods_order_detail_inc
    DROP TABLE IF EXISTS ods_order_detail_inc;
    CREATE EXTERNAL TABLE ods_order_detail_inc
    (
        `type` STRING COMMENT '变动类型',
        `ts`   BIGINT COMMENT '变动时间',
        `data` STRUCT<id :STRING,order_id :STRING,sku_id :STRING,sku_name :STRING,img_url :STRING,order_price
                      :DECIMAL(16, 2),sku_num :BIGINT,create_time :STRING,source_type :STRING,source_id :STRING,split_total_amount
                      :DECIMAL(16, 2),split_activity_amount :DECIMAL(16, 2),split_coupon_amount
                      :DECIMAL(16, 2)> COMMENT '数据',
        `old`  MAP<STRING,STRING> COMMENT '旧值'
    ) COMMENT '订单明细表'
        PARTITIONED BY (`dt` STRING)
        ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe'
        LOCATION '/warehouse/gmall/ods/ods_order_detail_inc/';

=============订单表(增量表)ods_order_info_inc
    DROP TABLE IF EXISTS ods_order_info_inc;
    CREATE EXTERNAL TABLE ods_order_info_inc
    (
        `type` STRING COMMENT '变动类型',
        `ts`   BIGINT COMMENT '变动时间',
        `data` STRUCT<id :STRING,consignee :STRING,consignee_tel :STRING,total_amount :DECIMAL(16, 2),order_status :STRING,user_id
                      :STRING,payment_way :STRING,delivery_address :STRING,order_comment :STRING,out_trade_no :STRING,trade_body
                      :STRING,create_time :STRING,operate_time :STRING,expire_time :STRING,process_status :STRING,tracking_no
                      :STRING,parent_order_id :STRING,img_url :STRING,province_id :STRING,activity_reduce_amount
                      :DECIMAL(16, 2),coupon_reduce_amount :DECIMAL(16, 2),original_total_amount :DECIMAL(16, 2),freight_fee
                      :DECIMAL(16, 2),freight_fee_reduce :DECIMAL(16, 2),refundable_time :DECIMAL(16, 2)> COMMENT '数据',
        `old`  MAP<STRING,STRING> COMMENT '旧值'
    ) COMMENT '订单表'
        PARTITIONED BY (`dt` STRING)
        ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe'
        LOCATION '/warehouse/gmall/ods/ods_order_info_inc/';    

=============订单明细活动关联表(增量表)ods_order_detail_activity_inc
    DROP TABLE IF EXISTS ods_order_detail_activity_inc;
    CREATE EXTERNAL TABLE ods_order_detail_activity_inc
    (
        `type` STRING COMMENT '变动类型',
        `ts`   BIGINT COMMENT '变动时间',
        `data` STRUCT<id :STRING,order_id :STRING,order_detail_id :STRING,activity_id :STRING,activity_rule_id :STRING,sku_id
                      :STRING,create_time :STRING> COMMENT '数据',
        `old`  MAP<STRING,STRING> COMMENT '旧值'
    ) COMMENT '订单明细活动关联表'
        PARTITIONED BY (`dt` STRING)
        ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe'
        LOCATION '/warehouse/gmall/ods/ods_order_detail_activity_inc/';

=============订单明细优惠券关联表(增量表)ods_order_detail_coupon_inc
    DROP TABLE IF EXISTS ods_order_detail_coupon_inc;
    CREATE EXTERNAL TABLE ods_order_detail_coupon_inc
    (
        `type` STRING COMMENT '变动类型',
        `ts`   BIGINT COMMENT '变动时间',
        `data` STRUCT<id :STRING,order_id :STRING,order_detail_id :STRING,coupon_id :STRING,coupon_use_id :STRING,sku_id
                      :STRING,create_time :STRING> COMMENT '数据',
        `old`  MAP<STRING,STRING> COMMENT '旧值'
    ) COMMENT '订单明细优惠券关联表'
        PARTITIONED BY (`dt` STRING)
        ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe'
        LOCATION '/warehouse/gmall/ods/ods_order_detail_coupon_inc/';

=============编码字典表(全量表)ods_base_dic_full
    DROP TABLE IF EXISTS ods_base_dic_full;
    CREATE EXTERNAL TABLE ods_base_dic_full
    (
        `dic_code`     STRING COMMENT '编号',
        `dic_name`     STRING COMMENT '编码名称',
        `parent_code`  STRING COMMENT '父编号',
        `create_time`  STRING COMMENT '创建日期',
        `operate_time` STRING COMMENT '修改日期'
    ) COMMENT '编码字典表'
        PARTITIONED BY (`dt` STRING)
        ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
        NULL DEFINED AS ''
        LOCATION '/warehouse/gmall/ods/ods_base_dic_full/';

=============省份表(全量表)ods_base_province_full
    DROP TABLE IF EXISTS ods_base_province_full;
    CREATE EXTERNAL TABLE ods_base_province_full
    (
        `id`         STRING COMMENT '编号',
        `name`       STRING COMMENT '省份名称',
        `region_id`  STRING COMMENT '地区ID',
        `area_code`  STRING COMMENT '地区编码',
        `iso_code`   STRING COMMENT '旧版ISO-3166-2编码,供可视化使用',
        `iso_3166_2` STRING COMMENT '新版IOS-3166-2编码,供可视化使用'
    ) COMMENT '省份表'
        PARTITIONED BY (`dt` STRING)
        ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
        NULL DEFINED AS ''
        LOCATION '/warehouse/gmall/ods/ods_base_province_full/';

=============地区表(全量表)ods_base_region_full
    DROP TABLE IF EXISTS ods_base_region_full;
    CREATE EXTERNAL TABLE ods_base_region_full
    (
        `id`          STRING COMMENT '编号',
        `region_name` STRING COMMENT '地区名称'
    ) COMMENT '地区表'
        PARTITIONED BY (`dt` STRING)
        ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
        NULL DEFINED AS ''
        LOCATION '/warehouse/gmall/ods/ods_base_region_full/';
View Code

 

可见我们将ods层作为贴源层,尽可能的保留原始数据;来自dataX的text格式数据以及flume来的json格式;

 

 

从ods层中抽取dim维度表供后续dws层使用;注意维度表的压缩排序;

TEXTFILE 和 SEQUENCEFILE 的存储格式都是基于行存储的;

ORC 和 PARQUET 是基于列式存储的。

 

 

从ods层中抽取事实表及多张关联表组合为dwd事实层供后续dws层使用;还不需要翻译;

 

 

经过聚合dwd和dim的组合得到dws汇总层;

 

 

最后根据需求从dws层拿到对应的汇总记录再做一些个性化的统计如环比、同比等结果到ads层供应用层使用;

 

 

 

dim中的时间维度表、拉链表

离线数仓的首日数据装载 

增量表通过maxwell首次同步时的type为bootstrap-insert

posted @ 2023-04-07 09:31  大背头  阅读(504)  评论(0编辑  收藏  举报