首单分析(二)

1、数据采集——maxwell 实现

1.1、maxwell 介绍

Maxwell 是由美国zendesk开源,用java编写的Mysql实时抓取软件,其抓取的原理也是基于binlog。

1.2、maxwell 与 canal 对比

  • Maxwell没有canal那种server+client模式,只有一个server把数据发送到消息队列或redis。如果需要多个实例,通过指定不同配置文件启动多个进程。
  • Maxwell有一个亮点功能,就是canal只能抓取最新数据,对已存在的历史数据没有办法处理。而Maxwell有一个bootstrap功能,可以直接引导出完整的历史数据用于初始化,非常好用。
  • Maxwell不能直接支持HA,但是它支持断点还原,即错误解决后重启继续上次点儿读取数据。
  • Maxwell只支持json格式,而Canal如果用Server+client模式的话,可以自定义格式。
  • Maxwell比Canal更加轻量级。

1.3、安装 maxwell

[hui@hadoop201 software]$ tar -zxvf maxwell-1.25.0.tar.gz -C /opt/module/
[hui@hadoop201 module]$ mv maxwell-1.25.0 maxwell1.25

1.4、使用maxwell前准备工作

#在数据库中建立一个maxwell库用于存储Maxwell的元数据
[atguigu@hadoop202 module]$ mysql -uroot -p123465
CREATE DATABASE maxwell;
#分配一个账号可以操作该数据库
GRANT ALL   ON maxwell.* TO 'maxwell'@'%' IDENTIFIED BY '123456';
#分配这个账号可以监控其他数据库的权限
GRANT  SELECT ,REPLICATION SLAVE , REPLICATION CLIENT  ON *.* TO maxwell@'%';

另外需要开启mysql binlog 功能

[hui@hadoop201 ~]$ locate my.cnf
/etc/my.cnf
/etc/my.cnf.d
[hui@hadoop201 ~]$ less /etc/my.cnf
# For advice on how to change settings please see
# http://dev.mysql.com/doc/refman/5.7/en/server-configuration-defaults.html

[mysqld]
#
# Remove leading # and set to the amount of RAM for the most important data
# cache in MySQL. Start at 70% of total RAM for dedicated server, else 10%.
# innodb_buffer_pool_size = 128M
#
# Remove leading # to turn on a very important data integrity option: logging
# changes to the binary log between backups.
# log_bin
#
# Remove leading # to set options mainly useful for reporting servers.
# The server defaults are faster for transactions and fast SELECTs.
# Adjust sizes as needed, experiment to find the optimal values.
# join_buffer_size = 128M
# sort_buffer_size = 2M
# read_rnd_buffer_size = 2M
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock

# Disabling symbolic-links is recommended to prevent assorted security risks
symbolic-links=0

log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid

max_allowed_packet=1024M

#数据库id
server-id = 1
##启动binlog,该参数的值会作为binlog的文件名
log-bin=mysql-bin
##binlog类型,maxwell要求为row类型
binlog_format=row
##启用binlog的数据库,需根据实际情况作出修改,如需配置监听多个数据库,可以多写几行进行配置
binlog-do-db=gmall0423
#解除下面一行配置,可以多监听一个数据库
#binlog-do-db=gmall

1.5、编辑maxwell配置文件

[hui@hadoop201 maxwell1.25]$ vim config.properties
producer=kafka
kafka.bootstrap.servers=hadoop201:9092,hadoop202:9092,hadoop203:9092
# mysql login info
host=hadoop201
user=maxwell
password=maxwell
#需要添加
kafka_topic=gmall0423_db_maxwell
#需要添加 后续初始化会用
client_id=maxwell_1
producer_partition_by=primary_key

注意:默认还是输出到指定Kafka主题的一个kafka分区,因为多个分区并行可能会打乱binlog的顺序如果要提高并行度,首先设置kafka的分区数>1,然后设置producer_partition_by属性 

可选值producer_partition_by=database|table|primary_key|random| column

maxwell 启动脚本

[hui@hadoop201 bin]$ less maxwell.sh 

/opt/module/maxwell-1.25.0/bin/maxwell --config  /opt/module/maxwell-1.25.0/config.properties >/dev/null 2>&1 &
[hui@hadoop201 ~]$ jps
2675 Maxwell

启动模拟数据程序,观察效果

[hui@hadoop201 kafka]$ bin/kafka-console-consumer.sh --bootstrap-server hadoop201:9092 --topic gmall0423_db_maxwell

数据格式

{
    "database":"gmall0423",
    "table":"comment_info",
    "type":"insert",
    "ts":1651873950,
    "xid":2229,
    "xoffset":1392,
    "data":{
        "id":1522695828560359428,
        "user_id":23,
        "sku_id":12,
        "spu_id":10,
        "order_id":3515,
        "appraise":"1204",
        "comment_txt":"评论内容:23985229577423662931538863626755831166131361537121",
        "create_time":"2022-04-23 05:52:30",
        "operate_time":null
    }
}

1.6、Maxwell 版本的 ods 层处理

1.6.1、不同操作下 canal 和 maxwell 数据格式对比

Insert

INSERT INTO z_user_info VALUES(30,'zhang3','13810001010'),(31,'li4','1389999999');

canal

maxwell

{"data":[{"id":"30","user_name":"zhang3","tel":"13810001010"},{"id":"31","user_name":"li4","tel":"1389999999"}],"database":"gmall-2020-04","es":1589385314000,"id":2,"isDdl":false,"mysqlType":{"id":"bigint(20)","user_name":"varchar(20)","tel":"varchar(20)"},"old":null,"pkNames":["id"],"sql":"","sqlType":{"id":-5,"user_name":12,"tel":12},"table":"z_user_info","ts":1589385314116,"type":"INSERT"}

{"database":"gmall-2020-04","table":"z_user_info","type":"insert","ts":1589385314,"xid":82982,"xoffset":0,"data":{"id":30,"user_name":"zhang3","tel":"13810001010"}}

 

{"database":"gmall-2020-04","table":"z_user_info","type":"insert","ts":1589385314,"xid":82982,"commit":true,"data":{"id":31,"user_name":"li4","tel":"1389999999"}}

update

UPDATE z_user_info SET user_name='wang55' WHERE id IN(30,31)

canal

maxwell

{"data":[{"id":"30","user_name":"wang55","tel":"13810001010"},{"id":"31","user_name":"wang55","tel":"1389999999"}],"database":"gmall-2020-04","es":1589385508000,"id":3,"isDdl":false,"mysqlType":{"id":"bigint(20)","user_name":"varchar(20)","tel":"varchar(20)"},"old":[{"user_name":"zhang3"},{"user_name":"li4"}],"pkNames":["id"],"sql":"","sqlType":{"id":-5,"user_name":12,"tel":12},"table":"z_user_info","ts":1589385508676,"type":"UPDATE"}

{"database":"gmall-2020-04","table":"z_user_info","type":"update","ts":1589385508,"xid":83206,"xoffset":0,"data":{"id":30,"user_name":"wang55","tel":"13810001010"},"old":{"user_name":"zhang3"}}

 

{"database":"gmall-2020-04","table":"z_user_info","type":"update","ts":1589385508,"xid":83206,"commit":true,"data":{"id":31,"user_name":"wang55","tel":"1389999999"},"old":{"user_name":"li4"}}

delete

DELETE  FROM z_user_info   WHERE id IN(30,31)

canal

maxwell

{"data":[{"id":"30","user_name":"wang55","tel":"13810001010"},{"id":"31","user_name":"wang55","tel":"1389999999"}],"database":"gmall-2020-04","es":1589385644000,"id":4,"isDdl":false,"mysqlType":{"id":"bigint(20)","user_name":"varchar(20)","tel":"varchar(20)"},"old":null,"pkNames":["id"],"sql":"","sqlType":{"id":-5,"user_name":12,"tel":12},"table":"z_user_info","ts":1589385644829,"type":"DELETE"}

{"database":"gmall-2020-04","table":"z_user_info","type":"delete","ts":1589385644,"xid":83367,"xoffset":0,"data":{"id":30,"user_name":"wang55","tel":"13810001010"}}

 

{"database":"gmall-2020-04","table":"z_user_info","type":"delete","ts":1589385644,"xid":83367,"commit":true,"data":{"id":31,"user_name":"wang55","tel":"1389999999"}}

1.6.2、数据格式对比总结

  • 日志结构:canal 每一条SQL会产生一条日志,如果该条Sql影响了多行数据,则已经会通过集合的方式归集在这条日志中。(即使是一条数据也会是数组结构)maxwell 以影响的数据为单位产生日志,即每影响一条数据就会产生一条日志。如果想知道这些日志是否是通过某一条sql产生的可以通过xid进行判断,相同的xid的日志来自同一sql。
  • 数字类型:当原始数据是数字类型时,maxwell会尊重原始数据的类型不增加双引,变为字符串。canal一律转换为字符串。
  • 带原始数据字段定义:canal数据中会带入表结构。maxwell更简洁。

1.6.3、SparkStreaming 分流业务逻辑修改-——maxwell

import com.alibaba.fastjson.{JSON, JSONObject}
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.TopicPartition
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka010.{HasOffsetRanges, OffsetRange}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.wdh01.gmall.realtime.util.{MyKafkaSink, MyKafkaUtil, OffsetManagerUtil}

/**
 * 使用 maxwell 采集数据 分流
 */
object BaseDBMaxwellApp {
  def main(args: Array[String]): Unit = {
    val conf: SparkConf = new SparkConf().setAppName("BaseDBMaxwellApp").setMaster("local[4]")
    val ssc: StreamingContext = new StreamingContext(conf, Seconds(5))

    var topic: String = "gmall0426_db_maxwell"
    var groupid: String = "base_db_db_maxwell_group"

    var recordDStream: InputDStream[ConsumerRecord[String, String]] = null
    //从 redis 获取偏移量
    val offsetMap: Map[TopicPartition, Long] = OffsetManagerUtil.getOffset(topic, groupid)
    if (offsetMap != null && offsetMap.size > 0) {
      //从指定位置读取数据
      recordDStream = MyKafkaUtil.getKafkaStream(topic, ssc, offsetMap, groupid)
    } else {
      //从默认位置读取数据
      recordDStream = MyKafkaUtil.getKafkaStream(topic, ssc, groupid)
    }
    //获取当前采集周琼的偏移量
    var ranges: Array[OffsetRange] = Array.empty[OffsetRange]
    val offsetDstream: DStream[ConsumerRecord[String, String]] = recordDStream.transform {
      rdd => {
        ranges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
      }
        rdd
    }
    //对读取的数据进行结构转换 ConsumerRecord<K,V>==>V(sonStr)==>V(jsonObj )
    val jsonObjDStream: DStream[JSONObject] = offsetDstream.map {
      record => {
        val jsonStr: String = record.value()
        //将jsonStr 转换为 jsonObj
        val jsonObj: JSONObject = JSON.parseObject(jsonStr)
        jsonObj
      }
    }
    jsonObjDStream.foreachRDD {
      rdd => {
        rdd.foreach {
          jsonObj => {
            //获取操作数据
            val dataJsonObj: JSONObject = jsonObj.getJSONObject("data")
            //获取操作类型
            val opType: String = jsonObj.getString("type")

            if (dataJsonObj != null && !dataJsonObj.isEmpty && "insert".equals(opType)) {
              //获取表名
              val tabName: String = jsonObj.getString("table")
              var sendTopic = "ods_" + tabName
              MyKafkaSink.send(sendTopic, dataJsonObj.toString())
            }
          }
        }
        //提交偏移量
        OffsetManagerUtil.saveOffset(topic, groupid, ranges)

      }
    }
  }
}
posted @ 2022-05-08 08:28  晓枫的春天  阅读(37)  评论(0编辑  收藏  举报