首单分析(二)
1、数据采集——maxwell 实现
1.1、maxwell 介绍
Maxwell 是由美国zendesk开源,用java编写的Mysql实时抓取软件,其抓取的原理也是基于binlog。
1.2、maxwell 与 canal 对比
- Maxwell没有canal那种server+client模式,只有一个server把数据发送到消息队列或redis。如果需要多个实例,通过指定不同配置文件启动多个进程。
- Maxwell有一个亮点功能,就是canal只能抓取最新数据,对已存在的历史数据没有办法处理。而Maxwell有一个bootstrap功能,可以直接引导出完整的历史数据用于初始化,非常好用。
- Maxwell不能直接支持HA,但是它支持断点还原,即错误解决后重启继续上次点儿读取数据。
- Maxwell只支持json格式,而Canal如果用Server+client模式的话,可以自定义格式。
- Maxwell比Canal更加轻量级。
1.3、安装 maxwell
[hui@hadoop201 software]$ tar -zxvf maxwell-1.25.0.tar.gz -C /opt/module/
[hui@hadoop201 module]$ mv maxwell-1.25.0 maxwell1.25
1.4、使用maxwell前准备工作
#在数据库中建立一个maxwell库用于存储Maxwell的元数据 [atguigu@hadoop202 module]$ mysql -uroot -p123465 CREATE DATABASE maxwell; #分配一个账号可以操作该数据库 GRANT ALL ON maxwell.* TO 'maxwell'@'%' IDENTIFIED BY '123456'; #分配这个账号可以监控其他数据库的权限 GRANT SELECT ,REPLICATION SLAVE , REPLICATION CLIENT ON *.* TO maxwell@'%';
另外需要开启mysql binlog 功能
[hui@hadoop201 ~]$ locate my.cnf /etc/my.cnf /etc/my.cnf.d [hui@hadoop201 ~]$ less /etc/my.cnf # For advice on how to change settings please see # http://dev.mysql.com/doc/refman/5.7/en/server-configuration-defaults.html [mysqld] # # Remove leading # and set to the amount of RAM for the most important data # cache in MySQL. Start at 70% of total RAM for dedicated server, else 10%. # innodb_buffer_pool_size = 128M # # Remove leading # to turn on a very important data integrity option: logging # changes to the binary log between backups. # log_bin # # Remove leading # to set options mainly useful for reporting servers. # The server defaults are faster for transactions and fast SELECTs. # Adjust sizes as needed, experiment to find the optimal values. # join_buffer_size = 128M # sort_buffer_size = 2M # read_rnd_buffer_size = 2M datadir=/var/lib/mysql socket=/var/lib/mysql/mysql.sock # Disabling symbolic-links is recommended to prevent assorted security risks symbolic-links=0 log-error=/var/log/mysqld.log pid-file=/var/run/mysqld/mysqld.pid max_allowed_packet=1024M #数据库id server-id = 1 ##启动binlog,该参数的值会作为binlog的文件名 log-bin=mysql-bin ##binlog类型,maxwell要求为row类型 binlog_format=row ##启用binlog的数据库,需根据实际情况作出修改,如需配置监听多个数据库,可以多写几行进行配置 binlog-do-db=gmall0423 #解除下面一行配置,可以多监听一个数据库 #binlog-do-db=gmall
1.5、编辑maxwell配置文件
[hui@hadoop201 maxwell1.25]$ vim config.properties producer=kafka kafka.bootstrap.servers=hadoop201:9092,hadoop202:9092,hadoop203:9092 # mysql login info host=hadoop201 user=maxwell password=maxwell #需要添加 kafka_topic=gmall0423_db_maxwell #需要添加 后续初始化会用 client_id=maxwell_1
producer_partition_by=primary_key
注意:默认还是输出到指定Kafka主题的一个kafka分区,因为多个分区并行可能会打乱binlog的顺序如果要提高并行度,首先设置kafka的分区数>1,然后设置producer_partition_by属性
可选值producer_partition_by=database|table|primary_key|random| column
maxwell 启动脚本
[hui@hadoop201 bin]$ less maxwell.sh
/opt/module/maxwell-1.25.0/bin/maxwell --config /opt/module/maxwell-1.25.0/config.properties >/dev/null 2>&1 &
[hui@hadoop201 ~]$ jps
2675 Maxwell
启动模拟数据程序,观察效果
[hui@hadoop201 kafka]$ bin/kafka-console-consumer.sh --bootstrap-server hadoop201:9092 --topic gmall0423_db_maxwell
数据格式
{ "database":"gmall0423", "table":"comment_info", "type":"insert", "ts":1651873950, "xid":2229, "xoffset":1392, "data":{ "id":1522695828560359428, "user_id":23, "sku_id":12, "spu_id":10, "order_id":3515, "appraise":"1204", "comment_txt":"评论内容:23985229577423662931538863626755831166131361537121", "create_time":"2022-04-23 05:52:30", "operate_time":null } }
1.6、Maxwell 版本的 ods 层处理
1.6.1、不同操作下 canal 和 maxwell 数据格式对比
Insert
INSERT INTO z_user_info VALUES(30,'zhang3','13810001010'),(31,'li4','1389999999');
canal |
maxwell |
{"data":[{"id":"30","user_name":"zhang3","tel":"13810001010"},{"id":"31","user_name":"li4","tel":"1389999999"}],"database":"gmall-2020-04","es":1589385314000,"id":2,"isDdl":false,"mysqlType":{"id":"bigint(20)","user_name":"varchar(20)","tel":"varchar(20)"},"old":null,"pkNames":["id"],"sql":"","sqlType":{"id":-5,"user_name":12,"tel":12},"table":"z_user_info","ts":1589385314116,"type":"INSERT"} |
{"database":"gmall-2020-04","table":"z_user_info","type":"insert","ts":1589385314,"xid":82982,"xoffset":0,"data":{"id":30,"user_name":"zhang3","tel":"13810001010"}}
{"database":"gmall-2020-04","table":"z_user_info","type":"insert","ts":1589385314,"xid":82982,"commit":true,"data":{"id":31,"user_name":"li4","tel":"1389999999"}} |
update
UPDATE z_user_info SET user_name='wang55' WHERE id IN(30,31)
canal |
maxwell |
{"data":[{"id":"30","user_name":"wang55","tel":"13810001010"},{"id":"31","user_name":"wang55","tel":"1389999999"}],"database":"gmall-2020-04","es":1589385508000,"id":3,"isDdl":false,"mysqlType":{"id":"bigint(20)","user_name":"varchar(20)","tel":"varchar(20)"},"old":[{"user_name":"zhang3"},{"user_name":"li4"}],"pkNames":["id"],"sql":"","sqlType":{"id":-5,"user_name":12,"tel":12},"table":"z_user_info","ts":1589385508676,"type":"UPDATE"} |
{"database":"gmall-2020-04","table":"z_user_info","type":"update","ts":1589385508,"xid":83206,"xoffset":0,"data":{"id":30,"user_name":"wang55","tel":"13810001010"},"old":{"user_name":"zhang3"}}
{"database":"gmall-2020-04","table":"z_user_info","type":"update","ts":1589385508,"xid":83206,"commit":true,"data":{"id":31,"user_name":"wang55","tel":"1389999999"},"old":{"user_name":"li4"}} |
delete
DELETE FROM z_user_info WHERE id IN(30,31)
canal |
maxwell |
{"data":[{"id":"30","user_name":"wang55","tel":"13810001010"},{"id":"31","user_name":"wang55","tel":"1389999999"}],"database":"gmall-2020-04","es":1589385644000,"id":4,"isDdl":false,"mysqlType":{"id":"bigint(20)","user_name":"varchar(20)","tel":"varchar(20)"},"old":null,"pkNames":["id"],"sql":"","sqlType":{"id":-5,"user_name":12,"tel":12},"table":"z_user_info","ts":1589385644829,"type":"DELETE"} |
{"database":"gmall-2020-04","table":"z_user_info","type":"delete","ts":1589385644,"xid":83367,"xoffset":0,"data":{"id":30,"user_name":"wang55","tel":"13810001010"}}
{"database":"gmall-2020-04","table":"z_user_info","type":"delete","ts":1589385644,"xid":83367,"commit":true,"data":{"id":31,"user_name":"wang55","tel":"1389999999"}} |
1.6.2、数据格式对比总结
- 日志结构:canal 每一条SQL会产生一条日志,如果该条Sql影响了多行数据,则已经会通过集合的方式归集在这条日志中。(即使是一条数据也会是数组结构)maxwell 以影响的数据为单位产生日志,即每影响一条数据就会产生一条日志。如果想知道这些日志是否是通过某一条sql产生的可以通过xid进行判断,相同的xid的日志来自同一sql。
- 数字类型:当原始数据是数字类型时,maxwell会尊重原始数据的类型不增加双引,变为字符串。canal一律转换为字符串。
- 带原始数据字段定义:canal数据中会带入表结构。maxwell更简洁。
1.6.3、SparkStreaming 分流业务逻辑修改-——maxwell
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 | import com.alibaba.fastjson.{JSON, JSONObject} import org.apache.kafka.clients.consumer.ConsumerRecord import org.apache.kafka.common.TopicPartition import org.apache.spark.SparkConf import org.apache.spark.streaming.dstream.{DStream, InputDStream} import org.apache.spark.streaming.kafka 010 .{HasOffsetRanges, OffsetRange} import org.apache.spark.streaming.{Seconds, StreamingContext} import org.wdh 01 .gmall.realtime.util.{MyKafkaSink, MyKafkaUtil, OffsetManagerUtil} /** * 使用 maxwell 采集数据 分流 */ object BaseDBMaxwellApp { def main(args : Array[String]) : Unit = { val conf : SparkConf = new SparkConf().setAppName( "BaseDBMaxwellApp" ).setMaster( "local[4]" ) val ssc : StreamingContext = new StreamingContext(conf, Seconds( 5 )) var topic : String = "gmall0426_db_maxwell" var groupid : String = "base_db_db_maxwell_group" var recordDStream : InputDStream[ConsumerRecord[String, String]] = null //从 redis 获取偏移量 val offsetMap : Map[TopicPartition, Long] = OffsetManagerUtil.getOffset(topic, groupid) if (offsetMap ! = null && offsetMap.size > 0 ) { //从指定位置读取数据 recordDStream = MyKafkaUtil.getKafkaStream(topic, ssc, offsetMap, groupid) } else { //从默认位置读取数据 recordDStream = MyKafkaUtil.getKafkaStream(topic, ssc, groupid) } //获取当前采集周琼的偏移量 var ranges : Array[OffsetRange] = Array.empty[OffsetRange] val offsetDstream : DStream[ConsumerRecord[String, String]] = recordDStream.transform { rdd = > { ranges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges } rdd } //对读取的数据进行结构转换 ConsumerRecord<K,V>==>V(sonStr)==>V(jsonObj ) val jsonObjDStream : DStream[JSONObject] = offsetDstream.map { record = > { val jsonStr : String = record.value() //将jsonStr 转换为 jsonObj val jsonObj : JSONObject = JSON.parseObject(jsonStr) jsonObj } } jsonObjDStream.foreachRDD { rdd = > { rdd.foreach { jsonObj = > { //获取操作数据 val dataJsonObj : JSONObject = jsonObj.getJSONObject( "data" ) //获取操作类型 val opType : String = jsonObj.getString( "type" ) if (dataJsonObj ! = null && !dataJsonObj.isEmpty && "insert" .equals(opType)) { //获取表名 val tabName : String = jsonObj.getString( "table" ) var sendTopic = "ods_" + tabName MyKafkaSink.send(sendTopic, dataJsonObj.toString()) } } } //提交偏移量 OffsetManagerUtil.saveOffset(topic, groupid, ranges) } } } } |
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 无需6万激活码!GitHub神秘组织3小时极速复刻Manus,手把手教你使用OpenManus搭建本
· C#/.NET/.NET Core优秀项目和框架2025年2月简报
· 一文读懂知识蒸馏
· Manus爆火,是硬核还是营销?
· 终于写完轮子一部分:tcp代理 了,记录一下
2020-05-08 Redis 应用--分布式锁