记录去重保留一条和联表删除的排坑过程
因为接触的数据库比较多,各个库之间有些细节有出入没有注意就会踩坑
一、场景要求
生产库中有一张主表tbl_invoice_xbk5
和明细表tbl_invoice_item_xbk5
,关联字段是主表的INV_ID
和明细表的ID
对应,由于前期设计缺陷,发现主表中INVOICE_NO
字段有重复数据,需要去重只保留一条,且对应的明细表也需要删除关联数据,有限制条件为主表INVOICE_CODE
值是'数电票',基本的主键和索引都有。
二、踩坑过程
1、基本思路是查找出主表中的需要删除的重复数据,关联明细表之后先将明细表的数据删除再删除主表数据,即准备先查找出主表中所需删除数据,这里呢也有两个方法:
-
首先去重最先想到的是group by having句式的使用,查找所有重复数据
select * from tbl_invoice_xbk5 where invoice_code = '数电票' group by invoice_no having count(1) > 1;
结果报错如下:
这是因为mysql5.7开始有设置比较严格的sql_mode,用
show variables like 'sql_mode%';
确认后临时修改其值去掉'only_full_group_by'再次查询,得到的结果也并非是我们想要的结果,这个查询的结果只是显示重复数据中的一条,只能看到我们有多少条数据是重复的,具体每一条重复数据是有多少条重复数量是看不到的,需要修改sql语句。[NOTE]
第一坑就是对于sql_mode
中'only_full_group_by'的配置相关问题,含义就是sql中select后面的字段必须出现在group by后面,或者被聚合函数包裹,不然会抛出上面的错误 。于是在默认'only_full_group_by'模式下重新修改语句:
SELECT * FROM tbl_invoice_xbk5 WHERE invoice_no IN ( SELECT invoice_no FROM tbl_invoice_xbk5 WHERE invoice_code = '数电票' GROUP BY invoice_no HAVING count( 1 ) > 1 );
得到了所有的重复数据,要求是保留一条,主表中ID是唯一键,于是想保留ID最小的一条,其余删除,即查找所有重复数据中ID不等于min(id)的那些:
SELECT * FROM tbl_invoice_xbk5 WHERE invoice_no IN ( SELECT invoice_no FROM tbl_invoice_xbk5 WHERE invoice_code = '数电票' GROUP BY invoice_no HAVING count( 1 ) > 1 ) AND id NOT IN ( SELECT min( id ) FROM tbl_invoice_xbk5 WHERE invoice_code = '数电票' GROUP BY invoice_no HAVING count( 1 ) > 1 ) ;
用explain查看执行计划,子查询都走了索引,但是主查询走的是全盘扫描,根据业务场景在主查询中再添加invoice_code的条件限制后有优化,最终的SQL语句是:
SELECT * FROM tbl_invoice_xbk5 WHERE invoice_no IN ( SELECT invoice_no FROM tbl_invoice_xbk5 WHERE invoice_code = '数电票' GROUP BY invoice_no HAVING count( 1 ) > 1 ) AND id NOT IN ( SELECT min( id ) FROM tbl_invoice_xbk5 WHERE invoice_code = '数电票' GROUP BY invoice_no HAVING count( 1 ) > 1 ) AND invoice_code = '数电票';
[NOTE]
这里涉及到in语句的sql优化,相比较的还有exists以及not in和not exists的对比。 -
上述第一种方法语句看上去就比较复杂,另外在最后删除的时候由于Mysql无法支持在查询一张表时又更新这张表的操作,还需要再添加一张第三方表,语句会更加复杂容易出错,于是想着优化一下。实际上我们用重复字段invoice_code进行group by时得到对应min(id),只要主表id不在这部分中即可,因为即使没有重复的数据它的id就是它对应的min(id),优化SQL如下:
SELECT * FROM tbl_invoice_xbk5 WHERE id NOT IN ( SELECT min( id ) FROM tbl_invoice_xbk5 WHERE invoice_code = '数电票' GROUP BY invoice_no ) AND invoice_code = '数电票';
2、主表中要删除的数据已经找到,然后需要先删除对应明细表中的数据,但是在写删除语句时有前面所提到的mysql的限制,这里就先对此做修改:
-
上述语句如果将select * 直接修改成delete进行删除会报错如下:
我们通过加入第三方表的方式来解决:
DELETE FROM tbl_invoice_xbk5 WHERE id NOT IN ( SELECT t.minid FROM ( SELECT min( id ) AS minid FROM tbl_invoice_xbk5 WHERE invoice_code = '数电票' GROUP BY invoice_no ) t ) AND invoice_code = '数电票';
3、关联明细表,找出明细表中对应所需删除数据进行删除
-
第一种直接用in来关联:
SELECT * FROM tbl_invoice_item_xbk5 WHERE INV_ID IN ( SELECT id FROM tbl_invoice_xbk5 WHERE id NOT IN ( SELECT min( id ) FROM tbl_invoice_xbk5 WHERE INVOICE_CODE = '数电票' GROUP BY invoice_no ) AND invoice_code = '数电票' );
用explain查看执行计划,可以看到走了索引且rows也不大,实际执行也能很快出结果:
但是将select * 改为delete进行删除时却很慢,再查看执行计划,发现进行delete时走的是全盘扫面:
非常的痛苦!!!
[NOTE]
这里是后期记录时写的,没有实际执行,如果在执行时有出现上面的限制报错,可以直接加上第三方表。 -
换用join的方式来关联查询:
SELECT * FROM tbl_invoice_item_xbk5 t1 LEFT JOIN tbl_invoice_xbk5 t2 ON t1.INV_ID = t2.ID WHERE t2.INVOICE_CODE = '数电票' AND t2.id NOT IN ( SELECT min( id ) FROM tbl_invoice_xbk5 WHERE invoice_code = '数电票' GROUP BY INVOICE_NO );
不放心的先看了下执行计划,一切正常,于是将select * 改为delete再执行,再次报错:
这里直接查看是显示的是语法错误,查了资料后才知道mysql在做delete时如果使用了表别名是需要在delete和from直接加上这个表别名,发现了新大陆了!!画重点!!于是修改下删除语句后再看执行计划:
DELETE t1 FROM tbl_invoice_item_xbk5 t1 LEFT JOIN tbl_invoice_xbk5 t2 ON t1.INV_ID = t2.ID WHERE t2.INVOICE_CODE = '数电票' AND t2.id NOT IN ( SELECT min( id ) FROM tbl_invoice_xbk5 WHERE invoice_code = '数电票' GROUP BY INVOICE_NO );
也不报错了,效率也高很多很多,一切都显得非常平静,peace & love ~
-
到此基本是已经可以完成业务要求了,先删除明细表再删除主表即可,不过联想到表别名的使用再看前面那个非常痛苦的in语句使用,想着是否可以用表别名来优化delete的效率:
DELETE t FROM tbl_invoice_item_xbk5 t WHERE t.INV_ID IN ( SELECT id FROM tbl_invoice_xbk5 WHERE id NOT IN ( SELECT min( id ) FROM tbl_invoice_xbk5 WHERE INVOICE_CODE = '数电票' GROUP BY invoice_no ) AND invoice_code = '数电票' );
我只能说,非常的丝滑~
4、进阶
再解决上面问题的时候,还有意外的惊喜,就是Mysql的两表联合删除,即一条语句删除两张表中所需删除的数据,基本语法是:
DELETE FROM table1,table2
WHERE table1.column1=table2.column2;
按照上面的语法,我们可以在两张表之间建立联系,然后在WHERE子句中给出删除的限制条件。再结合表别名的使用,于是乎最后我是使用如下的语句来完成业务的要求:
DELETE ta,tb
FROM
tbl_invoice_item_xbk5 ta,
tbl_invoice_xbk5 tb
WHERE
ta.INV_ID = tb.ID
AND tb.ID NOT IN ( SELECT t.minid FROM ( SELECT min( id ) AS minid FROM tbl_invoice_xbk5 WHERE INVOICE_CODE = '数电票' GROUP BY invoice_no ) t )
AND tb.invoice_code = '数电票';
主表删除20条重复数据,明细表对应的需要删除23条数据,于Affected rows: 43也吻合。