记录去重保留一条和联表删除的排坑过程

因为接触的数据库比较多,各个库之间有些细节有出入没有注意就会踩坑

一、场景要求

生产库中有一张主表tbl_invoice_xbk5和明细表tbl_invoice_item_xbk5,关联字段是主表的INV_ID和明细表的ID对应,由于前期设计缺陷,发现主表中INVOICE_NO字段有重复数据,需要去重只保留一条,且对应的明细表也需要删除关联数据,有限制条件为主表INVOICE_CODE值是'数电票',基本的主键和索引都有。

二、踩坑过程

1、基本思路是查找出主表中的需要删除的重复数据,关联明细表之后先将明细表的数据删除再删除主表数据,即准备先查找出主表中所需删除数据,这里呢也有两个方法:

  • 首先去重最先想到的是group by having句式的使用,查找所有重复数据

    select * from tbl_invoice_xbk5 where invoice_code = '数电票' group by invoice_no having count(1) > 1;
    

    结果报错如下:

    这是因为mysql5.7开始有设置比较严格的sql_mode,用show variables like 'sql_mode%';确认后临时修改其值去掉'only_full_group_by'再次查询,得到的结果也并非是我们想要的结果,这个查询的结果只是显示重复数据中的一条,只能看到我们有多少条数据是重复的,具体每一条重复数据是有多少条重复数量是看不到的,需要修改sql语句。

    [NOTE]
    第一坑就是对于sql_mode中'only_full_group_by'的配置相关问题,含义就是sql中select后面的字段必须出现在group by后面,或者被聚合函数包裹,不然会抛出上面的错误 。

    于是在默认'only_full_group_by'模式下重新修改语句:

    SELECT
        * 
    FROM
        tbl_invoice_xbk5  
    WHERE
        invoice_no IN ( SELECT invoice_no FROM tbl_invoice_xbk5 WHERE invoice_code = '数电票' GROUP BY invoice_no HAVING count( 1 ) > 1 );
    

    得到了所有的重复数据,要求是保留一条,主表中ID是唯一键,于是想保留ID最小的一条,其余删除,即查找所有重复数据中ID不等于min(id)的那些:

    SELECT
        * 
    FROM
        tbl_invoice_xbk5  
    WHERE
        invoice_no IN ( SELECT invoice_no FROM tbl_invoice_xbk5 WHERE invoice_code = '数电票' GROUP BY invoice_no HAVING count( 1 ) > 1 ) 
        AND id NOT IN ( SELECT min( id ) FROM tbl_invoice_xbk5 WHERE invoice_code = '数电票' GROUP BY invoice_no HAVING count( 1 ) > 1 ) ;
    

    用explain查看执行计划,子查询都走了索引,但是主查询走的是全盘扫描,根据业务场景在主查询中再添加invoice_code的条件限制后有优化,最终的SQL语句是:

    SELECT
        * 
    FROM
        tbl_invoice_xbk5  
    WHERE
        invoice_no IN ( SELECT invoice_no FROM tbl_invoice_xbk5 WHERE invoice_code = '数电票' GROUP BY invoice_no HAVING count( 1 ) > 1 ) 
        AND id NOT IN ( SELECT min( id ) FROM tbl_invoice_xbk5 WHERE invoice_code = '数电票' GROUP BY invoice_no HAVING count( 1 ) > 1 ) 
        AND invoice_code = '数电票';
    

    [NOTE]
    这里涉及到in语句的sql优化,相比较的还有exists以及not in和not exists的对比。

  • 上述第一种方法语句看上去就比较复杂,另外在最后删除的时候由于Mysql无法支持在查询一张表时又更新这张表的操作,还需要再添加一张第三方表,语句会更加复杂容易出错,于是想着优化一下。实际上我们用重复字段invoice_code进行group by时得到对应min(id),只要主表id不在这部分中即可,因为即使没有重复的数据它的id就是它对应的min(id),优化SQL如下:

    SELECT
        * 
    FROM
        tbl_invoice_xbk5  
    WHERE
        id NOT IN ( SELECT min( id ) FROM tbl_invoice_xbk5 WHERE invoice_code = '数电票' GROUP BY invoice_no ) 
        AND invoice_code = '数电票';
    

2、主表中要删除的数据已经找到,然后需要先删除对应明细表中的数据,但是在写删除语句时有前面所提到的mysql的限制,这里就先对此做修改:

  • 上述语句如果将select * 直接修改成delete进行删除会报错如下:

    我们通过加入第三方表的方式来解决:

    DELETE 
    FROM
        tbl_invoice_xbk5 
    WHERE
        id NOT IN ( SELECT t.minid FROM ( SELECT min( id ) AS minid FROM tbl_invoice_xbk5 WHERE invoice_code = '数电票' GROUP BY invoice_no ) t ) 
        AND invoice_code = '数电票';
    

3、关联明细表,找出明细表中对应所需删除数据进行删除

  • 第一种直接用in来关联:

    SELECT
      * 
    FROM
        tbl_invoice_item_xbk5 
    WHERE
        INV_ID IN (
    SELECT
        id 
    FROM
        tbl_invoice_xbk5 
    WHERE
        id NOT IN ( SELECT min( id ) FROM tbl_invoice_xbk5 WHERE INVOICE_CODE = '数电票' GROUP BY invoice_no ) 
        AND invoice_code = '数电票' 
        );
    

    用explain查看执行计划,可以看到走了索引且rows也不大,实际执行也能很快出结果:

    但是将select * 改为delete进行删除时却很慢,再查看执行计划,发现进行delete时走的是全盘扫面:

    非常的痛苦!!!

    [NOTE]
    这里是后期记录时写的,没有实际执行,如果在执行时有出现上面的限制报错,可以直接加上第三方表。

  • 换用join的方式来关联查询:

      SELECT
          * 
      FROM
          tbl_invoice_item_xbk5 t1
          LEFT JOIN tbl_invoice_xbk5 t2 ON t1.INV_ID = t2.ID 
      WHERE
          t2.INVOICE_CODE = '数电票' 
          AND t2.id NOT IN ( SELECT min( id ) FROM tbl_invoice_xbk5 WHERE invoice_code = '数电票' GROUP BY INVOICE_NO );
    

    不放心的先看了下执行计划,一切正常,于是将select * 改为delete再执行,再次报错:

    这里直接查看是显示的是语法错误,查了资料后才知道mysql在做delete时如果使用了表别名是需要在delete和from直接加上这个表别名,发现了新大陆了!!画重点!!于是修改下删除语句后再看执行计划:

    DELETE t1 
    FROM
        tbl_invoice_item_xbk5 t1
        LEFT JOIN tbl_invoice_xbk5 t2 ON t1.INV_ID = t2.ID 
    WHERE
        t2.INVOICE_CODE = '数电票' 
        AND t2.id NOT IN ( SELECT min( id ) FROM tbl_invoice_xbk5 WHERE invoice_code = '数电票' GROUP BY INVOICE_NO );
    

    也不报错了,效率也高很多很多,一切都显得非常平静,peace & love ~

  • 到此基本是已经可以完成业务要求了,先删除明细表再删除主表即可,不过联想到表别名的使用再看前面那个非常痛苦的in语句使用,想着是否可以用表别名来优化delete的效率:

    DELETE t 
    FROM
        tbl_invoice_item_xbk5 t 
    WHERE
        t.INV_ID IN (
    SELECT
        id 
    FROM
        tbl_invoice_xbk5 
    WHERE
        id NOT IN ( SELECT min( id ) FROM tbl_invoice_xbk5 WHERE INVOICE_CODE = '数电票' GROUP BY invoice_no ) 
        AND invoice_code = '数电票' 
        );
    

    我只能说,非常的丝滑~

4、进阶

再解决上面问题的时候,还有意外的惊喜,就是Mysql的两表联合删除,即一条语句删除两张表中所需删除的数据,基本语法是:

DELETE FROM table1,table2
WHERE table1.column1=table2.column2;

按照上面的语法,我们可以在两张表之间建立联系,然后在WHERE子句中给出删除的限制条件。再结合表别名的使用,于是乎最后我是使用如下的语句来完成业务的要求:

DELETE ta,tb 
FROM
    tbl_invoice_item_xbk5 ta,
    tbl_invoice_xbk5 tb 
WHERE
    ta.INV_ID = tb.ID 
    AND tb.ID NOT IN ( SELECT t.minid FROM ( SELECT min( id ) AS minid FROM tbl_invoice_xbk5 WHERE INVOICE_CODE = '数电票' GROUP BY invoice_no ) t ) 
    AND tb.invoice_code = '数电票';

主表删除20条重复数据,明细表对应的需要删除23条数据,于Affected rows: 43也吻合。

posted @ 2023-06-26 09:56  灬夜露沁心灬  阅读(76)  评论(0编辑  收藏  举报