Mysql删除重复的数据

最近在做一个多线程的爬虫程序,由于队列中有重复的数据,尽管程序中有判断不存在则插入,但由于多个线程并发,导致数据库中存在部分重复的数据。

 

 程序中的bug已经修复,但重新爬一遍耗时耗力,于是就选择删除重复的数据,只保留一条有效数据

解决的思路就是根据确定其数据唯一的聚合字段进行分组,然后只保留一条有效数据

1.查询重复数据

select * FROM ZYZBBData
WHERE (code,year,report_type) IN (SELECT
                          code,
                          year,
                          report_type
                        FROM (SELECT
                                code,
                                year,
                                report_type
                              FROM ZYZBBData
                              GROUP BY code,year,report_type
                              HAVING COUNT( * ) > 1) a)

 

 2.只保留Id最小的1条数据,过滤出要被删除的数据

select * FROM ZYZBBData
WHERE (code,year,report_type) IN (SELECT
                          code,
                          year,
                          report_type
                        FROM (SELECT
                                code,
                                year,
                                report_type
                              FROM ZYZBBData
                              GROUP BY code,year,report_type
                              HAVING COUNT( * ) > 1) a)
    AND id NOT IN(SELECT
                    id
                  FROM (SELECT
                          MIN(id) AS id
                        FROM ZYZBBData
                        GROUP BY code,year,report_type
                        HAVING COUNT( * ) > 1) b)

 

3.删除重复的数据

DELETE
FROM ZYZBBData
WHERE (code,year,report_type) IN (SELECT
                          code,
                          year,
                          report_type
                        FROM (SELECT
                                code,
                                year,
                                report_type
                              FROM ZYZBBData
                              GROUP BY code,year,report_type
                              HAVING COUNT( * ) > 1) a)
    AND id NOT IN(SELECT
                   id
                  FROM (SELECT
                          MIN(id) AS id
                        FROM ZYZBBData
                        GROUP BY code,year,report_type
                        HAVING COUNT( * ) > 1) b)

 

 数据正常

 

posted @ 2021-11-06 11:12  hello-*-world  阅读(1809)  评论(0编辑  收藏  举报