GreenPlum高效去除表重复数据
1.针对PostgreSQL数据库表的去重复方法基本有三种,这是在网上查找的方法,在附录1给出。但是这些方法对GreenPlum来说都不管用。
2.数据表分布在不同的节点上,每个节点的ctid是唯一的,但是不同的节点就有ctid重复的可能,因此GreenPlum必须借助gp_segment_id来进行去重复处理。
3.在网上找到了一个相对繁琐的方法,在附录2给出:
4.最终的方法是:
delete from test where (gp_segment_id, ctid) not in (select gp_segment_id, min(ctid) from test group by x, gp_segment_id);
验证通过。
附录1:PostgreSQL数据表去重复的三种方法:
引用自:http://my.oschina.net/swuly302/blog/144933
采用PostgreSQL 9.2 官方文档例子为例:
CREATE TABLE weather ( city varchar(80), temp_lo int, -- low temperature temp_hi int, -- high temperature prcp real, -- precipitation date date ); INSERT INTO weather VALUES ('San Francisco', 46, 50, 0.25, '1994-11-27'), ('San Francisco', 43, 57, 0, '1994-11-29'), ('Hayward', 37, 54, NULL, '1994-11-29'), ('Hayward', 37, 54, NULL, '1994-11-29'); --- duplicated row
这里有3中方法:
第一种:替换法 -- 剔除重复行的数据转存到新表weather_temp SELECT DISTINCT city, temp_lo, temp_hi, prcp, date INTO weather_temp FROM weather; -- 删除原表 DROP TABLE weather; -- 将新表重命名为weather ALTER TABLE weather_temp RENAME TO weather; 或者 -- 创建与weather一样的表weather_temp CREATE TABLE weather_temp (LIKE weather INCLUDING CONSTRAINTS); -- 用剔除重复行的数据填充到weather_temp中 INSERT INTO weather_temp SELECT DISTINCT * FROM weather; -- 删除原表 DROP TABLE weather; -- 将新重命名为weather. ALTER TABLE weather_temp RENAME TO weather; 通俗易懂,有很多毁灭性的操作如DROP,而且当数据量大时,耗时耗空间。不推荐。 第二种: 添加字段法 -- 添加一个新字段,类型为serial ALTER TABLE weather ADD COLUMN id SERIAL; -- 删除重复行 DELETE FROM weather WHERE id NOT IN ( SELECT max(id) FROM weather GROUP BY city, temp_lo, temp_hi, prcp, date ); -- 删除添加的字段 ALTER TABLE weather DROP COLUMN id; 需要添加字段,「暂时不知道Postgres是如何处理添加字段的,是直接在原表追加呢,还是复制原表组成新表呢?」,如果是原表追加,可能就会因为新字段的加入而导致分页(一般block: 8k),如果是复制的话那就罪过了。不好。 第三种:系统字段[查看 System Columns] DELETE FROM weather WHERE ctid NOT IN ( SELECT max(ctid) FROM weather GROUP BY city, temp_lo, temp_hi, prcp, date ); 针对性强[Postgres独有],但是简单。
----------------但是对GreenPlum的表来说,表分割在各个节点上,不能单纯的用ctid来做去重复处理。
附录2:
https://discuss.pivotal.io/hc/zh-cn/community/posts/206428018-What-is-the-most-efficient-way-of-deleting-duplicate-records-from-a-table-
What is the most efficient way of deleting duplicate records from a table?
Currently we use Primary Keys to avoid loading duplicate data into our tables, but PK brings many restrictions. Since we can’t easily identify or prevent duplicates arriving from the variety of 3rd party upstream systems, we wanted to investigate the ‘load everything, remove duplicates afterwards’ approach.
In Postgres, you can use an efficient method such as:
DELETE FROM test WHERE ctid NOT IN ( SELECT min(ctid) FROM test GROUP BY x); (where 'x' is the unique column list)
However in Greenplum ‘ctid’ is only unique per segment.
One approach would be:
DELETE FROM test USING (select gp_segment_id, ctid from (select gp_segment_id, ctid, rank() over (partition by x order by gp_segment_id, ctid) as rk from test ) foo WHERE rk <> 1) rows_to_delete WHERE test.gp_segment_id=rows_to_delete.gp_segment_id AND test.ctid=rows_to_delete.ctid;
But the use of window functions, subqueries etc. feels pretty inefficient.
Is there a better form?
Note that in our use case our unique column list varies up to ~10 columns so we don’t have a single unique key field – hence the RANK in the example. I suppose adding a sequence column could be used, but how much overhead does this add when doing bulk data loading?
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 10年+ .NET Coder 心语,封装的思维:从隐藏、稳定开始理解其本质意义
· .NET Core 中如何实现缓存的预热?
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· 10年+ .NET Coder 心语 ── 封装的思维:从隐藏、稳定开始理解其本质意义
· 地球OL攻略 —— 某应届生求职总结
· 提示词工程——AI应用必不可少的技术
· Open-Sora 2.0 重磅开源!
· 周边上新:园子的第一款马克杯温暖上架