【编程技巧】结合JPA通用分页与排序技术，支持百万级以上数据的DML批量处理方案

Posted on 2024-06-22 16:53 召冠阅读(50) 评论(0) 编辑收藏举报

百万级以上的大数据量的批量更新或删除，如果直接采用传统update、delete等DML，在高并发的生产系统有可能导致数据库宕机乃至触发数据库bug导致数据不一致等问题。

考虑了一下可行的解决方案：

方案一，只保留少量数据的删除场景，可以先将保留的数据查出并插入中间表，truncate原表然后再将中间表插入回原表

方案描述已经很明确了，只能适用于非常特定的场景，保留的数据量也很多，或者是数据的更新场景都没法使用

方案二，最朴素的想法，两层循环：外循环每次查询1w条，内循环每次删除1k条delete from t where id in (0…999) 或 delete from t where id= ?1 or id = ?2 …..

此方案的弊端是每批次操作的数据量太少、应用与DB的交互次数太多，并且根据实际验证SQL脚本太长、in或or的数据量超过500+等情况，在部分数据库上的性能急剧下降。

方案三，采用各数据库的大批量删除方案

此方案的问题是各数据库的方言差异很大，对支持多种数据库的通用产品来说，需要逐个适配。

Oracle：
     delete from products where update_time < to_timestamp(‘2010-01-01 00:00:00’, ‘yyyy-mm-dd hh24:mi:ss’) and rownum <=100000;

SqlServer：
     DELETE TOP (100000)     FROM Purchasing.PurchaseOrderDetail     WHERE DueDate < '20020701';

MySQL：
   delete from t where DueDate < '20020701' limit = 100000;

PostgreSQL：
   delete from t where id in (select id from t where DueDate < '20020701'  limit 100000);

方案四，即另一种方案实践：排序字段分页查询与DML方案+JPA屏蔽数据库差异+数据库索引与批量优势

1、批量删除的场景

// Service层示例代码如下：
String parentId = "50d29225-7fc1-4a5d-a77a-d89e0a315158";
Sort sort = Sort.by(Sort.Direction.ASC, "Id");
Pageable pageable = PageRequest.of(10000, 10, sort);
List list = null;
do {
     list = repository.findByParentId(parentId, pageable);
     if (list == null || list.size() < 10) {  // 不足10w数据，直接删除
         repository.deleteAllByParentId(parentId);
     }
     else {  // 超过10w数据，仅删除前10w数据
         repository.deleteAllByParentIdAndLessThanId(parentId, list.get(list.size()));
     }
} while (list != null && list.size() >= 10);


// Repository层示例代码如下：
@Transactional
@Modifying
@Query(value=”delete from entityName where parentId = ?1")
void deleteAllByParentId(String parentId);

@Transactional
@Modifying
@Query(value=”delete from entityName where parentId = ?1 and id <= ?2")
void deleteAllByParentIdAndLessThanId(String parentId, String id);

@Query(value=”select * from entityName where parentId = ?1")
List queryByParentId(String parentId, Pageable pageable);

2、批量更新的场景

// Service层示例代码：
String parentId = "50d29225-7fc1-4a5d-a77a-d89e0a315158";
Sort sort = Sort.by(Sort.Direction.ASC, "Id");
Pageable pageable = PageRequest.of(10000, 10, sort);
List list = null;
String preid = “”;
do {
     list = repository.queryByParentIdAndGreaterThanId(parentId, preid, pageable);

     if (list == null || list.size() < 10) {  // 不足10w数据，统一更新
        repository.updateByParentIdAndBetweenId(parentId, preid, null);
     }
    else {  // 超过10w数据，仅更新两次id之间的10w数据
        repository.updateByParentIdAndBetweenId(parentId, preid, list.get(list.size()));

        preid = list.get(list.size());
     }
} while (list != null && list.size() >= 10);



// Repository层示例代码：
@Transactional
@Modifying
@Query(value=”update entityName set qty = qty+1 where parentId = ?1 and id > ?2 and (id <= ?3 or  ?3 is null)")
void updateByParentIdAndBetweenId(String parentId, String preId, String nextId);

// 批量更新场景，查询时需要排除已经处理过的数据
@Query(value=”select * from entityName where parentId = ?1 and id > ?2)
List queryByParentIdAndGreaterThanId(String parentId, String id, Pageable pageable);