1.问题描述

  • 7月4号,18:53分,个人在值班时,发现别的开发组的服务WMS-SEVICE出现服务负载飙高的告警。

  • 自动生成的Dump

2.问题分析排查过程:

  • 看下服务监控

    • 如下图,在18:50~18:55分这段时间内,确实服务CUP很高。

 为什么这么高啊?有大的网络IO操作吗?继续排查。

  • 分析jstack文件

    • 打开jstack文件,搜索"com.pupu",发现是截图所示的代码可能导致的这个问题

"http-nio-7143-exec-5" #282 daemon prio=5 os_prio=0 tid=0x000055b9e71fc000 nid=0x250 runnable [0x00007fdb5ba8b000]

   java.lang.Thread.State: RUNNABLE

	at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:174)

	at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)

	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)

	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)

	at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)

	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)

	at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)

	at com.pupu.wms.domain.reserve.config.repository.ReserveResourceConfigRepository.updateCapacityConfig(ReserveResourceConfigRepository.java:140)
  • 线程正在执行updateCapacityConfig方法.
  • 该方法内部使用了java流(java.util.stream包)来处理数据。
  • 流操作涉及到了accept方法(在ReferencePipeline$2$1类中),这通常与流的forEach,map,filter等中间操作相关。
  • 流正在处理一个ArrayList(在ArrayList$ArrayListSpliterator类中),并且正在执行forEachRemaining方法来遍历列表中的元素。
  • 最终,流操作会导致数据的收集(collect方法),这可能是将流中的元素聚合到一个新的集合中。

简单理解就是updateCapacityConfig()方法中流操作导致的。

找到对应的代码:

@Transactional(rollbackFor = Exception.class)
    public Integer updateCapacityConfig(List<ReserveReceivingCapacityConfigPO> pos) {
        if (CollectionUtils.isEmpty(pos)) {
            return 0;
        }
        List<ReserveReceivingCapacityConfig> poReserveReceivingCapacityConfigs = ConverterUtil.convert(pos, ReserveReceivingCapacityConfig.class);
        List<String> storeIds = poReserveReceivingCapacityConfigs.stream().map(ReserveReceivingCapacityConfig::getStoreId).distinct().collect(Collectors.toList());
        List<Integer> dates = poReserveReceivingCapacityConfigs.stream().map(ReserveReceivingCapacityConfig::getConfigDate).distinct().sorted().collect(Collectors.toList());

        LambdaQueryWrapper<ReserveReceivingCapacityConfig> query = Wrappers.lambdaQuery();
        query.eq(ReserveReceivingCapacityConfig :: getHasDeleted,false);
        query.in(ReserveReceivingCapacityConfig :: getStoreId,storeIds);
        query.between(ReserveReceivingCapacityConfig :: getConfigDate,dates.get(0),dates.get(dates.size() - 1));
        List<ReserveReceivingCapacityConfig> reserveReceivingCapacityConfigs = reserveReceivingCapacityConfigMapper.selectList(query);

        Iterator<ReserveReceivingCapacityConfig> iterator = poReserveReceivingCapacityConfigs.iterator();
        List<Long> removeIds = Lists.newArrayList();

        while (iterator.hasNext()) {
            ReserveReceivingCapacityConfig po = iterator.next();
            List<ReserveReceivingCapacityConfig> list = reserveReceivingCapacityConfigs.stream().filter(r -> r.getStoreId().equals(po.getStoreId()) &&
                    r.getConfigDate().equals(po.getConfigDate()) &&
                    r.getConfigHour().equals(po.getConfigHour()))
                    .collect(Collectors.toList());
            if (CollectionUtils.isNotEmpty(list)) {
                // 如果完全相同则不更新
                List<ReserveReceivingCapacityConfig> removes = list.stream().filter(r ->  r.getSkuCount().equals(po.getSkuCount()) && r.getProductCount().equals(po.getProductCount())).collect(Collectors.toList());
                if (CollectionUtils.isNotEmpty(removes)) {
                    iterator.remove();
                } else {
                    // 如果配置发生变化,则移除旧的保存新的
                    removeIds.addAll(list.stream().map(ReserveReceivingCapacityConfig::getId).distinct().collect(Collectors.toList()));
                }
            }
        }
        if (CollectionUtils.isNotEmpty(removeIds)) {
            log.warn("更新预约容量配置:removeIds = {}", JSONObject.toJSONString(removeIds));
            LambdaUpdateWrapper<ReserveReceivingCapacityConfig> updateWrapper = Wrappers.lambdaUpdate();
            updateWrapper
                    .in(ReserveReceivingCapacityConfig::getId, removeIds)
                    .set(ReserveReceivingCapacityConfig::getHasDeleted, true)
                    .set(ReserveReceivingCapacityConfig::getTimeUpdate,System.currentTimeMillis());
            reserveReceivingCapacityConfigMapper.update(updateWrapper);
        }
        if (CollectionUtils.isNotEmpty(poReserveReceivingCapacityConfigs)) {
            log.warn("更新预约容量配置:pos = {}", JSONObject.toJSONString(poReserveReceivingCapacityConfigs));
            return reserveReceivingCapacityConfigMapper.insertBatchSomeColumn(poReserveReceivingCapacityConfigs);
        }
        return 0;
    }

 jstack文件提示的updateCapacityConfig方法中的140行如下:

 List<ReserveReceivingCapacityConfig> list = reserveReceivingCapacityConfigs.stream().filter(r -> r.getStoreId().equals(po.getStoreId()) &&
                    r.getConfigDate().equals(po.getConfigDate()) &&
                    r.getConfigHour().equals(po.getConfigHour()))
                    .collect(Collectors.toList());

确实是stream流操作.继续往下走.

看下这段代码的调用流程.下面是这段代码所在的方法,从该方法可以看到该代码位于循环内,且循环会遍历每个po对象。

假设下面的poReserveReceivingCapacityConfigs集合有200条数据,查询结果reserveReceivingCapacityConfigs集合有100条数据,

这段代码在每次批量调用中可能会被执行高达2万次(200个po对象 * 每次查询返回的约100条记录)

@Transactional(rollbackFor = Exception.class)
    public Integer updateCapacityConfig(List<ReserveReceivingCapacityConfigPO> pos) {
        if (CollectionUtils.isEmpty(pos)) {
            return 0;
        }
        List<ReserveReceivingCapacityConfig> poReserveReceivingCapacityConfigs = ConverterUtil.convert(pos, ReserveReceivingCapacityConfig.class);
        List<String> storeIds = poReserveReceivingCapacityConfigs.stream().map(ReserveReceivingCapacityConfig::getStoreId).distinct().collect(Collectors.toList());
        List<Integer> dates = poReserveReceivingCapacityConfigs.stream().map(ReserveReceivingCapacityConfig::getConfigDate).distinct().sorted().collect(Collectors.toList());

        LambdaQueryWrapper<ReserveReceivingCapacityConfig> query = Wrappers.lambdaQuery();
        query.eq(ReserveReceivingCapacityConfig :: getHasDeleted,false);
        query.in(ReserveReceivingCapacityConfig :: getStoreId,storeIds);
        query.between(ReserveReceivingCapacityConfig :: getConfigDate,dates.get(0),dates.get(dates.size() - 1));
        List<ReserveReceivingCapacityConfig> reserveReceivingCapacityConfigs = reserveReceivingCapacityConfigMapper.selectList(query);
        Iterator<ReserveReceivingCapacityConfig> iterator = poReserveReceivingCapacityConfigs.iterator();
        List<Long> removeIds = Lists.newArrayList();

        while (iterator.hasNext()) {
            ReserveReceivingCapacityConfig po = iterator.next();
            List<ReserveReceivingCapacityConfig> list = reserveReceivingCapacityConfigs.stream().filter(r -> r.getStoreId().equals(po.getStoreId()) &&
                    r.getConfigDate().equals(po.getConfigDate()) &&
                    r.getConfigHour().equals(po.getConfigHour()))
                    .collect(Collectors.toList());
            if (CollectionUtils.isNotEmpty(list)) {
                // 如果完全相同则不更新
                List<ReserveReceivingCapacityConfig> removes = list.stream().filter(r ->  r.getSkuCount().equals(po.getSkuCount()) && r.getProductCount().equals(po.getProductCount())).collect(Collectors.toList());
                if (CollectionUtils.isNotEmpty(removes)) {
                    iterator.remove();
                } else {
                    // 如果配置发生变化,则移除旧的保存新的
                    removeIds.addAll(list.stream().map(ReserveReceivingCapacityConfig::getId).distinct().collect(Collectors.toList()));
                }
            }
        }
        if (CollectionUtils.isNotEmpty(removeIds)) {
            log.warn("更新预约容量配置:removeIds = {}", JSONObject.toJSONString(removeIds));
            LambdaUpdateWrapper<ReserveReceivingCapacityConfig> updateWrapper = Wrappers.lambdaUpdate();
            updateWrapper
                    .in(ReserveReceivingCapacityConfig::getId, removeIds)
                    .set(ReserveReceivingCapacityConfig::getHasDeleted, true)
                    .set(ReserveReceivingCapacityConfig::getTimeUpdate,System.currentTimeMillis());
            reserveReceivingCapacityConfigMapper.update(updateWrapper);
        }
        if (CollectionUtils.isNotEmpty(poReserveReceivingCapacityConfigs)) {
            log.warn("更新预约容量配置:pos = {}", JSONObject.toJSONString(poReserveReceivingCapacityConfigs));
            return reserveReceivingCapacityConfigMapper.insertBatchSomeColumn(poReserveReceivingCapacityConfigs);
        }
        return 0;
    }

看下这个方法在哪里调用的?如下截图所示,每100分批调用的.分批调用了多少次?根据log.warn("更新预约配置")的日志,查询到调用了75次。意思就是这个要分批的saveList集合总共有7500条.其实这个方法主要干的事情就是做数据的更新操作.

再回到调用的代码处,由上面的分析得出,updateCapacityConfig(List<ReserveReceivingCapacityConfigPO> pos)这个方法被调用了75次,每次调用方法入参集合是100条数据,reserveReceivingCapacityConfigMapper.selectList(query)我抽取了3次调用的结果,看了一下返回结果集合,有返回100条的,有返回95条的,有返回97条的,这3次调用就取个平均值97.reserveReceivingCapacityConfigs集合大小可以看做是97.

@Transactional(rollbackFor = Exception.class)
    public Integer updateCapacityConfig(List<ReserveReceivingCapacityConfigPO> pos) {
        if (CollectionUtils.isEmpty(pos)) {
            return 0;
        }
        List<ReserveReceivingCapacityConfig> poReserveReceivingCapacityConfigs = ConverterUtil.convert(pos, ReserveReceivingCapacityConfig.class);
        List<String> storeIds = poReserveReceivingCapacityConfigs.stream().map(ReserveReceivingCapacityConfig::getStoreId).distinct().collect(Collectors.toList());
        List<Integer> dates = poReserveReceivingCapacityConfigs.stream().map(ReserveReceivingCapacityConfig::getConfigDate).distinct().sorted().collect(Collectors.toList());

        LambdaQueryWrapper<ReserveReceivingCapacityConfig> query = Wrappers.lambdaQuery();
        query.eq(ReserveReceivingCapacityConfig :: getHasDeleted,false);
        query.in(ReserveReceivingCapacityConfig :: getStoreId,storeIds);
        query.between(ReserveReceivingCapacityConfig :: getConfigDate,dates.get(0),dates.get(dates.size() - 1));
        List<ReserveReceivingCapacityConfig> reserveReceivingCapacityConfigs = reserveReceivingCapacityConfigMapper.selectList(query);
        Iterator<ReserveReceivingCapacityConfig> iterator = poReserveReceivingCapacityConfigs.iterator();
        List<Long> removeIds = Lists.newArrayList();

        while (iterator.hasNext()) {
            ReserveReceivingCapacityConfig po = iterator.next();
            List<ReserveReceivingCapacityConfig> list = reserveReceivingCapacityConfigs.stream().filter(r -> r.getStoreId().equals(po.getStoreId()) &&
                    r.getConfigDate().equals(po.getConfigDate()) &&
                    r.getConfigHour().equals(po.getConfigHour()))
                    .collect(Collectors.toList());
            if (CollectionUtils.isNotEmpty(list)) {
                // 如果完全相同则不更新
                List<ReserveReceivingCapacityConfig> removes = list.stream().filter(r ->  r.getSkuCount().equals(po.getSkuCount()) && r.getProductCount().equals(po.getProductCount())).collect(Collectors.toList());
                if (CollectionUtils.isNotEmpty(removes)) {
                    iterator.remove();
                } else {
                    // 如果配置发生变化,则移除旧的保存新的
                    removeIds.addAll(list.stream().map(ReserveReceivingCapacityConfig::getId).distinct().collect(Collectors.toList()));
                }
            }
        }
        if (CollectionUtils.isNotEmpty(removeIds)) {
            log.warn("更新预约容量配置:removeIds = {}", JSONObject.toJSONString(removeIds));
            LambdaUpdateWrapper<ReserveReceivingCapacityConfig> updateWrapper = Wrappers.lambdaUpdate();
            updateWrapper
                    .in(ReserveReceivingCapacityConfig::getId, removeIds)
                    .set(ReserveReceivingCapacityConfig::getHasDeleted, true)
                    .set(ReserveReceivingCapacityConfig::getTimeUpdate,System.currentTimeMillis());
            reserveReceivingCapacityConfigMapper.update(updateWrapper);
        }
        if (CollectionUtils.isNotEmpty(poReserveReceivingCapacityConfigs)) {
            log.warn("更新预约容量配置:pos = {}", JSONObject.toJSONString(poReserveReceivingCapacityConfigs));
            return reserveReceivingCapacityConfigMapper.insertBatchSomeColumn(poReserveReceivingCapacityConfigs);
        }
        return 0;
    }

那结果就是在每次循环中,都会对reserveReceivingCapacityConfigs集合列表执行一次过滤操作。由于集合列表大小约为97条记录,且每次批量调用都会执行100次循环,因此这段代码在每次调用中可能会被执行高达9700次过滤操作(100次循环 * 97条记录).

List<ReserveReceivingCapacityConfig> list = reserveReceivingCapacityConfigs.stream().filter(r -> r.getStoreId().equals(po.getStoreId()) &&
                    r.getConfigDate().equals(po.getConfigDate()) &&
                    r.getConfigHour().equals(po.getConfigHour()))
                    .collect(Collectors.toList());
  • 调用链路

    • 整个链路调用,总共耗时619908毫秒.

  • 数据库DB监控

    • 看下出问题的时间段的数据库DB监控:证明这段时间“reservecapacityconfigs”这个表,确实有大量的数据库更新查询操作.

  • jmap文件

    • 再看看jmap文件:用"com.pupu"去搜索,产生了很多新的对象(ReserveReceivingCapacityConfig).如下图所示,其实就是出问题的那段代码导致的.每次循环产生了1个新的List<ReserveReceivingCapacityConfig>集合对象.循环了75*100=7500次.

  • List<ReserveReceivingCapacityConfig> list = reserveReceivingCapacityConfigs.stream().filter(r -> r.getStoreId().equals(po.getStoreId()) && r.getConfigDate().equals(po.getConfigDate()) && r.getConfigHour().equals(po.getConfigHour())) .collect(Collectors.toList());
  • 其实上面看它的执行链路,grafna上看数据库DB情况,这个只能证明出问题的确实是在updateCapacityConfig()这个方法中,匹配了这个方法的执行流程.比如这个方法中确实执行了很多次数据库的更新和查询操作,但定位这个问题需要结合jstack文件和jmap文件.jstack文件展示了这个线程堆栈跟踪的情况,jmap文件显示了Java堆中对象的统计信息,包括类名、对象数量以及占用的内存大小.

3.问题定位

  • 经过上面的这些排查,大概可以确定是下面如图所示的循环下(每次调用执行9700次过滤操作)的流操作引起的.
while (iterator.hasNext()) {
            ReserveReceivingCapacityConfig po = iterator.next();
            List<ReserveReceivingCapacityConfig> list = reserveReceivingCapacityConfigs.stream().filter(r -> r.getStoreId().equals(po.getStoreId()) &&
                    r.getConfigDate().equals(po.getConfigDate()) &&
                    r.getConfigHour().equals(po.getConfigHour()))
                    .collect(Collectors.toList());
  • 对于reserveReceivingCapacityConfigs列表中的每个元素,都会调用getStoreId().equals(),getConfigDate().equals()和getConfigHour().equals()方法。这些方法的执行需要CPU计算.虽然单个 equals方法的调用可能消耗的CPU资源很少,但在大量数据或频繁调用(9700次)的情况下,这种消耗会变得显著.
  • 过滤后的元素需要通过collect(Collectors.toList())收集到一个新的列表中.collect方法会遍历过滤后的元素,并将它们添加到一个新的ArrayList中。这个遍历和添加操作也需要CPU资源,在大量数据或频繁调用的情况下,这种消耗会变得显著.

4.优化建议

  • 因为这块代码是其他开发组的,只单纯的针对于这个问题,是否可以不要在while循环里面每次遍历reserveReceivingCapacityConfigs这整个集合,可不可以把reserveReceivingCapacityConfigs这个集合转换成1个Map,map的key,就是那些比较字段合在一起,比如:storeId_configdate_configHour,value就是对应的每个对象.然后while循环去遍历的时候,直接拼装成key,去map中取就行了,不用每次都要再遍历整个reserveReceivingCapacityConfigs集合了.

5.问题处理

  • 停掉该接口对应的定时任务,不让其继续运行.

6.处理结果

  • 1.确认对业务没有影响,新建的表,新写的接口,业务还没有开始使用
  • 2.第二天,优化后发版上线
  • 3.生产问题复盘
posted on 2024-11-16 12:07  路飞_lufei  阅读(2)  评论(0编辑  收藏  举报