数据清洗使用Parallel 多线程

一.概述

　　在开发数据清洗时，ES数据集有600w条，每一条的子对象又有几十条，需要拿到子对象去重后的集合，使用分批提取ES数据，共535批。开始使用List来操作，关键代码如下：

           var specListAll = new List<SpecInfo>();
            for (int i = 0; i < batchCount; i++)
            {
                //从es提取一批数据
                //每条数据提取子集合到list
               //下面去重后添加到新集合中
                foreach (var specDesc in list)
                {
                    if (specListAll.Count(w => w.NameJoinValue == specDesc.NameJoinValue) == 0)
                        specListAll.Add(specDesc);
                }
            }

　　使用计时器，第一批数据执行完耗时3分29秒，去重后进入15542个到specListAll集合中, 535批预估共执行31.2小时。

　　下面使用Parallel 多线程来实现去重后，添加到新集合中，关键代码如下：

           var specListAll = new ConcurrentBag<SpecInfo>();
            for (int i = 0; i < batchCount; i++)
            {
                //从es提取一批数据
                //每条数据提取子集合到list
                //下面去重后添加到新集合中
                Parallel.ForEach(list, specDesc =>
                {
                    if (specListAll.Count(w => w.NameJoinValue == specDesc.NameJoinValue) == 0)
                        specListAll.Add(specDesc);
                });
            }

　　使用计时器，第一批数据执行完耗时2分19秒，去重后进入15542个到specListAll集合中, 535批预估共执行20.8小时。

　　最后查看CPU,使用Parallel 多线程会高出30%的使用率

二.改进

　　　　在清洗中，发现使用specListAll.Count来去重复很耗时间，改进后，只需要2个多小时清洗完成，代码如下：

    var specListAll = new ConcurrentBag<SpecInfo>();
            for (int i = 0; i < batchCount; i++)
            {
                //从es提取一批数据
                //每条数据提取子集合到list
                //下面去重后添加到新集合中
                Parallel.ForEach(list, specDesc =>
                {
                    //if (specListAll.Count(w => w.NameJoinValue == specDesc.NameJoinValue) == 0)
                        specListAll.Add(specDesc);
                });
            }
   //最后加上去重
   var specListAll2=specListAll.Distinct().ToList();

posted on 2024-03-21 18:09 花阴偷移阅读(36) 评论(0) 编辑收藏举报

刷新页面返回顶部

花阴偷移

功名本是真儒事，公知否？

数据清洗使用Parallel 多线程

一.概述

二.改进

导航

公告