代码背景:

  • 由于Excel本身无法简单的比较两个Excel数据的异同,所以用VBA编写代码的方式来实现。
  • 这里的比较条件是:数据行为单位,假设对应Sheet中没有重复数据,对应数据行的所有列的数据都相等,即为此行数据相同。
  • 这里的两个Sheet的数据行量级别大约为:50000 * 50000,数据列大约:50,对应Cell中的字符串大约100以内,中英文混合。
  • 如何在Excel中调出VBA的编写工具,请参考如下链接: https://jingyan.baidu.com/article/63f236281f17650208ab3d97.html

 

整体来说,需求非常明确,若是不考虑效率的话,代码逻辑比较简单,循环比较即可。

相关代码:

Sub CompareData()

    Dim i As Long
    Dim j As Long
    
    Dim fullSheetName As String
    fullSheetName = "Sheet1"
    Set fullSheet = Sheets(fullSheetName)
    Dim fullDataRange As Variant
    fullDataRange = fullSheet.Range("A1", "AT80000").CurrentRegion.Value
    Dim fullSheetRowMax As Long
    fullSheetRowMax = fullSheet.Range("A1", "AT80000").CurrentRegion.Rows.Count
    
    Dim partialSheetName As String
    partialSheetName = "Sheet2"
    Set partialSheet = Sheets(partialSheetName)
    Dim partialDataRange As Variant
    partialDataRange = partialSheet.Range("A1", "AT80000").CurrentRegion.Value
    Dim partialSheetRowMax As Long
    partialSheetRowMax = partialSheet.Range("A1", "AT80000").CurrentRegion.Rows.Count

    Dim columnMax As Integer
    columnMax = 46
    
    Dim columnMark As Integer
    columnMark = 48
    
    Dim sameRow As Boolean
    
    For i = 1 To fullSheetRowMax
        For j = 1 To partialSheetRowMax
            sameRow = True
            For columnIndex = 1 To columnMax
                If fullDataRange(i, columnIndex) <> partialDataRange(j, columnIndex) Then
                    sameRow = False
                    Exit For
                End If
            Next columnIndex
            
            If sameRow Then
                fullSheet.Cells(i, columnMark) = 1
                Exit For
            End If
        Next j
    Next i
    
    MsgBox "Successfully!"
End Sub
View Code

上述代码实际运行大约用30分钟完成此数量级的比较,大约1000亿次的比较。

 

当然了我们需要更快的比较方式,那么就需要对大数据进行结构优化,即:将partial的sheet中的数据进行分组,比如每1000个row放到一个组里,然后用一个标志位标记这个组里1000个row是否都有相同的数据,如有都有的话,那么下次再比较的时候就可以跳过这个组,进行下一组的1000个row的循环遍历。相同数量级,大约2分钟比较完成。

注:实际数据是按照时间进行抽取出来的,所以partial的sheet数据 大致都在full的sheet的前半部分相同,如果数据无规律,非常混乱,那么还要对每一个row的数据进行结构优化,即:用另外一个标记为进行标记此row是否有相同的数据,判断的时候先判断这个标记位】

相关代码如下:

【注:函数中的一些变量都是HardCode的,要根据具体数据进行修改】

Public Type PartialBasedModule
    IsAllSame As Boolean
    SheetDataRange As Variant
    SameCount As Long
End Type


Sub CompareData()

    Dim i As Long
    Dim j As Long
    Dim k As Long
    
    Dim fullSheetName As String
    fullSheetName = "Sheet1"
    Set fullSheet = Sheets(fullSheetName)
    Dim fullDataRange As Variant
    fullDataRange = fullSheet.Range("A1", "AT80000").CurrentRegion.Value
    Dim fullSheetRowMax As Long
    fullSheetRowMax = fullSheet.Range("A1", "AT80000").CurrentRegion.Rows.Count
    
    Dim partialSheetName As String
    partialSheetName = "Sheet2"
    Set partialSheet = Sheets(partialSheetName)
    Dim PartialDataRange As Variant
    PartialDataRange = partialSheet.Range("A1", "AT80000").CurrentRegion.Value
    Dim partialSheetRowMax As Long
    partialSheetRowMax = partialSheet.Range("A1", "AT80000").CurrentRegion.Rows.Count
    
    Dim partialSheetPages() As PartialBasedModule
    partialSheetPages = SeparatePartialSheet(PartialDataRange, partialSheetRowMax)

    Dim columnMax As Integer
    columnMax = 46
    
    Dim columnMark As Integer
    columnMark = 48
    
    Dim sameRow As Boolean
    
    For i = 1 To fullSheetRowMax
        For j = 1 To UBound(partialSheetPages)
            If partialSheetPages(j).SameCount < 1000 Then
                For k = 1 To UBound(partialSheetPages(j).SheetDataRange)
                    sameRow = True
                    For ColumnIndex = 1 To columnMax
                        If fullDataRange(i, ColumnIndex) <> partialSheetPages(j).SheetDataRange(k, ColumnIndex) Then
                            sameRow = False
                            Exit For
                        End If
                    Next ColumnIndex
                    
                    If sameRow Then
                        fullSheet.Cells(i, columnMark) = 1
                        partialSheetPages(j).SameCount = partialSheetPages(j).SameCount + 1
                        Exit For
                    End If
                Next k
            Else
                sameRow = False
            End If
            
            If sameRow Then
                Exit For
            End If
        Next j
    Next i
    
    MsgBox "Successfully!"
End Sub

Public Function SeparatePartialSheet(ByRef PartialDataRange As Variant, ByVal rowCount As Long) As PartialBasedModule()
    Dim eachPageCount As Long
    eachPageCount = 1000
    Dim pageCount As Integer
    pageCount = Int(rowCount / eachPageCount) + 1
    Dim pageIndex As Long
    
    Dim pageArr() As PartialBasedModule
    Dim startIndex As Long
    Dim endIndex As Long
    
    For pageIndex = 1 To pageCount
        Dim seperatedDataRange(1 To 1000, 1 To 46) As Variant
        Dim seperatedIndex As Long
        seperatedIndex = 1
        Dim colIndex As Integer
        
        If pageIndex < pageCount Then
            endIndex = pageIndex * eachPageCount
        Else
            endIndex = rowCount
        End If
        
        
        For startIndex = (pageIndex - 1) * eachPageCount + 1 To endIndex
            For colIndex = 1 To 46
                seperatedDataRange(seperatedIndex, colIndex) = PartialDataRange(startIndex, colIndex)
            Next colIndex
            seperatedIndex = seperatedIndex + 1
        Next startIndex
        
        Dim pageData As PartialBasedModule
        pageData.SheetDataRange = seperatedDataRange
        pageData.SameCount = 0
        pageData.IsAllSame = False
        
        ReDim Preserve pageArr(pageIndex)
        pageArr(pageIndex) = pageData
    Next pageIndex
    
    
    SeparatePartialSheet = pageArr
End Function
View Code

 

给每个Row都加上标记的代码如下所示:【相同界别的数据,大约1分钟完成比较】

Public Type RowModule
    IsSame As Boolean
    RowData As Variant
End Type

Public Type PartialBasedModule
    IsAllSame As Boolean
    SheetDataRange() As RowModule
    SameCount As Long
End Type


Sub CompareData()

    Dim i As Long
    Dim j As Long
    Dim k As Long
    
    Dim fullSheetName As String
    fullSheetName = "Sheet1"
    Set fullSheet = Sheets(fullSheetName)
    Dim fullDataRange As Variant
    fullDataRange = fullSheet.Range("A1", "AT80000").CurrentRegion.Value
    Dim fullSheetRowMax As Long
    fullSheetRowMax = fullSheet.Range("A1", "AT80000").CurrentRegion.Rows.Count
    
    Dim partialSheetName As String
    partialSheetName = "Sheet2"
    Set partialSheet = Sheets(partialSheetName)
    Dim PartialDataRange As Variant
    PartialDataRange = partialSheet.Range("A1", "AT80000").CurrentRegion.Value
    Dim partialSheetRowMax As Long
    partialSheetRowMax = partialSheet.Range("A1", "AT80000").CurrentRegion.Rows.Count
    
    Dim partialSheetPages() As PartialBasedModule
    partialSheetPages = SeparatePartialSheet(PartialDataRange, partialSheetRowMax)

    Dim columnMax As Integer
    columnMax = 46
    
    Dim columnMark As Integer
    columnMark = 48
    
    Dim sameRow As Boolean
    
    For i = 1 To fullSheetRowMax
        For j = 1 To UBound(partialSheetPages)
            If partialSheetPages(j).SameCount < 1000 Then
                For k = 1 To UBound(partialSheetPages(j).SheetDataRange)
                    sameRow = True
                    
                    If partialSheetPages(j).SheetDataRange(k).IsSame Then
                        sameRow = False
                    Else
                        For ColumnIndex = 1 To columnMax
                            If fullDataRange(i, ColumnIndex) <> partialSheetPages(j).SheetDataRange(k).RowData(ColumnIndex) Then
                                sameRow = False
                                Exit For
                            End If
                        Next ColumnIndex
                        
                        If sameRow Then
                            fullSheet.Cells(i, columnMark) = 1
                            partialSheetPages(j).SheetDataRange(k).IsSame = True
                            partialSheetPages(j).SameCount = partialSheetPages(j).SameCount + 1
                            Exit For
                        End If
                    End If
                Next k
            Else
                sameRow = False
            End If
            
            If sameRow Then
                Exit For
            End If
        Next j
    Next i
    
    MsgBox "Successfully!"
End Sub

Public Function SeparatePartialSheet(ByRef PartialDataRange As Variant, ByVal rowCount As Long) As PartialBasedModule()
    Dim eachPageCount As Long
    eachPageCount = 1000
    Dim pageCount As Integer
    pageCount = Int(rowCount / eachPageCount) + 1
    Dim pageIndex As Long
    
    Dim pageArr() As PartialBasedModule
    Dim startIndex As Long
    Dim endIndex As Long
    
    For pageIndex = 1 To pageCount
        Dim seperatedDataRange(1 To 1000) As RowModule
        Dim dataRows(1 To 1000) As Variant
        Dim seperatedIndex As Long
        seperatedIndex = 1
        Dim colIndex As Integer
        
        If pageIndex < pageCount Then
            endIndex = pageIndex * eachPageCount
        Else
            endIndex = rowCount
        End If
        
        
        For startIndex = (pageIndex - 1) * eachPageCount + 1 To endIndex
            Dim dataRow(1 To 46) As Variant
            For colIndex = 1 To 46
                dataRow(colIndex) = PartialDataRange(startIndex, colIndex)
            Next colIndex
            
            Dim currentRowModule As RowModule
            currentRowModule.RowData = dataRow
            currentRowModule.IsSame = False
            
            seperatedDataRange(seperatedIndex) = currentRowModule
            seperatedIndex = seperatedIndex + 1
        Next startIndex
        
        Dim pageData As PartialBasedModule
        pageData.SheetDataRange = seperatedDataRange
        pageData.SameCount = 0
        pageData.IsAllSame = False
        
        ReDim Preserve pageArr(pageIndex)
        pageArr(pageIndex) = pageData
    Next pageIndex
    
    
    SeparatePartialSheet = pageArr
End Function
View Code

 

最终的一个简单的数据结构如下图所示:

 

posted @ 2020-05-01 11:51 Eric Sun 阅读(4198) 评论(0) 推荐(0) 编辑
摘要: 在实际项目的开发过程中,会有这样的功能需求:要求创建一些Job定时触发运行,比如进行一些数据的同步。 那么在 .Net Framework 中如何实现这个Timer Job的功能呢? 这里所讲的是借助第三方的组件 Quartz.Net 来实现(源码位置:https://github.com/quar 阅读全文
posted @ 2018-05-13 20:17 Eric Sun 阅读(11015) 评论(0) 推荐(1) 编辑
摘要: 之前的一篇文章讲述了如何通过 Quartz .NET 实现 Timer Job (http://www.cnblogs.com/mingmingruyuedlut/p/8037263.html) 在此基础上如何将实现的Timer Job注册成为Windows Service,请看如下步骤: 1):在 阅读全文
posted @ 2018-05-13 20:16 Eric Sun 阅读(2348) 评论(0) 推荐(0) 编辑
摘要: 某个项目中涉及到这样一个情景: VB/C#写的原始项目要调用Python的一些方法完成特殊的操作, 那么这就涉及到了,在.Net Framework中如何调用Python的脚本方法。 具体步骤流程如下所示: 1): 展示一个简单的Python代码,即传递一个参数,然后返回修改后字符串,此文件名称为 阅读全文
posted @ 2018-05-09 11:28 Eric Sun 阅读(9389) 评论(0) 推荐(0) 编辑
摘要: 请参考如下连接的开源项目 源代码在本地测试好用。 注: 这里所用的Python 版本是 2.7.11 https://github.com/dmroeder/pylogix 阅读全文
posted @ 2018-05-08 08:02 Eric Sun 阅读(675) 评论(0) 推荐(0) 编辑
摘要: 用C#如何生成二维码,我们可以通过现有的第三方dll直接来实现,下面列出几种不同的生成方法: 1):通过QrCodeNet(Gma.QrCodeNet.Encoding.dll)来实现 1.1):首先通过VS2015的NuGet下载对应的第三方组件,如下图所示: 1.2):具体生成二维码方法如下 更 阅读全文
posted @ 2016-12-01 09:34 Eric Sun 阅读(27357) 评论(3) 推荐(5) 编辑
摘要: 应用场景: 有的时候需要通过条件来判断某个字段是否可以被序列化,例如:再写WebApi的时候,只有当Api方法出错的时候,才将error的具体信息返回去,如果方法正常运行就不返回error字段。 第一种方式(JSON): 我们可以用JSON.NET的 ShouldSerialize 语法 详细信息请 阅读全文
posted @ 2016-11-10 15:46 Eric Sun 阅读(1559) 评论(0) 推荐(0) 编辑
摘要: 应用场景: 很多时候我们要在Web.Config中添加appSettings的键值对来标识一些全局的信息,比如:调用service的domain,跳转其他网站页面的url 等等; 那么此时就涉及到了一个问题,如何在JS中去访问这些config信息? 我们主要的思路就是:在后台读取appSetting 阅读全文
posted @ 2016-11-10 14:41 Eric Sun 阅读(4597) 评论(0) 推荐(0) 编辑
摘要: 通过手机短信验证码的方式进行身份验证越来越普遍, 这里就简单的介绍一下 如何通过 “云之讯” 平台完成短信验证码的发送。“云之讯”平台只负责往指定的手机号码中发送短信,具体的短信模板需要在“云之讯”管理平台设定好,并且其余的逻辑控制都需要自己处理, 比如:验证码多长时间过期,保证验证码的唯一性 等等... 阅读全文
posted @ 2016-01-15 15:34 Eric Sun 阅读(4609) 评论(2) 推荐(0) 编辑
摘要: Serilog 是一种非常简便记录log 的处理方式,使用Serilog可以生成本地的text文件, 也可以通过 Seq 来在Web界面中查看具体的log内容。接下来就简单的介绍一下在Asp.Net MVC中如何配置是Serilog 生效:1):下载并且安装Seq,具体的下载URL 为 【http:... 阅读全文
posted @ 2016-01-15 14:59 Eric Sun 阅读(3919) 评论(1) 推荐(2) 编辑
点击右上角即可分享
微信分享提示