Python的difflib库的SequenceMatcher中的.get_opcodes()方法的理解

类似如下

('equal', 0, 16, 0, 16) ('replace', 16, 19, 16, 19), 也就是difference
这种返回结果应该如何理解呢？

from difflib import SequenceMatcher
def compare_texts(text1, text2):
    matcher = SequenceMatcher(None, text1, text2)
    return matcher.get_opcodes()  # 返回数据结构类似：('equal', 0, 16, 0, 16) ('replace', 16, 19, 16, 19), 也就是difference

如下是.get_opcodes()的源代码，算法原理暂按下不表，返回结果参数含义如下：

opcode：差异类型（如 equal、insert、delete、replace）。
i1, i2：第一段文本的差异范围。
j1, j2：第二段文本的差异范围。

先专注理解注释部分，结合案例便能够很好理解返回结果的含义。

    def get_opcodes(self):
        """Return list of 5-tuples describing how to turn a into b.

        Each tuple is of the form (tag, i1, i2, j1, j2).  The first tuple
        has i1 == j1 == 0, and remaining tuples have i1 == the i2 from the
        tuple preceding it, and likewise for j1 == the previous j2.

        The tags are strings, with these meanings:

        'replace':  a[i1:i2] should be replaced by b[j1:j2]
        'delete':   a[i1:i2] should be deleted.
                    Note that j1==j2 in this case.
        'insert':   b[j1:j2] should be inserted at a[i1:i1].
                    Note that i1==i2 in this case.
        'equal':    a[i1:i2] == b[j1:j2]

        >>> a = "qabxcd"
        >>> b = "abycdf"
        >>> s = SequenceMatcher(None, a, b)
        >>> for tag, i1, i2, j1, j2 in s.get_opcodes():
        ...    print(("%7s a[%d:%d] (%s) b[%d:%d] (%s)" %
        ...           (tag, i1, i2, a[i1:i2], j1, j2, b[j1:j2])))
         delete a[0:1] (q) b[0:0] ()
          equal a[1:3] (ab) b[0:2] (ab)
        replace a[3:4] (x) b[2:3] (y)
          equal a[4:6] (cd) b[3:5] (cd)
         insert a[6:6] () b[5:6] (f)
        """

        if self.opcodes is not None:
            return self.opcodes
        i = j = 0
        self.opcodes = answer = []
        for ai, bj, size in self.get_matching_blocks():
            # invariant:  we've pumped out correct diffs to change
            # a[:i] into b[:j], and the next matching block is
            # a[ai:ai+size] == b[bj:bj+size].  So we need to pump
            # out a diff to change a[i:ai] into b[j:bj], pump out
            # the matching block, and move (i,j) beyond the match
            tag = ''
            if i < ai and j < bj:
                tag = 'replace'
            elif i < ai:
                tag = 'delete'
            elif j < bj:
                tag = 'insert'
            if tag:
                answer.append( (tag, i, ai, j, bj) )
            i, j = ai+size, bj+size
            # the list of matching blocks is terminated by a
            # sentinel with size 0
            if size:
                answer.append( ('equal', ai, i, bj, j) )
        return answer

让我通过一个具体的例子来说明这句话：

        Each tuple is of the form (tag, i1, i2, j1, j2).  The first tuple
        has i1 == j1 == 0, and remaining tuples have i1 == the i2 from the
        tuple preceding it, and likewise for j1 == the previous j2.

假设有两个字符串a = "abcde"和b = "acdf"，我们想要找到将a转换为b所需的一系列操作。使用get_opcodes函数可能会得到以下操作码列表：

[
 ('equal', 0, 1, 0, 1),  # a[0:1] ('a') == b[0:1] ('a')
 ('delete', 1, 2, 1, 1), # a[1:2] ('b') should be deleted
 ('equal', 2, 4, 1, 3),  # a[2:4] ('c') == b[1:2] ('c')
 ('delete', 4, 5, 3, 4)  # a[4:5] ('e') should be deleted
]

现在，让我们解释这句话：

Each tuple is of the form (tag, i1, i2, j1, j2). 每个元组都有五个元素：标签、i1、i2、j1、j2。
The first tuple has i1 == j1 == 0. 第一个元组的i1和j1都是0，这意味着操作从字符串的开始处开始。在我们的例子中，第一个元组是('equal', 0, 1, 0, 1)，确实满足i1 == j1 == 0。
Remaining tuples have i1 == the i2 from the tuple preceding it, and likewise for j1 == the previous j2. 后续元组的i1是前一个元组的i2，j1是前一个元组的j2。我们可以检查这个规则是否适用于我们的例子：
- 第一个元组是('equal', 0, 1, 0, 1)，所以i2是1，j2也是1。
- 第二个元组是('delete', 1, 2, 1, 1)，这里i1是1，正好是前一个元组的i2，j1也是1，正好是前一个元组的j2。
- 同样地，第三个元组的i1是2，正好是前一个元组的i2，j1也是1，正好是前一个元组的j2。
- 这个模式一直持续到最后一个元组。

通过这个例子，我们可以清楚地看到元组是如何连续的，每个元组的起始索引与前一个元组的结束索引相匹配。

如下是一个更加完整的例子：

from difflib import SequenceMatcher

def compare_texts(text1, text2):
    matcher = SequenceMatcher(is_junk=None, text1, text2)
    return matcher.get_opcodes()

str1 = 'kitten'
str2 = 'sitting'

print(compare_texts(str1, str2))

返回结果：

[('replace', 0, 1, 0, 1), 
('equal', 1, 4, 1, 4), 
('replace', 4, 5, 4, 5), 
('equal', 5, 6, 5, 6),
('insert', 6, 6, 6, 7)]

追问：如何理解SequenceMatcher的is_junk这个参数含义？

is_junk 参数在 SequenceMatcher 类中是一个非常重要的参数，它的作用是告诉 SequenceMatcher 在比较两个序列时哪些元素可以被忽略，或者被认为是“垃圾”元素。

这些“垃圾”元素在比较过程中不会被视为有意义的元素，因此它们不会影响最终的比较结果。

简单来说，is_junk 参数是一个函数，这个函数接受一个参数（序列中的一个元素），并返回一个布尔值。如果返回 True，那么这个元素就被认为是“垃圾”元素；如果返回 False，则这个元素被认为是重要的，需要参与比较。

举例说明：

假设我们有两个字符串，分别是两段代码，我们想要比较这两段代码的相似度，但是不关心代码中的空白字符（空格和换行符）。

a = "def my_function(x):\n    return x + 1\n"
b = "def my_function(x):\n    return x + 2\n"

# 创建 SequenceMatcher 实例，指定 isjunk 函数
s = SequenceMatcher(lambda x: x in " \n", a, b)

# 输出相似度比率
print(s.ratio())

在上面的例子中，isjunk 参数是一个 lambda 函数，它检查元素是否是空格或换行符。如果是，那么这个元素就被认为是“垃圾”元素，在比较时会被忽略。

这个 SequenceMatcher 实例 s 在比较字符串 a 和 b 时，会忽略所有的空格和换行符。因此，尽管 a 和 b 在最后返回的值不同，但由于我们忽略了空白字符，相似度计算会认为这两个字符串在结构上非常相似。

当我们调用 s.ratio() 时，它会返回一个介于 0 到 1 之间的浮点数，表示两个序列的相似度。在这个例子中，尽管 a 和 b 的具体数值不同，但由于大部分结构是相同的，相似度比率会很高。

通过使用 is_junk 参数，我们可以更精确地控制比较过程，使得比较结果更符合实际需求。

posted @ 2024-12-30 14:22 AlphaGeek 阅读(141) 评论(0) 收藏举报

刷新页面返回顶部

Running water never grows stale. So you just have to keep on flowing.

Python的difflib库的SequenceMatcher中的.get_opcodes()方法的理解

举例说明：

公告