Python计算编辑距离和字符串相似度

Levenshtein module

distance(s1, s2, *, weights=(1, 1, 1), processor=None, score_cutoff=None)

  • 参数:s1和s2代表两个字符串,weights=(1,1,1)中的1分别代表对插入、删除和替换的权重;processor可以对字符进行预处理
  • 返回s1和s2之间的编辑距离,整数型;
from Levenshtein import distance
distance("lewenstein", "levenshtein")  # out:2
distance(["lewenstein"], ["levenshtein"], processor=lambda s: s[0])  # out:2

可以看到当指定processor时,s1和s2可以是列表的形式

ratio(s1, s2, *, processor=None, score_cutoff=None)

  • 返回s1和s2的相似度

seqratio(strlist1, strlist2);setratio(strlist1, strlist2)

  • 计算两个列表、集合的相似度

Rapidfuzz

process.cdist(queries,choices,scorer,processor,score_cutoff,workers)

  • queries和choices都是list of strings,scorer选择计算距离的方式,默认使用ratio,workers选择调用CPU数,-1代表使用所有可用CPU.
  • 返回矩阵
from rapidfuzz import process
process.cdist(["北京","南京","东京"],['北北北',"南京市","日本东京"])
Out[10]:
array([[40.      , 40.      , 33.333332],
       [ 0.      , 80.      , 33.333332],
       [ 0.      , 40.      , 66.666664]], dtype=float32)

剩下的extract、extractOne等于Thefuzz里的方法差不多

Rapidfuzz.distance

  • 主要包含了各种编辑操作相关的方法

rapidfuzz.distance.Levenshtein.distance(s1,s2,weights,processer,score_cutoff)

  • 功能和Levenshtein module中的同名方法一致,但是根据Rapidfuzz该方法的速度更快

rapidfuzz.fuzz.ratio(s1,s2,processor,score_cutoff)

  • Calculates the normalized Indel distance.
  • 要注意fuzz.ratio计算的不是Levenshtein distance!!

Thefuzz中的函数

extractWithoutOrder

  • Select the best match in a list or dictionary of choices.
  • Find best matches in a list or dictionary of choices, return a generator of tuples containing the match and its score.
  • If a dictionary is used, also returns the key for each match.

extract

  • Select the best match in a list or dictionary of choices.
  • Find best matches in a list or dictionary of choices, return a list of tuples containing the match and its score.
  • If a dictionary is used, also returns the key for each match

extractBests

  • Get a list of the best matches to a collection of choices.
  • Convenience function for getting the choices with best scores

extractOne

  • 参数:(query, choices, processor=default_processor, scorer=default_scorer, score_cutoff=0):

  • Find the single best match above a score in a list of choices.

  • This is a convenience method which returns the single best choice.

  • See extract() for the full arguments list.

  • query: A string to match against

  • choices: A list or dictionary of choices, suitable for use with extract().

  • processor: Optional function for transforming choices before matching.See extract().

  • scorer: Scoring function for extract().

  • score_cutoff: Optional argument for score threshold. If the best match is found, but it is not greater than this number, then return None anyway ("not a good enough match"). Defaults to 0.

dedupe

  • This convenience function takes a list of strings containing duplicates and uses fuzzy matching to identify and remove duplicates.
  • Specifically, it uses the process.extract to identify duplicates that score greater than a user defined threshold.
  • Then, it looks for the longest item in the duplicate list since we assume this item contains the most entity information and returns that.
  • It breaks string length ties on an alphabetical sort.
posted @ 2023-03-15 16:38  梁书源  阅读(232)  评论(0编辑  收藏  举报