Python计算编辑距离和字符串相似度

Levenshtein module

distance(s1, s2, *, weights=(1, 1, 1), processor=None, score_cutoff=None)

参数：s1和s2代表两个字符串，weights=(1,1,1)中的1分别代表对插入、删除和替换的权重；processor可以对字符进行预处理
返回s1和s2之间的编辑距离，整数型；

from Levenshtein import distance
distance("lewenstein", "levenshtein")  # out：2
distance(["lewenstein"], ["levenshtein"], processor=lambda s: s[0])  # out：2

可以看到当指定processor时，s1和s2可以是列表的形式

ratio(s1, s2, *, processor=None, score_cutoff=None)

返回s1和s2的相似度

seqratio(strlist1, strlist2)；setratio(strlist1, strlist2)

计算两个列表、集合的相似度

Rapidfuzz

process.cdist(queries,choices,scorer,processor,score_cutoff,workers)

queries和choices都是list of strings,scorer选择计算距离的方式,默认使用ratio,workers选择调用CPU数,-1代表使用所有可用CPU.
返回矩阵

from rapidfuzz import process
process.cdist(["北京","南京","东京"],['北北北',"南京市","日本东京"])
Out[10]:
array([[40.      , 40.      , 33.333332],
       [ 0.      , 80.      , 33.333332],
       [ 0.      , 40.      , 66.666664]], dtype=float32)

剩下的extract、extractOne等于Thefuzz里的方法差不多

Rapidfuzz.distance

主要包含了各种编辑操作相关的方法

rapidfuzz.distance.Levenshtein.distance(s1,s2,weights,processer,score_cutoff)

功能和Levenshtein module中的同名方法一致，但是根据Rapidfuzz该方法的速度更快

rapidfuzz.fuzz.ratio(s1,s2,processor,score_cutoff)

Calculates the normalized Indel distance.
要注意fuzz.ratio计算的不是Levenshtein distance！！

Thefuzz中的函数

extractWithoutOrder

Select the best match in a list or dictionary of choices.
Find best matches in a list or dictionary of choices, return a generator of tuples containing the match and its score.
If a dictionary is used, also returns the key for each match.

extract

Select the best match in a list or dictionary of choices.
Find best matches in a list or dictionary of choices, return a list of tuples containing the match and its score.
If a dictionary is used, also returns the key for each match

extractBests

Get a list of the best matches to a collection of choices.
Convenience function for getting the choices with best scores

extractOne

参数：(query, choices, processor=default_processor, scorer=default_scorer, score_cutoff=0):
Find the single best match above a score in a list of choices.
This is a convenience method which returns the single best choice.
See extract() for the full arguments list.
query: A string to match against
choices: A list or dictionary of choices, suitable for use with extract().
processor: Optional function for transforming choices before matching.See extract().
scorer: Scoring function for extract().
score_cutoff: Optional argument for score threshold. If the best match is found, but it is not greater than this number, then return None anyway ("not a good enough match"). Defaults to 0.

dedupe

This convenience function takes a list of strings containing duplicates and uses fuzzy matching to identify and remove duplicates.
Specifically, it uses the process.extract to identify duplicates that score greater than a user defined threshold.
Then, it looks for the longest item in the duplicate list since we assume this item contains the most entity information and returns that.
It breaks string length ties on an alphabetical sort.

posted @ 2023-03-15 16:38 梁书源阅读(232) 评论(0) 编辑收藏举报

刷新页面返回顶部

我的备忘录

记录学习中容易遗忘的零碎知识点

Python计算编辑距离和字符串相似度

Levenshtein module

distance(s1, s2, *, weights=(1, 1, 1), processor=None, score_cutoff=None)

ratio(s1, s2, *, processor=None, score_cutoff=None)

seqratio(strlist1, strlist2)；setratio(strlist1, strlist2)

Rapidfuzz

process.cdist(queries,choices,scorer,processor,score_cutoff,workers)

Rapidfuzz.distance

rapidfuzz.distance.Levenshtein.distance(s1,s2,weights,processer,score_cutoff)

rapidfuzz.fuzz.ratio(s1,s2,processor,score_cutoff)

Thefuzz中的函数

extractWithoutOrder

extract

extractBests

extractOne

dedupe

公告