Python计算编辑距离和字符串相似度
Levenshtein module
distance(s1, s2, *, weights=(1, 1, 1), processor=None, score_cutoff=None)
- 参数:s1和s2代表两个字符串,weights=(1,1,1)中的1分别代表对插入、删除和替换的权重;processor可以对字符进行预处理
- 返回s1和s2之间的编辑距离,整数型;
from Levenshtein import distance
distance("lewenstein", "levenshtein") # out:2
distance(["lewenstein"], ["levenshtein"], processor=lambda s: s[0]) # out:2
可以看到当指定processor时,s1和s2可以是列表的形式
ratio(s1, s2, *, processor=None, score_cutoff=None)
- 返回s1和s2的相似度
seqratio(strlist1, strlist2);setratio(strlist1, strlist2)
- 计算两个列表、集合的相似度
Rapidfuzz
process.cdist(queries,choices,scorer,processor,score_cutoff,workers)
- queries和choices都是list of strings,scorer选择计算距离的方式,默认使用ratio,workers选择调用CPU数,-1代表使用所有可用CPU.
- 返回矩阵
from rapidfuzz import process
process.cdist(["北京","南京","东京"],['北北北',"南京市","日本东京"])
Out[10]:
array([[40. , 40. , 33.333332],
[ 0. , 80. , 33.333332],
[ 0. , 40. , 66.666664]], dtype=float32)
剩下的extract、extractOne等于Thefuzz里的方法差不多
Rapidfuzz.distance
- 主要包含了各种编辑操作相关的方法
rapidfuzz.distance.Levenshtein.distance(s1,s2,weights,processer,score_cutoff)
- 功能和Levenshtein module中的同名方法一致,但是根据Rapidfuzz该方法的速度更快
rapidfuzz.fuzz.ratio(s1,s2,processor,score_cutoff)
- Calculates the normalized Indel distance.
- 要注意fuzz.ratio计算的不是Levenshtein distance!!
Thefuzz中的函数
extractWithoutOrder
- Select the best match in a list or dictionary of choices.
- Find best matches in a list or dictionary of choices, return a generator of tuples containing the match and its score.
- If a dictionary is used, also returns the key for each match.
extract
- Select the best match in a list or dictionary of choices.
- Find best matches in a list or dictionary of choices, return a list of tuples containing the match and its score.
- If a dictionary is used, also returns the key for each match
extractBests
- Get a list of the best matches to a collection of choices.
- Convenience function for getting the choices with best scores
extractOne
-
参数:
(query, choices, processor=default_processor, scorer=default_scorer, score_cutoff=0):
-
Find the single best match above a score in a list of choices.
-
This is a convenience method which returns the single best choice.
-
See
extract()
for the full arguments list. -
query
: A string to match against -
choices
: A list or dictionary of choices, suitable for use withextract()
. -
processor
: Optional function for transforming choices before matching.Seeextract()
. -
scorer
: Scoring function for extract(). -
score_cutoff
: Optional argument for score threshold. If the best match is found, but it is not greater than this number, then return None anyway ("not a good enough match"). Defaults to 0.
dedupe
- This convenience function takes a list of strings containing duplicates and uses fuzzy matching to identify and remove duplicates.
- Specifically, it uses the process.extract to identify duplicates that score greater than a user defined threshold.
- Then, it looks for the longest item in the duplicate list since we assume this item contains the most entity information and returns that.
- It breaks string length ties on an alphabetical sort.