R语言字符串相似度 stringdist包
计算字符串相似度可以使用utils包中的adist函数,或者MKmisc包中的stringdist函数,或者RecordLinkage包中也有如jarowinkler之类的距离函数。本文介绍stringdist包中的stringdist函数和stringdistmatrix函数。
stringdist包作者是 Mark der Loo
stringdist用于计算对象a,b中的字符串两两之间的相似度,对于一个对象中的元素少于另一个的情况,采用循环补齐机制。stringdistmatrix的出相似度矩阵,其中采用a中的行,b中的列。
stringdist(a, b, method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"), useBytes = FALSE, weight = c(d = 1, i = 1, s = 1, t = 1), maxDist = Inf, q = 1, p = 0, nthread = getOption("sd_num_thread"))
stringdistmatrix(a, b, method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"), useBytes = FALSE, weight = c(d = 1, i = 1, s = 1, t = 1), maxDist = Inf, q = 1, p = 0, useNames = c("none", "strings", "names"), ncores = 1, cluster = NULL, nthread = getOption("sd_num_thread"))
1
2
3
参数:
a,b: 字符串类型的目标对象
method:距离计算方法,默认为“osa”,可以设置为jaccard,hamming,jarowinkler等方法。
useBytes:以字节为单位进行比较
weight:权值必须为正并且不超过1
maxDist:最大距离限制
q:在使用method=’qgram’, ‘jaccard’ 或 ‘cosine’的时候设置,必须为非负数
p:jarowinkler距离的惩罚因子,默认为0,在0-0.25之间取值
nThread:最大线程数
useNames:输出的行、列名使用输入变量的行、列名
ncores:核心数
cluster:自定义集群数
案例:
> stringdistmatrix(c("foo","bar","boo"),c("baz","buz"))
[,1] [,2]
[1,] 3 3
[2,] 1 2
[3,] 2 2
> # string distance matching is case sensitive:
> stringdist("ABC","abc")
[1] 3
>
> # so you may want to normalize a bit:
> stringdist(tolower("ABC"),"abc")
[1] 0
>
> # stringdist recycles the shortest argument:
> stringdist(c('a','b','c'),c('a','c'))
Warning message: longer object length is not a multiple of shorter object length
[1] 0 1 1
>
> # different edit operations may be weighted; e.g. weighted substitution:
> stringdist('ab','ba',weight=c(1,1,1,0.5))
[1] 0.5
>
> # Non-unit weights for insertion and deletion makes the distance metric asymetric
> stringdist('ca','abc')
[1] 3
> stringdist('abc','ca')
[1] 3
> stringdist('ca','abc',weight=c(0.5,1,1,1))
[1] 2
> stringdist('abc','ca',weight=c(0.5,1,1,1))
[1] 2.5
> # q-grams are based on the difference between occurrences of q consecutive characters
> # in string a and string b.
> # Since each character abc occurs in 'abc' and 'cba', the q=1 distance equals 0:
> stringdist('abc','cba',method='qgram',q=1)
[1] 0
>
> # since the first string consists of 'ab','bc' and the second
> # of 'cb' and 'ba', the q=2 distance equals 4 (they have no q=2 grams in common):
> stringdist('abc','cba',method='qgram',q=2)
[1] 4
> stringdist('MARTHA','MATHRA',method='jw')
[1] 0.08333333
> # Note that stringdist gives a _distance_ where wikipedia gives the corresponding
> # _similarity measure_. To get the wikipedia result:
> 1 - stringdist('MARTHA','MATHRA',method='jw')
[1] 0.9166667
>
> # The corresponding Jaro-Winkler distance can be computed by setting p=0.1
> stringdist('MARTHA','MATHRA',method='jw',p=0.1)
[1] 0.06666667
> # or, as a similarity measure
> 1 - stringdist('MARTHA','MATHRA',method='jw',p=0.1)
[1] 0.9333333
>
> # This gives distance 1 since Euler and Gauss translate to different soundex codes.
> stringdist('Euler','Gauss',method='soundex')
[1] 1
> # Euler and Ellery translate to the same code and have distance 0
> stringdist('Euler','Ellery',method='soundex')
[1] 0
>
————————————————
函数 Levenshtein编辑距离.可以将其转换为相似度指标,例如1-(Levenshtein编辑距离/更长的字符串长度).
RecordLinkage 包中的levenshteinSim
函数也可以直接执行此操作,并且可能比adist
快.
library(RecordLinkage)
> levenshteinSim("apple", "apple")
[1] 1
> levenshteinSim("apple", "aaple")
[1] 0.8
> levenshteinSim("apple", "appled")
[1] 0.8333333
> levenshteinSim("appl", "apple")
[1] 0.8
ETA:有趣的是,虽然RecordLinkage软件包中的levenshteinDist
似乎比adist
略快,但levenshteinSim
却比任何一个都慢.使用 rbenchmark 包:
> benchmark(levenshteinDist("applesauce", "aaplesauce"), replications=100000)
test replications elapsed relative
1 levenshteinDist("applesauce", "aaplesauce") 100000 4.012 1
user.self sys.self user.child sys.child
1 3.583 0.452 0 0
> benchmark(adist("applesauce", "aaplesauce"), replications=100000)
test replications elapsed relative user.self
1 adist("applesauce", "aaplesauce") 100000 4.277 1 3.707
sys.self user.child sys.child
1 0.461 0 0
> benchmark(levenshteinSim("applesauce", "aaplesauce"), replications=100000)
test replications elapsed relative
1 levenshteinSim("applesauce", "aaplesauce") 100000 7.206 1
user.self sys.self user.child sys.child
1 6.49 0.743 0 0
此开销仅是由于levenshteinSim
的代码造成的,它只是levenshteinDist
的包装:
> levenshteinSim
function (str1, str2)
{
return(1 - (levenshteinDist(str1, str2)/pmax(nchar(str1),
nchar(str2))))
}
仅供参考:如果您始终比较两个字符串而不是向量,则可以创建一个使用max
而不是pmax
的新版本,并将运行时间节省约25%:
mylevsim = function (str1, str2)
{
return(1 - (levenshteinDist(str1, str2)/max(nchar(str1),
nchar(str2))))
}
> benchmark(mylevsim("applesauce", "aaplesauce"), replications=100000)
test replications elapsed relative user.self
1 mylevsim("applesauce", "aaplesauce") 100000 5.608 1 4.987
sys.self user.child sys.child
1 0.627 0 0
长话短说,adist
和levenshteinDist
在性能上几乎没有区别,尽管如果您不想添加软件包依赖项,则前者是更可取的.如何将其转换为相似性指标确实会对性能产生一些影响.
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· SQL Server 2025 AI相关能力初探
· AI编程工具终极对决:字节Trae VS Cursor,谁才是开发者新宠?
· 开源Multi-agent AI智能体框架aevatar.ai,欢迎大家贡献代码
· Manus重磅发布:全球首款通用AI代理技术深度解析与实战指南
2021-01-19 领域模型vs数据模型,应该怎么用?
2021-01-19 如何让技术想法更容易被理解?
2021-01-19 如何做好技术 Team Leader