转：什么是Shingling算法

shingling算法用于计算两个文档的相似度，例如，用于网页去重。维基百科对w-shingling的定义如下：

In natural language processing a w-shingling is a set of unique "shingles"—contiguous subsequences of tokens in a document —that can be used to gauge the similarity of two documents. The w denotes the number of tokens in each shingle in the set.

维基百科用一个浅显的例子讲解了shingling算法的原理。比如，一个文档

"a rose is a rose is a rose"
分词后的词汇(token，语汇单元)集合是

(a,rose,is,a,rose,is, a, rose)
那么w=4的4-shingling就是集合:

{ (a,rose,is,a), (rose,is,a,rose), (is,a,rose,is), (a,rose,is,a), (rose,is,a,rose) }
去掉重复的子集合：

{ (a,rose,is,a), (rose,is,a,rose), (is,a,rose,is) }
给定shingle的大小,两个文档A和B的相似度 r 定义为:

r(A,B)=|S(A)∩S(B)| / |S(A)∪S(B)|
其中|A|表示集合A的大小。

因此,相似度是介于0和1之间的一个数值，且r(A,A)=1,即一个文档和它自身 100%相似。

posted @ 2020-07-25 12:50 HuangB2ydjm 阅读(337) 评论(0) 收藏举报

刷新页面返回顶部

大风起兮云飞扬😋

何不食肉糜

转：什么是Shingling算法

公告