短文本、字符串相似度算法

短文本、字符串的相似度计算方法一般是非深层语义的，速度比较快。

string-based, corpus-based, knowledge-based (e.g. wordnet).

常见算法有：

LCS, Longest common substring
Jaro-Distance, based on the number and order of characters, taking account into spelling deviation
Jaro-Winkler Distance, an extention of Jaro distance, introducing a prefix
cosine distance, 将字符串以空格切分为词（对英文）或字符（对非空格分隔字符串），统计频次，以词/字符频次作为向量表示字符串，计算向量余弦相似度（即词袋模型+余弦距离）
Euclean distance，类似cosine，不过计算欧氏距离而非余弦距离
hamming distance 等长字符串度量
edit distance(一般指levenshtein distance),
jaccard similarity，常用于度量集合重叠度，也可用于字符串
dice distance，常用于度量集合重叠度，也可用于字符串
BM25 词级别的算法，需要用到tf-idf
N-gram 连续N个词/字符作为基本分析单位
LSA(latent semantic analysis)
LDA(latent Dirichlet allocation)
ESA(explicit semantic analysis)
WMD(word mover's distance) (word2vec + earth mover's distance)
TF-IDF based.

apache commons-text:1.4下的这些算法说明：

JaroWinkler: 注意到"trump"和"donald trump"相似度为0。（注意包含空格情况）

Cosine："trump" <--> "realdonaldtrump"相似度为0。中文始终为0（即使完全一样）.

posted @ 2022-07-06 23:25 二球悬铃木阅读(416) 评论(0) 编辑收藏举报

刷新页面返回顶部

二球悬铃木