求字符串相似度
今天做字符串模糊匹配,需要用到求字符串相似度的相关算法,特意了解了一下。
字符串相似度算法介绍: http://blog.csdn.net/dongle2001/archive/2007/01/02/1472235.aspx
http://blog.csdn.net/tanhua103292/archive/2009/08/06/4419494.aspx
Levenshtein Distance 算法 维基百科上面解释得比较清楚 http://en.wikipedia.org/wiki/Levenshtein_distance
The Algorithm
Steps
Step | Description |
---|---|
1 | Set n to be the length of s. Set m to be the length of t. If n = 0, return m and exit. If m = 0, return n and exit. Construct a matrix containing 0..m rows and 0..n columns. |
2 | Initialize the first row to 0..n. Initialize the first column to 0..m. |
3 | Examine each character of s (i from 1 to n). |
4 | Examine each character of t (j from 1 to m). |
5 | If s[i] equals t[j], the cost is 0. If s[i] doesn't equal t[j], the cost is 1. |
6 | Set cell d[i,j] of the matrix equal to the minimum of: a. The cell immediately above plus 1: d[i-1,j] + 1. b. The cell immediately to the left plus 1: d[i,j-1] + 1. c. The cell diagonally above and to the left plus the cost: d[i-1,j-1] + cost. |
7 | After the iteration steps (3, 4, 5, 6) are complete, the distance is found in cell d[n,m]. |
伪代码(转自维基百科):
int LevenshteinDistance(char s[1..m], char t[1..n]) { // for all i and j, d[i,j] will hold the Levenshtein distance between // the first i characters of s and the first j characters of t; // note that d has (m+1)x(n+1) values declare int d[0..m, 0..n] for i from 0 to m d[i, 0] := i // the distance of any first string to an empty second string for j from 0 to n d[0, j] := j // the distance of any second string to an empty first string for j from 1 to n { for i from 1 to m { if s[i] = t[j] then d[i, j] := d[i-1, j-1] // no operation required else d[i, j] := minimum ( d[i-1, j] + 1, // a deletion d[i, j-1] + 1, // an insertion d[i-1, j-1] + 1 // a substitution ) } } return d[m,n] }
Levenshtein Distance 算法分析:
Proof of correctness
As mentioned earlier, the invariant is that we can transform the initial segment s[1..i]
into t[1..j]
using a minimum of d[i,j]
operations. This invariant holds since:
- It is initially true on row and column 0 because
s[1..i]
can be transformed into the empty stringt[1..0]
by simply dropping alli
characters. Similarly, we can transforms[1..0]
tot[1..j]
by simply adding allj
characters. - If
s[i] = t[j]
, and we can transforms[1..i-1]
tot[1..j-1]
ink
operations, then we can do the same tos[1..i]
and just leave the last character alone, givingk
operations. - Otherwise, the distance is the minimum of the three possible ways to do the transformation:
- If we can transform
s[1..i]
tot[1..j-1]
ink
operations, then we can simply addt[j]
afterwards to gett[1..j]
ink+1
operations (insertion). - If we can transform
s[1..i-1]
tot[1..j]
ink
operations, then we can removes[i]
and then do the same transformation, for a total ofk+1
operations (deletion). - If we can transform
s[1..i-1]
tot[1..j-1]
ink
operations, then we can do the same tos[1..i]
, and exchange the originals[i]
fort[j]
afterwards, for a total ofk+1
operations (substitution).
- If we can transform
- The operations required to transform
s[1..n]
intot[1..m]
is of course the number required to transform all ofs
into all oft
, and sod[n,m]
holds our result.
This proof fails to validate that the number placed in d[i,j]
is in fact minimal; this is more difficult to show, and involves an argument by contradiction in which we assume d[i,j]
is smaller than the minimum of the three, and use this to show one of the three is not minimal.
算法实现(c++):
#define MIN(a,b) (a) < (b) ? (a) : (b)
int LevenshteinDistance(const char *source, const char *target)
{
//step 1
int n, m;
if (source == NULL) n = 0;
else n = strlen(source);
if (target == NULL) m = 0;
else m = strlen(target);
if (m == 0) return n;
if (n == 0) return m;
//Construct a matrix
int **matrix = new int*[n+1];
for (int i = 0; i <= n; ++i)
matrix[i] = new int[m+1];
//step 2 Initialize
matrix[0][0] = 0;
for (int i = 0; i <= n; ++i) matrix[i][0] = i;
for (int i = 0; i <= m; ++i) matrix[0][i] = i;
//step 3
char si,tj;
for (int i = 1; i <= n; ++i) {
si = source[i-1];
//step 4
for (int j = 1; j <= m; ++j) {
tj = target[i-1];
//step 5,6
if (si == tj)
matrix[i][j] = matrix[i-1][j-1];
else
matrix[i][j] = MIN(matrix[i-1][j] + 1, \
MIN(matrix[i][j-1] + 1, matrix[i-1][j-1] + 1));
}
}
//step 7
int result = matrix[n][m];
for (int i = 0; i <= n; ++i)
delete[] matrix[i];
delete[] matrix;
return result;
}