求字符串相似度

今天做字符串模糊匹配,需要用到求字符串相似度的相关算法,特意了解了一下。

字符串相似度算法介绍: http://blog.csdn.net/dongle2001/archive/2007/01/02/1472235.aspx

                              http://blog.csdn.net/tanhua103292/archive/2009/08/06/4419494.aspx

Levenshtein Distance 算法 维基百科上面解释得比较清楚 http://en.wikipedia.org/wiki/Levenshtein_distance

The Algorithm

Steps

 

StepDescription
1 Set n to be the length of s.
Set m to be the length of t.
If n = 0, return m and exit.
If m = 0, return n and exit.
Construct a matrix containing 0..m rows and 0..n columns.
2 Initialize the first row to 0..n.
Initialize the first column to 0..m.
3 Examine each character of s (i from 1 to n).
4 Examine each character of t (j from 1 to m).
5 If s[i] equals t[j], the cost is 0.
If s[i] doesn't equal t[j], the cost is 1.
6 Set cell d[i,j] of the matrix equal to the minimum of:
a. The cell immediately above plus 1: d[i-1,j] + 1.
b. The cell immediately to the left plus 1: d[i,j-1] + 1.
c. The cell diagonally above and to the left plus the cost: d[i-1,j-1] + cost.
7 After the iteration steps (3, 4, 5, 6) are complete, the distance is found in cell d[n,m].

伪代码(转自维基百科):

int LevenshteinDistance(char s[1..m], char t[1..n])
 {
   // for all i and j, d[i,j] will hold the Levenshtein distance between
   // the first i characters of s and the first j characters of t;
   // note that d has (m+1)x(n+1) values
   declare int d[0..m, 0..n]
  
   for i from 0 to m
     d[i, 0] := i // the distance of any first string to an empty second string
   for j from 0 to n
     d[0, j] := j // the distance of any second string to an empty first string
  
   for j from 1 to n
   {
     for i from 1 to m
     {
       if s[i] = t[j] then  
         d[i, j] := d[i-1, j-1]       // no operation required
       else
         d[i, j] := minimum
                    (
                      d[i-1, j] + 1,  // a deletion
                      d[i, j-1] + 1,  // an insertion
                      d[i-1, j-1] + 1 // a substitution
                    )
     }
   }
  
   return d[m,n]
 }

Levenshtein Distance 算法分析:

Proof of correctness

As mentioned earlier, the invariant is that we can transform the initial segment s[1..i] into t[1..j] using a minimum of d[i,j] operations. This invariant holds since:

  • It is initially true on row and column 0 because s[1..i] can be transformed into the empty string t[1..0] by simply dropping all i characters. Similarly, we can transform s[1..0] to t[1..j] by simply adding all j characters.
  • If s[i] = t[j], and we can transform s[1..i-1] to t[1..j-1] in k operations, then we can do the same to s[1..i] and just leave the last character alone, giving koperations.
  • Otherwise, the distance is the minimum of the three possible ways to do the transformation:
    • If we can transform s[1..i] to t[1..j-1] in k operations, then we can simply add t[j] afterwards to get t[1..j] in k+1 operations (insertion).
    • If we can transform s[1..i-1] to t[1..j] in k operations, then we can remove s[i] and then do the same transformation, for a total of k+1 operations (deletion).
    • If we can transform s[1..i-1] to t[1..j-1] in k operations, then we can do the same to s[1..i], and exchange the original s[i] for t[j] afterwards, for a total of k+1 operations (substitution).
  • The operations required to transform s[1..n] into t[1..m] is of course the number required to transform all of s into all of t, and so d[n,m] holds our result.

This proof fails to validate that the number placed in d[i,j] is in fact minimal; this is more difficult to show, and involves an argument by contradiction in which we assume d[i,j] is smaller than the minimum of the three, and use this to show one of the three is not minimal.

算法实现(c++):

#define MIN(a,b) (a) < (b) ? (a) : (b)
int LevenshteinDistance(const char *source, const char *target)
{
//step 1
int n, m;
if (source == NULL) n = 0;
else n = strlen(source);

if (target == NULL) m = 0;
else m = strlen(target);

if (m == 0) return n;
if (n == 0) return m;

//Construct a matrix
int **matrix = new int*[n+1];
for (int i = 0; i <= n; ++i)
matrix[i]
= new int[m+1];

//step 2 Initialize
matrix[0][0] = 0;
for (int i = 0; i <= n; ++i) matrix[i][0] = i;
for (int i = 0; i <= m; ++i) matrix[0][i] = i;

//step 3
char si,tj;
for (int i = 1; i <= n; ++i) {
si
= source[i-1];

//step 4
for (int j = 1; j <= m; ++j) {
tj
= target[i-1];

//step 5,6
if (si == tj)
matrix[i][j]
= matrix[i-1][j-1];
else
matrix[i][j]
= MIN(matrix[i-1][j] + 1, \
MIN(matrix[i][j
-1] + 1, matrix[i-1][j-1] + 1));
}
}

//step 7
int result = matrix[n][m];
for (int i = 0; i <= n; ++i)
delete[] matrix[i];
delete[] matrix;

return result;
}


posted @ 2011-05-26 20:20  LittleAnt  阅读(328)  评论(0编辑  收藏  举报