求字符串相似度

今天做字符串模糊匹配，需要用到求字符串相似度的相关算法，特意了解了一下。

字符串相似度算法介绍: http://blog.csdn.net/dongle2001/archive/2007/01/02/1472235.aspx

http://blog.csdn.net/tanhua103292/archive/2009/08/06/4419494.aspx

Levenshtein Distance 算法维基百科上面解释得比较清楚 http://en.wikipedia.org/wiki/Levenshtein_distance

The Algorithm

Steps

Step	Description
1	Set n to be the length of s. Set m to be the length of t. If n = 0, return m and exit. If m = 0, return n and exit. Construct a matrix containing 0..m rows and 0..n columns.
2	Initialize the first row to 0..n. Initialize the first column to 0..m.
3	Examine each character of s (i from 1 to n).
4	Examine each character of t (j from 1 to m).
5	If s[i] equals t[j], the cost is 0. If s[i] doesn't equal t[j], the cost is 1.
6	Set cell d[i,j] of the matrix equal to the minimum of: a. The cell immediately above plus 1: d[i-1,j] + 1. b. The cell immediately to the left plus 1: d[i,j-1] + 1. c. The cell diagonally above and to the left plus the cost: d[i-1,j-1] + cost.
7	After the iteration steps (3, 4, 5, 6) are complete, the distance is found in cell d[n,m].

伪代码（转自维基百科）：

int LevenshteinDistance(char s[1..m], char t[1..n])
 {
   // for all i and j, d[i,j] will hold the Levenshtein distance between
   // the first i characters of s and the first j characters of t;
   // note that d has (m+1)x(n+1) values
   declare int d[0..m, 0..n]
  
   for i from 0 to m
     d[i, 0] := i // the distance of any first string to an empty second string
   for j from 0 to n
     d[0, j] := j // the distance of any second string to an empty first string
  
   for j from 1 to n
   {
     for i from 1 to m
     {
       if s[i] = t[j] then  
         d[i, j] := d[i-1, j-1]       // no operation required
       else
         d[i, j] := minimum
                    (
                      d[i-1, j] + 1,  // a deletion
                      d[i, j-1] + 1,  // an insertion
                      d[i-1, j-1] + 1 // a substitution
                    )
     }
   }
  
   return d[m,n]
 }

Levenshtein Distance 算法分析：

Proof of correctness

As mentioned earlier, the invariant is that we can transform the initial segment s[1..i] into t[1..j] using a minimum of d[i,j] operations. This invariant holds since:

It is initially true on row and column 0 because s[1..i] can be transformed into the empty string t[1..0] by simply dropping all i characters. Similarly, we can transform s[1..0] to t[1..j] by simply adding all j characters.
If s[i] = t[j], and we can transform s[1..i-1] to t[1..j-1] in k operations, then we can do the same to s[1..i] and just leave the last character alone, giving koperations.
Otherwise, the distance is the minimum of the three possible ways to do the transformation:
- If we can transform s[1..i] to t[1..j-1] in k operations, then we can simply add t[j] afterwards to get t[1..j] in k+1 operations (insertion).
- If we can transform s[1..i-1] to t[1..j] in k operations, then we can remove s[i] and then do the same transformation, for a total of k+1 operations (deletion).
- If we can transform s[1..i-1] to t[1..j-1] in k operations, then we can do the same to s[1..i], and exchange the original s[i] for t[j] afterwards, for a total of k+1 operations (substitution).
The operations required to transform s[1..n] into t[1..m] is of course the number required to transform all of s into all of t, and so d[n,m] holds our result.

This proof fails to validate that the number placed in d[i,j] is in fact minimal; this is more difficult to show, and involves an argument by contradiction in which we assume d[i,j] is smaller than the minimum of the three, and use this to show one of the three is not minimal.

算法实现(c++)：

#define MIN(a,b) (a) < (b) ? (a) : (b)
int LevenshteinDistance(const char *source, const char *target)
{
  //step 1
  int n, m;
  if (source == NULL) n = 0;
  else n = strlen(source);

  if (target == NULL) m = 0;
  else m = strlen(target);

  if (m == 0) return n;
  if (n == 0) return m;

  //Construct a matrix
  int **matrix = new int*[n+1];
  for (int i = 0; i <= n; ++i)
    matrix[i] = new int[m+1];

  //step 2 Initialize
  matrix[0][0] = 0;
  for (int i = 0; i <= n; ++i) matrix[i][0] = i;
  for (int i = 0; i <= m; ++i) matrix[0][i] = i;

  //step 3
  char si,tj;
  for (int i = 1; i <= n; ++i) {
    si = source[i-1];

    //step 4
    for (int j = 1; j <= m; ++j) {
      tj = target[i-1];

      //step 5，6
      if (si == tj)
        matrix[i][j] = matrix[i-1][j-1];
      else
        matrix[i][j] =  MIN(matrix[i-1][j] + 1, \
                        MIN(matrix[i][j-1] + 1, matrix[i-1][j-1] + 1));
    }
  }

  //step 7
  int result = matrix[n][m];
  for (int i = 0; i <= n; ++i)
    delete[] matrix[i];
  delete[] matrix;

  return result;
}

posted @ 2011-05-26 20:20 LittleAnt 阅读(328) 评论(0) 编辑收藏举报

刷新页面返回顶部

码蚁

一步一脚印

求字符串相似度

The Algorithm

Steps

Proof of correctness

公告

码 蚁

一步一脚印

求字符串相似度

The Algorithm

Steps

Proof of correctness

公告

码蚁