文本比较算法Ⅵ——用线性空间计算最大公共子序列(翻译贴)
研究文本比较算法有一段时间了。近日研读了《A Linear Space Algorithm for Computing Maximal Common Subsequences》(D.S.Hirschberg著)。文章写于1975年。很多其他的论文都会引用这篇论文,可见这篇论文的质量。同时,该文作者D.S.Hirschberg也写了很多有关LCS的文章,也都是经典中的经典。
在研读这篇文章之后,我将它翻译成中文。由于本人的英语与文法都还不行,故翻译的质量也就一般了,也欢迎广大网友指正。
Introduction
导论
The problem of finding a longest common subsequence of two strings has been solved in quadratic time and space. For stings of length 1000 (assuming coefficients of 1 microsecond and 1 byte) the solution would require 106 microseconds (one second) and 106 bytes (1000K bytes). The former is easily accommodated, the latter is not so easily obtainable. If the strings were of length of 10000, the problem might not to be solvable in main memory for lack of space.
过去求解两个字符串的最长公共子序列的问题,需要花费二次的时间和空间。在求解两个长1000的字符串(假定时间系数为1微秒,空间系数是1字节)过程中需要106微秒(1秒)和106字节(1000K字节)。前者很容易解决,而后者不是很容易满足的。如果两个字符串的长度为10000,则可能没有足够的主内存空间来求解这个问题。
注:文章写于1975年,以当时的计算机的能力来看,1000K是个天量了。不过,就算是现在的计算机,如果没有良好的算法,在大容量的文本比较时就会出问题。比方说,文本是1M的,在O(MN)的情况下,需要1T的容量。这个可是够惊人的。
We present an algorithm which will solve this problem in quadratic time and in linear space. For Example, assuming coefficients of 2 microseconds and 10 bytes , for strings of length 1000 we would require 2 seconds and 10K bytes; for strings of length 10000 we would require a little over 3 minutes and 100K bytes.
我们提出一个解决该问题算法,该算法花费二次的时间和线性空间。举例来说,假定时间系数是2微秒,空间系数为10字节。求解两个长1000的字符串,我们要花费2秒和10K字节;求解两个长10000的字符串,我们花费仅仅增加到3分钟和100K字节。
String C=c1c2……cp is a subsequence of string A=a1a2……am if and only if there is a mapping F:{1,2,……,p}→{1,2,……,m} such that f(i)=k only if ci is ak and F is a monotone strictly increasing function(i.e. F(i)=u,F(j)=v,and i<j imply that u<v)
字符串C=c1c2……cp是字符串A=a1a2……am的子序列,指的是存在一个映射F:{1,2,……,p}→{1,2,……,m},当f(i)=k,则ci=ak,并且F是一个严格单调递增函数(举例说明:若F(i)=u,F(j)=v,当i<j 则u<v)
String C is a common subsequence of strings A and B if and only if C is a subsequence of A and C is a subsequence of B
字符串C是字符串A和B的公共子序列,当且仅当C既是A的子序列,同时C又是B的子序列
The problem can be stated as follows : Given strings A=a1a2……am and B=b1b2……bn (over alphabet Σ), find a string C= c1c2……cp such that C is a common subsequence of A and B and p is maximized
求解最长公共子序列问题定义如下:给定字符串A=a1a2……am和B=b1b2……bn(覆盖字符集Σ),找到一个字符串C=c1c2……cp,C是A和B的公共子序列之中p最大的那个
We call C an example of a maximal common subsequence.
我们又把C称作最大公共子序列
Notation. For string D=d1d2……dr , Dkt is dkdk+1……dt if k≤t; dkdk-1……dt if k≥1. When k>t , we shall write ~Dkt so as to make clear that we are referring to a “reverse substring” of D
标记。对于字符串D=d1d2……dr,Dkt表示为dkdk+1……dt (k≤t);dkdk-1……dt(k≥1),当k>t时,我们标记为~Dkt,称为D的“反转子串”
L(i,j) is the maximum length possible of any common subsequence of A1i and B1j
L(i,j)表示为A1i和B1j的所有可能的公共子序列的长度中最大值。
x||y is the concatenation of strings x and y
x||y表示为字符串x和y的连接。
We present the algorithm described in [3], which takes quadratic time and space
我们提到的算法出自[3],它花费二次时间和空间
Algorithm A
算法A
Algorithm A accepts as input strings A1m and B1n and produces as output the matrix L (where the element L(i,j) corresponds to our notation of maximum length possible of any common subsequence of A1i and B1j)
算法A接受输入字符串A1m 和B1n,并且计算输出矩阵L(矩阵元素L(i,j)如标记中所称,表示为A1i和B1j的所有可能的公共子序列的长度中最大值。)
ALG A(m,n,A,B,L)
1. Initialization: L(i,0)←0 [i=0……m];
L(0,j)←0 [j=0……n];
2. for i←1 to m do
begin
3. for j←1 to n do
if A(i)=B(j) then L(i,j)←L(i-1,j-1)+1
else L(i,j)←max{L(i,j-1),L(i-1,j)}
end
Proof of correctness of Algorithm A
论证算法A的正确性
To find L(i,j) ,let a common subsequence of that length be denoted by S(i,j)=c1c2……cp , If ai=bj, we can do no better than by taking cp=ai and looking for c1……cp-1 as a common subsequence of length L(i,j)-1 of string A1,i-1 and B1,j-1. Thus , in this case ,L(i,j)=L(i-1,j-1)+1
为了计算L(i,j),把长度和其相等的公共子序列定义为S(i,j)=c1c2……cp,如果ai=bj,则cp=ai,并且c1……cp-1是A1,i-1和B1,j-1的最长公共子序列,长度为L(i,j)-1。因此,在这种情况下,L(i,j)=L(i-1,j-1)+1
If ai≠bj ,then cp is ai,bj, or neither (but not both). If cp is ai , then a solution C to problem(A1i,B1j) [written P(i,j)] will be a solution to P(i,j-1) since bj is not used. Similarly , if cp is bj , then we can get a solution to P(i,j) by solving P(i-1,j). If cp is neither, then a solution to either P(i-1,j) or P(i,j-1) will suffice . In determining the length of the solution, it is seen that L(i,j) [corresponding to P(i,j)] will be the maximum of L(i-1,j) and L(i,j-1).
如果ai≠bj,则cp要么是ai,要么是bj,要么两者都不是(肯定不会都是)。如果cp=ai,因为bj不是C的元素,则求解C的问题(A1i,B1j)[写作P(i,j)]等同于求解P(i,j-1)。同样的,如果cp=bj,求解P(i,j)等同于求解P(i-1,j)。如果,cp两者都不是,则必是P(i-1,j)和P(i,j-1)中的一个。求解的长度称为L(i,j)[和P(i,j)相一致]将会是L(i-1,j) 和L(i,j-1)中的最大值。
Time and Space Analysis of Algorithm A
算法A的时间和空间分析
The if statement in Algorithm A will be executed exactly mn times. Input and output arrays require m+n+(m+1)(n+1) locations. Thus Algorithm A requires O(mn) time and O(mn) space.
算法A中的判断语句将会精确的执行mn次。输入和输出占用的空间需要m+n+(m+1)(n+1)位置。因此,算法A需要O(mn)时间和O(mn)空间。
Algorithm B
算法B
In Algorithm A, the derivation of row i of matrix L(L(i,1), L(i,2),…… ,L(i,n)) requires only row i-1 of matrix L. Thus , a slight modification yields Algorithm B, which accepts as input strings A1m and B1n and produces as output vector LL where LL(j) will have the value L(m,j)
在算法A中,推导出L矩阵中的第i行(L(i,1), L(i,2), ……,L(i,n))只需要矩阵L的第i-1行。一个细小的改观生成了算法B,该算法接受输入字符串A1m 和B1n,输出向量LL,LL(j)的值就是矩阵L中的L(m,j)。
ALG B(m,n,A,B,LL)
1. Initialization:K(1,j)←0 [j=0……n];
2. for i←1 to n do
begin
3. K(0,j) ←K(1,j) [j=0……n]
4. for j←1 to n do
if A(i)=B(j) then K(1,j) ←K(0,j-1)+1
else K(1,j) ←max{K(1,j-1),K(0,j)}
end
5. LL(j) ←K(1,j) [j=0……n]
Proof of Correctness of Algorithm B
论证算法B的正确性
Algorithm B is Algorithm A with K(0,j) in statement 4 of ALG B having the same value as L(i-1,j) in statement 3 of ALG A and K(1,j) receiving the same value as L(i,j). We show this by induction on i.
算法B和算法A等价,就像ALG B中的第四步计算K(0,j)的值和ALG A中的第三步计算L(i-1,j)的值是一样的。同理,K(1,j)的值和L(i,j)的值一样。下面我们将根据i的值进行归纳说明。
For i=1 , L(i-1,j) is zero (initialized in statement 1 of ALG A). In ALG B, K(0,j) received in statement 3 the value of K(1,j) , which was just initialized to zero in statement 1.
当i=1,L(i-1,j)为0(在ALG A中的第一步初始化数据后)。在ALG B中的第3步中,K(0,j)从K(1,j)获得0值,因为K(1,j)在第一步中就已经初始化为0了。
Assuming K(0,j) has the same value as does L(i-1,j). Then K(1,j) receives the same value as L(i,j) since the assignment statement within the inner loops of ALG A and ALG B are equivalent . For the next iteration, K(0,j) receives (in statement 3 of ALG B) the value of K(1,j) which has the value of L(i,j) as shown above.
假定K(0,j)和L(i-1,j)值一样。那么K(1,j)像L(i,j)一样获取同样的值,因为在ALG A和ALG B中指定的循环步骤是一致的。在下一个循环之前,K(0,j)获取K(1,j)的值(在ALG B中的第3步),就像上面所示,K(1,j)的值就是L(i,j)。
Time and Space Analysis of Algorithm B
算法B的时间和空间分析
As in Algorithm A , the if statement in Algorithm B is executed exactly mn times. Input and output arrays require m+n+(n+1) locations. Local storage requires 2(n+1) locations. Thus Algorithm B requires O(mn) time and O(m+n) space.
和算法A一样,算法B的判断语句也会精确的执行mn次。输入和输出占用m+n+(n+1)位置,算法内部占用2(n+1)位置。因此算法B需要O(mn)时间和O(m+n)空间。
注:该算法还可以优化,使得LL()只占用n个位置,而不是2n个位置
We shall show that using Algorithm B for appropriate substrings of A and B will enable us to recover a maximal common subsequence of A and B in linear space.
接下来,我们要用算法B在线性空间中利用A和B的合适子串来找回A和B的最大公共子序列。
Define L*(i,j) to be the maximum length of common subsequence of Ai+1,m and Bj+1,n.
定义L*(i,j)是Ai+1,m和Bj+1,n的最大公共子序列
We note that L(i,j) j=0……n are the maximum lengths of common subsequence of A1i and various prefixes of B1n. We also note that L*(i,j) j=0……n are the maximum lengths of common subsequence of ~Am,i+1 and various prefixes of ~Bn,1 .Choosing i to be m/2 and using the theorem below , we shall be able to determine a prefix B1 of B which can be matched with the first half A1 of A (and the corresponding suffix B2 of B matched with the last half A2 of A) such that a maximal common subsequence (mcs) of A1 and B1 concatenated with an mcs of A2 and B2 will be an mcs of A and B
我们注意到L(i,j) j=0……n表示A1i和B1n的一些前缀的公共子序列的长度最大值。我们同时注意到L*(i,j) j=0……n表示~Am,i+1和~Bn,1的一些前缀的公共子序列的长度最大值。在下面的定理中,令i为m/2,我们能确定B的一个前缀B1能和A的前半部分A1匹配(同时相对应的B的后缀B2和A的后半部分A2匹配)。如此,A1和B1的最大公共子序列和A2和B2的最大公共子序列连接起来就是A和B的最大公共子序列。
Define M(i)=max{L(i,j)+L*(i,j)} 0≤i≤n
THEOREM For 0≤i≤m, M(i)=L(m,n)
PROOF . Let M(i)=L(i,j)+L*(i,j) for some j. Let S(i,j) be any maximal common subsequence of A1i and B1j; let S*(i,j) be any maximal common subsequence of Ai+1,m and Bj+1,n . Then C=S(i,j) || S*(i,j) is a common subsequence of A1m and B1n of length M(i). Thus L(m,n)≥M(i)
证明。当j取某些值的时候,M(i)=L(i,j)+L*(i,j)。让S(i,j)是某个A1i和B1j的最大公共子序列;让S*(i,j)是某个Ai+1,m和Bj+1,n的最大公共子序列。那么C=S(i,j) || S*(i,j)就是A1m和B1n的一个公共子序列,且长度为M(i)。因此L(m,n)≥M(i)
Let S(m,n) be any maximal common subsequence of A1m and B1n. S(m,n) is a subsequence of B that is S1 (a subsequence of A1i) || S2 ( a subsequence of Ai+1,m). Then there exists j such that S1 is a subsequence of B1j and S2 is a subsequence of Bj+1,n . By definition of L and L*, |S1|≤L(i,j) and |S2|≤L*(i,j). Thus L(m,n)=|S(m,n)|=|S1|+|S2|≤L(i,j)+L*(i,j) ≤M(i)
让S(m,n)是某个A1m和B1n的最大公共子序列。S(m,n)是B的一个子序列,且就是S1 (A1i的一个子序列) || S2 (Ai+1,m的一个子序列)。那么,必存在j,使得S1是B1j的子序列,同时S2 是Bj+1,n的子序列。根据L和L*的定义,|S1|≤L(i,j) 同时|S2|≤L*(i,j)。因此,L(m,n)=|S(m,n)|=|S1|+|S2|≤L(i,j)+L*(i,j) ≤M(i)
注:典型的数学证明法。要证明A=B。先证明A≥B,再证明A≤B。
Algorithm C
算法C
We now apply the above theorem recursively to divide a given problem into two smaller problems until we obtain a trivial subproblem.
现在,我们根据上面的定理递归的将一个给定的问题分成两个小问题直到能获得一个不成问题的子问题。
Algorithm C accepts as input stings A and B (of length m and n) and produces as output a common subsequence C of A and B that is of Maximum length p.
算法C获得输入字符串A和B(长度分别为m和n),然后计算输出A和B的一个公共子序列C,且C是最长的,长度为p
ALG C(m,n,A,B,C)
1. If problem is trivial , solve it:
If n=0 then C←e (e is empty string)
Else if m=1 then if 存在j≤n such that A(1)=B(j) 注:存在的标记没法打出来,故用了中文
Then C←A(1)
Else C←e
2. Otherwise, split problem:
Else begin i←[m/2]
3. Evaluate L(i,j) and L*(i,j) [j=0……n]:
ALG B(i,n,A1i,B1n,L1)
ALG B(m-i,n,~An,i+1,~Bn1,L2)
4. Find j such that L(i,j)+L*(i,j)=L(m,n) using theorem:
M←max{L1(j)+L2(n-j)};0≤i≤n
k←min j such that L1(j)+L2(n-j)=M
5. Solve simpler problems:
ALG C (i,k,A1i,B1k,C1)
ALG C (m-i,n-k,Ai+1m,Bk+1,n,C2)
6. Give output
C←C1 || C2
end
Proof of Correctness of Algorithm C
论证算法C的正确性
L1(j) produced by the first call to ALG B in line 3 is equal to L(i,j). This was shown in the proof of correctness of Algorithm B. Similarly , L2(j) is equal to the maximum length of common subsequence (max lcs) of ~Am,i+1 and ~Bn,n-j+1 by the proof of correctness of Algorithm B .
在第3行第一次调用ALG B计算出的L1(j)等价于L(i,j),这个在“论证算法B的正确性”中就说明了。同样的,L2(j)等同于~Am,i+1和~Bn,n-j+1的公共子序列中的长度最大值也在“论证算法B的正确性”中说明了。
L2(n-j)=max lcs of ~Am,i+1 and ~Bn,j+1=max lcs of Ai+1,m and Bj+1,n=L*(i,j)
By our theorem , we can find k (as in line 4) such that L(i,k)+L*(i,k)=L(m,n). So there must exist solutions C1 and C2 to the subproblems (A1i, B1k) and (Ai+1,m,Bk+1,n) such that C1 || C2 will be a common subsequence of A and B of length L(m,n). The solutions to the subproblems are obtained in line 5 and are added together in line 6 to obtain the final output .
根据我们的定理,我们能找到k(在第4步),使得L(i,k)+L*(i,k)=L(m,n)。那么就一定存在C1和C2,分别是子问题(A1i, B1k)和子问题(Ai+1,m,Bk+1,n)的解,而C1 || C2是A 和B的长度为L(m,n)的公共子序列。求解过程在第5步获得子问题的解,并在第6步将两个子问题的解连接起来并最终输出。
Time Analysis of Algorithm C
算法C的时间分析
For P(1,n) we look for a single match . For some constants c1 and c2 this is time-bounded by c1n+c2
针对P(1,n),我们找到一个单独的匹配。给定常量c1和c2,时间临界点为c1n+c2
For P(2m,n) , let operations on vectors that are linear in m or n be time-bounded by c3m+c4n+c5. That leaves two calls to ALG B and two calls to ALG C. the calls to ALG B are bounded by c6mn by time analysis of ALG B . Assume P(m,n) is time-bounded by d1mn+d2(d1≥c1,d2≥c2). Then the calls to ALG C will be time-bounded by d1mk+d2 and d1m(n-k)+d2. Thus a total time-bound T for P(2m,n) will be T=(d1+c6)mn+c3m+c4n+c5+2d2. For n≥1,T≤(d1+c6+c3+c4+c5+d2)mn+d2. For n=0 , let T≤d2 . Then to be consistent with our assumption on the time-bound of P(m,n) , we must have d1+c6+c3+c4+c5+d2≤2d1 , which is realizable by letting d1=c6+c3+c4+c5+d2.
针对P(2m,n),在向量的操作上,时间是关于m或者n线性的,时间临界点是c3m+c4n+c5。还要执行两次ALG B和两次ALG C。根据ALG B的时间分析,执行ALG B的时间临界点为c6mn。假定P(m,n)的时间临界点为d1mn+d2(d1≥c1,d2≥c2)。那么执行两次ALG C的时间临界点分别为d1mk+d2和d1m(n-k)+d2。因此,P(2m,n)的总共时间临界点T,将会是T=(d1+c6)mn+c3m+c4n+c5+2d2。当n≥1,T≤(d1+c6+c3+c4+c5+d2)mn+d2。当n=0时,T≤d2。那么就象我们始终如一的假定P(m,n)的时间临界点,我们一定会得到d1+c6+c3+c4+c5+d2≤2d1。不妨写成d1=c6+c3+c4+c5+d2。
Thus Algorithm C has an O(mn) time bound.
因此算法C需要时间O(mn)。
注:由于英文水平有限,这一段翻的很干涩。其实当看到T=(d1+c6)mn+c3m+c4n+c5+2d2时,就知道算法C的时间为O(mn)。因为式子的最高次是mn
Space Analysis of Algorithm C
算法C的空间分析
We assume that vectors A and B are in common storage and substrings can be transferred as arguments by giving initial and final locations.
我们假设向量A和B存储在公共空间并且他们的子串能作为传递参数在初始化过程和确定的位置。
Then ,during execution, the calls to ALG B use temporary storage which is linear in m and n (see space analysis of Algorithm B) . It is seen that ,exclusive of recursive calls to ALG C , ALG C uses a constant amount of memory space. There are 2m-1 calls to ALG C (proven below) , and so ALG C requires memory space proportional to m and n , i.e. O(m+n) space.
那么在整个计算过程中,调用ALG B临时存储空间是和m和n的线性相关的(在算法B的空间分析里说明)。这就像是,独享的递归调用ALG C,ALG C占用一块总量固定的内存空间。一共要调用2m-1次ALG C(在后面证明),那么所以ALG C需要的内存空间为和m和n成比例,也就是O(m+n)空间
Proof That There Are 2m-1 Calls to ALG C
证明,一共调用2m-1次ALG C
Let m≤2r. If r is zero , then m is one , and there are 21-1=1 call to ALG C
设m<=2r。如果r是0,那么m是1,那么一共有21-1=1次调用ALG C
Assume that for m≤2r=M there are 2m-1 to calls to ALG C. For m’ ≤2r+1=2M, i will be set equal to at most M in line 2. There will be two calls to ALG C with first parameters m1 and m2 such that m1+m2=m’ and both m1 and m2 are at most M . By assumption , each for these calls will generate a total of 2mi-1 calls to ALG C . Adding in the initial call results in a total of :(2m1-1)+2(m2-1)+1=2(m1+m2)-1=2m’-1 calls.
假设m≤2r=M成立,那么一共有2m-1次调用ALG C。当m’ ≤2r+1=2M时,那么在第2步,i将会等于接近M的值。一共调用2次第一个参数分别是m1和m2的ALG C,且m1+m2=m’且m1和m2都接近M,根据假设,调用这2次ALG C则一共需要调用2mi-1次ALG C。加上第一次的调用,总计为(2m1-1)+2(m2-1)+1=2(m1+m2)-1=2m’-1次调用。
注:典型的数学归纳法证明。先证明r=0成立,再假设2r=M时成立,再证明2r+1=2M时成立
Algorithm C can be modified to find the edit distance between two strings (as defined in [3]). In this case we would seek to minimize D(m,n), the cost of our theorem would be : for all I,
算法C能修改成找寻两个字符串的编辑距离(在[3]中的定义)。在这种情况下,我们设定最小的D(m,n)。在算法的定理中将会是
D(m,n)=min{D(i,j)+D*(i,j)} 0≤i≤n
The modified Algorithm C would split problems in half by the above theorem, using a modified Algorithm B to evaluate D(i,j) and D*(i,j), and call itself recursively.
在上面的定理中,修改过的算法C将会对半分割问题,用修改过的算法B计算D(i,j) 和D*(i,j),然后再自身递归调用。
Received May 1974;revised November 1974
References
参考文章
1. Chvatal, V.,Klarner, D.A., and Knuth, D.E. Selected combinatorial research problems. STAN-CS-72-292, Stanford U., (June 1972),26
2. Private communication from D.Knuth to J.D. Ullman.
3. Wagner, R.A., and Fischer, M.J. The string-to-string correction problem. J. ACM 21 , 1 (Jan ,1974) 168-173
4. Aho, A. V., Hirschberg, D.S., and Ullman, J.D. Bounds on the complexity of the longest common subsequence problem. Proc . 15th Ann. Symp. on Swiching and Automata Theory, 1974,pp.104-109