• Nenhum resultado encontrado

3 Algorithms

No documento Combinatorial Pattern Matching (páginas 196-200)

In this section, we present several algorithms for computingLCSqS(A, B). In order to avoid processing unnecessary characters, we will assume that the input stringsAandB have been already preprocessed by an alphabet reduction technique [16] as follows: First, we compute the lexicographical ranks of the characters inAandB. Assuming thatAandB are drawn from an integer alphabet of sizenO(1), this can be done inO(n) time with radix sort. We then replace each character inAandB with its rank, turningAandB into strings over the integer alphabet [1,2n]. Then we remove every character that appears only either inAor in B. It is clear that this preprocessing essentially preserves common subsequences between the originalAandB and thus has no negative effect on computingLCSqS(A, B). Note that n≤ Mholds after alphabet reduction, whileM=O(n2) still also holds.

3.1 Simple Algorithm

Our first algorithm considers Θ(n2) pairs of partitioning ofA andB. Namely, we have that LCSqS(A, B) = max

1i<n,1j<n{2×LCS(A[1..i], A[i+ 1..n], B[1..j], B[j+ 1..n])}.

This immediately implies anO(n6)-timeO(n4)-space algorithm for computingLCSqS(A, B), since the LCS of four strings can be computed inO(n4) time and space by standard DP.

TheO(n6)-time complexity can be improved as follows. For any matching point (i, j)∈ M, leti0 (resp. j0) be the smallest position such thati < i0,j < j0, and (i0, j0)∈ M. If such (i0, j0) does not exist, then leti0 =j0 =n.

IObservation 1. For anyik < i0 andjh < j0, LCS(A[1..k], A[k+ 1..n], B[1..h], B[h+

1..n] =LCS(A[1..i], A[i+ 1..n], B[1..j], B[j+ 1..n].

By Observation 1, it is sufficient for us to consider only|M|partition points betweenAand B. Hence, we can computeLCSqS(A, B) inO(|M|n4) time andO(n4) space.

3.2 O(σ|M|

3

+ n)-time algorithm

Here we present ourO(σ|M|3+n)-time algorithm for computing LCSqS(A, B), where σ is the number of distinct characters occurring in A and B. This algorithm is based on Inenaga and Hyyrö’s algorithm [16] which computes (the length of) alongest palindromic common subsequence of two given strings in O(σ|M|2+n) time. Consider a 2D plain where the stringAcorresponds to the vertical axis upward (i.e.,A[1] is on the bottom and A[n] is on the top), and the string B corresponds to the horizontal axis rightward (i.e., B[1] is on the left end and B[n] is on the right end). Our key idea is to represent each common square subsequence of stringsAand B by matching rectangles defined as follows:

For 1 ≤ i < jn and 1 ≤ k < ln, a tuple r = (i, j, k, l) is said to be a matching rectangle iff A[i] = A[j] = B[k] = B[l], and more specifically a c-matching rectangle iff A[i] =A[j] =B[k] =B[l] =c. For a matching rectangle r= (i, j, k, l), (i, k) is said to be the left-bottom corner ofr, and (j, l) is said to be the right-upper corner ofr. Let Rdenote the set of matching rectangles ofAandB. Notice|R|=O(|M|2). For two matching rectangles r= (i, j, k, l) andr0= (i0, j0, k0, l0), let

r=r0 ⇐⇒ i=i0, j=j0, k=k0, andl=l0 r < r0 ⇐⇒ i < i0, j < j0, k < k0, andl < l0

rCr0 ⇐⇒ ii0, jj0, kk0, ll0, andr6=r0.

For twoc-matching rectanglesr= (i, j, k, l) andr0 = (i0, j0, k0, l0), let rr0 ⇐⇒ii0, jj0, kk0 andll0.

A sequence hr1, . . . , rmiof matching rectangles is said to be a sequence of diagonally overlapping matching rectangles (DOMRs) iff rx < rx+1 for all 1 ≤x < m, im< j1 and km< l1, where we use the notationrh= (ih, jh, kh, lh) for allh= 1, . . . , m. Thesize of a sequencehr1, . . . , rmiof DOMRs is the numbermof overlapping rectangles in it.

The following observation lays the foundation to the algorithms of this subsection (and to the one of the following subsection as well):

IObservation 2. There is a common square subsequence T of length 2m of stringsA and B iff there exists a sequencehr1, . . . , rmiof DOMRs of lengthm.

See Figure 1 which depicts the relationship between common square subsequences and DOMRs for two stringsAandB. By Observation 2, the problem of computingLCSqS(A, B) reduces to the problem of finding a longest sequence of DOMRs.

… a … b … c … a … b … c … ! A!

B! c

b a

c

b

a!

!!!!!!!

Figure 1Illustration of the relationship between common square subsequences and DOMRs.

The basic idea of our algorithm is to extend a given sequenceS=hr1, . . . , rmiof DOMRs by adding a new matching rectangle to its right-end. We say that ac-matching rectangle r= (i, j, k, l) is a c-extension ofS ifhr1, . . . , rm, riis a sequence of DOMRs. Ac-extension rofS isdominantif the condition rr0 holds betweenr and anyc-extensionr0 ofS. The algorithms in this subsection are based on the following lemmas.

I Lemma 3. Let S =hr1, . . . , rmi be any sequence of DOMRs. If S has at least one c- extension, thenS has a unique dominantc-extensionr0. It is furthermore possible to compute any such r0 in O(1) time after initial preprocessing of AandB inO(σn)time and space.

Proof. Consider r0 = (i0, j0, k0, l0), where i0 = min({i |im < i < j1, A[i] =c} ∪ {n+ 1}), j0 = min({j|jm< j, A[j] =c} ∪ {n+ 1}),k0= min({k|km< k < l1, B[k] =c} ∪ {n+ 1}) andl0= min({l|l > lm, B[l] =c} ∪ {n+ 1}). If any ofi0,j0,k0andl0holds the sentinel value n+ 1 that corresponds to non-existence of a further suitable match withc, thenS cannot have any c-extension. Otherwise A[i0] =A[j0] =B[k0] =B[l0] =c and r0 is a c-matching rectangle. Furthermore im < i0, jm< j0, km < k0, lm < l0, i0 < j1 andk0 < l1, sor0 is a c-extension ofS. If we assume the existence of anotherc-extensionr0 ofS such thatr00r0 does not hold, then at least one of the definitions ofi0, j0,k0 andl0 above is contradicted.

Hencer0 must be dominant. Finally,r0 must clearly be unique: if alsor006=r0 is a dominant c-extension, then bothr0 r00andr00r0 must hold, but this is possible only ifr00=r0.

The valuesi0 andj0 can be computed inO(1) time by using a precomputed tablePAof sizeσ×nthat holds the valuesPA[c, h] = min({i|h < i, A[h] =c} ∪ {n+ 1}) for allc∈Σ and 1≤hn. The values k0 andl0 can be computed inO(1) time by using an analogous precomputed table PB with valuesPB[c, h] = min({i| h < i, B[h] = c} ∪ {n+ 1}). Both tables can be precomputed inO(σn) time and space in a straight-forward manner. J Note that the proof of Lemma 3 refers only to r1 andrm when determining the unique dominant extension ofhr1, . . . , rmi: any inner rectanglerifor 1< i < mdoes not need to be considered. Thus all sequences of DOMRs that begin with the rectangler1 and end with the rectanglermshare the same unique dominant extensions.

ILemma 4. LetS=hr1, . . . , rmibe any sequence of at least two DOMRs. If anyc-matching rectanglerh with 1< hmis replaced by the dominantc-extension ofhr1, . . . , rh1i, also the resulting sequence of matching rectangles is a sequence of DOMRs.

Proof. The lemma clearly holds ifh=m, so consider the case 1< h < m. Let (i0, j0, k0, l0) be the dominantc-extension ofhr1, . . . , rh1i, and letS0 =hr01, . . . , r0midenote the sequence

i0m = im < j1 = j10, km0 = km < l1 = l01, and rx < rx+1 for 1 ≤ x < m. On the other handr0h1< r0h, as alsohr01, . . . , r0hi=hr1, . . . , rh1,(i0, j0, k0, l0)iis a sequence of DOMRs.

Becauserh0 is dominant, we haverh0 1< rh0 rh< rh+1=rh0+1, which in turn implies that rh0 < r0h+1 for 1≤h < m, and henceS0 fulfills all conditions of a sequence of DOMRs. J

Basic algorithm. The basic principle of our first rectangle-based algorithm, Algorithm 1, is to fix the first left-bottom matching rectanglerb, and then try to extend it as long as possible to the right-upper direction. For each such starting rectangle rb, we compute a dynamic programming tableDPrb of size O(|M|2) such that DPrb[re] will finally store the length of the longest sequence of DOMRs beginning withrb and ending withre, wherereis eitherrb itself or a dominant extension. In more detail, Algorithm 1 works as follows:

Algorithm 1:

Preprocessing: Compute a listLof all matching rectangles sorted according to<and by radix sorting all rectangles (i, j, k, l) as 4-digit numbers.

Compute longest sequence of DOMRs: For each matching rectangle rb (in any order), perform the following:

(1) For eachre(6=rb), we initializeDPrb[re]←0. We letDPrb[rb]←2.

(2) Supposerb is theith element ofL. For eachj=i+ 1, . . . ,|L|in increasing order, letrL[j] and attempt to extend a sequencehrb, . . . , riof DOMRs as follows:

(a) IfDPrb[r] = 0, then no sequence of DOMRs of formhrb, . . . , riexists.

(b) Otherwise, for each characterc, try to compute the unique dominantc-extension r0 of any sequence hrb, . . . , riof DOMRs which begins withrb and ends withr.

If suchr0 exists, setDPrb[r0]←max{DPrb[r0], DPrb[r] + 2}.

(3) If the maximum value inDPrb exceeds the current best solution, then update it.

Let us explain the correctness of Algorithm 1. Lemma 4 guarantees that an optimal sequence of DOMRs can be constructed by considering only dominant extensions. Consider any such optimal sequence of DOMRsS=hr1, . . . , rmi. The outer loop of Algorithm 1 will at some point selectrb=r1. As rL[j] are processed in increasing order ofj, the sorting order ofLguarantees that rectanglesri of S will be selected as the currentr in the order i= 1, . . . , m. For each such r= ri, the algorithm uses Lemma 3 to consider all possible dominant extensions, including also the extensionri+1ifi < m. A simple inductive argument shows that the valuesDPr1[ri] will become correctly computed in the orderi= 1, . . . , m.

Let us analyze the efficiency of Algorithm 1. Constructing the tablesPA andPB takes O(σn) time and space. Note that alphabet reduction guarantees that O(σn) =O(σ|M|).

Since 1≤i, j, k, lnfor each matching rectangle (i, j, k, l), we obtain a sorted list L of allO(|M|2) matching rectangles in O(|M|2+n) time and space by radix sort. Hence the preprocessing takesO(|M|2+n) total time and space. We test no more thanσcharacters for any cellDPrb[r] of the dynamic programming tableDPrb. By Lemma 3, we can compute a unique dominantc-extension inO(1) time, if it exists. Since there areO(|M|2) candidates forrbandO(|R|) =O(|M|2) candidates forr, Algorithm 1 takes overallO(σ|M|4+n) time andO(|M|2+n) space.

Improved algorithm. Now we show how to reduce the number of candidates for the starting rectanglerb. We give proof for Lemma 5. Lemmas 6 and 7 can be proven similarly.

No documento Combinatorial Pattern Matching (páginas 196-200)