3 Algorithms - Combinatorial Pattern Matching

In this section, we present several algorithms for computingLCSqS(A, B). In order to avoid processing unnecessary characters, we will assume that the input stringsAandB have been already preprocessed by an alphabet reduction technique [16] as follows: First, we compute the lexicographical ranks of the characters inAandB. Assuming thatAandB are drawn from an integer alphabet of sizen^O⁽¹⁾, this can be done inO(n) time with radix sort. We then replace each character inAandB with its rank, turningAandB into strings over the integer alphabet [1,2n]. Then we remove every character that appears only either inAor in B. It is clear that this preprocessing essentially preserves common subsequences between the originalAandB and thus has no negative effect on computingLCSqS(A, B). Note that n≤ Mholds after alphabet reduction, whileM=O(n²) still also holds.

3.1 Simple Algorithm

Our first algorithm considers Θ(n²) pairs of partitioning ofA andB. Namely, we have that LCSqS(A, B) = max

1≤i<n,1≤j<n{2×LCS(A[1..i], A[i+ 1..n], B[1..j], B[j+ 1..n])}.

This immediately implies anO(n⁶)-timeO(n⁴)-space algorithm for computingLCSqS(A, B), since the LCS of four strings can be computed inO(n⁴) time and space by standard DP.

TheO(n⁶)-time complexity can be improved as follows. For any matching point (i, j)∈ M, leti⁰ (resp. j⁰) be the smallest position such thati < i⁰,j < j⁰, and (i⁰, j⁰)∈ M. If such (i⁰, j⁰) does not exist, then leti⁰ =j⁰ =n.

IObservation 1. For anyi≤k < i⁰ andj≤h < j⁰, LCS(A[1..k], A[k+ 1..n], B[1..h], B[h+

1..n] =LCS(A[1..i], A[i+ 1..n], B[1..j], B[j+ 1..n].

By Observation 1, it is sufficient for us to consider only|M|partition points betweenAand B. Hence, we can computeLCSqS(A, B) inO(|M|n⁴) time andO(n⁴) space.

3.2 O(σ|M|

+ n)-time algorithm

Here we present ourO(σ|M|³+n)-time algorithm for computing LCSqS(A, B), where σ is the number of distinct characters occurring in A and B. This algorithm is based on Inenaga and Hyyrö’s algorithm [16] which computes (the length of) alongest palindromic common subsequence of two given strings in O(σ|M|²+n) time. Consider a 2D plain where the stringAcorresponds to the vertical axis upward (i.e.,A[1] is on the bottom and A[n] is on the top), and the string B corresponds to the horizontal axis rightward (i.e., B[1] is on the left end and B[n] is on the right end). Our key idea is to represent each common square subsequence of stringsAand B by matching rectangles defined as follows:

For 1 ≤ i < j ≤ n and 1 ≤ k < l ≤ n, a tuple r = (i, j, k, l) is said to be a matching rectangle iff A[i] = A[j] = B[k] = B[l], and more specifically a c-matching rectangle iff A[i] =A[j] =B[k] =B[l] =c. For a matching rectangle r= (i, j, k, l), (i, k) is said to be the left-bottom corner ofr, and (j, l) is said to be the right-upper corner ofr. Let Rdenote the set of matching rectangles ofAandB. Notice|R|=O(|M|²). For two matching rectangles r= (i, j, k, l) andr⁰= (i⁰, j⁰, k⁰, l⁰), let

r=r⁰ ⇐⇒ i=i⁰, j=j⁰, k=k⁰, andl=l⁰ r < r⁰ ⇐⇒ i < i⁰, j < j⁰, k < k⁰, andl < l⁰

rCr⁰ ⇐⇒ i≤i⁰, j≤j⁰, k≤k⁰, l≤l⁰, andr6=r⁰.

For twoc-matching rectanglesr= (i, j, k, l) andr⁰ = (i⁰, j⁰, k⁰, l⁰), let rr⁰ ⇐⇒i≤i⁰, j≤j⁰, k≤k⁰ andl≤l⁰.

A sequence hr₁, . . . , r_miof matching rectangles is said to be a sequence of diagonally overlapping matching rectangles (DOMRs) iff rx < rx+1 for all 1 ≤x < m, im< j₁ and k_m< l₁, where we use the notationr_h= (i_h, j_h, k_h, l_h) for allh= 1, . . . , m. Thesize of a sequencehr₁, . . . , rmiof DOMRs is the numbermof overlapping rectangles in it.

The following observation lays the foundation to the algorithms of this subsection (and to the one of the following subsection as well):

IObservation 2. There is a common square subsequence T of length 2m of stringsA and B iff there exists a sequencehr₁, . . . , r_miof DOMRs of lengthm.

See Figure 1 which depicts the relationship between common square subsequences and DOMRs for two stringsAandB. By Observation 2, the problem of computingLCSqS(A, B) reduces to the problem of finding a longest sequence of DOMRs.

… a … b … c … a … b … c … ! A!

B! c

b a

…!…!…!…!…!…!…!

Figure 1Illustration of the relationship between common square subsequences and DOMRs.

The basic idea of our algorithm is to extend a given sequenceS=hr₁, . . . , rmiof DOMRs by adding a new matching rectangle to its right-end. We say that ac-matching rectangle r= (i, j, k, l) is a c-extension ofS ifhr₁, . . . , rm, riis a sequence of DOMRs. Ac-extension rofS isdominantif the condition rr⁰ holds betweenr and anyc-extensionr⁰ ofS. The algorithms in this subsection are based on the following lemmas.

I Lemma 3. Let S =hr₁, . . . , rmi be any sequence of DOMRs. If S has at least one c- extension, thenS has a unique dominantc-extensionr⁰. It is furthermore possible to compute any such r⁰ in O(1) time after initial preprocessing of AandB inO(σn)time and space.

Proof. Consider r⁰ = (i⁰, j⁰, k⁰, l⁰), where i⁰ = min({i |i_m < i < j₁, A[i] =c} ∪ {n+ 1}), j⁰ = min({j|jm< j, A[j] =c} ∪ {n+ 1}),k⁰= min({k|km< k < l₁, B[k] =c} ∪ {n+ 1}) andl⁰= min({l|l > lm, B[l] =c} ∪ {n+ 1}). If any ofi⁰,j⁰,k⁰andl⁰holds the sentinel value n+ 1 that corresponds to non-existence of a further suitable match withc, thenS cannot have any c-extension. Otherwise A[i⁰] =A[j⁰] =B[k⁰] =B[l⁰] =c and r⁰ is a c-matching rectangle. Furthermore i_m < i⁰, j_m< j⁰, k_m < k⁰, l_m < l⁰, i⁰ < j₁ andk⁰ < l₁, sor⁰ is a c-extension ofS. If we assume the existence of anotherc-extensionr⁰ ofS such thatr⁰⁰r⁰ does not hold, then at least one of the definitions ofi⁰, j⁰,k⁰ andl⁰ above is contradicted.

Hencer⁰ must be dominant. Finally,r⁰ must clearly be unique: if alsor⁰⁰6=r⁰ is a dominant c-extension, then bothr⁰ r⁰⁰andr⁰⁰r⁰ must hold, but this is possible only ifr⁰⁰=r⁰.

The valuesi⁰ andj⁰ can be computed inO(1) time by using a precomputed tablePAof sizeσ×nthat holds the valuesPA[c, h] = min({i|h < i, A[h] =c} ∪ {n+ 1}) for allc∈Σ and 1≤h≤n. The values k⁰ andl⁰ can be computed inO(1) time by using an analogous precomputed table PB with valuesPB[c, h] = min({i| h < i, B[h] = c} ∪ {n+ 1}). Both tables can be precomputed inO(σn) time and space in a straight-forward manner. J Note that the proof of Lemma 3 refers only to r₁ andrm when determining the unique dominant extension ofhr₁, . . . , r_mi: any inner rectangler_ifor 1< i < mdoes not need to be considered. Thus all sequences of DOMRs that begin with the rectangler₁ and end with the rectanglermshare the same unique dominant extensions.

ILemma 4. LetS=hr₁, . . . , rmibe any sequence of at least two DOMRs. If anyc-matching rectanglerh with 1< h≤mis replaced by the dominantc-extension ofhr₁, . . . , r_h−₁i, also the resulting sequence of matching rectangles is a sequence of DOMRs.

Proof. The lemma clearly holds ifh=m, so consider the case 1< h < m. Let (i⁰, j⁰, k⁰, l⁰) be the dominantc-extension ofhr₁, . . . , r_h−₁i, and letS⁰ =hr⁰₁, . . . , r⁰_midenote the sequence

i⁰_m = im < j₁ = j₁⁰, k_m⁰ = km < l₁ = l⁰₁, and rx < rx+1 for 1 ≤ x < m. On the other handr⁰_h−₁< r⁰_h, as alsohr⁰₁, . . . , r⁰_hi=hr₁, . . . , r_h−₁,(i⁰, j⁰, k⁰, l⁰)iis a sequence of DOMRs.

Becauser_h⁰ is dominant, we haver_h−⁰ ₁< r_h⁰ r_h< r_h₊₁=r_h⁰₊₁, which in turn implies that r_h⁰ < r⁰_h₊₁ for 1≤h < m, and henceS⁰ fulfills all conditions of a sequence of DOMRs. J

Basic algorithm. The basic principle of our first rectangle-based algorithm, Algorithm 1, is to fix the first left-bottom matching rectanglerb, and then try to extend it as long as possible to the right-upper direction. For each such starting rectangle rb, we compute a dynamic programming tableDPr_b of size O(|M|²) such that DPr_b[re] will finally store the length of the longest sequence of DOMRs beginning withr_b and ending withr_e, wherer_eis eitherrb itself or a dominant extension. In more detail, Algorithm 1 works as follows:

Algorithm 1:

Preprocessing: Compute a listLof all matching rectangles sorted according to<and by radix sorting all rectangles (i, j, k, l) as 4-digit numbers.

Compute longest sequence of DOMRs: For each matching rectangle r_b (in any order), perform the following:

(1) For eachre(6=rb), we initializeDPr_b[re]←0. We letDPr_b[rb]←2.

(2) Supposer_b is theith element ofL. For eachj=i+ 1, . . . ,|L|in increasing order, letr←L[j] and attempt to extend a sequencehrb, . . . , riof DOMRs as follows:

(a) IfDP_r_b[r] = 0, then no sequence of DOMRs of formhr_b, . . . , riexists.

(b) Otherwise, for each characterc, try to compute the unique dominantc-extension r⁰ of any sequence hr_b, . . . , riof DOMRs which begins withr_b and ends withr.

If suchr⁰ exists, setDPr_b[r⁰]←max{DPr_b[r⁰], DPr_b[r] + 2}.

(3) If the maximum value inDPr_b exceeds the current best solution, then update it.

Let us explain the correctness of Algorithm 1. Lemma 4 guarantees that an optimal sequence of DOMRs can be constructed by considering only dominant extensions. Consider any such optimal sequence of DOMRsS=hr₁, . . . , r_mi. The outer loop of Algorithm 1 will at some point selectrb=r₁. As r←L[j] are processed in increasing order ofj, the sorting order ofLguarantees that rectanglesr_i of S will be selected as the currentr in the order i= 1, . . . , m. For each such r= ri, the algorithm uses Lemma 3 to consider all possible dominant extensions, including also the extensionri+1ifi < m. A simple inductive argument shows that the valuesDPr1[r_i] will become correctly computed in the orderi= 1, . . . , m.

Let us analyze the efficiency of Algorithm 1. Constructing the tablesPA andPB takes O(σn) time and space. Note that alphabet reduction guarantees that O(σn) =O(σ|M|).

Since 1≤i, j, k, l ≤nfor each matching rectangle (i, j, k, l), we obtain a sorted list L of allO(|M|²) matching rectangles in O(|M|²+n) time and space by radix sort. Hence the preprocessing takesO(|M|²+n) total time and space. We test no more thanσcharacters for any cellDPr_b[r] of the dynamic programming tableDPr_b. By Lemma 3, we can compute a unique dominantc-extension inO(1) time, if it exists. Since there areO(|M|²) candidates forrbandO(|R|) =O(|M|²) candidates forr, Algorithm 1 takes overallO(σ|M|⁴+n) time andO(|M|²+n) space.

Improved algorithm. Now we show how to reduce the number of candidates for the starting rectanglerb. We give proof for Lemma 5. Lemmas 6 and 7 can be proven similarly.

No documento Combinatorial Pattern Matching (páginas 196-200)