1998 ACM Subject Classification F.2.2 Nonnumerical Algorithms and Problems, G.2.1 Com-

(1)

Problem

Mauro Castelli

1

, Riccardo Dondi

2

, Giancarlo Mauri

3

, and

Italo Zoppis

4

1 NOVA IMS, Universidade Nova de Lisboa, Lisbon, Portugal

[email protected]

2 Dipartimento di Lettere, Filosofia, Comunicazione, Università degli Studi di Bergamo, Bergamo, Italy

3 Dipartimento di Informatica, Sistemistica e Comunicazione, Università degli Studi di Milano-Bicocca, Milano, Italy

4 Dipartimento di Informatica, Sistemistica e Comunicazione, Università degli Studi di Milano-Bicocca, Milano, Italy

Abstract

Inspired by a recent approach for genome reconstruction from incomplete data, we consider a variant of the longest common subsequence problem for the comparison of two sequences, one of which is incomplete, i.e. it has some missing elements. The new combinatorial problem, called Longest Filled Common Subsequence, given two sequencesAandB, and a multisetMof symbols missing inB, asks for a sequenceB∗_{obtained by inserting the symbols of}_M_into_B_{so that}_B∗ induces a common subsequence withAof maximum length.

First, we investigate the computational and approximation complexity of the problem and we show that it is NP-hard and APX-hard whenA contains at most two occurrences of each symbol. Then, we give a 3

5−approximation algorithm for the problem. Finally, we present a fixed-parameter algorithm, when the problem is parameterized by the number of symbols inserted inBthat “match” symbols ofA.

1998 ACM Subject Classification F.2.2 Nonnumerical Algorithms and Problems, G.2.1 Com-binatorics

Keywords and phrases longest common subsequence, approximation algorithms, computational complexity, fixed-parameter algorithms

Digital Object Identifier 10.4230/LIPIcs.CPM.2017.14

1 Introduction

The comparison of sequences via Longest Common Subsequence (LCS) has been applied in several contexts where we want to retrieve the maximum number of elements that appear in the same order in two or more sequences. There are well-known fields of application of LCS like scheduling and data compression, a notable example is thediffutility to compute the differences between two files.

The extraction of common subsequences has been widely applied to compare molecular sequences in bioinformatics [17, 14]. For example, the comparison of biological sequences provides a measure of their similarities and differences, aiming at understanding whether they encode similar/different functionalities. Different approaches for the comparison of two

(2)

genomes based on LCS have been considered in the last years, leading to variants of the longest common subsequence problem, like the constrained longest common subsequence [13, 8, 18, 11, 4] or the repetition-free longest common subsequence and variants thereof [7, 1, 6, 12].

The approaches based on LCS for genome comparison assume that the input sequences are complete, that is there are no missing data. However, while Next Generation Sequencing technologies are able to produce a huge amount of DNA/RNA fragments, the cost of reconstructing a complete genome is still high [10]. Hence, released genomes often contain errors or are incomplete. These incomplete genomes are called scaffolds. One approach to the reconstruction of genome is to fill scaffolds with missing genes, based on the comparison of an incomplete genome with a reference genome [16, 15, 9, 19]. Given an incomplete genomeB, a set of missing genes (symbols)Mand a reference genomeA, the goal is to insert the missing symbols inB so that the number of common adjacencies between the resulting genomeB∗ andAis maximized. We have a common adjacency when two genes a, b are consecutive both inAandB∗_{, independently from the order. We mention briefly that there is also a} variant of the scaffold filling approach that compares two incomplete genomes [15, 9].

Inspired by methods for genome comparison based on LCS and by the scaffold filling approach, we introduce a new variant of the LCS problem, called theLongest Filled Common Subsequenceproblem, for the comparison of a complete genomeAand an incomplete genome B. The goal of the problem is to find the maximum number of genes that appear in the same order in both genomes. However, since some of the genes inB are missing (a multiset Mof symbols), we have to compute a longest common subsequence ofA and of a filling B∗ _of _B_{, that is of a sequence obtained from} _B _{by inserting the symbols of} _M _into _B_. Notice that while the scaffold filling problem aims to reconstruct a complete genome from an incomplete one by maximizing the number of common adjacencies, here we aim to infer only those elements (genes) that appear in the same order in the complete genomeAand in the completed genomeB∗_.

In this paper, we investigate different algorithmic and complexity aspects of the Longest Filled Common Subsequence problem. First, in Section 3 we prove that it is NP-hard and APX-hard, even when genomeAcontains at most two occurrences of each symbol. Notice that bounding the maximum number of occurrences of symbols in a sequence is relevant in this case, as usually the number of copies of a gene inside a genome is bounded. Then, in Section 4 we present a polynomial-time approximation algorithm of factor 3

5. In Section 5,

we give a fixed-parameter algorithm, where the parameter is the number of inserted symbols that lead to a “match” with symbols of sequenceA. Such a parameter can be of interest when the number of missing elements, and in particular those that lead to a “match” with symbols ofA, is moderate, as the complexity of the algorithm depends exponentially only on this parameter.

Some of the proofs are omitted due to page limit.

2 Preliminaries

(3)

d

b

c

a

c

b

d

a

d

a

b

c

d

b

d

A

B

Figure 1The threading schema of two sequencesAandB: lines connect matched positions ofA andB.

A subsequence ofS is a sequence S′ _{that is obtained from}_S _{by deleting some symbols} (possibly none). A common subsequenceS of two sequences AandB is a subsequence of bothA andB. A longest common subsequence ofA andB is a common subsequence ofA andB having maximum length.

Given two sequencesAandB, a common subsequence can be defined by aligningA and B and by connecting two positions ofAandB containing an identical symbol with a line, such that there is no pair of crossing lines. This is called athreading schema (see Fig. 1). Given a threading schema for sequencesA,B, a connection between two symbols inAandB, respectively, is called amatchand the two positions incident in a line are said to bematched. Given a sequenceS and a multiset of symbolsM, we define a filling ofS withMas a sequence S′ _{obtained by inserting a subset} _M′ _{of symbols of} _M_into_S_{. Notice that in a} filling ofS withMnot all the symbols ofMhave to be inserted inS. Informally, we may not insert those symbols that do not induce matches, to simplify the algorithms we describe in Section 4 and in Section 5. Now, we are ready to present the formal definition of Longest Filled Common Subsequence.

◮Problem 1. Longest Filled Common Subsequence (LFCS)

Instance: _{two sequences} _A _and_B _{over an alphabet}_Σ_{, and a multiset}_M_over_Σ_. Solution: _{a filling}_B∗ _of _B _with _M_.

Measure: _{the length of a longest common subsequence of} _A_and_B∗ _{(to be maximized).}

Given two sequences A, B and a multiset M over Σ, let B∗ _{be a filling of} _B _with _M_. Consider a common subsequence ofAandB∗_{, and their corresponding threading schema. A} position ofAcan have two possible kinds of matches (see Fig. 2): a match with a position of B∗ _{that contains a symbol of}_M _{inserted in}_B_{, called}_{match by insertion}_{, or a match} with a position ofB∗_{not involved in an insertion, called} _{match by alignment}_{. We can easily} compute in polynomial-time two upper bounds on the number of positions of Athat can be matched by alignment and by insertion, that will be useful in Section 4. The first upper bound is related to a longest common subsequenceLofA andB, which can be computed in polynomial time. In fact, the maximum number of positions ofA (and of a fillingB∗ _of_B withM) that are matched by alignment is at most the length ofL.

Next we show how to compute in polynomial-time an upper bound on the number of positions of a sequenceA that can be matched by insertion. First, given a multisetMof symbols, we define anordering of Mas a sequence obtained by defining an order among each element ofM, that is each occurrence of a symbol ofM.

(4)

d

b

c

a

c

b

d

a

d

b

c

a

b

c

d

b

d

A

B

*

Figure 2A filling B∗ _{of sequence}_B _{in Fig. 1, computed by inserting a symbol in position 2}

(symbolb) and a symbol in position 3 (symbol c), both in grey. A subsequence ofAandB∗_is

induced by the threading schema ofA_andB∗_{, where straight lines represent matches by alignment,}

dashed lines represent matches by insertion.

Algorithm 1: Data: A,M

Result: a subsequenceA′ _of_A_{that matches the maximum number of symbols of a} sequenceM obtained by orderingM

1 i:= 1;

2 A′ is an empty sequence;

3 whilei≤ |A|do

4 if α∈ M withA[i] =αthen 5 A′ :=A′·α;

6 M:=M \ {α};

7 i:=i+ +;

Next, we prove the correctness of Algorithm 1.

◮_{Lemma 1.}Given a sequenceA, a multisetMonΣ, and a substringA[1, i]ofA, Algorithm 1

computes a subsequence ofA[1, i]that matches the maximum number of symbols of an ordering

M of M.

3 Complexity of

LF CS

In this section, we investigate the computational (and approximation) complexity of the LFCS problem, and we prove that it is APX-hard whenAcontains at most two occurrences of each symbol in Σ (we denote this restriction ofLFCS by 2-LFCS). We prove the result by an L-reduction from the Maximum Independent Set problem on Cubic Graphs (Max-ISC), which is known to be APX-hard [2](see [5] for details on L-reduction). Max-ISC, given a cubic graphG= (V, E)1_{, asks for a maximum cardinality subset}_V′ _⊆_V _{such that given} vi, vj∈V′ it holds {vi, vj}∈/E.

Given a cubic graphG= (V, E), withV ={v1, v2, . . . , vn} and|E|=m, in the following

we show how to construct an instance (A, B,M) of 2-LFCS. Define an order on the edges incident on a vertexvi∈V assuming{vi, vj}<{vi, vh}ifj < h. Given a vertex vi, and the

edges{vi, vj},{vi, vh},{vi, vz} ∈E, with j < h < z, we say that{vi, vj} ({vi, vh}, {vi, vz},

respectively) is the first (second, third, respectively) edge incident onvi.

First, we define the alphabet Σ:

Σ ={xi,j :vi∈V,1≤j≤3}∪{yi,j:vi∈V,1≤j≤2}∪{zi,j : 1≤i≤n+m−1,1≤j≤4}.

(5)

The input sequences AandB are built by concatenating several substrings. For each vi∈V, we define the following substrings of the input sequences A,B:

A(vi) =yi,1yi,2xi,1xi,2xi,3 B(vi) =xi,1xi,2xi,3yi,1yi,2.

For each{vi, vj} ∈E, with i < j (which is thep-th edge, 1≤p≤3, incident onvi and the

q-th edge, 1≤q≤3, incident onvj), define the following substrings ofA,B:

A({vi, vj}) =xi,pxj,q B({vi, vj}) =xj,qxi,p.

Finally, define 2(n+m−1) additional substringsSA,1, SA,2, . . . , SA,m+n−1, SB,1, SB,2, . . . , SB,m+n−1 whereSA,i, SB,i, with 1≤i≤m+n−1, are defined as follows:

SA,i=SB,i=zi,1zi,2zi,3zi,4.

Now, we are able to define the input sequencesAandB, by concatenating the substrings previously defined, where substrings associated with edges ofGare concatenated assuming some edge ordering (we assume that{v1, vw} is the first edge, while{vr, vt}is the last edge

according to the ordering):

A=A(v1)·SA,1·A(v2)· · · · ·SA,n−1·A(vn)·SA,n·A({v1, vw})· · · · ·SA,n+m−1·A({vr, vt}),

B=B(v1)·SB,1·B(v2)· · · · ·SB,n−1·B(vn)·SB,nB({v1, vw})· · · · ·SB,n+m−1·B({vr, vt}).

Notice that each substring associated with an edge {vi, vj} appears exactly once in both

AandB.

M(in this case is a set) is defined as follows: M={xi,t:vi ∈V,1≤t≤3}.

First, we prove that (A, B,M) is an instance of 2-LFCS, that is we prove that each symbol has at most two occurrences inA.

◮Lemma 2. Each symbol of Σoccurs at most twice inA.

Proof. Notice that each symbol appearing in a substring SA,i, 1 ≤i ≤m+n−1, does

not appear in any other subsequence of A. Now, consider a symbol yi,t, 1 ≤i ≤ n and

1≤t≤2, appearing in substringA(vi); yi,t does not appear in any other substring ofA.

Finally, consider a symbolxi,t, 1≤i≤nand 1≤t≤3;xi,t has one occurrence in exactly

two subsequences ofA: subsequenceA(vi) and subsequenceA({vi, vj}) (where{vi, vj} is the

t-th edges incident onvi). ◭

Let B∗ _{be a solution of 2-}_LFCS _{over instance (}_{A, B,}_M_{). We denote by}_S

B∗_,i(B∗(v_i),

B∗₍_{_v

i, vj}), respectively), the substring of a solutionB∗corresponding (after some insertion)

to the substringSB,i (B(vi),B({vi, vj}), respectively), of B.

Next, we show that we can assume that in a solution B∗ _{of 2-}_LFCS _{over instance} (A, B,M), a longest common subsequence ofA andB∗ _{matches by alignment a position of}

a subsequenceSA,i, 1≤i≤m+n−1, only with a position ofSB∗_,i, 1≤i≤m+n−1.

◮_{Lemma 3.} Given a cubic graphG, let(A, B,M)be the corresponding instance of 2-LFCS,

and B∗ _{a solution of 2-}_LFCS _over ₍_{A, B,}_M₎_{. Then a longest common subsequence of}_A

andB∗ _{contains each symbol} _z

t,q, with 1≤t≤m+n−1 and1≤q≤4.

Proof. Consider a solutionB∗ _{of 2-}_LFCS _{over instance (}_{A, B,}_M_{) and assume that it does} not contain a symbolzt,q, with 1≤t≤m+n−1 and 1≤q≤4. By construction a longest

common subsequence ofB∗ _and _A_{matches by alignment a position of}_A₍_v

i) either with a

(6)

First, we prove that a longest common subsequence between A and B∗ _{matches by} alignment a position ofA(vi) only with a position ofB(vi). Assume thatiis the minimum

value such that a longest common subsequence S of A and B∗ _{matches by alignment a} position ofA(vi) and a position ofB∗({vi, vj}). Notice that, by construction of (A, B,M),

no position of SA,i can be matched. Now, starting from S we can compute a common

subsequenceS′ _of_A _and_B∗_{, with}_|_S′_|_>_|_S_|_{, by modifying the alignment of}_S _{as follows: (i)} match by alignment the positions ofA(vi) and the positions ofB∗(vi) containing symbols

yi,1, yi,2; (ii) match by alignment the positions of subsequences SA,i containing symbol

zi,q with position of subsequencesSB,i containing symbolzi,q; (iii) any other match is not

modified. It follows that the number of positions inA(vi) matched byS′ with respect toS is

decreased by at most three, since eventually positions ofA(Vi) containing symbolsxi,1,xi,2, xi,3 will not be matched. The number of positions inSA,i matched byS′ with respect toS

is increased by at least 4, since each position ofSA,iis not matched byS and it is matched

byS′_{. By iterating this procedure, we eventually find a longest common subsequence}_S′ _of AandB∗ _{where if each position of}_A₍_V

i) is matched by alignment, then it is matched with

a position ofB(vi). By the maximality ofS′, this implies that each position ofA containing

a symbolzt,q, with 1≤t≤m+n−1 and 1≤q≤4, matches a position ofB∗ containing

symbolzt,q. ◭

Consider a vertex vi ∈ V and the corresponding substringsA(vi), B(vi) of A andB.

Moreover, let{vi, vj},{vi, vh},{vi, vz} ∈Ebe the three edges ofGincident onviand consider

the corresponding substringsA({vi, vj}),A({vi, vh}),A({vi, vz}) (B({vi, vj}),B({vi, vh}),

B({vi, vz}), respectively), ofA(of B, respectively). Informally, the reduction shows that

there are essentially two possible configurations (calledI-configuration andC-configuration) of the substringB∗₍_v

i) (and possiblyB∗({vi, vj}),B∗({vi, vh}) andB∗({vi, vz})) of a filling

B∗ _of _B_{. A substring} _B∗₍_v

i) having an I-configuration is related to the vertex vi in an

independent set ofG, while a substringB∗₍_v

i) having a C-configuration is related to the

vertexvi in a vertex cover ofG.

We define now the two possible configurations, calledI-configuration andC-configuration, forB∗₍_v

i) and, possibly, for the substringsB∗({vi, vj}),B∗({vi, vh}) and B∗({vi, vz}) of a

fillingB∗ _of_B_{. An}_{I-configuration}_{for the substrings}_B∗₍_v

i),B∗({vi, vj}),B∗({vi, vh}) and

B∗₍_{_v

i, vz}) is defined as follows:

B∗₍_v

i) =B(vi) (hence there is no insertion inB(vi)).

For each{vi, vt}, witht∈ {j, h, z}, where{vi, vt}is thep-th edge incident onvi, 1≤p≤3,

and the q-th edge incident onvt, 1 ≤q ≤3, B∗({vi, vt}) = xi,pxj,qxi,p (hence xi,p is

inserted inB({vi, vt})).

If B∗₍_v

i), B∗({vi, vj}), B∗({vi, vh}), B∗({vi, vz}) have an I-configuration, a longest

common subsequence ofB∗₍_v

i) andA(vi) has length three (it matches the positions containing

xi,1, xi,2, xi,3), and a longest common subsequence of A({vi, vt}) andB∗({vi, vt}), with

t∈ {j, h, z}, has length two (it matches the positions containingxi,p,xj,q).

AC-configuration for the substringB∗₍_v

i) is defined as follows:

B∗₍_v

i) =xi,1xi,2xi,3yi,1yi,2xi,1xi,2xi,3(hence B∗(vi) =B(vi)·xi,1xi,2xi,3).

IfB∗₍_v

i) has aC-configuration, a longest common subsequence ofB∗(vi) andA(vi) has

length five, it matches the positions containingyi,1,yi,2, xi,1,xi,2,xi,3.

Next, we present the main lemmata of this section.

◮ _{Lemma 4.} Let G be a cubic graph, instance of Max-ISC, and let (A, B,M) be the

(7)

in polynomial time a solution B∗ _{of 2-}_LFCS _{over instance} ₍_{A, B,}_M₎_{inducing a longest}

common subsequence with Aof length 4(m+n−1) + 6|I|+ 5(n− |I|) +m.

Proof. Consider an independent set I and define a solution B∗ _{of 2-}_LFCS _{over instance} (A, B,M) as follows. For eachvi∈I, with{vi, vj},{vi, vh},{vi, vz} ∈Ethe three edges ofG

incident onvi, define anI-configuration forB∗(vi),B∗({vi, vj}),B∗({vi, vh}),B∗({vi, vz}).

For each vi ∈ V \I, define a C-configuration for B∗(vi). For each edge {vi, vj} ∈ E if

vi, vj ∈V \I, thenB∗({vi, vj}) =B({vi, vj}); notice that in this case a longest common

subsequence of A({vi, vj}) and B∗({vi, vj}) has length one, as it matches exactly one

position containing eitherxi,porxj,q. Finally, each position ofA in the substringSA,i, with

1≤i≤m+n−1, is matched by alignment with the corresponding position ofSB∗_,i.

Notice that the solution B∗ _{is well-defined, as each}_B∗₍_{_v

i, vj}), with {vi, vj} ∈E, can

belong to anI-configuration of at most one ofB∗₍_v

i) andB∗(vj), since at most one ofvi,vj

belongs toI.

Now, consider a longest common subsequenceS ofA andB∗_. _S _{matches 4(}_m₊_n₋₁₎ positions in substrings SA,1, . . ., SA,m+n−1, since all the positions of these substrings are

matched and, by construction, the overall length of SA,1, . . ., SA,m+n−1 is 4(m+n−1).

Moreover, by definition ofI-configuration andC-configuration, for eachvi∈I,S matches 3

positions ofA(vi) and 2 positions of eachA({vi, vj}), with{vi, vj} ∈E; for eachvi∈V \I,

S matches 5 positions ofA(vi); for each {vi, vj} ∈E, withvi, vj ∈ V \I, S matches one

position ofA({vi, vt}). Hence,S matches 4(m+n−1) + 6|I|+ 5(n− |I|) +mpositions ofA

andB∗_. ◭

Based on Lemma 3, we can prove the following result.

◮ Lemma 5. Let G be a cubic graph, instance of Max-ISC, and let (A, B,M) be the

corresponding instance of 2-LFCS. Then, given a solution B∗ _{of 2-}_LFCS _{over instance} (A, B,M)of length 4(m+n−1) + 6p+ 5(n−p) +m, we can compute in polynomial time an independent set ofGof size at least p.

By Lemmata 4 and 5, and by the APX-hardness of Max-ISC [2] we can conclude that the 2-LFCS problem is APX-hard.

◮_{Theorem 6.} 2-LFCS is APX-hard.

4 Approximating

LF CS

In this section we give a polynomial-time approximation algorithm forLFCS of factor 3 5.

The approximation algorithm picks the largest number of matched positions returned by two polynomial-time algorithms, Approx-Algorithm-1 and Approx-Algorithm-2. Notice that each algorithm does not return a filling ofB withM, but two disjoint subsets of positions ofA that have to be matched by alignment and by insertion, respectively, by a subsequence ofA and of a filling ofB withM. We can easily compute in polynomial time a fillingB∗ _of_B withM so that there exists a common subsequence ofA andB∗ _{that matches these two} subsets of positions.

Both algorithms consist of two phases.

Approx-Algorithm-1. In the first phase, Approx-Algorithm-1 computes in polynomial time a longest common subsequence ofAandB. Denote byR1,a the positions ofAmatched

(8)

A

R 1,a

R 2,i

R

1,i R2,a

OPT

a,o OPTi,o OPTa,e OPTi,e

OPT'

a,o OPT'i,o

Figure 3 The input sequence A and the positions matched by solution R₁ (dashed) and by solutionR₂ _{(in grey). In the upper part, brackets represent the subsets}R₁_,a _andR₁_,i _ofR₁_{, and} R₂_,a_andR₂_,i _ofR₂_{. In the lower part, the brackets represent the positions matched by}OP T_.

positions ofA′ _{of maximum size that matches}_M_{by insertion, applying Algorithm 1 on} (A′_,_M_{). Denote by}_R

1=R1,a∪R1,ithe set of positions returned by Approx-Algorithm-1.

Approx-Algorithm-2. In the first phase, Approx-Algorithm-2 computes a subset R2,i of

positions ofAof maximum size that matchesMby insertion applying Algorithm 1 on (A,M). Denote byA′′ _{the subsequence of}_A_{obtained by removing the positions of}_R

2,i.

The second phase computes a longest common subsequence of B andA′′_{; denote by}_R 2,a

the set of positions ofA′′_(and _A_{) matched by this phase. Denote by}_R

2=R2,a∪R2,i

the set of positions returned by Approx-Algorithm-2.

Next, we show that the maximum number of positions matched by one of Approx-Algorithm-1 and Approx-Algorithm-2 gives a 3

5-approximated solution. First, we introduce

some notations (see Fig. 3). LetBopt_{be an optimal solution of}_LFCS _{on instance (}_{A, B,}_M_),

and letOP T be a longest common subsequence of AandBopt_{. We consider the following}

sets of positions ofOP T. Denote byOP Ta the set of positions of Amatched by alignment

inOP T and by OP Ti the set of positions ofAmatched by insertion inOP T. Notice that

by construction it holdsOP Ta∩OP Ti=∅.

DefineOP Ta,o=OP Ta∩(R1,a∪R2,i) andOP Ti,o=OP Ti∩(R1,a∪R2,i). Moreover,

defineOP Ta,e=OP Ta\OP Ta,oandOP Ti,e =OP Ti\OP Ti,o. Informally,OP Ta,e(OP Ti,e,

respectively) is the set of positions ofAmatched by alignment (by insertion, respectively) in OP T that is not matched in the first phase by Approx-Algorithm-1 (in the second phase by Approx-Algorithm-2, respectively). Finally, defineOP T′

i,o=OP Ti,o\R1,a and

OP T′

a,o=OP Ta,o\R2,i.

By definition ofOP T,OP Ta,o,OP Ti,o,OP Ta,eandOP Ti,e, it holds|OP T|=|OP Ta,o|+

|OP Ta,e|+|OP Ti,o|+|OP Ti,e|.

We will show that the largest set betweenR1 andR2 gives a 3₅-approximate solution,

that is max(|R1|,|R2|)≥ 3₅|OP T|. We start by showing two bounds onOP Ti andOP Ta.

◮Lemma 7. |R₁_,a| ≥ |OP T_a|and|R₂_,i| ≥ |OP T_i|.

Proof. First, we prove that|R1,a| ≥ |OP Ta|. Consider the set of positions inOP Ta. Since

each position inOP Ta is a position ofAmatched by alignment, it follows that the setOP Ta

induces a common subsequence ofAandB. Since the setR1,aof positions ofAinduces a

longest common subsequence ofAandB, it follows that|R1,a| ≥ |OP Ta|.

Now, we prove that|R2,i| ≥ |OP Ti|. Consider the set of positions inOP Ti. Each position

(9)

Lemma 1, R2,i is a set of positions ofA of maximum cardinality that can be matched by

insertion with symbols ofM, hence|R2,i| ≥ |OP Ti|. ◭

As a consequence of Lemma 7, it follows that|R1,a|+|R2,i| ≥ |OP Ti|+|OP Ta| ≥ |OP T|.

Hence the maximum ofR1,R2 is (at least) 1₂|OP T|. In the following, we show with a more

refined analysis that the maximum of|R1|,|R2|is at least 3₅|OP T|.

We prove some bounds onR1,iandR2,a, then we consider three cases depending on the

values ofOP Ta,o,OP Ti,o,OP Ta,e,OP Ti,e, OP Ti,o′ andOP Ta,o′ . First, the following result

holds.

◮_{Lemma 8.} |R₁_,i| ≥ |OP T′

i,o|+|OP Ti,e| and|R2,a| ≥ |OP Ta,o′ |+|OP Ta,e|.

Now, in the analysis of the approximation factor of Algorithm-1 and Approx-Algorithm-2, we consider three cases, depending on the values of OP Ti,e,OP Ti,o,OP Ti,o′ .

Case 1

◮Lemma 9. Assume that |OP T_i,e|+|OP T_i,o′ | ≥ 1

2|OP Ti,o|, then |R1| ≥ 3₅|OP T|.

Proof. Since|R1,i| ≥ |OP Ti,o′ |+|OP Ti,e|by Lemma 8, it follows that

|R1,a|+|R1,i| ≥ |R1,a|+|OP Ti,o′ |+|OP Ti,e| ≥

3

5(|R1,a|+|OP Ti,e|) + 2

5(|R1,a|+|OP Ti,e|) +|OP Ti,o′ |.

By Lemma 7 it follows that|R1,a| ≥ |OP Ta| and, since|OP Ta|=|OP Ta,o|+|OP Ta,e|, it

follows that |R1,a| ≥ |OP Ta,o|+|OP Ta,e|, hence

3

5(|R1,a|+|OP Ti,e|) + 2

5(|R1,a|+|OP Ti,e|) +|OP Ti,o′ | ≥

3

5(|OP Ta,o|+|OP Ta,e|+|OP Ti,e|) + 2

5(|R1,a|+|OP Ti,e|) +|OP Ti,o′ |.

Hence, it holds

|R1,a|+|R1,i| ≥

3

5(|R1,a|+|OP Ti,e|) +|OP Ti,o′ |. (1)

Notice that|R1,a|+|OP Ti,o′ | ≥ |OP Ti,o|, since, by construction, each position in OP Ti,ois

either inOP T′

i,o or inR1,a. Then,

2

5(|R1,a|+|OP Ti,o′ |)≥

2

5|OP Ti,o|. (2)

2

5(|OP Ti,e|+|OP Ti,o′ |)≥

1

5|OP Ti,o|. (3)

|R1,a|+|R1,i| ≥

3

(10)

3

5(|R1,a|+|OP Ti,o′ |) +

2

5(|OP Ti,e|+|OP Ti,o′ |)≥

3

5(|OP Ta,o|+|OP Ta,e|+|OP Ti,o|+|OP Ti,e|).

It follows that, under the hypothesis |OP Ti,e|+|OP Ti,o′ | ≥ 12|OP Ti,o|, it holds |R1| ≥ 3

5|OP T|. ◭

Case 2

following result.

◮_{Lemma 10.} Assume that |OP T_a,e|+|OP T′

a,o| ≥ 12|OP Ta,o|, then|R2| ≥ 3 5|OP T|.

Case 3

Assume that both Case 1 and Case 2 do not hold. Then,

|OP Ti,e|+|OP Ti,o′ |<

1

2|OP Ti,o|and|OP Ta,e|+|OP Ta,o′ |<

1

2|OP Ta,o|.

|OP T|=|OP Ta,o|+|OP Ti,o|+|OP Ta,e|+|OP Ti,e|, it follows that

|OP T| ≤ 3

2(|OP Ta,o|+|OP Ti,o|)

We show that|R1| ≥ |OP Ta,o|+|OP Ti,o|, thus implying that|R1| ≥ 3₅|OP T|.

◮Lemma 11. |R₁_,a∪R₁_,i| ≥ |OP T_a,o|+|OP T_i,o|.

|OP T| ≤ 3

2(|OP Ta,o|+|OP Ti,o|), it follows that|R1|=|R1,a∪R1,i| ≥ 2

3|OP T| ≥ 3 5|OP T|.

From Lemma 9, Lemma 10 and Lemma 11, it follows the main result of this section.

◮Theorem 12. Given an instance (A, B,M) of LFCS, the largest solution returned by

Approx-Algorithm-1 and Approx-Algorithm-2 is an approximate solution of factor 3₅.

Proof. From Lemma 9, Lemma 10 and Lemma 11, it follows that max(|R1|, R2|)≥3₅|OP T|.

We can compute a fillingB1ofB withMthat matches at least|R1|positions withAas

follows: we consider the positions inR1,a as matched by alignment, we insert symbols ofM

inB in order to match by insertion the positions inR1,i. It follows that a longest common

subsequence ofAandB1matches at least|R1| positions.

Similarly, we can compute a fillingB2ofB withMthat matches at least|R2| positions

ofA. We insert symbols ofMinB so that the positions inR1,i are matched by insertion.

Consider the subsequence A′′ _{obtained after the removal of positions in} _R₁

,i; a longest

common subsequence ofA′′ _and_B _{matches at least}_|_R

2,a|positions. It follows that a longest

(11)

5 An FPT Algorithm

In this section, we present an FPT algorithm forLFCS parameterized by the numberkof positions ofAmatched by insertions. Notice thatk <|M|. Here we assume that the input sequencesAandB have been extended by adding two symbols $A,$B∈/Σ, respectively, in

position 0 ofAandB, respectively. Hence we assume that position 0 ofAand of a filling B∗ _of_B _with_M_{is not matched by alignment or by insertion by any solution of}_LFCS _of length greater than zero.

The algorithm we present is based on the color-coding technique [3]. Next, we present the definition of perfect families of hash functions for a multiset of symbols, on which our color-coding approach is based.

◮Definition 13. LetMbe a multiset of positions and let F be a family of hash functions

fromMto a set {c1, . . . , ck}of colors. F is calledperfect if for any subset W ⊆ M, such

that|W|=k, there exists a functionf ∈F which is injective onW.

A perfect familyF of hash functions fromMto{c1, . . . , ck}, having sizeO(log|M|2O(k)),

can be constructed in timeO(2O(k)_|M|_log_|M|_{) (see [3]).}

Consider a perfect family of hash functions F : M → {c1, . . . , ck}. Let f ∈ F be

an injective function, and define L[i, j, C, l], with C ⊆ {c1, . . . , ck}, 0 ≤ i, l ≤ |A| and

0≤j ≤ |B|, as follows:

L[i, j, C, l] = 1 if and only if there exists a common subsequence ofA[0, i] and of a filling B∗_of_B_[0_{, j}_{] with}_M_{having length}_l_{, such that there exist}_|_C_|_{symbols of}_M_{inserted in} B[0, j], each one associated with a distinct color ofC and matched by insertion with a position ofA

else L[i, j, C, l] = 0.

Next, we define the recurrence to computeL[i, j, C, l], wherei≥1 andj≥0.

L[i, j, C, l] = max

                

L[i−1, j, C, l]

L[i, j−1, C, l] ifj≥1

L[i−1, j−1, C, l−1] ifA[i] =B[j] andj≥1 L[i−1, j, C\ {c}, l−1] if A[i] =αand there exists

α∈ Mwithf(α) =c∈C

(4)

For the base case, since we have extended AandB so that position 0 inA and in the filling of B cannot be matched by insertions or by alignment, it holds L[0,0, C, l] = 1, if C=∅andl= 0, elseL[0,0, C, l] = 0. Next, we prove the correctness of the recurrence.

◮_{Lemma 14.} Let F :M → {c₁, . . . , c_k} be a perfect family of hash functions, let f ∈F

be an injective function and letC be a subset of{c1, . . . , ck}. Then there exists a common

subsequence of lengthl,l≥0, of A[0, i],0≤i≤ |A|, and of a filling of B[0, j],0≤j≤ |B|, withM′_{⊆ M}_{such that each symbol of}_M′ _{matched by insertion is associated with a distinct}

color inC if and only if L[i, j, C, l] = 1.

Now, we are able to prove the main result of this section.

◮Theorem 15. LetA,B be two sequences andMa mutliset of symbols. Then it is possible

to compute in time 2O(k)_poly₍_|_A_|₊_|_B_|₊_|M|₎_{if there exists a solution} _B∗ _of _LFCS _over

instance(A, B,M)such that a longest common subsequence S ofA andB∗ _{has length} _l _and

(12)

Proof. The correctness of the algorithm follows from Lemma 14 and from the fact that entry L[|A|,|B|, C, l] = 1 if and only if there exists a solution ofLFCS over instance (A, B,M) having lengthl that matches by insertionkpositions ofA.

Next, we consider the time complexity of the algorithm. A perfect family of hash functions that color-codes the symbols of M can be computed in time 2O(k)_poly₍_|M|_{). Then, the}

algorithm iterates through 2O(k)_poly₍_|M|_{) color-codings. For each color-coding, the table} L[i, j, C, l] is computed in timeO(2k_|_A_|2_|_B_|_k_{) (where}_l_{≤ |}_A_|_{), since for each of the at most} O(2k_|_A_|2_|_B_|_{) entries we need to look for at most}_k_{possible entries. The overall complexity}

is then 2O(k)_poly₍_|_A_|₊_|_B_|₊_|M|_). _◭

6 Conclusion

We have introduced a variant of the LCS problem, called Longest Filled Common Subsequence (LFCS), to compare a sequenceAwith an incomplete sequenceBto be filled with a multiset M of symbols. We have shown that the problem is APX-hard (hence NP-hard), even when each symbol occurs at most twice in the input sequenceA. Then, we have given an approximation algorithm of factor 3

5 and a fixed-parameter algorithm, where the parameter

is the number of symbols inMmatched by insertion.

There are some interesting open problems related toLFCS. It would be interesting to extend LFCS to the comparison of two incomplete sequences, similar to what has been done for Scaffold Filling [15]. Moreover, it would be interesting to design more efficient parameterized algorithms forLFCS, for example by considering the algebraic technique used for the repetition-free longest common subsequence [6]. Another open problem is whether LFCS is NP-hard on a constant size alphabet.

References

1 Said Sadique Adi, Marília D. V. Braga, Cristina G. Fernandes, Carlos Eduardo Ferreira, Fá-bio Viduani Martinez, Marie-France Sagot, Marco A. Stefanes, Christian Tjandraatmadja, and Yoshiko Wakabayashi. Repetition-free longest common subsequence. Discrete Appl. Math., 158(12):1315–1324, 2010. doi:10.1016/j.dam.2009.04.023.

2 Paola Alimonti and Viggo Kann. Some apx-completeness results for cubic graphs. Theor. Comput. Sci., 237(1-2):123–134, 2000. doi:10.1016/S0304-3975(98)00158-3.

3 Noga Alon, Raphael Yuster, and Uri Zwick. Color-coding. J. ACM, 42(4):844–856, 1995.

doi:10.1145/210332.210337.

4 Abdullah N. Arslan and Ömer Egecioglu. Algorithms for the constrained longest common subsequence problems.Int. J. Found. Comput. Sci., 16(6):1099–1109, 2005. doi:10.1142/ S0129054105003674.

5 Giorgio Ausiello, Pierluigi Crescenzi, Giorgio Gambosi, Viggo Kann, Alberto Marchetti-Spaccamela, and Marco Protasi. Complexity and Approximation: Combinatorial Optimiz-ation Problems and Their Approximability Properties. Springer-Verlag, Heidelberg, 1999.

doi:10.1007/978-3-642-58412-1.

6 Guillaume Blin, Paola Bonizzoni, Riccardo Dondi, and Florian Sikora. On the parameter-ized complexity of the repetition free longest common subsequence problem. Inf. Process. Lett., 112(7):272–276, 2012. doi:10.1016/j.ipl.2011.12.009.

(13)

8 Paola Bonizzoni, Gianluca Della Vedova, Riccardo Dondi, and Yuri Pirola. Variants of constrained longest common subsequence. Inf. Process. Lett., 110(20):877–881, 2010. doi: 10.1016/j.ipl.2010.07.015.

9 Laurent Bulteau, Anna Paola Carrieri, and Riccardo Dondi. Fixed-parameter algorithms for scaffold filling. Theor. Comput. Sci., 568:72–83, 2015. doi:10.1016/j.tcs.2014.12.005. 10 P. G. S. Chain and et al. Genomics. Genome project standards in a new era of sequencing.

Science, 326:236–237, 2009. doi:10.1126/SCIENCE.1180614.

11 Francis Y. L. Chin, Alfredo De Santis, Anna Lisa Ferrara, N. L. Ho, and S. K. Kim. A simple algorithm for the constrained sequence problems. Inf. Process. Lett., 90(4):175–179, 2004. doi:10.1016/j.ipl.2004.02.008.

12 Carlos Eduardo Ferreira and Christian Tjandraatmadja. A branch-and-cut approach to the repetition-free longest common subsequence problem. Electron. Notes Discrete Math., 36:527–534, 2010. doi:10.1016/j.endm.2010.05.067.

13 Zvi Gotthilf, Danny Hermelin, and Moshe Lewenstein. Constrained LCS: hardness and ap-proximation. In Paolo Ferragina and Gad M. Landau, editors,Proceedings of the 19th An-nual Symposium on Combinatorial Pattern Matching (CPM 2008), volume 5029 ofLNCS, pages 255–262. Springer, 2008. doi:10.1007/978-3-540-69068-9_24.

14 Tao Jiang and Ming Li. On the approximation of shortest common supersequences and longest common subsequences. SIAM J. Comput., 24(5):1122–1139, 1995. doi:10.1137/ S009753979223842X.

15 Nan Liu, Haitao Jiang, Daming Zhu, and Binhai Zhu. An improved approximation al-gorithm for scaffold filling to maximize the common adjacencies.IEEE/ACM Trans. Com-put. Biol. Bioinform., 10(4):905–913, 2013. doi:10.1109/TCBB.2013.100.

16 Adriana Muñoz, Chunfang Zheng, Qian Zhu, Victor A. Albert, Steve Rounsley, and David Sankoff. Scaffold filling, contig fusion and comparative gene order inference. BMC Bioin-formatics, 11:304, 2010. doi:10.1186/1471-2105-11-304.

17 Temple F. Smith and Michael S. Waterman. Identification of common molecular sub-sequences. J. Mol. Biol., 147(1):195–197, 1981. doi:10.1016/0022-2836(81)90087-5. 18 Yin-Te Tsai. The constrained longest common subsequence problem. Inf. Process. Lett.,

88(4):173–176, 2003. doi:10.1016/j.ipl.2003.07.001.