The Approximate Period Problem

(1)

The Approximate Period Problem

Vladimir Yu. Popov

∗†

Abstract—We show that the approximate period problem is NP-complete for metric and alphabet car-dinality 7.

Keywords: approximate periods, computational com-plexity

1 Introduction

Most of the current work in music information retrieval has been done with monophonic sources. In monophonic music, no new note begins until the current note has fin-ished sounding. The most basic approach to monophonic feature extraction reduces notes to a single dimension. Pitch is extracted and duration is ignored, or vice versa. Both pitch and duration information may be used in the final retrieval system, but as features they are treated separately. Most music information retrieval researchers favor relative measures because a change in tempo or transposition across keys does not significantly alter the music information expressed [1], [2], [3], [4], [5], [6], [7].

Relative pitch has three standard expressions: exact in-terval, rough contour, and simple contour. Exact interval is the signed magnitude between two contiguous pitches. Simple contour keeps the sign and discards the magni-tude. Rough contour keeps the sign and groups the mag-nitude into a number of equivalence classes. Collectively, this techniques are known as unigrams. Long sequences, orn-grams, are constructed from an initial sequence of in-terval or ratio unigrams. A most sophisticated approach to n-gram extraction is the detection of repeating

pat-terns [8], [9], [10]. String matching n-gram extraction

techniques use notions such as insertions, deletions, and substitutions.

Regularities in experimentally obtained data often reveal important knowledge about the underlying physical sys-tem. Regularities in a biological sequence can be used to identify the sequence among other sequences, or to in-fer information about the evolution of the sequence. The genomes of eukaryotes, i.e. higher order organisms such humans, contain many regularities. Tandem repeats, or

∗_Ural _State _University, _Department _of Mathemat-ics and Mechanics, 620083 Ekaterinburg, RUSSIA Email: Vladimir.Popov@usu.ru

†_{The work partially supported by Grant of President of the} Russian Federation MD-1687.2008.9 and Analytical Departmen-tal Program ”Developing the scientific potential of high school” 2.1.1/1775.

tandem arrays, which are consecutive occurrences of the same string, are the most frequent. For example, the

six nucleotidesT T AGGGappear at the end of every

hu-man chromosome in tandem arrays that contain between one and two thousand copies [11]. Finding occurrences of repeated substrings in string is a widely studied prob-lem. In biological sequence analysis searching for tandem repeats is used to reveal structural and functional infor-mation.

DNA and protein sequences can be seen as long texts over specific alphabets. Those sequences represent the genetic code of living beings. Searching specific sequences over those texts appeared as a fundamental operation for prob-lems such as looking for given features in DNA chains, or determining how different two genetic sequences were. This was modeled as searching for given “patterns” in a “text”. However, exact searching was of no use for this application, since the patterns rarely matched the text exactly. The genetic sequences of two members of the same species are not identical, they are just very similar. Moreover, searching for exact tandem repeats can be too restrictive because of experimental errors. This gave a motivation to “search allowing errors”.

2 Preliminaries

Traditionally the alignment notation has been used to illustrate a comparison between two or more sequences. Given a set of strings

X ={x1, x2, . . . , xk}

on an alphabet Σ, a multiple alignment of X is a set of

strings

A={A1, A2, . . . , Ak},

|A1|=|A2|=. . .|Ak|=n, on augmented alphabet Γ =

Σ∪{∆}such that each stringAiis a copy ofxiinto which

n− |xi| copies of special symbol ∆ have been inserted.

Symbol ∆ is called an indel and represents the insertion or deletion of a particular symbol in one string relative to another [12].

A conventional way to measure the approximate simi-larity between two sequences a1. . . amandb1. . . bn is to

(2)

calculate local transformations or costs of local transfor-mations. Usually the considered local transformations are the following:

• substitution: ai→bj;

• insertion: ∆→bj;

• deletion: ai→∆.

To define a distance between sequences, one should first fix the set of local transformations and non-negative

val-ued cost function δ that gives for each transformation

a→b a costδ(a, b). A penalty matrix specifies the

sub-stitution cost for each pair of characters and the inser-tion/deletion cost for each character. The differences ap-pearing in the considered two sequences can be viewed differently, e.g., one substitution can be viewed as one insertion and one deletion. Therefore, it is natural to observe the minimum number of such differences. The

weighted edit distance between xand y is the minimum

cost to convertxtoyusing a penalty matrix. The general

framework of the edit distance and its many variations is given in [13].

3 Problem Definition

We consider the notion of approximate periods which is an approximate version of periods. This notion is first

discussed in [14]. Let δ be a distance function which

specified by a penalty matrix. Given two stringsx, pand

distance functionδ, we define approximate periods as

fol-lows. If there exists a partition ofxinto disjoint blocks of substrings, i.e.,x=p1. . . pr,pi6=ǫ, such thatδ(p, pi)≤t

for 1≤i < r, and δ(p′_{, p}_i₎_≤_t_where _p′ _{is some prefix of}

p, we say thatpis at-approximate period ofx(see [14]).

Let us consider the following problem:

The approximate period problem (AP)

Instance: A finite alphabet Γ, a string x from Γ∗_{, a}

penalty matrixM, and a positive integer t.

Question: Is there a string u ∈ Γ∗ _{such that} _u _{is a}

t-approximate period ofx?

4 The Main Result

If |Γ| ≥ 9 then there exists δ such that δ(a, a) = 0,

δ(a, b) =δ(b, a) for alla, b∈Γ, andAP problem isNP

-complete [14]. It is shown in [15] that AP problem is

NP_{-complete for} _|_Γ_{| ≥} _{5 and penalty matrix} _M _{as in}

figure 1.

Theorem. _If _|_Γ_{| ≥} _{7 then there exists} _δ _{such that}

δ(a, a) = 0, δ(a, b) = δ(b, a) for all a, b ∈ Γ, and AP

problem isNP_-complete.

A G C T ∆

A 0 m d t+ 1 t+ 1

G m 0 d t+ 1 t+ 1

C d d 2d t+ 1 t+ 1

T t+ 1 t+ 1 t+ 1 0 t+ 1

∆ t+ 1 t+ 1 t+ 1 t+ 1 0

Figure 1: The penalty matrix M

Proof. _{It is easy to see that the}_AP_{problem is in}NP_.

Consider the familiar Hamming distance, D, where the

local transformations are of the form a → b with cost

D(a, a) = 0 andD(a, b) = 1, fora6=b. Lety1 andy2 be

finite strings. Let us consider the following problem:

The closest string problem (CS)

Instance: A finite alphabet Σ, a set S={s1, s2, . . . , sn}

of strings each of lengthm,S⊆Σ∗_{, and a positive integer}

d.

Question: Is there a string s of length m such that for every stringsi∈S, D(s, si)≤d?

TheCSproblem isNP_{-complete even for the restriction}

to a binary alphabet [16], [17]. We will assume that Σ =

{0,1}. Now we transform an instance of theCSproblem

to an instance of theAPproblem as follows.

• Γ = Σ∪ {2,3,4, T,∆}

• x=T2m_T2₃m_T2₂m_T2₂m_T2₃m_T2₃m_T2₄m_T2_s

1T2s2T2

s3T . . . T snT

• t=md. Note that we can assume thatd >m

2.

• Define the penalty matrixM as in figure 2.

0 1 2 3 4 T ∆

0 0 m d d d t+ 1 t+ 1

1 m 0 d d d t+ 1 t+ 1

2 d d 0 2d 2d t+ 1 t+ 1

3 d d 2d 0 2d t+ 1 t+ 1

4 d d 2d 2d 0 t+ 1 t+ 1

T t+ 1 t+ 1 t+ 1 t+ 1 t+ 1 0 t+ 1

∆ t+ 1 t+ 1 t+ 1 t+ 1 t+ 1 t+ 1 0

Figure 2: The penalty matrix M

It is easy to see that this transformation can be done in polynomial time.

Let us show that if there is a string usuch that u is a

t-approximate period ofx, then u=T u′_T _where _u′ _{is a}

string in alphabet{0,1}.

Let us consider a partition of x into disjoint blocks of

(3)

substrings: x=p1. . . pr.

First suppose that u has no T. Clearly, there exists a

partition block of x which has at least one T, and the

distance between u and the partition block is greater

then t. Therefore, u must have at least one T.

Sup-pose that u has no more than one T. In this case,

u = u′_(0,_1,_2,_3,_4,_{∆)T u}′′_(0,_1,_2,_3,_4,_{∆). Consider the}

alignment ofp1, . . . , prinduced byu. It is easy to see that

p1=p′1(∆)T p′′1(2,∆). It is clear that eitherδ(u, p1)> t

or

δ(u′_(0,_1,_2,_3,_4,_{∆), p}′

1(∆)) +

+δ(u′′_(0,_1,_2,_3,_4,_{∆), p}′′

1(2,∆))≤t. (1)

Since u is a t-approximate period of x, δ(u, p1)≤ t. In

view of (1),

δ(u′_(0,_1,_2,_3,_4,_{∆), p}′

1(∆))≤t (2)

and

δ(u′′_(0,_1,_2,_3,_4,_{∆), p}′′

1(2,∆))≤t.

Sinceδ(∆, y) =t+ 1,y ∈ {0,1,2,3,4}, in view of (2), it is easy to see that

u′_(0,_1,_2,_3,_4,_∆)_{∈ {}_∆_}∗_. ₍₃₎

Since p1 =p′1(∆)T p′′1(2,∆), p2 =p′2(2,∆)T p′′2(∆). It is

clear that eitherδ(u, p2)> tor

δ(u′_(0,_1,_2,_3,_4,_{∆), p}′

2(2,∆)) +

+δ(u′′_(0,_1,_2,_3,_4,_{∆), p}′′

2(∆))≤t. (4)

Since u is a t-approximate period of x, δ(u, p2) ≤ t.

Therefore, in view of (4),

δ(u′_(0,_1,_2,_3,_4,_{∆), p}′

2(2,∆))≤t (5)

and

δ(u′′_(0,_1,_2,_3,_4,_{∆), p}′′

2(∆))≤t. (6)

Since δ(∆, y) = t+ 1, y ∈ {0,1,2,3,4}, in view of (3) and (5), p′

2(2,∆) ∈ {∆}∗. Similarly, in view of (6),

u′′_(0,_1,_2,_3,_4,_∆) _{∈ {}_∆_}∗ _and _p′′

1(2,∆) ∈ {∆}∗. Since

p′′

1(2,∆) ∈ {∆}∗, p2′(2,∆) ∈ {∆}∗ and 2m is a

subse-quence of p′′

1(2,∆)p′2(2,∆),umust have at least twoT.

Now suppose that

u=u′_(0,_1,_2,_3,_4,_{∆)T u}′′_(0,_1,_2,_3,_4,_{∆)T . . .}

. . . u(k)_(0,_1,_2,_3,_4,_{∆)T u}(k+1)_(0,_1,_2,_3,_4,_∆).

It is easy too see that in this case

p1=p′1T p′′1T . . . p (k) 1 T p

(k+1) 1 ,

where p′

1, p′′1, . . . , p (k) 1 , p

(k+1)

1 ∈ {0,1,2,3,4,∆}∗. Since u

is at-approximate period ofx,δ(u, p1)≤t. Therefore,

δ(u′_(0,_1,_2,_3,_4,_{∆), p}′

1) +

+δ(u′′_(0,_1,_2,_3,_4,_{∆), p}′′

1) +. . .

+δ(u(k)_(0,_1,_2,_3,_4,_{∆), p}(k) 1 ) +

+δ(u(k+1)_(0,_1,_2,_3,_4,_{∆), p}(k+1)

1 )≤t. (7)

In view of (7),

δ(u′_(0,_1,_2,_3,_4,_{∆), p}′

1)≤t,

δ(u′′_(0,_1,_2,_3,_4,_{∆), p}′′

1)≤t,

. . .

δ(u(k)_(0,_1,_2,_3,_4,_{∆), p}(k) 1 )≤t,

δ(u(k+1)_(0,_1,_2,_3,_4,_{∆), p}(k+1)

1 )≤t. (8)

Suppose thatk= 2p+ 1 wherep≥1. In this case

p2=p′2T p′′2T . . . p (k) 2 T p

(k+1) 2 ,

wherep′′

2 ∈ {∆}∗. It is easy to see thatp′′1 ∈ {2,∆}∗, and

2m_{is a subsequence of}_p′′

1. Sinceδ(∆,2) =t+ 1, In view

of (8), it is easy to check thatu′′_(0,_1,_2,_3,_4,_∆)_{∈ {}_/ _∆_}∗_.

Sinceδ(u, p2)≤t,

δ(u′_(0,_1,_2,_3,_4,_{∆), p}′

2) +

+δ(u′′_(0,_1,_2,_3,_4,_{∆), p}′′

2) +. . .

+δ(u(k)_(0,_1,_2,_3,_4,_{∆), p}(k) 2 ) +

+δ(u(k+1)_(0,_1,_2,_3,_4,_{∆), p}(k+1)

2 )≤t. (9)

In view of (9),

(4)

δ(u′′_(0,_1,_2,_3,_4,_{∆), p}′′

2)≤t.

Since k = 2p + 1, it is easy to check that p′′

2 ∈

{∆}∗_. _{In view of} _{δ(∆, y) =} _t _{+ 1,} _y _{∈ {}_0,_1,_2,₃_}_,

u′′_(0,_1,_2,_3,_4,_∆)_{∈ {}_∆_}∗_{. Therefore,}_k₆_{= 2p}_{+ 1.}

Suppose that k= 2pwherep≥7. It is easy to see that

in this case

p1=W1(∆)T(2m)∆T W2(∆)T(3m)∆T W3(∆)T(2m)∆

T W4(∆)T(2m)∆T W5(∆)T(3m)∆T W6(∆)T(3m)∆

T W7(∆)T(4m)∆T W8(∆)T(s1)∆T W9(∆)T(s2)∆T . . .

. . . T(sp₋7)∆T Wp+1(∆),

whereWj(∆)∈ {∆}∗, 1≤j≤p+ 1,

(Y)∆= ∆q1Y1∆q2Y2. . .∆qrYr∆qr+1,

Y =Y1Y2. . . Yr.

Therefore,

δ(u′_(0,_1,_2,_3,_4,_{∆), W}

1(∆)) +

+δ(u′′_(0,_1,_2,_3,_4,_∆),₍₂m₎

∆) +

+δ(u′′′_(0,_1,_2,_3,_4,_{∆), W}₂_{(∆)) +}

+δ(u(4)_(0,_1,_2,_3,_4,_∆),₍₃m₎

∆) +

+δ(u(5)_(0,_1,_2,_3,_4,_{∆), W}

3(∆)) +

+δ(u(6)_(0,_1,_2,_3,_4,_∆),₍₂m₎

∆) +

+δ(u(7)_(0,_1,_2,_3,_4,_{∆), W}

4(∆)) +

+δ(u(8)_(0,_1,_2,_3,_4,_∆),₍₂m₎

∆) +

+δ(u(9)_(0,_1,_2,_3,_4,_{∆), W}

5(∆)) +

+δ(u(10)_(0,_1,_2,_3,_4,_∆),₍₃m₎

∆) +

+δ(u(11)_(0,_1,_2,_3,_4,_{∆), W}

6(∆)) +

+δ(u(12)(0,1,2,3,4,∆),(3m₎

∆) +

+δ(u(13)_(0,_1,_2,_3,_4,_{∆), W}

7(∆)) +

+δ(u(14)_(0,_1,_2,_3,_4,_∆),₍₄m₎

∆) +

+δ(u(15)_(0,_1,_2,_3,_4,_{∆), W}

8(∆)) +

+δ(u(16)_(0,_1,_2,_3,_4,_∆),_(s 1)∆) +

+δ(u(17)_(0,_1,_2,_3,_4,_{∆), W}

9(∆)) +

+. . .+δ(u(k)_(0,_1,_2,_3,_4,_∆),_(s

p−7)∆) +

+δ(u(k+1)_(0,_1,_2,_3,_4,_{∆), W}

p+1(∆))≤t (10)

and

δ(u′_(0,_1,_2,_3,_4,_{∆), W}₁_(∆))_≤_t

δ(u′′_(0,_1,_2,_3,_4,_∆),₍₂m₎

∆)≤t

δ(u′′′_(0,_1,_2,_3,_4,_{∆), W}₂_(∆))_≤_t

δ(u(4)_(0,_1,_2,_3,_4,_∆),₍₃m₎

∆)≤t

δ(u(5)_(0,_1,_2,_3,_4,_{∆), W}

3(∆))≤t

δ(u(6)_(0,_1,_2,_3,_4,_∆),₍₂m₎

∆)≤t

δ(u(7)_(0,_1,_2,_3,_4,_{∆), W}

4(∆))≤t

δ(u(8)_(0,_1,_2,_3,_4,_∆),₍₂m₎

∆)≤t

δ(u(9)_(0,_1,_2,_3,_4,_{∆), W}

5(∆))≤t

δ(u(10)_(0,_1,_2,_3,_4,_∆),₍₃m₎

∆)≤t

δ(u(11)_(0,_1,_2,_3,_4,_{∆), W}

6(∆))≤t

δ(u(12)_(0,_1,_2,_3,_4,_∆),₍₃m₎

∆)≤t

δ(u(13)_(0,_1,_2,_3,_4,_{∆), W}

7(∆))≤t

δ(u(14)_(0,_1,_2,_3,_4,_∆),₍₄m₎

∆)≤t

δ(u(15)_(0,_1,_2,_3,_4,_{∆), W}

8(∆))≤t

δ(u(16)_(0,_1,_2,_3,_4,_∆),_(s

1)∆)≤t

δ(u(17)_(0,_1,_2,_3,_4,_{∆), W}

9(∆))≤t

. . . δ(u(k)_(0,_1,_2,_3,_4,_∆),_(s

p−7)∆)≤t

δ(u(k+1)_(0,_1,_2,_3,_4,_{∆), W}

p+1(∆))≤t (11)

Since Wj(∆) ∈ {∆}∗, 1 ≤ j ≤ p+ 1, in view of (11),

without loss of generality, we can assume that Wj(∆),

u(2j−1)_(0,_1,_2,_3,_4,_{∆), 1}_≤ _j _≤_p_{+ 1, are empty words.}

In view of (11), it is easy to check that if (2m₎

∆) =

Y1Y2. . . Yr and u′′(0,1,2,3,4,∆) = Z1Z2. . . Zr, where

Y1, Y2, . . . , Yr ∈ {2,∆}, Z1, Z2, . . . , Zr∈ {0,1,2,3,4,∆},

then

Yi= ∆⇔Zi= ∆,1≤i≤r.

Therefore, without loss of generality, we can assume that (2m₎

∆) ∈ {2}∗ and u′′(0,1,2,3,4,∆) ∈ {0,1,2,3,4}∗.

Since (2m₎

∆)∈ {2}∗, it is easy to check that (2m)∆) =

2m_{. Similarly, we can assume that} _u2j_(0,_1,_2,_3,_∆) _∈

{0,1,2,3}∗_{, (3}m₎

∆) = 3m, (4m)∆) = 4m, (si)∆ = si,

where 1< j≤p, 1≤i≤p−2.

Let occ(y, v) denote the number of

occur-rences of the letter y in the word v. Since

u′′_(0,_1,_2,_3,_4,_{∆), u}(4)_(0,_1,_2,_3,_4,_∆) _∈ _{_0,_1,_2,_3,₄_}∗

(5)

and (2m₎

∆) = 2m, (3m)∆) = 3m, in view of (10), it is

easy to see that

d·occ(0, u′′_(0,_1,_2,_3,_4,_∆))+

+d·occ(1, u′′_(0,_1,_2,_3,_4,_∆))+

+2d·occ(3, u′′_(0,_1,_2,_3,_4,_{∆)) =}

=δ(u′′_(0,_1,_2,_3,_4,_∆),₂m_),

d·occ(0, u(4)(0,1,2,3,4,∆))+

+d·occ(1, u(4)(0,1,2,3,4,∆))+

+2d·occ(2, u(4)(0,1,2,3,4,∆)) =

=δ(u(4)_(0,_1,_2,_3,_4,_∆),₃m_),

d·occ(0, u(6)(0,1,2,3,4,∆))+

+d·occ(1, u(6)(0,1,2,3,4,∆))+

+2d·occ(3, u(6)(0,1,2,3,4,∆)) =

=δ(u(6)(0,1,2,3,4,∆),2m_),

d·occ(0, u(8)(0,1,2,3,4,∆))+

+d·occ(1, u(8)_(0,_1,_2,_3,_4,_∆))+

+2d·occ(3, u(8)_(0,_1,_2,_3,_4,_{∆)) =}

=δ(u(8)(0,1,2,3,4,∆),2m_),

d·occ(0, u(10)(0,1,2,3,4,∆))+

+d·occ(1, u(10)(0,1,2,3,4,∆))+

+2d·occ(2, u(10)(0,1,2,3,4,∆)) =

=δ(u(10)(0,1,2,3,4,∆),3m),

d·occ(0, u(12)_(0,_1,_2,_3,_4,_∆))+

+d·occ(1, u(12)(0,1,2,3,4,∆))+

+2d·occ(2, u(12)(0,1,2,3,4,∆)) =

=δ(u(12)(0,1,2,3,4,∆),3m_),

and

P6

i=1occ(2, u(2i)(0,1,2,3,4,∆)) +

+occ(3, u(2i)_(0,_1,_2,_3,_4,_∆))_≥_5m. ₍₁₂₎

It is not hard to check that

p2=T sp−6T2sp−5T2. . . s2p−8T2s2p−7T.

Since δ(a, b) =d,a ∈ {0,1}, b ∈ {2,3}, in view of (12), δ(u, p2)≥5t. Therefore, p <7. Similarly, it is not hard

to check thatp <3.

Suppose thatp= 2. It is easy to see that in this case

p1=W1(∆)T(2m)∆T W2(∆)T(3m)∆T W3(∆).

Therefore,

δ(u′_(0,_1,_2,_3,_4,_{∆), W}

1(∆)) +

+δ(u′′_(0,_1,_2,_3,_4,_∆),₍₂m₎

∆) +

+δ(u′′′_(0,_1,_2,_3,_4,_{∆), W}₂_{(∆)) +}

+δ(u(4)_(0,_1,_2,_3,_4,_∆),₍₃m₎

∆) +

+δ(u(5)_(0,_1,_2,_3,_4,_{∆), W}

3(∆))≤t (13)

and

δ(u′_(0,_1,_2,_3,_4,_{∆), W}

1(∆))≤t

δ(u′′_(0,_1,_2,_3,_4,_∆),₍₂m₎

∆)≤t

δ(u′′′_(0,_1,_2,_3,_4,_{∆), W}₂_(∆))_≤_t

δ(u(4)_(0,_1,_2,_3,_4,_∆),₍₃m₎

∆)≤t

δ(u(5)_(0,_1,_2,_3,_4,_{∆), W}

3(∆))≤t (14)

Since Wj(∆) ∈ {∆}∗, 1 ≤ j ≤ p+ 1, in view of (14),

without loss of generality, we can assume that Wj(∆),

u(2j₋1)_(0,_1,_2,_3,_4,_{∆), 1}_≤ _j _≤_p_{+ 1, are empty words.}

In view of (14), it is easy to check that if (2m₎

∆) =

Y1Y2. . . Yr and u′′(0,1,2,3,4,∆) = Z1Z2. . . Zr, where

Y1, Y2, . . . , Yr ∈ {2,∆}, Z1, Z2, . . . , Zr∈ {0,1,2,3,4,∆},

then

Yi= ∆⇔Zi= ∆,1≤i≤r.

Therefore, without loss of generality, we can assume that (2m₎

∆) ∈ {2}∗ and u′′(0,1,2,3,4,∆) ∈ {0,1,2,3,4}∗.

Since (2m₎

∆)∈ {2}∗, it is easy to check that (2m)∆) =

2m_{. Similarly, we can assume that} _u2j_(0,_1,_2,_3,_4,_∆) _∈

{0,1,2,3,4}∗_{, (3}m₎

∆) = 3m, where 1< j≤p.

Since

u′′_(0,_1,_2,_3,_4,_{∆), u}(4)_(0,_1,_2,_3,_4,_∆)_{∈ {}_0,_1,_2,_3,₄_}∗

(6)

and (2m₎

∆) = 2m, (3m)∆) = 3m, in view of (13), it is

easy to see that

d·occ(0, u′′_(0,_1,_2,_3,_4,_∆))+

+d·occ(1, u′′_(0,_1,_2,_3,_4,_∆))+

+2d·occ(3, u′′_(0,_1,_2,_3,_4,_{∆)) =}

=δ(u′′_(0,_1,_2,_3,_4,_∆),₂m_),

d·occ(0, u(4)(0,1,2,3,4,∆))+

+d·occ(1, u(4)(0,1,2,3,4,∆))+

+2d·occ(2, u(4)(0,1,2,3,4,∆)) =

=δ(u(4)_(0,_1,_2,_3,_4,_∆),₃m_),

and

2

X

i=1

occ(2, u(2i)_(0,_1,_2,_3,_4,_∆))+

+occ(3, u(2i)_(0,_1,_2,_3,_4,_∆))_≥_m.

p2=T2mT22mT, p3=T3mT23mT.

Therefore,

occ(2, u(2)(0,1,2,3,4,∆))+

+occ(2, u(4)(0,1,2,3,4,∆))≥m,

occ(3, u(2)(0,1,2,3,4,∆))+

+occ(3, u(4)_(0,_1,_2,_3,_4,_∆))_≥_m,

and

P2

i=1occ(2, u(2

i)_(0,_1,_2,_3,_4,_{∆)) +}

+occ(3, u(2i)_(0,_1,_2,_3,_4,_∆))_≥_2m. ₍₁₅₎

p5=T s2T2s3T.

Since δ(a, b) =d,a ∈ {0,1}, b ∈ {2,3}, in view of (15), δ(u, p5)≥2t. Therefore,p= 1.

Since p= 1, it is easy to check that u = T u′_T_{, where}

u′ _{is a string in alphabet} _{_0,_1,_2,_3,₄_}_{, and} _p₁ ₌_T2m_T_,

p2=T3mT,p7=T4mT. Clearly,

δ(u, p1)≤t, δ(u, p2)≤t, δ(u, p7)≤t.

Therefore,

d·occ(0, u) +d·occ(1, u) + +2d·occ(3, u) + 2d·occ(4, u)≤t,

d·occ(0, u) +d·occ(1, u) + +2d·occ(2, u) + 2d·occ(4, u)≤t,

d·occ(0, u) +d·occ(1, u) +

+2d·occ(2, u) + 2d·occ(3, u)≤t. (16)

In view of (16),

occ(0, u) +occ(1, u) +

+2occ(3, u) + 2occ(4, u)≤m,

occ(0, u) +occ(1, u) +

+2occ(2, u) + 2occ(4, u)≤m,

occ(0, u) +occ(1, u) +

+2occ(2, u) + 2occ(3, u)≤m. (17)

In view of (17),

3occ(0, u) + 3occ(1, u) + 4occ(2, u)+

+4occ(3, u) + 4occ(4, u)≤3m.

Since u = T u′_T_{, where} _u′ _{is a string in alphabet}

{0,1,2,3,4},

occ(0, u) +occ(1, u) +occ(2, u)+

+occ(3, u) +occ(4, u) =m.

Therefore,

occ(2, u) +occ(3, u) +occ(4, u) = 0,

andu=T u′_T_{, where}_u′ _{is a string in alphabet}_{_0,₁_}_.

Since u=T u′_T_{, where}_u′ _{is a string in alphabet} _{_0,₁_}_,

for every stringsi ∈S,δ(u′, si)≤t. In view ofδ(0,1) =

δ(1,0) =m,D(u′_{, s}_i₎_≤_{d. Therefore,}_s₌_u′_.

Now suppose that there is a string s of length m such

that for every string si ∈S, D(s, si)≤ d. It is easy to

see thatu=T sT ist-approximate period ofx.

(7)

References

[1] Blackburn, S., DeRoure, D., “A tool for

content-based navigation of music,” Proceedings of ACM

International Multimedia Conference (ACMMM), Bristol, UK, pp. 361-368 9/98.

[2] Dillon, M., Hunter, M., “Automated identification

of melodic variants in folk music,” Computers and

the Humanities, V16, N2 pp. 107-117 6/82.

[3] Ghias, A., Logan, J., Chamberlin, D., Smith, B., “Query by humming-musical information retrieval

in an audio database,”Proceedings of ACM

Interna-tional Multimedia Conference (ACMMM), San Fran-cisco, CA, pp. 231-236 11/95.

[4] Kornst¨adt, A., “Themefinder: A web-based melodic

search tool,” Melodic Comparison: Concepts,

Pro-cesure, and Applications, Computing in Musicology, V11, pp. 231-236 98.

[5] Lemstr¨om, K., Laine, P., Perttu, S., “Using relative

interval slope in music information retrieval,”

Pro-ceedings of the International Computer Music Con-ference (ICMC), Beijing, China, pp. 317-320 9/99.

[6] Lindsay, A.T.,Using contour as mid-level

represen-tation of melody, Master’s thesis, MIT Media Lab, 1996.

[7] Selfridge-Field, E., “Conceptual and

representa-tional issues in melodie comparison,”Melodic

Com-parison: Concepts, Procesure, and Applications, Computing in Musicology, V11, pp. 3-64 98.

[8] Hsu, J.-L., Liu, C.-C., Chen, A.L.P., “Efficient

re-peating pattern finding in music databases,”

Pro-ceedings of the 1998 ACM CIKM International Con-ference on Information and Knowledge

Manage-ment, Bethesda, Maryland, USA, pp. 281-288 11/

98.

[9] Liu, C.-C., Hsu, J.-L., Chen, A.L.P., “Efficient theme and non-trivial repeating pattern discovering

in music databases,”Proceedings 15th International

Conference on Data Engineering, Sydney, Australia, IEEE Computer Society, pp. 14-21 3/99.

[10] Tseng, Y.-H., “Content-based retrieval for music

col-lections,” Proceedings of the 22nd annual

interna-tional ACM SIGIR conference on Research and de-velopment in information retrieval (SIGIR), Berke-ley, CA, pp. 176-182 8/99.

[11] Moyzis, R.K., “The human telomere,” Scientific

American, pp. 48-55 8/91.

[12] Pevzner, P.A., “Multiple alignment, communication

cost, and graph matching,” SIAM Journal on

Ap-plied Mathematics, V52, N6, pp. 1763-1779 12/92.

[13] Ukkonen, E., “Algorithms for approximate string

matching”Information and Control, V64, N1-3 pp.

100-118 1-3/85.

[14] Sim, J.S., Iliopoulos, C.S., Park, K., Smyth, W.F.,

“Approximate periods of strings,”Theoretical

Com-puter science, V262, N1-2, pp. 557-568 7/01.

[15] Popov, V.Yu., “The approximate period problem

for DNA alphabet,” Theoretical Computer Science,

V304, N1-3, pp. 443-447 7/03.

[16] Frances, M., Litman, A., “On covering problems of

codes,”Theory of Computer Systems, V30, N2, pp.

113-119 3/97.

[17] Lanctot, J.K., Li, M., Ma, B., Wang, S., Zhang, L.,

“Distinguishing string selection problems,”

Proceed-ings of 10th ACM-SIAM Symposium on Discrete Al-gorithms, pp. 633-642 1/99.