• Nenhum resultado encontrado

Analysis of patterns and minimal embeddings of non-Markovian sequences

N/A
N/A
Protected

Academic year: 2022

Share "Analysis of patterns and minimal embeddings of non-Markovian sequences"

Copied!
30
0
0

Texto

(1)

Analysis of patterns and minimal

embeddings of non-Markovian sequences

Manuel.Lladser@Colorado.EDU Department of Applied Mathematics

University of Colorado Boulder

AofA - April 13 2008

(2)

NOTATION & TERMINOLOGY.

A is a finite alphabet

A is the set of all words of finite length A language is a set L ⊂ A

X = (Xn)n≥1 is a sequence of A-valued random variables X may be non-Markovian

X1 · · ·Xl models a random word of length l

(3)

PARADIGM.

For various probabilistic models for X and languages L the frequency statistics of L are asymptotically normal.

SnL :=

0

@

number of prefixes in X1 · · ·Xn

that belong to the language L 1 A

The paradigm applies for:

generalized patterns i.i.d. models [BenKoch93]

simple patterns stationary Markovian models [RegSzp98]

primitive patterns k-order Markovian models [NicSalFla02, Nic03]

primitive patterns nice dynamical sources [BouVal02, BouVal06]

hidden patterns i.i.d. models [FlaSpaVal06]

(4)

THE MARKOV CHAIN EMBEDDING TECHNIQUE.

IF X is a homogeneous Markov chain IF L is a regular language

IF G = (V, A, f, q, T) is a DFA that recognizes L

IF the embedding of X into G i.e. the stochastic process

XnG := f(q, X1 · · ·Xn) is a first-order homogenous Markov chain THEN

SnL =

number of visits the embedded process XG makes to T in the first n-steps

(5)

EXAMPLE.

Consider a 1-st order Markov chain X such that

P[X1 = a] = µ; P[X1 = b] = (1 µ);

P[Xn+1 = a | Xn = a] = p; P[Xn+1 = b | Xn = a] = (1 p);

P[Xn+1 = a | Xn = b] = q; P[Xn+1 = b | Xn = b] = (1 q).

Then the embedding of X into the Aho-Corasick automaton

ǫ a

b ab

ba

abb abba

a

b a

b

b a

a b

a b

a

b a

b

that recognizes matches with the regular expression {a, b}{ba, abba} i.e.

all words of the form x = ...ba or x = ...abba is a 1-st order Markov chain.

(6)

ǫ

a

b

ab

ba

abb abba

1

2

3

4

5 6

a

b a

b

b

a

a b

a

b

a

b a

b

µ

(1µ) p

(1p)

(1q) q

q

(1q)

p (1p)

q

(1q) p (1p)

(7)

What about a completely general sequence X ?

(8)

EXAMPLE. A seemingly unbiassed coin.

Let 0 < p < 1/2

Consider the random binary sequence X = (Xn)n≥1 such that

Xn+1 =d













Bernoulli(p) , n1

n

P

i=1

Xi > 12

Bernoulli(1/2) , n1

n

P

i=1

Xi = 12

Bernoulli(1 − p) , n1

n

P

i=1

Xi < 12

Question. Is there a Markovian structure where X can be embedded into for analyzing the asymptotic distribution of the frequency statistics of a given language?

(9)

GENERAL SETTING.

Given

a possibly non-Markovian sequence X

a possibly non-regular language L

a transformation R : A → S define XR to be the stochastic process

XnR := R(X1 · · ·Xn)

Question 1. What conditions are necessary and sufficient in order for XR to be Markovian?

Question 2. Given a pattern L, is there a transformation R such that XR is Markovian but also informative of the distribution of the frequency statistics of L?

(10)

REMARK.

The Markovianity or non-Markovianity of

XnR := R(X1 · · · Xn), n ≥ 1 does not really depend on the range of R

The above motivates to think of R : A → S as an equivalence relation over A:

u R v ⇐⇒ R(u) = R(v)

• R(u) is the unique equivalence class of R that contains u

• c ∈ R means that c is an equivalence class of R

(11)

DEFINITION. X is embedable w.r.t. R provided that for all u, v ∈ A and c ∈ R, if u R v then

X

α∈A:R(uα)=c

P[X = uα... | X = u...] = X

α∈A:R(vα)=c

P[X = vα... | X = v...]

(12)

DEFINITION. X is embedable w.r.t. R provided that for all u, v ∈ A and c ∈ R, if u R v then

X

α∈A:R(uα)=c

P[X = uα... | X = u...] = X

α∈A:R(vα)=c

P[X = vα... | X = v...]

Figure. Schematic partition of {0,1,2} into equivalence classes

(13)

DEFINITION. X is embedable w.r.t. R provided that for all u, v ∈ A and c ∈ R, if u R v then

X

α∈A:R(uα)=c

P[X = uα... | X = u...] = X

α∈A:R(vα)=c

P[X = vα... | X = v...]

u v

Figure. Schematic partition of {0,1,2} into equivalence classes

(14)

DEFINITION. X is embedable w.r.t. R provided that for all u, v ∈ A and c ∈ R, if u R v then

X

α∈A:R(uα)=c

P[X = uα... | X = u...] = X

α∈A:R(vα)=c

P[X = vα... | X = v...]

u v

v0 u1

v1 u2

u0

v2

Figure. Schematic partition of {0,1,2} into equivalence classes

(15)

DEFINITION. X is embedable w.r.t. R provided that for all u, v ∈ A and c ∈ R, if u R v then

X

α∈A:R(uα)=c

P[X = uα... | X = u...] = X

α∈A:R(vα)=c

P[X = vα... | X = v...]

u v

v0 u1

v1 u2

u0

v2 .4

.3

.7

Figure. Schematic partition of {0,1,2} into equivalence classes

(16)

u v

v0 u1

v1 u2 u0

v2 .4 .3

.7

THEOREM A. X is embedable w.r.t. R if and only if, for x ∈ A, if we condition on having X = x... then the stochastic process

XnR := R(X1 · · · Xn), n ≥ |x|,

is a first-order homogeneous Markov chain with transition probabilities that do not depend on x

THEOREM B. For each equivalence relation R in A, there exists a unique coarsest refinement R0 of R w.r.t. which X is embedable

(17)

APPLICATION/QUESTION. What is the smallest state-space for studying the frequency statistics of a language L in X?

−→ X = a b b a b . . . (original sequence)

−→ XR = 1 0 0 1 0 . . . (non-Markovian encoding)

XR0 = 0 4 6 3 4 . . . (optimal Markovian encoding)

XQ = 6 3 18 15 10 . . . (any other Markovian encoding)

L A*/L

a

abba

ab

abbab

abb

Figure. Partition R = {L,A \ L} s.t. XR is non-Markovian

(18)

APPLICATION/QUESTION. What is the smallest state-space for studying the frequency statistics of a language L in X?

−→ X = a b b a b . . . (original sequence)

XR = 1 0 0 1 0 . . . (non-Markovian encoding)

−→ XR0 = 0 4 6 3 4 . . . (optimal Markovian encoding)

XQ = 6 3 18 15 10 . . . (any other Markovian encoding)

L A*/L

(0) (1) (2)

(3)

(4) (5)

(6) a

abba

ab

abbab

abb

Figure. Coarsest refinement R0 of R w.r.t. which X is embedable

(19)

APPLICATION/QUESTION. What is the smallest state-space for studying the frequency statistics of a language L in X?

−→ X = a b b a b . . . (original sequence)

XR = 1 0 0 1 0 . . . (non-Markovian encoding)

XR0 = 0 4 6 3 4 . . . (optimal Markovian encoding)

−→ XQ = 6 3 18 15 10 . . . (any other Markovian encoding)

L A*/L

(0) (1) (2) (3) (4) (5)

(6)

(7)

(8)

(9)

(10) (11)

(14)

(12) (15) (16)

(13)

(17) (18)

a

abba

ab

abbab

abb

Figure. Arbitrary refinement Q of R w.r.t. which X is embedable

(20)

REMARK. The optimal refinement R0 of R such that XR0 is embedable is obtained through a limiting process: this makes it almost impossible to characterize de equivalence classes of R0 Motivated by this we will introduce an embedding which—while not as optimal—it is analytically tractable (!)

(21)

DEFINITION. The Markov relation induced by X into A is the equivalence relation defined as

uRXv (∀w ∈ A) : P[X = uw...|X = u...] =P[X = vw...|X = v...]

(22)

DEFINITION. The Markov relation induced by X into A is the equivalence relation defined as

uRXv (∀w ∈ A) : P[X = uw...|X = u...] =P[X = vw...|X = v...]

ε

0

1

00 01

10 11 .8

.2

.4 .6

.5 .5

u

v

Figure. Weighted tree visualization of definition with A = {0,1}

(23)

ε 0

1 00 01

10 11 .8

.2 .4 .6

.5 .5

u

v

An equivalence relation R is said to be right-invariant if for all u, v ∈ A and α ∈ A:

R(u) = R(v) =⇒ R(uα) = R(vα)

THEOREM C. X is embedable w.r.t. any right-invariant

equivalence relation that is a refinement of RX; in particular, X is embedable w.r.t. RX

(24)

EXAMPLE. Back to the seemingly unbiassed coin.

For 0 < p < 1/2, define

Xn+1 =d 8

>>

>>

><

>>

>>

>:

Bernoulli(p) , n1

n

P

i=1

Xi > 12 Bernoulli(1/2) , n1

n

P

i=1

Xi = 12

Bernoulli(1 p) , n1

n

P

i=1

Xi < 12

We aim to understand the frequency statistics of

L1 = {0,1}{1},

L2 = {0}{1}{0}({1}{0}{1}{0}) within X

(25)

PROPOSITION. R : {0,1} Z defined as

R(x) = 2 8

<

:

|x|

X

i=1

xi |x|

2 9

=

;

=

|x|

X

i=1

xi

|x|

X

i=1

(1 xi)

is a right-invariant refinement of RX. In particular, XnR := R(X1 · · ·Xn) is a first-order homogeneous Markov chain

n>0

n<0 0

1/2

1/2 (1-p) p

p (1-p)

XR is recurrent, with period 2. Because 0 < p < 1/2, XR is positive recurrent; in particular, there exists a stationary distribution π.

Observe that

SnL1 =

n

X

i=1

Xi

(26)

SnL1 =

n

P

i=1

Xi

COROLLARY A. If U and V are Z-valued random variables such that

P[U = n] = 2 · π(n), n = 0( mod 2);

P[V = n] = 2 · π(n), n = 1( mod 2);

then for L1 := {0,1}{1} it applies that

n→∞lim

n=0(mod 2)

2n ·

SnL1

n 1 2

ff d

= U;

n→∞lim

n=1(mod 2)

2n ·

SnL1

n 1 2

ff d

= V.

(27)

L2 is recognized by the automaton: A 1 B

0 0

According to the Mihill-Nerode theorem, Q : {0,1} → {A, B}

defined as Q(x) :=

state in the automaton where the path associated with x ends when starting at A

is right-invariant

Hence R × Q is also right-invariant and a refinement of RX. In particular, XnR×Q := (XnR, XnQ) is a first-order homogeneous Markov chain

(28)

1/2 1/2

p p p p

p p p p

(1-p) 1/2 p

(1-p) (1-p)

(1-p)

(1-p) (1-p)

(1-p) A

B

0 1 2 3 4

-1 -2

-3 -4

XR×Q is positive recurrent, with period 4. Returning times to a state have finite second moment. This allows to use the central limit theorem for additive functionals of Markov chains to obtain the following result.

COROLLARY B. There exists σ > 0 such that

n→∞lim

√n ·

SnL2

n − 1 2

d

= σ · W,

(29)

CONCLUSION. For the same non-Markovian sequence X,

non-Gaussian (discrete w/phases) and Gaussian limits are obtained for the frequency statistics of different regular languages

(30)

(More details in the 2008 ANALCO proceedings.)

... Thank you (!)

Referências

Documentos relacionados

Sendo assim, devido à idade da paciente, os profissionais preferiram realizar apenas o clareamento dental e o tratamento restaurador das manchas hipoplásicas,

1. O Comitê Conjunto desenvolverá agendas temáticas de Cooperação e Facilitação de temas relevantes ao fomento e incremento dos investimentos bilaterais. Para

A Fase II envolve um número maior de participantes (algumas centenas) e prolonga a análise da segurança iniciada na Fase I ao mesmo tempo que analisa de forma mais robusta a

A partir dos grupos gerados, foram analisados os respectivos valores de cada um dos atributos em cada grupo com o objetivo de identificar os valores mais predominantes

Flores bissexuadas, 3-meras, até 7cm, epíginas, zigom orfas ou actinomorfas, pendentes ou patentes, campanuladas ou tubulosas, protândricas; tépalas livres, eretas

Além desses dados descritivos, testamos as seguintes hipóteses: (1) o acasalamento na espécie é assortativo, ou seja, existe uma correlação positiva entre machos e fêmeas

Além disto, esta espécie, que muito provàvelmente se poderá também encontrar entre nós, visto, como nos referimos, ser frequente nos arrozais italianos, distingue-se por folhas