Stochastic chains with memory of variable length

(1)

Stochastic chains with memory of variable length

Antonio Galves

Universidade de São Paulo

2009, Année de la France au Brésil

(2)

Stochastic modeling and linguistic rhythm retrieval from written texts

1. A discussion about stochastic modeling

2. The model is a stochastic chain with memory of variable length

3. A linguistic case study

4. Joint work with Charlotte Galves, Nancy Garcia and Florencia Leonardi.

(3)

Stochastic modeling and linguistic rhythm retrieval from written texts

(4)

Stochastic modeling and linguistic rhythm retrieval from written texts

(5)

Stochastic modeling and linguistic rhythm retrieval from written texts

(6)

Linguistic motivation

I A long standing conjecture says that Brazilian Portuguese (BP) and European Portuguese (EP) implement different rhythms.

I But there is no satisfactory formal notion of linguistic rhythm.

I This is a challenging and important problem in linguistics.

I Even more difficult: we want to retrieve rhythmic patterns looking only to written texts of BP and EP!!!

(7)

Linguistic motivation

(8)

Linguistic motivation

(9)

Linguistic motivation

(10)

A few facts about BP and EP

I BP and EP share the same lexicon

I Even if they have different syntaxes, BP and EP produce a great number of superficially identical sentences

I We are looking for a needle in a haystack!

(11)

A few facts about BP and EP

(12)

A few facts about BP and EP

(13)

How to retrieve evidences that BP and EP have different rhythms?

I Recipe:

I Get samples of BP and EP rhythmic sequences

I Find a good class of models for these samples

I See if the models which best fit BP and EP samples coincide.

(14)

How to retrieve evidences that BP and EP have different rhythms?

I Recipe:

(15)

How to retrieve evidences that BP and EP have different rhythms?

I Recipe:

(16)

How to retrieve evidences that BP and EP have different rhythms?

I Recipe:

(17)

Getting samples of BP and EP rhythmic sequences

I The data we analyzed is an encoded corpus of newspaper articles.

I This corpus contains all the 365 editions of the years 1994 and 1995 from the daily newspapers Folha de São Paulo (Brazil) and O Público (Portugal).

(18)

Encoding hypothetical rhythmic features

We encode the words by assigning one of four symbols to each syllable according to whether

(i) it is stressed or not;

(ii) it is the beginning of a prosodic word or not.

Byprosodic word we mean a lexical word together with the functional non stressed words which precede it.

(19)

A five symbols alphabet

This double 0-1 classification can be represented by the four symbols alphabet{0,1,2,3} where

I 0 = non-stressed, non prosodic word initial syllable;

I 1 = stressed, non prosodic word initial syllable;

I 2 = non-stressed, prosodic word initial syllable;

I 3 = stressed, prosodic word initial syllable.

Additionally we assign an extra symbol (4) to encode the end of each sentence. We callA={0,1,2,3,4} the alphabet obtained in this way.

(20)

An example

Example: “O menino já comeu o doce” (The boy already ate the candy)

Sentence O me ni no já co meu o do ce .

Code 2 0 1 0 3 2 1 2 1 0 4

(21)

Modeling samples of symbolic sequences

I The encoding described above produced sequences taking values in the alphabet A.

I At first sight we can’t see any kind of regular (deterministic) behavior in these sequences.

I Apparently the same subsequences may appear in BP and EP texts.

I What can be a model for these sequences?

I Answer: use a probability measure on the set of infinite sequences of symbols in the alphabet A.

(22)

Modeling samples of symbolic sequences

(23)

Modeling samples of symbolic sequences

(24)

Modeling samples of symbolic sequences

(25)

Modeling samples of symbolic sequences

(26)

Chains with memory of variable length

I Introduced by Rissanen (1983) as a universal system for data compression.

I He called this model a finitely generated sourceor atree machine.

I Statisticians call it variable length Markov chain(Bühlman and Wyner 1999).

I Also called prediction suffix treein bio-informatics (Bejerano and Yona 2001).

(27)

Chains with memory of variable length

(28)

Chains with memory of variable length

(29)

Chains with memory of variable length

(30)

Heuristics

I When we have a symbolic chain describing

I a syntatic structure,

I a prosodic contour,

I a DNA sequence,

I a protein,....

I it is natural to assume that each symbol depends only on a finite suffixof the past

I whose length depends on the past.

(31)

Heuristics

I a DNA sequence,

I a protein,....

(32)

Heuristics

I a DNA sequence,

I a protein,....

(33)

Heuristics

I a DNA sequence,

I a protein,....

(34)

Heuristics

I a DNA sequence,

I a protein,....

(35)

Heuristics

I a DNA sequence,

I a protein,....

(36)

Heuristics

I a DNA sequence,

I a protein,....

(37)

Warning!

I We are not making the usual markovian assumption:

I at each step we are under the influence of a suffix of the past whose length depends on the past itself.

I Even if it is finite, in general the length of the relevant part of the past is not bounded above!

I This means that in general these are chains of infinite order, not Markov chains.

(38)

Warning!

(39)

Warning!

(40)

Warning!

(41)

Contexts

I Rissanen called the relevant suffixes of the past contexts.

I The set of all contexts should have thesuffix property: no context is a proper suffix of another context.

I This means that we can identify the end of each context without knowing what happened sooner.

I The suffix property implies that the set of all contexts can be represented as a rooted tree with finite branches.

(42)

Contexts

(43)

Contexts

(44)

Contexts

(45)

Chains with variable length memory

It is a stationary stochastic chain(Xn) taking values on a finite alphabetAand characterized by two elements:

I The tree of all contexts.

I A family of transition probabilities associated to each context.

I Given a context, its associated transition probability gives the distribution of occurrence of the next symbol immediately after the context.

(46)

Chains with variable length memory

(47)

Chains with variable length memory

(48)

Stochastic chains with variable length memory

For example: if(Xt) is a Markov chain of order 2 on the alphabet {0,1}, then

τ ={00,01,10,11}.

This set can be identified with the tree

0 1

0 1 0 1

(49)

Example: the renewal process on Z

A={0,1}

τ ={1,10,100,1000, . . .}

p(1|0^k1) =q_k where 0<q_k <1, for anyk ≥0, and

X

k≥0

qk = +∞.

(50)

A mathematical question

I Given a probabilistic context tree (τ,p)does it exist at least (at most) one stationary chain(X_n) compatible with it?

I First answer: verify if the infinite order transition probabilities defined by(τ,p) satisfy the sufficient conditions which assure the existence and uniqueness of a chain of infinite order.

I But this is a bad answer: what we really want to know is if there exists a stochastic process having contexts almost surely finite.

I Recently A. Gallo in his PhD dissertation gave sufficient conditions for this.

(51)

A mathematical question

(52)

A mathematical question

(53)

A mathematical question

(54)

Back to our case study

I How to assign probabilistic context trees to the samples of BP and EP encoded texts?

I Obvious answer: for each sample choose the one which maximizes the probability of the sample!

I Bad answer: this is just too naive...

I A bigger model will always give a bigger probability to the sample!

(55)

Back to our case study

(56)

Back to our case study

(57)

Back to our case study

(58)

A basic statistical question

Given a sample is it possible to estimate the smallest probabilistic context tree generating it ?

In the case of finite context trees, Rissanen (1983) introduced the algorithm Contextto estimate in a consistent way the probabilistic context tree out from a sample.

(59)

A basic statistical question

Given a sample is it possible to estimate the smallest probabilistic context tree generating it ?

In the case of finite context trees, Rissanen (1983) introduced the algorithm Contextto estimate in a consistent way the probabilistic context tree out from a sample.

(60)

The algorithm Context

I Starting with a finite sample(X0, . . . ,Xn−1) the goal is to estimate the context at step n.

I Start with a candidate context (X_n−k(n), . . . ,Xn−1), where k(n) =logn.

I Then decide to shorten or not this candidate context using somegain function. For instance the log-likelihood ratio statistics.

(61)

The algorithm Context

(62)

The algorithm Context

(63)

Good and bad news

I Recently this algorithm was extended for the case of unbounded trees and its tconsistency was proved by several authors (Csiszar and Talata, Galves and Leonardi, Ferrari and Wyner,...).

I The hidden difficulty: there is always a threshold constant C in the gain function that we use to decide to shorten or not the candidate context.

I For asymptotic consistency results, the specific value of C is irrelevant.

I But if you are an applied statistician and you must select the context tree based on a finite sample, the choice ofC matters!

(64)

Good and bad news

(65)

Good and bad news

(66)

Good and bad news

(67)

The smallest maximizer criterion

I Assume that the sample was really produced by a probabilistic context tree (τ^?,p^?).

I Consider now the set of candidate context trees maximizing the probability of the sample for each number of degrees of freedom.

I It turns out that this sample of champion trees istotally ordered and contains the tree τ^?.

I Moreover, there is a change of regime in the gain of likelihood atτ^?.

I In the case the tree τ^? is bounded this is a rigorous result.

I A similar result for a different class of models was recently pointed out by Massart and co-authors.

(68)

The smallest maximizer criterion

(69)

The smallest maximizer criterion

(70)

The smallest maximizer criterion

(71)

The smallest maximizer criterion

(72)

The smallest maximizer criterion

(73)

A simulation study

We simulate a sequencex₁, . . . ,x_n over the alphabet A={0,1,2,3,4} using the following context tree

0 1

2 3 4

0 1 2 3 0

2

0 1 2 3 0 2

To perform the simulation we assign transition probabilities to each branch of the tree.

Using the tree and the transition probabilities we simulate 100,000 symbols.

(74)

A simulation study

I The candidates champion tress have successively 1,8,11,13,16,17,· · · leaves. The tree with 13 leaves corresponds to the correct tree (the tree we use to simulate the data).

I When we plot the log-likelihood of the sample as a function of the number of leaves we see a change of regime, as stated by our Theorem.

(75)

A simulation study

5 10 15 20

-130000-110000-90000

Change of regime of the log-likelihood function

Tree index

Log-likelihood

(76)

Application to the linguistic data set

I The sample consists of 80 articles randomly selected from the 1994 and 1995 editions.

I We chose 20 articles from each year for each newspaper.

I We ended up with a sample of 97,750 symbols for BP and 105,326 symbols for EP.

(77)

Application to the linguistic data set

0 20 40 60 80 100 120

−130000−110000−90000

BP − Log−likelihood function

Leaves

Log−likelihood

20 40 60 80 100 120

−77200−76800

BP − Log−likelihood function

Leaves

Log−likelihood

0 20 40 60 80 100 120

−140000−110000

EP − Log−likelihood function

Leaves

Log−likelihood

20 40 60 80 100 120

−83300−83000−82700

EP − Log−likelihood function

Leaves

Log−likelihood

(78)

Application to the linguistic data set

10000 40000 70000

0.00000.00050.00100.00150.0020

BP - Trees with 13 and 14 leaves

Resample size

Difference in log-likelihood/Resample size

10000 40000 70000

0.00000.00050.00100.0015

Resample size

10000 40000 70000

0.00050.00100.00150.0020

Resample size

10000 40000 70000

0.00050.00100.0015

Resample size

(79)

Application to the linguistic data set

10000 40000 70000

0.00000.00050.00100.0015

EP - Trees with 14 and 17 leaves

Resample size

10000 40000 70000

0e+002e-044e-04

Resample size

10000 40000 70000

0.00050.00100.0015

Resample size

10000 40000 70000

0.00050.00100.00150.00200.0025 EP - Trees with 20 and 23 leaves

Resample size

(80)

Application to the linguistic data set

0 1

2 3 4

0 1

2 3 0

2

0 1 2 30 0 2

2 0 2

0 1 2

3 4

0 1

2 3 0

2

0 1 2 3 0 2 0 2 0 2

01 3 4

BP tree EP tree