We also present an efficient way to evaluate the interval extension of the likelihood function and its derivatives

(1)

NON-ASYMPTOTIC BEHAVIOR OF MAXIMUM LIKELIHOOD ESTIMATORS IN HIDDEN MARKOV

MODELS

By Tiago Montanher^∗, and Arnold Neumaier Faculty of Mathematics, University of Vienna

Abstract

This paper studies the maximum likelihood estimator of discrete time and finite state-space hidden Markov models in the absence of asymptotic assumptions. We use interval analysis to derive a condition that must be satisfied in every box containing local maxima of the likelihood function.

We also present an efficient way to evaluate the interval extension of the likelihood function and its derivatives. Our approach requires only 2 rounding mode switchings, regardless the number of observations or parameters in the model. The new condition is used to empirically determine the minimum sample size needed to reliably recover the parameters θ of a model based on its observations. Numerical experiments

show that for models with hidden and observable states ranging from two to eight, at least 300 observations are needed to recoverθ with 95% of

probability. The code necessary to perform numerical experiments is available in aC++ library as well as in aMatlab toolbox.

1. Introduction. Hidden Markov models(HMMs) form a large class of stochastic processes with a wide range of applications in applied mathematics and statistics. A hidden Markov model describes the relation between two stochastic processes. In the simplest form, the first process is a non- observable Markov chain that is assumed to be discrete in time and with finite number of states. The second process is observable and its distribution is given, at each moment, by the current state of the Markov chain. In this paper we are mainly interested in the case where the observable process as- sumes a finite number of states. However, as we will show in two examples, our results can be extended to more general processes. For a comprehensive introduction of hidden Markov models the interested reader should see

∗This research was support through the research grant number 205557/2014-7 of the National council for scientific and technological development of Brazil(CNPq).

MSC 2010 subject classifications:Primary 62M05, 60J10; secondary 65G40

Keywords and phrases: Hidden Markov models, Interval arithmetic, Non-asymptotic analysis

(2)

T. MONTANHER AND A. NEUMAIER

Capp´e, Moulines and Ryden(2005) and Zucchini and MacDonald(2009).

We denote the set of indices 1, . . . , N by 1 :N. We write A_i: to represent theith row of the matrixAandA_:j to denote thejth column. We denote by Diag(v) the diagonal matrix whose non-zero entries are given by the vector v. The transpose operator of the matrix A is given by A^T. If x ∈Rⁿ then the set

x() :={y∈Rⁿ| |y_i−xi|< , i= 1 :n}

is called the-neighborhood of x.

Let{y_n}:={y_n, n= 1,2, . . .} be a time series assuming values on the set 1 : M and {x_n} := {x_n, n = 1,2, . . .} a discrete time Markov chain on the state-space 1 : N. The pair of stochastic processes {(x_n, y_n)} is a hidden Markov Model ifx_n is not observable and

(1) Pr(xn+1 |x1:n) = Pr(xn+1|xn), (2) Pr(y_n|y1:n−1, x_1:n) = Pr(y_n|x_n)

where Pr(· | ·) denotes conditional probabilities. We call {x_n} the hidden process and {y_n} the observable process. We denote a hidden Markov model withN states in the hidden process and M states in the observable process by HM M(N, M). Arranging equations (1) and (2) in matrices we identify the hidden Markov model with the parameter pairθ= (A, B) where

A_ij := Pr(x_n+1 =j|x_n=i), i, j = 1 :N and

B_ij := Pr(y_n=j|x_n=i), i= 1 :N, j = 1 :M.

An important problem when modeling with HMMs is the parameter estimation. GivenN,M and the observations y1:T, find θ= (A, B) that solves the following problem

maxθ z:= Ly1:T(θ) (3)

s.t. (4)−(6)

(4)

N

X

j=1

A_ij = 1 fori= 1 :N.

(5)

M

X

j=1

Bij = 1 fori= 1 :N.

(3)

(6) Aij, Bij ≥0

whereLy1:T(θ) is the likelihood function. We say that the pointθthat solves (3) is a maximum likelihood estimator of the model. Two common approaches to solve (3) are the Baum-Welch procedure, described in Baum et al. (1970) and Zucchini and MacDonald (2009) and Newton type methods as proposed in Capp´e and Moulines (2005), Bulla, Bulla and Nenadi (2010) and Turner (2008). Note that the estimation problem has several local maximizers as pointed byRabiner(1989) and to the best of our knowl- edge there is no method available in literature that is suitable for solving the estimation problem under a rigorous global optimization point of view.

The consistency of the maximum likelihood estimator for hidden Markov models was proved by Baum and Petrie (1966) and Leroux (1992). Let HM M(N, M) be a hidden Markov model with the parameter pair θ = (A, B) and y1:T a time series taken from this model. The consistency of the maximum likelihood estimator states that we recover θ by finding the global maximizers of (3) ifT is sufficiently large. We note that consistency results are typically asymptotic and do not give any clue on the sample size T needed to obtain a recovery.

This work aims to give empirical lower bounds to the sample sizeT necessary to recover parametersθ. Our approach is based on methods from rigorous computation. We use interval arithmetic Moore, Kearfott and Cloud (2009),Neumaier(1990),Kearfott(1996) and the simplex structure of con- straints (4)-(6) to derive a condition that must hold in every box that contains local maxima of (3). We apply the new test in the following procedure

1. Set a counter variable cas zero;

2. Given N and M, generate a random HM M(N, M) with parameters θ= (A, B);

3. Generate observationsy_1:T from the model;

4. Given a constant >0, build the-neighborhood of θ,θ();

5. Apply the test toθ(). If it is not able to prove thatθ() has no local maxima, setc asc+1;

6. Repeat steps 3-5 forJ times;

After completion, _J^c gives a lower bound on the recovery ratio of θ with samples of size T. We run the procedure above a large number of times and with different values of T. Moreover, we consider different types of hidden Markov models while generating θ. Specifically, we have considered general HMMs, models where the hidden process are forward chains, Markov chains(the special case whereB is known to be the identity matrix) as well as models with Poisson and normal outcomes.

(4)

We call the procedure above rigorous because it is based on interval arithmetic. We also present an heuristic approach to assign probabilities of recover θ depending on T. We compare both methods and show that the rigorous one is more conservative, underestimating the value or T. On the other hand we show that the minimum sample size required to recover θ with probability of 95% is 300. We note that this is greater than the sample size of several time series in applications.

We outline this paper as follows. In Section 2 we introduce the basics of interval arithmetic, the main tool of our analysis. Section 3 describes the heuristic approach. Section4presents the elimination test necessary for the step 5 of the algorithm above. An efficient and tight interval extension of the likelihood function is the subject of Section 5. We conclude with numerical experiments in Section 6 and compare the heuristic approach with the rigorous one. The code used to perform numerical experiments is available as aC++library and in aMatlabtoolbox upon request to the first author.

2. Interval Analysis. This section introduces the interval arithmetic, a numerical analysis technique that allows rigorous computations. Interval arithmetic is a natural tool to design global optimization algorithms as described inKearfott(1996),Hansen and Walster(2004),Domes(2009). It has been used in statistics for parameter estimation of several models, seeWright and Kennedy (2000), Ahn, Kim and Chen (2012) and Algahtani (2011). A comprehensive approach to the topics covered here is given in Neumaier (1990). We mostly follow the notation established inKearfott et al. (2010).

2.1. Definitions. Let a, a ∈ R with −∞ < a ≤ a < ∞ then a = [a, a]

denotes a real interval with infa=aand supa=a. The set of all compact intervals is denoted by

IR:={[a, a]|a≤a, a, a∈R}.

Aninterval vector x orboxis the Cartesian product of the compact real intervalsxi:= [x_i, xi]∈IR, representing an axis-parallel box inRⁿ. We also define interval matrices in a similar way and denote the set of all compact real interval matrices byIR^n×m.

The width of the interval a = [a, a] is given by wida = a−a and the norm by kak = max(|infa|,|supa|). An interval is called degenerate if wida = 0. We also denote the mid-point and radius of a respectively by â = â+a₂ and ˇa = â−a₂ . Operations defined for intervals are interpreted component-wise when applied to boxes or matrices.

(5)

Let a,b ∈ IR. The elementary real operations ◦ ∈ {+,−, /,∗} are extended to interval arguments a, b by defining the result of an elementary interval operation to be the set of real numbers which results from combining any two numbers contained inaand b. Formally,

a◦b:={a◦b|a∈a, b∈b, and a◦b is defined}

for◦ ∈ {+,−, /,∗}. Interval elementary operations has the so called inclusion property, that means

a⊆a⁰,b⊆b⁰ ⇒a◦b∈a⁰◦b⁰.

We call the function f :IRⁿ →IRan inclusion function of f :Rⁿ →Rif x⊆y⇒ f(x)⊆f(y). We also say that the interval function f :IRⁿ →IR is theinterval extension off :D⊆Rⁿ→R if

f(x) = f(x) for all x∈D, f(x) ∈ f(x) for all x∈x∈D.

Letx∈ IRⁿ and f :D⊆Rⁿ→ R. We define the setutf(x) :={f(x) |x ∈ x∩D}and call it therange off overx. We extend the range to a function onIR by f(x) :=utf(x), also called the range off.

One of the most important results in rigorous computations is the fundamental theorem of interval arithmetic that is proved inMoore (1966) or Moore, Kearfott and Cloud(2009).

Theorem 2.1 (Fundamental of interval analysis). If f has inclusion property and is the interval extension of f then utf(x)⊆f(x).

As shown in Moore, Kearfott and Cloud (2009), if the interval function F : IRⁿ → IR^m is composite by functions that satisfy inclusion property thenFalso has inclusion property. Formally,

Proposition 2.1. If F :IRⁿ → IR^m and G: IR^m → IR^p are interval functions with inclusion property then G(F(x)) has inclusion property.

We call the interval function f :IRⁿ → IRⁿ the natural extension of f :Rⁿ →Rⁿ iff is obtained by replacing real operations by interval ones.

The next proposition states that the natural extension of polynomials has inclusion property.

Proposition 2.2. If f : Rⁿ → R is a polynomial and f is the natural extension thenf has inclusion property.

(6)

Proof. The result follows from Proposition2.1sincefis the composition of elementary operations and the natural extension of elementary operations has inclusion property.

Throughout this paper, all interval functions are natural extensions of polynomials and therefore has inclusion property.

2.2. Computational issues. We implement interval arithmetic overM(F) instead ofRwhereM(F) is the finite set of representable numbers in a ma- chine with floating typeF. TypicallyF isfloat ordoubleinC-like languages.

The following example makes clear the problem that occur with rigorous calculus onM(F).

Figure 1. The need of switching the rounding mode.

Let say that the result of a calculation is the non-representable number r in Figure 1. Assume that x and y are representable numbers, i.e, x, y ∈ M(F). If we do not set the rounding mode properly, we may return the interval r= [y, y+M] as the enclosure of r, causing the lack of rigor. On the other hand, if we are cautious and evaluate infr carefully, we obtain r= [x, y+M] which is a rigorous enclosure ofr.

The standardIEEE754 IEEE (1985) for floating point operations is im- plemented in almost every processor nowadays. It establishes three main policies of rounding: to nearest (which in Figure 1 is the point y), upward (y again) and downward (that givesx). We denote downward operations by 5(.) and upward operations by 4(.). If x= [x, x] and y = [y, y] then the elementary operations are given by

x+y = [5(x+y),4(x+y)], x−y = [5(x−y),4(x−y)],

x∗y = [5(min{xy,xy,xy,xy}), 4(max{xy,xy,xy,xy})], 1

y = [5(1 y),4(1

y)] 0∈/y.

Note that interval multiplication requires 4 real multiplications and 2 rounding mode switches. Moreover, a rounding mode switching costs in average

(7)

13 times more than floating point multiplication with double precision in C++. This number shows that we should avoid rounding mode switchings whenever possible. To evaluate standard functions with specified rounding policies, see Fousse et al.(2007).

We finish this section noting that there are several good implementations of interval arithmetic for a number of programming languages used in scientific computing. Recently, IEEE defined the standard IEEE 1788 for implementations of interval arithmetic IEEE (2014). A first implemen- tation of the interval arithmetic proposed by the standard can be found in Nehmeier(2014).

3. Heuristic approach. This section provides an heuristic approach to determine the probability of recoverθas function of the number of obser- vationsT. We apply the procedure in hidden Markov models with 2 hidden states and 2 observable values. Letθbe the parameters ofHM M(2,2) and θ_T^∗ the solution of (3) for a given y_1:T taken from θ. Since the maximum likelihood estimator is consistent, we have

(7) lim

T→∞Pr(kθ^∗_T −θk_∞< ) = 1, for any >0.

This property allows to derive a procedure where we fix the parameter pairθ, generate observationsy1:T from θand apply a local optimization procedure like the Baum-Welch algorithm in order to obtain the estimation θ^∗ of θ.

We apply the local optimization algorithm with multiple starting points in order to increase the chance of finding the global maximizer of (3). If we follow this procedure a large number of times and increasing the sample size T, Equation (7) gives a natural way to estimate the probability of recoverθ fromy_1:T.

We note that the parameters of hidden Markov models are symmetric. If θ= (A, B) andθ⁰= (A⁰, B⁰) are parameter pairs such that

(8) A⁰=

A₂₂ A₂₁ A12 A11

and B⁰ =

B₂₁ B₂₂ B11 B12

then Ly1:T(θ) = Ly1:T(θ⁰) for any y1:T. Therefore one can find θ⁰ as global maximizer of the model with parameter pair θ, increasing the norm of the difference in (7). Algorithm1describes the heuristic experiment taking the symmetry ofHM M(2,2) into account.

(8)

Algorithm 1 Heuristic approach

Input: The parameterθ, the sample sizeT, the resample factorRand the toleranceτ. Output: The vectorDof norm differences.

1: forn= 1, . . . , Rdo

2: Generate a sequence of sizeT fromθ;

3: df min← −∞;

4: i←0;

5: i min←0;

6: θ0←θ;

7: whiletruedo 8: i←i+ 1;

9: θ^∗←Baum W elch(θ0, τ);

10: df←Ly_1:T(θ^∗);

11: if df > df min;then

12: df min←df;

13: i min←i;

14: Apply (8) toθ∗and set it onθ⁰; 15: D(n)←min(kθ−θ^∗kinf,kθ−θ⁰kinf);

16: end if

17: if i > i min+ 10then

18: break;

19: end if

20: θ0←generate random model();

21: end while 22: end for

4. Rigorous approach. Here and throughout this paper, we denote interval parameters of hidden Markov models by θ = (A,B). We say that the parameter pair θ = (A, B) is feasible if (4) - (6) holds for θ and the interval parameter pairθis consistent if at least one elementθ∈θis feasible.

Moreover, givenθ and theith row of the interval matrixA, denoted byAi:, we define the set of entries that may contain zero byI_A_i:(θ). Formally we write

IAi:(θ) :={j∈1 :N |infAij = 0} for all i= 1 :N.

The complement ofI_A_i:(θ) isI_A_i:(θ) := 1 :N\I_A_i:(θ). We also defineI_B_i:(θ) in a similar way for every rowB_i: ofB.

Proposition4.1. Ifθis a feasible parameter pair then there exists >0 and an interval pairθ such thatθ∈θ,max(widθ)≥and that bothIAi:(θ) andI_B_i:(θ) are non-empty for each i= 1 :N.

Proof. Since θ= (A, B) is a feasible, at least one entry of each row A_i:

andB_i:are greater than zero. Letθ be the interval parameter pair given by A_ij =h

max

0, A_ij −kA_i:k N

,min

1, A_ij +kA_i:k N

i

(9)

and

Bij = h

max

0, Bij −kB_i:k M

,min

1, Bij+kB_i:k M

i . The result follows from taking= max(kAk,kBk).

Result below relies on the first order Kuhn-Tucker optimality conditions.

For a review of this topic, see Nocedal and Wright(2006).

Proposition4.2. Letθ^∗= (A^∗, B^∗)be a feasible parameter pair. Ifθ^∗is a local maximum of(3)then there exist a vectorλ^∗ ∈R^2N andµ^∗∈R^N²^{+N M} such that

(9) ∂L

∂A_ij(θ^∗)−λ^∗_A_i: −µ^∗_A_ij = 0, i, j= 1 :N.

(10) ∂L

∂Bij

(θ^∗)−λ^∗_B_i:−µ^∗_B_ij = 0, i= 1 :N,j = 1 :M.

(11)

N

X

j=1

A^∗_ij = 1, i= 1 :N.

(12)

M

X

j=1

B^∗_ij = 1, i= 1 :N.

(13) µ^∗_A_ijA^∗_ij = 0, i, j= 1 :N (14) µ^∗_B_ijB_ij^∗ = 0, i= 1 :N, j= 1 :M.

(15) µ^∗_A_ij, µ^∗_B_ij ≥0.

Proof. Follows from applying Kuhn-Tucker conditions of first order to problem (3).

Vectorsλ^∗ and µ^∗ are called Lagrange multipliers ofθ^∗.

(10)

Theorem 4.1 (Lagrangian bounds). Let θ^∗ be a local maximizer of (3) associated to vectorsλ^∗ and µ^∗. Ifθ is an interval parameter pair such that θ^∗∈θ andI_A_i:(θ)6=∅then

(16) λ_A_i: := \

j∈I_A_i:(θ)

δL δA_ij(θ),

(17) µAij := [0,0], j∈IAi:(θ) and

(18) µ_A_ij := [0,∞]∩ δL δAij

(θ)−λ_A_i:

, j∈I_A_i:(θ) are rigorous bounds forλ^∗_A_i: andµ^∗_A_ij.

Proof. Since j∈I_A_i:(θ) implies that infA_ij >0, Equation (17) follows from (13). Let IAi:(θ) be a non-empty set, Proposition 4.2implies that

∂L

∂Aij

(θ^∗) =λ^∗_A_i:, j∈I_A_i:(θ).

By hypothesis θ is an interval parameter pair containing the local maximum θ^∗ and relation (16) follows from inclusion property. Relations (18) follows from the definition of elementary operations applied to the interval counterpart of (9) and from constraint (15).

We can also prove the same proposition to the Lagrange multipliers associated to the observable process by just replacing Equations (9) and (13) on the proof above by (10) and (14) respectively.

Corollary4.1 (Elimination test). Letθ be an interval parameter pair withIAi:(θ)6=∅, λA_i: andµA_ij given by(16) and(18)respectively. If λA_i: =

∅ or µ_A_ij =∅ for any row i then there is no local maximizer of (3) in θ Proof. Follows immediately from the application of the bounding on Lagrangian multipliers and the fundamental theorem of interval analysis.

The elimination test states that for a saddle pointθ^∗, δL

δAij

(θ^∗) = δL δAik

(θ^∗)

(11)

for every j and k such that Aij and A_ik are positive. Therefore, given θ, such that infA_ij > 0, infA_ik > 0 and _δA^δL

ij(θ)∩_δa^δL

ik(θ) = ∅ then we can discardθ. Note that the elimination test says nothing ifI_A_i:(θ) and I_B_i:(θ) are empty for everyi= 1 :N. However, Proposition4.1 states that given a local maximum θ^∗ we can alway find θ 3 θ^∗ for which I_A_i:(θ) and I_B_i:(θ) are non-empty. Algorithm2applies elimination test toθ.

Algorithm 2 Elimination test

Input: The boxθ.

Output: f alseifθdo not contains a maximizer of (3) andtrueif it may contains.

1: if IA_i(θ) =∅andIB_i(θ) =∅ for alli= 1 :N then 2: return true;

3: end if

4: fori= 1 :N do 5: λA_i ←T

j∈I_Ai(θ) δL δA_ij(θ);

6: λB_i←T

j∈I_Bi(θ) δL δB_ij(θ);

7: if λA_i=∅orλB_i =∅then 8: return false;

9: end if

10: forj∈IA_i(θ)do 11: µA_ij= [0,∞]∩

δL

δA_ij(θ)−λA_i

; 12: if µA_ij=∅then

13: return false;

14: end if 15: end for

16: forj∈IB_i(θ)do 17: µB_ij= [0,∞]∩

δL

δB_ij(θ)−λB_i

; 18: if µB_ij=∅then

19: return false;

20: end if 21: end for 22: end for 23: return true;

Note that the elimination test is conservative in the sense that it does not prove thatθ contains local maxima. Whenever the test returnstrue we cannot say nothing about the existence of local maxima inθ. The elimination test requires only the derivatives of the interval likelihood function. Next section presents an algorithm to evaluate such derivatives efficiently.

5. Interval Evaluation. This section addresses the problem of evaluating the likelihood function using interval arithmetic. Subsection 5.1 re- views the real case since it is the base for interval calculations. Subsection 5.2 shows that the interval likelihood function can be evaluated with two

(12)

rounding mode switches only.

5.1. Real case revision. As stated by RabinerRabiner(1989), given the feasible parameter pair θ = (A, B), the initial distribution π for the hidden process and the sequence of observationsy1:T, the associated likelihood function is

(19) Ly1:T(θ) :=π^TP(y1)A . . . AP(yT)1

where P(yi) := Diag(B:yi). Whenever there is a fixed sequence y1:T and there is no misunderstandings we writeL(θ) instead of Ly_1:T(θ).

Note that (19) depends on the initial distribution vector π. Throughout this paper we assume that theπ is given and therefore we do not include it in the notation of the likelihood function.

To implement the likelihood function, we use the so called backward recursion defined by the auxiliary time seriesβ, as follows

β_T = 1,

β_n = AP(y_n+1)β_n+1, n=T −1, . . . ,1, L(θ) = π^TP(y1)β1.

We also evaluate the derivatives ofL(θ) in terms ofβ. Note that _∂A^∂β^T

ij(θ) =

∂βT

∂Bij(θ) = 0 for everyA_ij and B_ij. Applying the chain rule to the recursion step for each variable A_ij gives

∂β_n

∂Aij

(θ) =χ_ijP(y_n+1)β_n+1+AP(y_n+1)∂β_n+1

∂Aij

whereχij is the indicator matrix with value 1 on entry (i, j) and 0 elsewhere.

Doing the same for variablesBij we obtain

∂βn

∂B_ij(θ) =A∂P(yn+1)

∂B_ij βn+1+AP(yn+1)∂βn+1

∂B_ij

for i = 1 : N and j = 1 : M. Finally, applying the chain rule to the last term of recursion gives

∂L

∂A_ij(θ) = π^TP(y₁)∂β1

∂A_ij,

∂L

∂Bij

(θ) = π^T∂P(y₁)

∂Bij

β₁+π^TP(O₁) ∂β₁

∂Bij

.

(13)

The calculus ofL(θ) requires a considerable number of multiplications with values in [0,1]. Therefore, we need to avoid underflow while evaluatingβand its derivatives. There are several normalization procedures in literature to properly handle this problem. A common approach (see for exampleZucchini and MacDonald(2009) andBulla, Bulla and Nenadi(2010)) is the inclusion of auxiliary variables

wn =

N

X

i=1

βn(i), φ_n = β_n

wn

.

Algorithm3was adapted fromZucchini and MacDonald(2009). It evaluates logL(θ) considering variableswn and φn.

Algorithm 3 log backwards evaluation

Input: The feasible parameter pairθ= (A, B), the initial distribution vectorπand the set of observationsy1:T.

Output: l:= logL(θ).

1: φT ←1;

2: l←0;

3: forn=T−1, T−2, . . . ,1do 4: β←AP(yn)φn+1;

5: w←β^T1;

6: l←l+ log(w);

7: φn=_w^β; 8: end for

9: β←π^TP(y1)φ1; 10: w←β^T1;

11: l←l+ log(w);

12: returnl;

We observe that Algorithm3requires a log evaluation at every iteration.

Since we are interested in generalizing the real procedure to intervals, we should avoid standard functions because they typically needs at least two rounding mode switches of a non-elementary function. For the logarithmic function in C++, MPFR Fousse et al. (2007) allows to compute the ex- tremes of natural extension with two rounding switches However, this is a costly operation. In this paper we evaluate backwards recursion with a scaled numerical type that allows to represent numbers of form

x=s×2^exp

whereexpis an integer andsis a floating point number (typically a double precision number). Note that this is an extension of double precision. Double

(14)

precision, as defined byIEEE754 represents number with an exponent of 11 bits while our scaled type allows exponents with 32 bits. Algorithms4 and 5show how to compute backward recursion with the scaled type.

Algorithm 4 Scaled backwards

Input: A feasible modelθand a set of observationsy1:T.

Output: An integer exp and a floating point numberssuch thats∗2^exp=L(θ).

1: βT ←~1 eexp←0;

2: forn=T−1, . . . ,1do 3: βn←AP(on+1)βn+1; 4: exp←exp+scale(βn);

5: end for 6: s←π⁰P(o1)β1; 7: exp←exp+scale(s);

8: return (exp, s);

Algorithm 5 relies ldexp and frexp functions of C/C++. The interested reader should seeLippman, Lajoie and Moo (2005).

Algorithm 5 Scale vector

Input: A vector of floating point numbersv.

Output: A integermexpand a scaled vectorv^∗ of the same type and size ofv.

1: v^∗←v;

2: [mexp, v^∗(1)]←f rexp(v^∗(1));

3: forn= 2, . . . , size(v)do

4: [tmpexp, v^∗(n)]←max(mexp, f rexp(v^∗(n)));

5: mexp←max(mexp, tmpexp);

6: end for

7: forn= 1, . . . , size(v)do

8: v^∗(n)←ldexp(v^∗(n),−mexp);

9: end for

10: return (mexp,v^∗);

Table1shows that both approaches have equivalent accuracy to evaluate the likelihood function. Figure2 gives the time needed to evaluate the likelihood with Algorithms3 and 4. We implement both algorithms in C++11 and perform the experiment in aCore i7 processor with6Gb of RAM mem- ory. For each algorithm we fix the feasible parameter pair θ = (A, B) of a HM M(2,2) and generate sequences of size T at random. Each entry of columns 2 and 3 gives the likelihood ofθ.

5.2. Fast interval evaluation. We show now that Algorithms4and5ex- tend naturally to intervals. Consider again the step of backwards recursion,

β_n=AP(y_n+1)β_n+1.

(15)

Table 1

Comparison of accuracy between Algorithms3and4to evaluateL(θ).

T scaled log |log−scaled|

10 0.00616949 0.00616949 7.80626e-18 20 3.06144e-06 3.06144e-06 9.74088e-21 50 2.49836e-15 2.49836e-15 2.32714e-29 100 3.70022e-30 3.70022e-30 6.30584e-45 150 6.93633e-42 6.93633e-42 4.11655e-55 200 3.56547e-55 3.56547e-55 2.14438e-68 250 2.72636e-72 2.72636e-72 3.93607e-85 300 3.31647e-85 3.31647e-85 2.97748e-98 400 3.44266e-116 3.44266e-116 1.02e-129

1e-2

Time(s)

1e-3

1e-4

1e-5

1e-6 1e-2

Figure 2. Time comparison of Algorithms 3and4

Since βT = 1 and every entry in A and P(yn+1) are probabilities, each βn

is non-negative. Moreover, theith element ofβ_n is given by β_n(i) =

n

X

j=1

A_ijγ_j(y_n+1) where

γj(yn+1) =bjyn+1βn+1(j).

Therefore, theβ variables and the likelihood function are evaluated by the multiplication and summation ofAij andBij variables. In such cases, we can simplify interval operations. Note that ifxandyare intervals with infx>0 and supy>0 then

(20) x∗y= [O(infxinfy),M(supxsupy)]

(16)

and

(21) x+y= [O(infx+ infy),M(supx+ supy)].

Combining the recursion step with equations (20) and (21) leads to infβ_n = infAinfP(y_n+1) infβ_n+1,

supβn = supAsupP(yn+1) supβn+1.

Moreover, writing the last term of recursion with (20) and (21) gives infL(θ) = infπ^T infP(y₁) infβ₁,

and

supL(θ) = supπ^T supP(y1) supβ1.

Algorithm 6extends Algorithm 4 to intervals. It requires thescale function as defined by Algorithm5.

Algorithm 6 Scaled interval backwards

Input: The consistent interval parameter pairθ, the interval initial distribution πand the set of observationsy1:T.

Output: The integer exp and the intervalL, such thatL(θ) :=L∗2^exp. 1: βT ←1,exp←0,exp lower←0;

2: round mode← 5();

3: forn=T−1, . . . ,1do

4: βn←infAinfP(yn+1)βn+1; 5: exp lower←exp lower+scale(βn);

6: end for

7: s lower←infπ^TinfP(y1)β1;

8: exp lower←exp lower+scale(s lower);

9: βT ←1;

10: round mode← 4();

11: forn=T−1, . . . ,1do

12: βn←supAsupP(yn+1)βn+1; 13: exp←exp+scale(βn);

14: end for

15: s upper←supπ^TsupP(y1)β1; 16: exp←exp+scale(s upper);

17: s lower←s lower∗2exp lower−exp

; 18: L←[s lower, s upper];

19: return (exp,L);

The same analysis is valid for the derivatives ofβ since _δA^δβⁿ

ij and _δB^δβⁿ

ij in- volves only non-negative products and summations. Therefore we can easily extend Algorithm 6to evaluate the derivatives of L(θ).

(17)

The simplification presented here remains valid for the likelihood of HMMs with a general observable process. However, it is not true that the derivative can be evaluated with 2 rounding mode switchings, since the derivatives

∂P(yn+1)

∂Aij and ^∂P_∂B^(yⁿ⁺¹⁾

ij are not necessarily positive.

6. Numerical experiments. We apply Algorithm2to study the non- asymptotic behavior of the likelihood function. The experiment we perform is as follows: Letθ= (A, B) be a feasible parameter pair ofHM M(N, M).

We generate the interval parameter pairθwith maximum diameterby the following rule

(22) A_ij = [max(0, A_ij −),min(1, A_ij+)]

fori, j= 1 :N and

(23) Bij = [max(0, Bij−),min(1, Bij +)].

fori= 1 :N and j= 1 :M.

We also generateK sequences of observations from it, each of which with size T. For each sequence of observations we apply Algorithm 2 to θ and count the number of times it returns true. If we gradually increase the sample size of the observations, the consistency of the MLE guarantee that the proportion of successes in counting will tend to 1. Therefore, the elimination test allows to estimate a lower bound on the sample size needed recover θ.

See the Algorithm7.

We apply Algorithm 7 in a number of different scenarios. We divide the rest of this section as follows. Subsection6.1presents the results of the case where the matricesAand B has no zero entries. Subsection6.2discuss the case whereAis a random matrix andBis the identity matrix. In Subsection 6.3we present the results of hidden Markov models whereAis a diagonal upper triangular matrix. Finally, Subsection6.4generalize the results obtained in this paper for models with Poisson and normal outcomes.

6.1. General Hidden Markov models. Let assume that matricesAandB has non-zero entries. We run the counting experiment described in Algorithm 7considering that the number of hidden statesN ranges in 2 : 8, the number of observable valuesM inN : 10, the number of samples generated is given byK= 1000, the number of resample isR= 30 and the diameter= 10⁻⁴. Table2displays the sample size needed to recover θwith a given probability. The table must be read as follows. For hidden Markov models withN hidden states andM observable values, the average size needed to recoverθ in 50% of the experiments is given in column 3 and the standard deviation

(18)

Algorithm 7 Counting experiment

Input: The numbers N and M, the sample size T, the number of sequences K, the resample parameterRand the tolerance.

Output: The vectorV of sizeR with proportions ^C_K where C is the number of times algorithm2returned true.

1: V ←zeros(1, R);

2: forr= 1, . . . , Rdo

3: Generate the parameter pairθwithN hidden states andM observable values;

4: Generateθfrom equations (22) and (23);

5: C←0;

6: fork= 1, . . . , K do

7: Generate a sequence of sizeT fromθ;

8: result←elimination test(θ);

9: if result == truethen

10: C←C+ 1;

11: end if 12: end for 13: V(r)← ^C_K; 14: end for 15: returnV;

in column 4. The same is valid for quartiles 75% and 95%. Therefore, for models of form HM M(2,2) the average number of samples size needed to recoverθwith probability of 50% is 209 and to recover the parameters with probability of 95% is 376.

We note that at least 320 observations are needed, in average, to achieve 95% of successes for any hidden Markov model tested. All results in the first 8 columns of Table2were taken from hidden Markov models where the initial distributionπ satisfiesPN

i=1πi= 1. We perform a second experiment where samples were taken from models with the initial distributionπas the invariant measure of A. To evaluate π in this case we solve the following linear system (seeZucchini and MacDonald (2009))

(I−A^T +U)π= 1

whereI is the identity matrix and U the unity matrix of dimensionN. The last column of Table 2 (p) displays the p−value of Kolmogorov-Smirnov hypothesis testing that compares purely randomπ and invariant measures.

Table 3 presents the differences between both experiments in those cases wherep <5%.

We compare the rigorous approach in models of form HM M(2,2) with the heuristic method presented in Sectino 3. We run Algorithm 1 with θ generated at random and withA_ij >0.1 and B_ij >0.1 fori, j = 1 : 2. We setT ={300,600, . . . ,2100},R= 30 and τ = 10⁻⁴.

(19)

The rigorous approach proposed in Algorithm 7 is conservative when compared to the heuristic one presented in Algorithm 1. Note that with 378 observations, the rigorous approach states that we would expect 95%

of probability to recover θ. On the other hand, the heuristic method gives only 50% of probability with 300 observations. It is due to the fact that the rigorous method relies only on the first order information of the likelihood function. We also observe that even the conservative estimate is greater than several time series used in applications.

300 600 900 1200 1500 1800 2100

40%

45%

50%

55%

60%

65%

Sample size

Figure 3. (a) Average and standard deviation ofkθ^∗−θk∞for the outcomes of Algorithm 1withT ={300,600, . . . ,2100},R= 30 andτ= 10⁻⁴.

We finish the subsection noting that the number of observations needed to recoverθdecreases as the number of hidden states increase.

6.2. Markov chains. A Markov chain is a special case of hidden Markov model where M = N and the matrix B is known to be the identity. We perform the counting experiment of Algorithm7setting the number of states ranging inN = 2 : 8, the number of samples generated is given byK= 1000, the number of resample isR = 30 and the diameter= 10⁻⁴. In this case, the matrixAis generated at random and we generateAfollowing (22). On the other hand, matrixB is given by

Bij = [1−,1], i=j and

B_ij = [0, ], i6=j.

We apply the elimination test described in Algorithm 2 to θ = (A, B).

Table4must be read as we note in Subsection6.1. The table shows that the average sample size needed to recover θ is greater for Markov chains than those observed for hidden Markov models.

(20)

T. MONTANHER AND A. NEUMAIER Table 2

Minimum sample size needed to recoverθ. Av stands for the average number needed to recoverθ with a given probability. Sd is the standard deviation of the experiment

described in Subsection 6.1. Last columnpdisplays thep−value of the Kolmogorov-Smirnov test for stochastic and invariant measure hidden Markov models.

N M Av(50%) Sd(50%) Av(75%) Sd(75%) Av(95%) Sd(95%) p

2 2 209.2 31.51 288.2 36.85 376.2 46.28 0.05

2 3 297.6 59.79 371.0 75.64 458.0 102.25 0.23

2 4 282.0 11.55 336.6 16.63 396.0 17.50 0.00

2 5 321.0 38.86 377.4 51.30 442.6 55.32 0.01

2 6 306.4 12.54 354.0 22.55 413.6 34.20 0.12

2 7 346.0 58.38 410.4 88.47 497.8 136.08 0.00

2 8 392.4 187.09 452.2 226.38 512.4 250.05 0.12

2 9 302.8 11.46 340.2 9.63 389.6 13.91 0.02

2 10 324.8 16.86 367.2 26.77 416.6 38.80 0.64

3 3 262.4 13.70 316.2 17.22 378.8 13.17 0.41

3 4 300.2 27.02 352.0 33.20 421.6 49.28 0.00

3 5 295.0 11.99 344.2 17.95 397.6 25.62 0.02

3 6 309.4 15.63 350.0 17.20 401.8 25.61 0.41

3 7 320.6 8.93 362.8 13.00 413.2 15.67 0.00

3 8 307.8 15.42 345.8 17.83 393.8 16.79 0.64

3 9 306.2 8.45 341.2 12.27 389.4 14.88 0.00

3 10 305.2 11.04 338.2 8.65 379.6 12.58 0.87

4 4 274.6 14.06 318.2 17.61 368.0 20.21 0.41

4 5 277.6 11.10 319.2 14.48 362.4 16.65 0.00

4 6 286.2 15.56 323.2 19.47 372.8 19.69 0.23

4 7 289.8 12.20 324.0 12.99 364.0 17.85 0.05

4 8 298.8 10.03 331.8 15.74 372.2 16.21 0.12

4 9 287.8 9.47 318.4 12.05 356.6 13.97 0.00

4 10 296.4 10.16 331.2 11.84 371.4 11.68 0.05

5 5 272.2 11.19 309.4 11.39 354.4 15.50 0.05

5 6 274.0 12.58 308.8 10.34 351.4 11.95 0.87

5 7 276.4 6.21 309.6 10.60 347.0 13.92 0.41

5 8 294.2 10.67 328.8 11.11 368.8 12.77 0.99

5 9 288.8 7.26 324.4 9.28 367.0 18.14 0.05

5 10 288.2 6.27 315.4 6.28 354.0 10.10 0.87

6 6 267.2 9.14 297.6 8.43 339.0 10.61 0.12

6 7 272.0 8.54 307.0 9.13 343.2 11.54 0.12

6 8 269.6 9.00 299.8 10.65 336.2 11.93 0.41

6 9 276.8 5.18 306.4 5.50 343.0 9.24 0.00

6 10 273.8 6.66 304.2 8.62 340.4 13.61 0.64

7 7 256.6 7.32 288.8 8.57 323.4 9.97 0.05

7 8 266.8 6.60 293.2 6.90 330.0 8.16 0.05

7 9 265.6 4.64 295.4 4.98 330.4 8.77 0.87

7 10 270.8 5.14 298.6 6.38 333.6 6.85 0.00

8 8 253.8 7.11 281.0 8.66 313.6 10.75 0.23

8 9 260.2 5.30 286.0 7.36 315.4 8.03 0.99

8 10 264.6 6.44 293.0 5.40 320.0 6.77 0.23

(21)

Table 3

Minimum sample size needed to recoverθ.N is the number of hidden states,M is the number of observable values, Av stands for the average number needed to recoverθ with a

given probability. Sd is the standard deviation of the experiment described in Subsection 6.1. Results that are significantly different from stochastic and stationary HMM.

N M Av(50%) Sd(50%) Av(75%) Sd(75%) Av(95%) Sd(95%)

2 4 344.8 58.28 407.4 73.53 485.2 86.73

2 5 290.8 13.90 338.2 18.14 395.2 24.52

2 7 312.8 22.04 364.2 31.48 423.2 46.59

2 9 315.2 24.00 354.2 28.53 410.2 34.08

3 4 284.8 11.59 332.2 9.36 385.6 11.58

3 5 291.2 9.71 334.0 11.90 381.2 13.41

3 7 300.8 10.77 337.2 12.25 386.0 19.26

3 9 299.6 6.91 335.8 11.06 377.4 16.21

4 5 281.6 6.41 321.8 10.69 375.6 11.84

4 9 296.2 6.96 329.4 7.12 369.2 7.73

6 9 271.0 5.20 298.6 4.45 333.6 7.43

7 10 267.6 6.31 295.2 8.23 325.2 9.52

Table 4

Minimum sample size needed to recoverθ for Markov chains.N is the number of hidden states, Av stands for the average number needed to recoverθ with a given probability. Sd

is the standard deviation of the experiment described in Subsection6.2.

N Av(50%) Sd(50%) Av(75%) Sd(75%) Av(95%) Sd(95%)

2 414.2 165.31 505.4 193.44 601.0 254.24

3 542.8 207.50 623.4 255.79 722.4 340.12

4 532.4 56.86 600.6 75.30 668.8 103.06

5 652.6 117.08 764.2 201.41 911.8 368.84

6 1785.0 1184.69 2337.0 1448.58 2646.2 1403.31 7 1172.2 618.91 1543.6 941.67 1758.6 920.93

8 1979.8 495.76 2585.4 285.61 2893.0 29.76

6.3. Forward chains. We say that a Markov chain is a forward chain if a_ii and a_ii+1 are the only non-zero entries of transition matrix A. There are several applications for forward chains as described in Rabiner (1989) andLeroux(1992). In this section we run the counting experiment on hidden Markov chains where the hidden process is forward. We generate the interval matrixA as follows

A_ij = [max(0, A_ij−),min(1, A_ij +)],