NON-ASYMPTOTIC BEHAVIOR OF MAXIMUM LIKELIHOOD ESTIMATORS IN HIDDEN MARKOV
MODELS
By Tiago Montanher∗, and Arnold Neumaier Faculty of Mathematics, University of Vienna
Abstract
This paper studies the maximum likelihood estimator of discrete time and finite state-space hidden Markov models in the absence of asymptotic assumptions. We use interval analysis to derive a condition that must be satisfied in every box containing local maxima of the likelihood function.
We also present an efficient way to evaluate the interval extension of the likelihood function and its derivatives. Our approach requires only 2 rounding mode switchings, regardless the number of observations or parameters in the model. The new condition is used to empirically determine the minimum sample size needed to reliably recover the parameters θ of a model based on its observations. Numerical experiments
show that for models with hidden and observable states ranging from two to eight, at least 300 observations are needed to recoverθ with 95% of
probability. The code necessary to perform numerical experiments is available in aC++ library as well as in aMatlab toolbox.
1. Introduction. Hidden Markov models(HMMs) form a large class of stochastic processes with a wide range of applications in applied mathema- tics and statistics. A hidden Markov model describes the relation between two stochastic processes. In the simplest form, the first process is a non- observable Markov chain that is assumed to be discrete in time and with finite number of states. The second process is observable and its distribution is given, at each moment, by the current state of the Markov chain. In this paper we are mainly interested in the case where the observable process as- sumes a finite number of states. However, as we will show in two examples, our results can be extended to more general processes. For a comprehen- sive introduction of hidden Markov models the interested reader should see
∗This research was support through the research grant number 205557/2014-7 of the National council for scientific and technological development of Brazil(CNPq).
MSC 2010 subject classifications:Primary 62M05, 60J10; secondary 65G40
Keywords and phrases: Hidden Markov models, Interval arithmetic, Non-asymptotic analysis
T. MONTANHER AND A. NEUMAIER
Capp´e, Moulines and Ryden(2005) and Zucchini and MacDonald(2009).
We denote the set of indices 1, . . . , N by 1 :N. We write Ai: to represent theith row of the matrixAandA:j to denote thejth column. We denote by Diag(v) the diagonal matrix whose non-zero entries are given by the vector v. The transpose operator of the matrix A is given by AT. If x ∈Rn then the set
x() :={y∈Rn| |yi−xi|< , i= 1 :n}
is called the-neighborhood of x.
Let{yn}:={yn, n= 1,2, . . .} be a time series assuming values on the set 1 : M and {xn} := {xn, n = 1,2, . . .} a discrete time Markov chain on the state-space 1 : N. The pair of stochastic processes {(xn, yn)} is a hidden Markov Model ifxn is not observable and
(1) Pr(xn+1 |x1:n) = Pr(xn+1|xn), (2) Pr(yn|y1:n−1, x1:n) = Pr(yn|xn)
where Pr(· | ·) denotes conditional probabilities. We call {xn} the hidden process and {yn} the observable process. We denote a hidden Markov model withN states in the hidden process and M states in the observable process by HM M(N, M). Arranging equations (1) and (2) in matrices we identify the hidden Markov model with the parameter pairθ= (A, B) where
Aij := Pr(xn+1 =j|xn=i), i, j = 1 :N and
Bij := Pr(yn=j|xn=i), i= 1 :N, j = 1 :M.
An important problem when modeling with HMMs is the parameter esti- mation. GivenN,M and the observations y1:T, find θ= (A, B) that solves the following problem
maxθ z:= Ly1:T(θ) (3)
s.t. (4)−(6)
(4)
N
X
j=1
Aij = 1 fori= 1 :N.
(5)
M
X
j=1
Bij = 1 fori= 1 :N.
(6) Aij, Bij ≥0
whereLy1:T(θ) is the likelihood function. We say that the pointθthat solves (3) is a maximum likelihood estimator of the model. Two common ap- proaches to solve (3) are the Baum-Welch procedure, described in Baum et al. (1970) and Zucchini and MacDonald (2009) and Newton type meth- ods as proposed in Capp´e and Moulines (2005), Bulla, Bulla and Nenadi (2010) and Turner (2008). Note that the estimation problem has several local maximizers as pointed byRabiner(1989) and to the best of our knowl- edge there is no method available in literature that is suitable for solving the estimation problem under a rigorous global optimization point of view.
The consistency of the maximum likelihood estimator for hidden Markov models was proved by Baum and Petrie (1966) and Leroux (1992). Let HM M(N, M) be a hidden Markov model with the parameter pair θ = (A, B) and y1:T a time series taken from this model. The consistency of the maximum likelihood estimator states that we recover θ by finding the global maximizers of (3) ifT is sufficiently large. We note that consistency results are typically asymptotic and do not give any clue on the sample size T needed to obtain a recovery.
This work aims to give empirical lower bounds to the sample sizeT nec- essary to recover parametersθ. Our approach is based on methods from rig- orous computation. We use interval arithmetic Moore, Kearfott and Cloud (2009),Neumaier(1990),Kearfott(1996) and the simplex structure of con- straints (4)-(6) to derive a condition that must hold in every box that con- tains local maxima of (3). We apply the new test in the following procedure
1. Set a counter variable cas zero;
2. Given N and M, generate a random HM M(N, M) with parameters θ= (A, B);
3. Generate observationsy1:T from the model;
4. Given a constant >0, build the-neighborhood of θ,θ();
5. Apply the test toθ(). If it is not able to prove thatθ() has no local maxima, setc asc+1;
6. Repeat steps 3-5 forJ times;
After completion, Jc gives a lower bound on the recovery ratio of θ with samples of size T. We run the procedure above a large number of times and with different values of T. Moreover, we consider different types of hidden Markov models while generating θ. Specifically, we have considered general HMMs, models where the hidden process are forward chains, Markov chains(the special case whereB is known to be the identity matrix) as well as models with Poisson and normal outcomes.
T. MONTANHER AND A. NEUMAIER
We call the procedure above rigorous because it is based on interval arith- metic. We also present an heuristic approach to assign probabilities of re- cover θ depending on T. We compare both methods and show that the rigorous one is more conservative, underestimating the value or T. On the other hand we show that the minimum sample size required to recover θ with probability of 95% is 300. We note that this is greater than the sample size of several time series in applications.
We outline this paper as follows. In Section 2 we introduce the basics of interval arithmetic, the main tool of our analysis. Section 3 describes the heuristic approach. Section4presents the elimination test necessary for the step 5 of the algorithm above. An efficient and tight interval extension of the likelihood function is the subject of Section 5. We conclude with numerical experiments in Section 6 and compare the heuristic approach with the rigorous one. The code used to perform numerical experiments is available as aC++library and in aMatlabtoolbox upon request to the first author.
2. Interval Analysis. This section introduces the interval arithmetic, a numerical analysis technique that allows rigorous computations. Interval arithmetic is a natural tool to design global optimization algorithms as de- scribed inKearfott(1996),Hansen and Walster(2004),Domes(2009). It has been used in statistics for parameter estimation of several models, seeWright and Kennedy (2000), Ahn, Kim and Chen (2012) and Algahtani (2011). A comprehensive approach to the topics covered here is given in Neumaier (1990). We mostly follow the notation established inKearfott et al. (2010).
2.1. Definitions. Let a, a ∈ R with −∞ < a ≤ a < ∞ then a = [a, a]
denotes a real interval with infa=aand supa=a. The set of all compact intervals is denoted by
IR:={[a, a]|a≤a, a, a∈R}.
Aninterval vector x orboxis the Cartesian product of the compact real intervalsxi:= [xi, xi]∈IR, representing an axis-parallel box inRn. We also define interval matrices in a similar way and denote the set of all compact real interval matrices byIRn×m.
The width of the interval a = [a, a] is given by wida = a−a and the norm by kak = max(|infa|,|supa|). An interval is called degenerate if wida = 0. We also denote the mid-point and radius of a respectively by ˆa = a+a2 and ˇa = a−a2 . Operations defined for intervals are interpreted component-wise when applied to boxes or matrices.
Let a,b ∈ IR. The elementary real operations ◦ ∈ {+,−, /,∗} are ex- tended to interval arguments a, b by defining the result of an elementary interval operation to be the set of real numbers which results from combining any two numbers contained inaand b. Formally,
a◦b:={a◦b|a∈a, b∈b, and a◦b is defined}
for◦ ∈ {+,−, /,∗}. Interval elementary operations has the so called inclu- sion property, that means
a⊆a0,b⊆b0 ⇒a◦b∈a0◦b0.
We call the function f :IRn →IRan inclusion function of f :Rn →Rif x⊆y⇒ f(x)⊆f(y). We also say that the interval function f :IRn →IR is theinterval extension off :D⊆Rn→R if
f(x) = f(x) for all x∈D, f(x) ∈ f(x) for all x∈x∈D.
Letx∈ IRn and f :D⊆Rn→ R. We define the setutf(x) :={f(x) |x ∈ x∩D}and call it therange off overx. We extend the range to a function onIR by f(x) :=utf(x), also called the range off.
One of the most important results in rigorous computations is the fun- damental theorem of interval arithmetic that is proved inMoore (1966) or Moore, Kearfott and Cloud(2009).
Theorem 2.1 (Fundamental of interval analysis). If f has inclusion property and is the interval extension of f then utf(x)⊆f(x).
As shown in Moore, Kearfott and Cloud (2009), if the interval function F : IRn → IRm is composite by functions that satisfy inclusion property thenFalso has inclusion property. Formally,
Proposition 2.1. If F :IRn → IRm and G: IRm → IRp are interval functions with inclusion property then G(F(x)) has inclusion property.
We call the interval function f :IRn → IRn the natural extension of f :Rn →Rn iff is obtained by replacing real operations by interval ones.
The next proposition states that the natural extension of polynomials has inclusion property.
Proposition 2.2. If f : Rn → R is a polynomial and f is the natural extension thenf has inclusion property.
T. MONTANHER AND A. NEUMAIER
Proof. The result follows from Proposition2.1sincefis the composition of elementary operations and the natural extension of elementary operations has inclusion property.
Throughout this paper, all interval functions are natural extensions of polynomials and therefore has inclusion property.
2.2. Computational issues. We implement interval arithmetic overM(F) instead ofRwhereM(F) is the finite set of representable numbers in a ma- chine with floating typeF. TypicallyF isfloat ordoubleinC-like languages.
The following example makes clear the problem that occur with rigorous cal- culus onM(F).
Figure 1. The need of switching the rounding mode.
Let say that the result of a calculation is the non-representable number r in Figure 1. Assume that x and y are representable numbers, i.e, x, y ∈ M(F). If we do not set the rounding mode properly, we may return the interval r= [y, y+M] as the enclosure of r, causing the lack of rigor. On the other hand, if we are cautious and evaluate infr carefully, we obtain r= [x, y+M] which is a rigorous enclosure ofr.
The standardIEEE754 IEEE (1985) for floating point operations is im- plemented in almost every processor nowadays. It establishes three main policies of rounding: to nearest (which in Figure 1 is the point y), upward (y again) and downward (that givesx). We denote downward operations by 5(.) and upward operations by 4(.). If x= [x, x] and y = [y, y] then the elementary operations are given by
x+y = [5(x+y),4(x+y)], x−y = [5(x−y),4(x−y)],
x∗y = [5(min{xy,xy,xy,xy}), 4(max{xy,xy,xy,xy})], 1
y = [5(1 y),4(1
y)] 0∈/y.
Note that interval multiplication requires 4 real multiplications and 2 round- ing mode switches. Moreover, a rounding mode switching costs in average
13 times more than floating point multiplication with double precision in C++. This number shows that we should avoid rounding mode switchings whenever possible. To evaluate standard functions with specified rounding policies, see Fousse et al.(2007).
We finish this section noting that there are several good implementa- tions of interval arithmetic for a number of programming languages used in scientific computing. Recently, IEEE defined the standard IEEE 1788 for implementations of interval arithmetic IEEE (2014). A first implemen- tation of the interval arithmetic proposed by the standard can be found in Nehmeier(2014).
3. Heuristic approach. This section provides an heuristic approach to determine the probability of recoverθas function of the number of obser- vationsT. We apply the procedure in hidden Markov models with 2 hidden states and 2 observable values. Letθbe the parameters ofHM M(2,2) and θT∗ the solution of (3) for a given y1:T taken from θ. Since the maximum likelihood estimator is consistent, we have
(7) lim
T→∞Pr(kθ∗T −θk∞< ) = 1, for any >0.
This property allows to derive a procedure where we fix the parameter pairθ, generate observationsy1:T from θand apply a local optimization procedure like the Baum-Welch algorithm in order to obtain the estimation θ∗ of θ.
We apply the local optimization algorithm with multiple starting points in order to increase the chance of finding the global maximizer of (3). If we follow this procedure a large number of times and increasing the sample size T, Equation (7) gives a natural way to estimate the probability of recoverθ fromy1:T.
We note that the parameters of hidden Markov models are symmetric. If θ= (A, B) andθ0= (A0, B0) are parameter pairs such that
(8) A0=
A22 A21 A12 A11
and B0 =
B21 B22 B11 B12
then Ly1:T(θ) = Ly1:T(θ0) for any y1:T. Therefore one can find θ0 as global maximizer of the model with parameter pair θ, increasing the norm of the difference in (7). Algorithm1describes the heuristic experiment taking the symmetry ofHM M(2,2) into account.
T. MONTANHER AND A. NEUMAIER
Algorithm 1 Heuristic approach
Input: The parameterθ, the sample sizeT, the resample factorRand the toleranceτ. Output: The vectorDof norm differences.
1: forn= 1, . . . , Rdo
2: Generate a sequence of sizeT fromθ;
3: df min← −∞;
4: i←0;
5: i min←0;
6: θ0←θ;
7: whiletruedo 8: i←i+ 1;
9: θ∗←Baum W elch(θ0, τ);
10: df←Ly1:T(θ∗);
11: if df > df min;then
12: df min←df;
13: i min←i;
14: Apply (8) toθ∗and set it onθ0; 15: D(n)←min(kθ−θ∗kinf,kθ−θ0kinf);
16: end if
17: if i > i min+ 10then
18: break;
19: end if
20: θ0←generate random model();
21: end while 22: end for
4. Rigorous approach. Here and throughout this paper, we denote interval parameters of hidden Markov models by θ = (A,B). We say that the parameter pair θ = (A, B) is feasible if (4) - (6) holds for θ and the interval parameter pairθis consistent if at least one elementθ∈θis feasible.
Moreover, givenθ and theith row of the interval matrixA, denoted byAi:, we define the set of entries that may contain zero byIAi:(θ). Formally we write
IAi:(θ) :={j∈1 :N |infAij = 0} for all i= 1 :N.
The complement ofIAi:(θ) isIAi:(θ) := 1 :N\IAi:(θ). We also defineIBi:(θ) in a similar way for every rowBi: ofB.
Proposition4.1. Ifθis a feasible parameter pair then there exists >0 and an interval pairθ such thatθ∈θ,max(widθ)≥and that bothIAi:(θ) andIBi:(θ) are non-empty for each i= 1 :N.
Proof. Since θ= (A, B) is a feasible, at least one entry of each row Ai:
andBi:are greater than zero. Letθ be the interval parameter pair given by Aij =h
max
0, Aij −kAi:k N
,min
1, Aij +kAi:k N
i
and
Bij = h
max
0, Bij −kBi:k M
,min
1, Bij+kBi:k M
i . The result follows from taking= max(kAk,kBk).
Result below relies on the first order Kuhn-Tucker optimality conditions.
For a review of this topic, see Nocedal and Wright(2006).
Proposition4.2. Letθ∗= (A∗, B∗)be a feasible parameter pair. Ifθ∗is a local maximum of(3)then there exist a vectorλ∗ ∈R2N andµ∗∈RN2+N M such that
(9) ∂L
∂Aij(θ∗)−λ∗Ai: −µ∗Aij = 0, i, j= 1 :N.
(10) ∂L
∂Bij
(θ∗)−λ∗Bi:−µ∗Bij = 0, i= 1 :N,j = 1 :M.
(11)
N
X
j=1
A∗ij = 1, i= 1 :N.
(12)
M
X
j=1
B∗ij = 1, i= 1 :N.
(13) µ∗AijA∗ij = 0, i, j= 1 :N (14) µ∗BijBij∗ = 0, i= 1 :N, j= 1 :M.
(15) µ∗Aij, µ∗Bij ≥0.
Proof. Follows from applying Kuhn-Tucker conditions of first order to problem (3).
Vectorsλ∗ and µ∗ are called Lagrange multipliers ofθ∗.
T. MONTANHER AND A. NEUMAIER
Theorem 4.1 (Lagrangian bounds). Let θ∗ be a local maximizer of (3) associated to vectorsλ∗ and µ∗. Ifθ is an interval parameter pair such that θ∗∈θ andIAi:(θ)6=∅then
(16) λAi: := \
j∈IAi:(θ)
δL δAij(θ),
(17) µAij := [0,0], j∈IAi:(θ) and
(18) µAij := [0,∞]∩ δL δAij
(θ)−λAi:
, j∈IAi:(θ) are rigorous bounds forλ∗Ai: andµ∗Aij.
Proof. Since j∈IAi:(θ) implies that infAij >0, Equation (17) follows from (13). Let IAi:(θ) be a non-empty set, Proposition 4.2implies that
∂L
∂Aij
(θ∗) =λ∗Ai:, j∈IAi:(θ).
By hypothesis θ is an interval parameter pair containing the local maxi- mum θ∗ and relation (16) follows from inclusion property. Relations (18) follows from the definition of elementary operations applied to the interval counterpart of (9) and from constraint (15).
We can also prove the same proposition to the Lagrange multipliers as- sociated to the observable process by just replacing Equations (9) and (13) on the proof above by (10) and (14) respectively.
Corollary4.1 (Elimination test). Letθ be an interval parameter pair withIAi:(θ)6=∅, λAi: andµAij given by(16) and(18)respectively. If λAi: =
∅ or µAij =∅ for any row i then there is no local maximizer of (3) in θ Proof. Follows immediately from the application of the bounding on Lagrangian multipliers and the fundamental theorem of interval analysis.
The elimination test states that for a saddle pointθ∗, δL
δAij
(θ∗) = δL δAik
(θ∗)
for every j and k such that Aij and Aik are positive. Therefore, given θ, such that infAij > 0, infAik > 0 and δAδL
ij(θ)∩δaδL
ik(θ) = ∅ then we can discardθ. Note that the elimination test says nothing ifIAi:(θ) and IBi:(θ) are empty for everyi= 1 :N. However, Proposition4.1 states that given a local maximum θ∗ we can alway find θ 3 θ∗ for which IAi:(θ) and IBi:(θ) are non-empty. Algorithm2applies elimination test toθ.
Algorithm 2 Elimination test
Input: The boxθ.
Output: f alseifθdo not contains a maximizer of (3) andtrueif it may contains.
1: if IAi(θ) =∅andIBi(θ) =∅ for alli= 1 :N then 2: return true;
3: end if
4: fori= 1 :N do 5: λAi ←T
j∈IAi(θ) δL δAij(θ);
6: λBi←T
j∈IBi(θ) δL δBij(θ);
7: if λAi=∅orλBi =∅then 8: return false;
9: end if
10: forj∈IAi(θ)do 11: µAij= [0,∞]∩
δL
δAij(θ)−λAi
; 12: if µAij=∅then
13: return false;
14: end if 15: end for
16: forj∈IBi(θ)do 17: µBij= [0,∞]∩
δL
δBij(θ)−λBi
; 18: if µBij=∅then
19: return false;
20: end if 21: end for 22: end for 23: return true;
Note that the elimination test is conservative in the sense that it does not prove thatθ contains local maxima. Whenever the test returnstrue we cannot say nothing about the existence of local maxima inθ. The elimination test requires only the derivatives of the interval likelihood function. Next section presents an algorithm to evaluate such derivatives efficiently.
5. Interval Evaluation. This section addresses the problem of eval- uating the likelihood function using interval arithmetic. Subsection 5.1 re- views the real case since it is the base for interval calculations. Subsection 5.2 shows that the interval likelihood function can be evaluated with two
T. MONTANHER AND A. NEUMAIER
rounding mode switches only.
5.1. Real case revision. As stated by RabinerRabiner(1989), given the feasible parameter pair θ = (A, B), the initial distribution π for the hid- den process and the sequence of observationsy1:T, the associated likelihood function is
(19) Ly1:T(θ) :=πTP(y1)A . . . AP(yT)1
where P(yi) := Diag(B:yi). Whenever there is a fixed sequence y1:T and there is no misunderstandings we writeL(θ) instead of Ly1:T(θ).
Note that (19) depends on the initial distribution vector π. Throughout this paper we assume that theπ is given and therefore we do not include it in the notation of the likelihood function.
To implement the likelihood function, we use the so called backward re- cursion defined by the auxiliary time seriesβ, as follows
βT = 1,
βn = AP(yn+1)βn+1, n=T −1, . . . ,1, L(θ) = πTP(y1)β1.
We also evaluate the derivatives ofL(θ) in terms ofβ. Note that ∂A∂βT
ij(θ) =
∂βT
∂Bij(θ) = 0 for everyAij and Bij. Applying the chain rule to the recursion step for each variable Aij gives
∂βn
∂Aij
(θ) =χijP(yn+1)βn+1+AP(yn+1)∂βn+1
∂Aij
whereχij is the indicator matrix with value 1 on entry (i, j) and 0 elsewhere.
Doing the same for variablesBij we obtain
∂βn
∂Bij(θ) =A∂P(yn+1)
∂Bij βn+1+AP(yn+1)∂βn+1
∂Bij
for i = 1 : N and j = 1 : M. Finally, applying the chain rule to the last term of recursion gives
∂L
∂Aij(θ) = πTP(y1)∂β1
∂Aij,
∂L
∂Bij
(θ) = πT∂P(y1)
∂Bij
β1+πTP(O1) ∂β1
∂Bij
.
The calculus ofL(θ) requires a considerable number of multiplications with values in [0,1]. Therefore, we need to avoid underflow while evaluatingβand its derivatives. There are several normalization procedures in literature to properly handle this problem. A common approach (see for exampleZucchini and MacDonald(2009) andBulla, Bulla and Nenadi(2010)) is the inclusion of auxiliary variables
wn =
N
X
i=1
βn(i), φn = βn
wn
.
Algorithm3was adapted fromZucchini and MacDonald(2009). It evaluates logL(θ) considering variableswn and φn.
Algorithm 3 log backwards evaluation
Input: The feasible parameter pairθ= (A, B), the initial distribution vectorπand the set of observationsy1:T.
Output: l:= logL(θ).
1: φT ←1;
2: l←0;
3: forn=T−1, T−2, . . . ,1do 4: β←AP(yn)φn+1;
5: w←βT1;
6: l←l+ log(w);
7: φn=wβ; 8: end for
9: β←πTP(y1)φ1; 10: w←βT1;
11: l←l+ log(w);
12: returnl;
We observe that Algorithm3requires a log evaluation at every iteration.
Since we are interested in generalizing the real procedure to intervals, we should avoid standard functions because they typically needs at least two rounding mode switches of a non-elementary function. For the logarithmic function in C++, MPFR Fousse et al. (2007) allows to compute the ex- tremes of natural extension with two rounding switches However, this is a costly operation. In this paper we evaluate backwards recursion with a scaled numerical type that allows to represent numbers of form
x=s×2exp
whereexpis an integer andsis a floating point number (typically a double precision number). Note that this is an extension of double precision. Double
T. MONTANHER AND A. NEUMAIER
precision, as defined byIEEE754 represents number with an exponent of 11 bits while our scaled type allows exponents with 32 bits. Algorithms4 and 5show how to compute backward recursion with the scaled type.
Algorithm 4 Scaled backwards
Input: A feasible modelθand a set of observationsy1:T.
Output: An integer exp and a floating point numberssuch thats∗2exp=L(θ).
1: βT ←~1 eexp←0;
2: forn=T−1, . . . ,1do 3: βn←AP(on+1)βn+1; 4: exp←exp+scale(βn);
5: end for 6: s←π0P(o1)β1; 7: exp←exp+scale(s);
8: return (exp, s);
Algorithm 5 relies ldexp and frexp functions of C/C++. The interested reader should seeLippman, Lajoie and Moo (2005).
Algorithm 5 Scale vector
Input: A vector of floating point numbersv.
Output: A integermexpand a scaled vectorv∗ of the same type and size ofv.
1: v∗←v;
2: [mexp, v∗(1)]←f rexp(v∗(1));
3: forn= 2, . . . , size(v)do
4: [tmpexp, v∗(n)]←max(mexp, f rexp(v∗(n)));
5: mexp←max(mexp, tmpexp);
6: end for
7: forn= 1, . . . , size(v)do
8: v∗(n)←ldexp(v∗(n),−mexp);
9: end for
10: return (mexp,v∗);
Table1shows that both approaches have equivalent accuracy to evaluate the likelihood function. Figure2 gives the time needed to evaluate the like- lihood with Algorithms3 and 4. We implement both algorithms in C++11 and perform the experiment in aCore i7 processor with6Gb of RAM mem- ory. For each algorithm we fix the feasible parameter pair θ = (A, B) of a HM M(2,2) and generate sequences of size T at random. Each entry of columns 2 and 3 gives the likelihood ofθ.
5.2. Fast interval evaluation. We show now that Algorithms4and5ex- tend naturally to intervals. Consider again the step of backwards recursion,
βn=AP(yn+1)βn+1.
Table 1
Comparison of accuracy between Algorithms3and4to evaluateL(θ).
T scaled log |log−scaled|
10 0.00616949 0.00616949 7.80626e-18 20 3.06144e-06 3.06144e-06 9.74088e-21 50 2.49836e-15 2.49836e-15 2.32714e-29 100 3.70022e-30 3.70022e-30 6.30584e-45 150 6.93633e-42 6.93633e-42 4.11655e-55 200 3.56547e-55 3.56547e-55 2.14438e-68 250 2.72636e-72 2.72636e-72 3.93607e-85 300 3.31647e-85 3.31647e-85 2.97748e-98 400 3.44266e-116 3.44266e-116 1.02e-129
1e-2
Time(s)
1e-3
1e-4
1e-5
1e-6 1e-2
Figure 2. Time comparison of Algorithms 3and4
Since βT = 1 and every entry in A and P(yn+1) are probabilities, each βn
is non-negative. Moreover, theith element ofβn is given by βn(i) =
n
X
j=1
Aijγj(yn+1) where
γj(yn+1) =bjyn+1βn+1(j).
Therefore, theβ variables and the likelihood function are evaluated by the multiplication and summation ofAij andBij variables. In such cases, we can simplify interval operations. Note that ifxandyare intervals with infx>0 and supy>0 then
(20) x∗y= [O(infxinfy),M(supxsupy)]
T. MONTANHER AND A. NEUMAIER
and
(21) x+y= [O(infx+ infy),M(supx+ supy)].
Combining the recursion step with equations (20) and (21) leads to infβn = infAinfP(yn+1) infβn+1,
supβn = supAsupP(yn+1) supβn+1.
Moreover, writing the last term of recursion with (20) and (21) gives infL(θ) = infπT infP(y1) infβ1,
and
supL(θ) = supπT supP(y1) supβ1.
Algorithm 6extends Algorithm 4 to intervals. It requires thescale func- tion as defined by Algorithm5.
Algorithm 6 Scaled interval backwards
Input: The consistent interval parameter pairθ, the interval initial distribution πand the set of observationsy1:T.
Output: The integer exp and the intervalL, such thatL(θ) :=L∗2exp. 1: βT ←1,exp←0,exp lower←0;
2: round mode← 5();
3: forn=T−1, . . . ,1do
4: βn←infAinfP(yn+1)βn+1; 5: exp lower←exp lower+scale(βn);
6: end for
7: s lower←infπTinfP(y1)β1;
8: exp lower←exp lower+scale(s lower);
9: βT ←1;
10: round mode← 4();
11: forn=T−1, . . . ,1do
12: βn←supAsupP(yn+1)βn+1; 13: exp←exp+scale(βn);
14: end for
15: s upper←supπTsupP(y1)β1; 16: exp←exp+scale(s upper);
17: s lower←s lower∗2exp lower−exp
; 18: L←[s lower, s upper];
19: return (exp,L);
The same analysis is valid for the derivatives ofβ since δAδβn
ij and δBδβn
ij in- volves only non-negative products and summations. Therefore we can easily extend Algorithm 6to evaluate the derivatives of L(θ).
The simplification presented here remains valid for the likelihood of HMMs with a general observable process. However, it is not true that the derivative can be evaluated with 2 rounding mode switchings, since the derivatives
∂P(yn+1)
∂Aij and ∂P∂B(yn+1)
ij are not necessarily positive.
6. Numerical experiments. We apply Algorithm2to study the non- asymptotic behavior of the likelihood function. The experiment we perform is as follows: Letθ= (A, B) be a feasible parameter pair ofHM M(N, M).
We generate the interval parameter pairθwith maximum diameterby the following rule
(22) Aij = [max(0, Aij −),min(1, Aij+)]
fori, j= 1 :N and
(23) Bij = [max(0, Bij−),min(1, Bij +)].
fori= 1 :N and j= 1 :M.
We also generateK sequences of observations from it, each of which with size T. For each sequence of observations we apply Algorithm 2 to θ and count the number of times it returns true. If we gradually increase the sample size of the observations, the consistency of the MLE guarantee that the proportion of successes in counting will tend to 1. Therefore, the elimination test allows to estimate a lower bound on the sample size needed recover θ.
See the Algorithm7.
We apply Algorithm 7 in a number of different scenarios. We divide the rest of this section as follows. Subsection6.1presents the results of the case where the matricesAand B has no zero entries. Subsection6.2discuss the case whereAis a random matrix andBis the identity matrix. In Subsection 6.3we present the results of hidden Markov models whereAis a diagonal up- per triangular matrix. Finally, Subsection6.4generalize the results obtained in this paper for models with Poisson and normal outcomes.
6.1. General Hidden Markov models. Let assume that matricesAandB has non-zero entries. We run the counting experiment described in Algorithm 7considering that the number of hidden statesN ranges in 2 : 8, the number of observable valuesM inN : 10, the number of samples generated is given byK= 1000, the number of resample isR= 30 and the diameter= 10−4. Table2displays the sample size needed to recover θwith a given proba- bility. The table must be read as follows. For hidden Markov models withN hidden states andM observable values, the average size needed to recoverθ in 50% of the experiments is given in column 3 and the standard deviation
T. MONTANHER AND A. NEUMAIER
Algorithm 7 Counting experiment
Input: The numbers N and M, the sample size T, the number of sequences K, the resample parameterRand the tolerance.
Output: The vectorV of sizeR with proportions CK where C is the number of times algorithm2returned true.
1: V ←zeros(1, R);
2: forr= 1, . . . , Rdo
3: Generate the parameter pairθwithN hidden states andM observable values;
4: Generateθfrom equations (22) and (23);
5: C←0;
6: fork= 1, . . . , K do
7: Generate a sequence of sizeT fromθ;
8: result←elimination test(θ);
9: if result == truethen
10: C←C+ 1;
11: end if 12: end for 13: V(r)← CK; 14: end for 15: returnV;
in column 4. The same is valid for quartiles 75% and 95%. Therefore, for models of form HM M(2,2) the average number of samples size needed to recoverθwith probability of 50% is 209 and to recover the parameters with probability of 95% is 376.
We note that at least 320 observations are needed, in average, to achieve 95% of successes for any hidden Markov model tested. All results in the first 8 columns of Table2were taken from hidden Markov models where the initial distributionπ satisfiesPN
i=1πi= 1. We perform a second experiment where samples were taken from models with the initial distributionπas the invariant measure of A. To evaluate π in this case we solve the following linear system (seeZucchini and MacDonald (2009))
(I−AT +U)π= 1
whereI is the identity matrix and U the unity matrix of dimensionN. The last column of Table 2 (p) displays the p−value of Kolmogorov-Smirnov hypothesis testing that compares purely randomπ and invariant measures.
Table 3 presents the differences between both experiments in those cases wherep <5%.
We compare the rigorous approach in models of form HM M(2,2) with the heuristic method presented in Sectino 3. We run Algorithm 1 with θ generated at random and withAij >0.1 and Bij >0.1 fori, j = 1 : 2. We setT ={300,600, . . . ,2100},R= 30 and τ = 10−4.
The rigorous approach proposed in Algorithm 7 is conservative when compared to the heuristic one presented in Algorithm 1. Note that with 378 observations, the rigorous approach states that we would expect 95%
of probability to recover θ. On the other hand, the heuristic method gives only 50% of probability with 300 observations. It is due to the fact that the rigorous method relies only on the first order information of the likelihood function. We also observe that even the conservative estimate is greater than several time series used in applications.
300 600 900 1200 1500 1800 2100
40%
45%
50%
55%
60%
65%
Sample size
Figure 3. (a) Average and standard deviation ofkθ∗−θk∞for the outcomes of Algorithm 1withT ={300,600, . . . ,2100},R= 30 andτ= 10−4.
We finish the subsection noting that the number of observations needed to recoverθdecreases as the number of hidden states increase.
6.2. Markov chains. A Markov chain is a special case of hidden Markov model where M = N and the matrix B is known to be the identity. We perform the counting experiment of Algorithm7setting the number of states ranging inN = 2 : 8, the number of samples generated is given byK= 1000, the number of resample isR = 30 and the diameter= 10−4. In this case, the matrixAis generated at random and we generateAfollowing (22). On the other hand, matrixB is given by
Bij = [1−,1], i=j and
Bij = [0, ], i6=j.
We apply the elimination test described in Algorithm 2 to θ = (A, B).
Table4must be read as we note in Subsection6.1. The table shows that the average sample size needed to recover θ is greater for Markov chains than those observed for hidden Markov models.
T. MONTANHER AND A. NEUMAIER Table 2
Minimum sample size needed to recoverθ. Av stands for the average number needed to recoverθ with a given probability. Sd is the standard deviation of the experiment
described in Subsection 6.1. Last columnpdisplays thep−value of the Kolmogorov-Smirnov test for stochastic and invariant measure hidden Markov models.
N M Av(50%) Sd(50%) Av(75%) Sd(75%) Av(95%) Sd(95%) p
2 2 209.2 31.51 288.2 36.85 376.2 46.28 0.05
2 3 297.6 59.79 371.0 75.64 458.0 102.25 0.23
2 4 282.0 11.55 336.6 16.63 396.0 17.50 0.00
2 5 321.0 38.86 377.4 51.30 442.6 55.32 0.01
2 6 306.4 12.54 354.0 22.55 413.6 34.20 0.12
2 7 346.0 58.38 410.4 88.47 497.8 136.08 0.00
2 8 392.4 187.09 452.2 226.38 512.4 250.05 0.12
2 9 302.8 11.46 340.2 9.63 389.6 13.91 0.02
2 10 324.8 16.86 367.2 26.77 416.6 38.80 0.64
3 3 262.4 13.70 316.2 17.22 378.8 13.17 0.41
3 4 300.2 27.02 352.0 33.20 421.6 49.28 0.00
3 5 295.0 11.99 344.2 17.95 397.6 25.62 0.02
3 6 309.4 15.63 350.0 17.20 401.8 25.61 0.41
3 7 320.6 8.93 362.8 13.00 413.2 15.67 0.00
3 8 307.8 15.42 345.8 17.83 393.8 16.79 0.64
3 9 306.2 8.45 341.2 12.27 389.4 14.88 0.00
3 10 305.2 11.04 338.2 8.65 379.6 12.58 0.87
4 4 274.6 14.06 318.2 17.61 368.0 20.21 0.41
4 5 277.6 11.10 319.2 14.48 362.4 16.65 0.00
4 6 286.2 15.56 323.2 19.47 372.8 19.69 0.23
4 7 289.8 12.20 324.0 12.99 364.0 17.85 0.05
4 8 298.8 10.03 331.8 15.74 372.2 16.21 0.12
4 9 287.8 9.47 318.4 12.05 356.6 13.97 0.00
4 10 296.4 10.16 331.2 11.84 371.4 11.68 0.05
5 5 272.2 11.19 309.4 11.39 354.4 15.50 0.05
5 6 274.0 12.58 308.8 10.34 351.4 11.95 0.87
5 7 276.4 6.21 309.6 10.60 347.0 13.92 0.41
5 8 294.2 10.67 328.8 11.11 368.8 12.77 0.99
5 9 288.8 7.26 324.4 9.28 367.0 18.14 0.05
5 10 288.2 6.27 315.4 6.28 354.0 10.10 0.87
6 6 267.2 9.14 297.6 8.43 339.0 10.61 0.12
6 7 272.0 8.54 307.0 9.13 343.2 11.54 0.12
6 8 269.6 9.00 299.8 10.65 336.2 11.93 0.41
6 9 276.8 5.18 306.4 5.50 343.0 9.24 0.00
6 10 273.8 6.66 304.2 8.62 340.4 13.61 0.64
7 7 256.6 7.32 288.8 8.57 323.4 9.97 0.05
7 8 266.8 6.60 293.2 6.90 330.0 8.16 0.05
7 9 265.6 4.64 295.4 4.98 330.4 8.77 0.87
7 10 270.8 5.14 298.6 6.38 333.6 6.85 0.00
8 8 253.8 7.11 281.0 8.66 313.6 10.75 0.23
8 9 260.2 5.30 286.0 7.36 315.4 8.03 0.99
8 10 264.6 6.44 293.0 5.40 320.0 6.77 0.23
Table 3
Minimum sample size needed to recoverθ.N is the number of hidden states,M is the number of observable values, Av stands for the average number needed to recoverθ with a
given probability. Sd is the standard deviation of the experiment described in Subsection 6.1. Results that are significantly different from stochastic and stationary HMM.
N M Av(50%) Sd(50%) Av(75%) Sd(75%) Av(95%) Sd(95%)
2 4 344.8 58.28 407.4 73.53 485.2 86.73
2 5 290.8 13.90 338.2 18.14 395.2 24.52
2 7 312.8 22.04 364.2 31.48 423.2 46.59
2 9 315.2 24.00 354.2 28.53 410.2 34.08
3 4 284.8 11.59 332.2 9.36 385.6 11.58
3 5 291.2 9.71 334.0 11.90 381.2 13.41
3 7 300.8 10.77 337.2 12.25 386.0 19.26
3 9 299.6 6.91 335.8 11.06 377.4 16.21
4 5 281.6 6.41 321.8 10.69 375.6 11.84
4 9 296.2 6.96 329.4 7.12 369.2 7.73
6 9 271.0 5.20 298.6 4.45 333.6 7.43
7 10 267.6 6.31 295.2 8.23 325.2 9.52
Table 4
Minimum sample size needed to recoverθ for Markov chains.N is the number of hidden states, Av stands for the average number needed to recoverθ with a given probability. Sd
is the standard deviation of the experiment described in Subsection6.2.
N Av(50%) Sd(50%) Av(75%) Sd(75%) Av(95%) Sd(95%)
2 414.2 165.31 505.4 193.44 601.0 254.24
3 542.8 207.50 623.4 255.79 722.4 340.12
4 532.4 56.86 600.6 75.30 668.8 103.06
5 652.6 117.08 764.2 201.41 911.8 368.84
6 1785.0 1184.69 2337.0 1448.58 2646.2 1403.31 7 1172.2 618.91 1543.6 941.67 1758.6 920.93
8 1979.8 495.76 2585.4 285.61 2893.0 29.76
6.3. Forward chains. We say that a Markov chain is a forward chain if aii and aii+1 are the only non-zero entries of transition matrix A. There are several applications for forward chains as described in Rabiner (1989) andLeroux(1992). In this section we run the counting experiment on hidden Markov chains where the hidden process is forward. We generate the interval matrixA as follows
Aij = [max(0, Aij−),min(1, Aij +)],