Outliers detection in INAR(1) model with negative binomial innovations

(1)

Universidade de Aveiro Departamento de Matem´atica, 2012

Soheila

Aghlmandi

Dete¸

c˜

ao bayesiana de outliers aditivos em processos

INAR(1) com inova¸

c˜

oes binomiais negativas.

Outliers detection in IN AR(1) model with negative

binomial innovations.

(2)

(3)

Universidade de Aveiro Departamento de Matem´atica, 2012

Soheila

Aghlmandi

Dete¸

c˜

ao bayesiana de outliers aditivos em processos

INAR(1) com inova¸

c˜

oes binomiais negativas.

Outliers detection in IN AR(1) model with negative

binomial innovations.

Disserta¸cão apresentada á Universidade de Aveiro para cumprimento dos requisitos necessários á obten¸cão do grau de Mestre em Matemática e Aplica¸cões, realizada sob a orienta¸cão cient´ıfica da Doutora Isabel Maria Simões Pereira, Professora Auxiliar do Departamento de Matemática da Universidade de Aveiro.

Thesis submitted to the Department of Mathematics, University of Aveiro, in partial fulfillment of the requirements for the degree of Master of Science under the scientific supervision of Professor Isabel Pereira from Department of Mathematics, University of Aveiro.

(4)

(5)

O j´uri

Presidente Prof. Doutor Agostinho Miguel Mendes Agra

Prof. Auxiliar do Departamento de Matematica da Universidade de Aveiro.

Arguente Prof. Doutora Isabel Maria Marques da Silva Magalhaes

Prof. Auxiliar do Departamento de Engenharia Civil da Faculdade de Engenharia da Universidade do Porto.

Orientador Prof. Doutora Isabel Maria Simoes Pereira

(6)

(7)

Acknowledgements

First, and foremost, I would like to greatly acknowledge my supervisor, Prof. Isabel Pereira, for motivating me to work on Bayesian statistics and teach me how to think and write critically in a scientific research. Additionally, I would like to express my gratitude to Prof. Eduarda Silva for giving me an opportunity to work with her in the computational part of this study. Last but not least, I kindly appreciate the critiques from my husband, Shakoor Pooseh, and his time in reading this thesis.

(8)

(9)

Resumo Os processos de contagem, apesar de serem largamente usados na pr´atica, continuam a ser alvo de investiga¸c˜ao.

Neste trabalho considera-se o processo de contagem autorregressivo de 1a ordem - INAR(1). O objetivo principal consiste em tratar o problema da dete¸cão de outliers aditivos em processos INAR(1), considerando uma dis-tribui¸cão binomial negativa para o processo de inova¸cões. Aplica-se a abor-dagem bayesiana, através da amostragem de Gibbs, para estimar a proba-bilidade de que uma observa¸cão seja afetada por um outlier. A metodologia proposta é ilustrada através de vários exemplos simulados e conjuntos de dados reais.

(10)

(11)

Abstract Discrete-valued, or so called Integer-valued, time series is widely used in practice; but still it can be considered as a new subject for research nowa-days. In this context, the variables of the process take place on finite or countable infinite sets.

In this work, we study first-order INteger-valued AutoRegressive, IN AR(1), processes. The main goal, however, is to develop the statistical expressions for detecting outliers for the model, by considering the distributions of in-novations as negative binomial. The Binomial thinning operator is used in process. This work considers a Bayesian approach to the problem of modeling a negative binomial integer-valued autoregressive time series con-taminated with additive outliers.

Furthermore, we focus on computational part of detecting the outliers of IN AR(1) process where we use R software. We show how Gibbs sampling can be used to detect outlying observations in IN AR(1) processes.

(12)

(13)

List of Figures

1 Simulation of IN AR(1) using Binomial thinning operator. . . 3

1.1 Procedure of Bayesian scheme . . . 6

1.2 Adaptive rejection function h4(x) (−) constructed from equation (1.3) for a long-concave function f (x) (Gilks et al. (1995)). . . 11

1.3 Updated current set S5 and rejection function h5(x) (−) after incorporating X in Figure 1.2 (Gilks et al. (1995)). . . 12

1.4 Adaptive rejection function h5(x) (−) for a non-log-concave function f (x), constructed from equation (1.5) (Gilks et al. (1995)). . . 14

3.1 Model with α = 0.2, θ = 0.6, r = 3; η21= 6, η54= 8, η74= 4 . . . 35

3.2 Model with α = 0.5, θ = 0.6, r = 3; without any outlier . . . 37

3.3 Model with α = 0.5, θ = 0.6, r = 3; η35= 4, η61= 8 . . . 38

3.4 Model with α = 0.85, θ = 0.6, r = 3; η11= 16, η78= 5, η90= 9 . . . 39

3.5 Model with α = 0.2, θ = 0.4, r = 2; η218= 4 and n = 300 . . . 41

3.6 Model with α = 0.5, θ = 0.4, r = 2; η208= 4 and n = 300 . . . 43

(16)

(17)

List of Tables

3.1 Table of initial values for parameters, Yule–Walker initials and obtained Bayes parameters (α = 0.2, η21, η54 and η74) . . . 35

3.2 Probabilities of the outliers for α = 0.2 . . . 36 3.3 Table of initial values for parameters, Yule–Walker initials and obtained Bayes

parameters (α = 0.5, ηS = 0) . . . 36

3.4 Probabilities of the outliers (α = 0.5) . . . 36 3.5 Table of initial values for parameters, Yule–Walker initials and obtained Bayes

parameters (α = 0.5, η35= 4, η61= 8.) . . . 37

3.6 Probabilities of the outliers (α = 0.85) . . . 39 3.7 Table of initial values for parameters, Yule–Walker initials and obtained Bayes

parameters (α = 0.85, η11= 16, η78= 5, η90= 9.) . . . 40

3.8 Probability of the outlier (α = 0.2 and n = 300) . . . 41 3.9 Table of initial values for parameters, Yule–Walker initials and obtained Bayes

parameters (α = 0.2, n = 300 and η218= 4.) . . . 42

3.10 Outlier probability (α = 0.5 and n = 300) . . . 42 3.11 Table of initial values for parameters, Yule–Walker initials and obtained Bayes

parameters (α = 0.5, n = 300 and η208= 4.) . . . 42

3.12 Outlier probability (α = 0.85 and n = 200) . . . 43 3.13 Table of initial values for parameters, Yule–Walker initials and obtained Bayes

(18)

(19)

Introduction

The subject of detecting Additive Outliers (AO) has been studied in Silva & Pereira (2012). This paper introduces a Bayesian approach to the problem of modeling a Poisson integer-valued autoregressive time series contaminated with additive outliers. In Silva & Pereira (2012), the authors develop the additive outlier model for IN AR(1) with the Poisson distribution for innovations. It also shows that how Gibbs sampling can be used to detect outlier observations in IN AR(1) processes. In the count time series context it is worth to mention the works of Barczy et al. (2010) who considered the conditional least square estima-tion of the parameters of an IN AR(1) contaminated at known time periods with innovaestima-tional and additive outliers, respectively.

In a Poisson distribution, as we know, the values of mean and variance are equal while in real world data the variance is greater than mean. So, we propose to extend this model with negative binomial distribution for innovations. We choose the negative binomial distribution for innovations to satisfy the condition of real world data. In this study we use the Bayesian approach to the problem of modeling a negative binomial integer-valued autoregressive time series contaminated with additive outliers.

Outliers and structure changes, are commonly encountered in time-series data analysis. The presence of those extraordinary events could easily mislead the conventional time series analysis procedure resulting in erroneous conclusions. The impact of those events is often overlooked, however, there are useful methods available to deal with the dynamic behavior of those events in the underlying series. Tsay (1988) considered unified methods for detecting and handling outliers and structure changes in a univariate time series. The treated outliers, are the additive outlier (AO) and the innovational outlier (IO).

Several approaches have been considered in the literature for handling outliers in a time series.

• Abraham & Box (1979) used a Bayesian method, Martin (1985) treated outliers as contamination generated from a given probability distribution, and Fox (1972) proposed two parametric models for studying outliers. Chang (1982) adopted Fox’s models and proposed an iterative procedure to detect multiple outliers. In recent years, this iterative procedure has been widely used with encouraging results, see Chang et al. (1988), Hillmer et al. (1983), and Tsay (1988).

• The methods mentioned above may be regarded as batch-type procedures for detecting outliers, because the full data set is used in detecting the existence of outliers. On the other hand, Harrison & Stevens (1976), Smith & West (1983), West et al. (1985) and West (1986) have considered sequential detecting methods for handling outliers. These sequential methods assume probabilistic models for outlier disturbances.

(20)

• The third method for handling outliers, is the robust procedure advocated by Denby & Martin (1979). This approach is summarized in Martin (1985). However, the study of Chang et al. (1988) shows that the robust procedure of Denby & Martin (1979) is not powerful in handling innovational outliers. Note that the effect of a single IO on estimation is usually negligible provided that the IO is not close to the end of the ob-servational period. The effect of multiple IOs, however, could be serious. See Chang (1982). There is no comparison available between the batch-type and the sequential pro-cedures in handling outliers. The probabilistic treatment has its appeal but may not be easy to implement as it requires prior information of the underlying model to begin with. The well-known assertion of George Box that while all models are wrong some are useful, motivates that we approach the issue of modeling outliers in integer-valued time series focusing on the integer-valued autoregressive model of order one. In fact, this model introduced independently by Al-Osh & Alzaid (1987) and McKenzie (1985) to model time series of counts, has been extensively studied in the literature and applied to many real-world problems including statistical process control because of its simplicity and easiness of interpretation, see for example Weiß (2007) and Fokianos & Fried (2010).

To motivate our approach, we make a quick review of the model. Let Xtdefine as IN AR(1)

model, where ”◦” represents the thinning operator, Xt= α ◦ Xt−1+ t

with t, the arrival process, a sequence of independent identically distributed negative binomial

variables t ∼ N B(r, θ); t means the number of failures until the r-th success happens, θ is

the probability of success and r is the number of success. In our model we use the binomial thinning operator (see 1.3.1). Also we know that Xt−1 and t are independent.

When additive Outliers (AO) occur at times τ1, τ2, . . . , τk, with sizes ω1, ω2, . . . , ωk, Xt is

unobservable and it is assumed that the observed series{Yt} satisfies

Yt= Xt+ k

X

i=1

It,τiωi,

where k ∈ N is the number of outliers and It,s is an indicator function define as

It,s=

1 , t = s 0, otherwise.

Roughly speaking, an additive outlier can be interpreted as a measurement error or as an impulse due to some unspecified exogenous source at time τi, i = 1, 2, . . . , k.

Using the above information, we generate a IN AR(1) model using Binomial thinning op-erator in R program, considering the negative binomial as the distribution of the innovations, we need to define the initial values in order to generate a data for this model. We choose the initial values as

α = 0.5, θ = 0.6, r = 3

and we present the data set without any outlier i.e., ωS = 0, and by considering two outliers

i.e., ω35 and ω61 with the values of four and eight, respectively. We present the generated

data sets in Figure 1. Figure 1(b) is the result with two outliers, and Figure 1(a) without any outlier.

(21)

(a) Simulation without additive outliers

(b) Simulation with two additive outliers

Figure 1: Simulation of IN AR(1) using Binomial thinning operator.

Here we present the structure of our work. In Chapter 1, we provide the necessary definitions and properties, from both theoretical and computational point of view associated to IN AR(1) models and the Bayesian methodology. In Chapter 2, we go through the estimation procedure using Conditional Least Square (CLS), Yule–Walker estimation methods. Then we complete the chapter with the Bayesian estimation procedure for our model. In Chapter 3 we present the result of the computational study in R software considering the model under study and finally, in chapter 4, we point out some future work.

(22)

Chapter 1

Preliminaries

In this section we provide a short review of relevant basics. The perspective of our study is from Bayesian point of view. Therefore, in first part, we present a brief review of Bayesian methodology. Then we go through the applicability of Bayesian approach, which leads us to determine the path that we will follow in the practical part. In the next step, we discuss the thinning operator and finally we discuss the IN AR(1) process. One can find definitions and detailed information about several types of distributions in Appendix B.

1.1 Bayesian Methodology

1.1.1 Brief history of Bayesian Methodology

Traditionally, the starting point is associated with the paper of British clergyman and amateur mathematician, Thomas Bayes (1701-1761), but it was indeed Richard Price (1763) with the publication of ”An essay towards solving a problem in the doctrine of chances”, (see Amaral-Turkman (1981)) who gave impetus to the use of Bayes theorem as the ideal tool for making inferences from a data set. In his famous memoir on the ”probabilities of causes” (Laplace, 1774), he had already pointed out the fundamental difference between pure mathematics and scientific reasoning. The former relies on logical deduction as the tool for making inferences, with the result that there is no room for propositions of the type ”This is probably the cause of that event”.

1.1.2 Bayesian Methodology

Bernardo (2003) discusses that the mathematical statistics uses two major paradigms: conventional or frequentist and Bayesian. Bayesian methods provide a complete paradigm for both statistical inference and decision making under uncertainty. Bayesian methods contain as particular cases of many of the more often used frequentist procedures, solve many of the difficulties faced by conventional statistical methods and extend the applicability of statistical methods. In particular, Bayesian methods make it possible to incorporate scientific hypoth-esis in the analysis (by means of prior distribution) and may be applied to problems whose structure is too complex for conventional methods to be able to handle.

Bayesian methods of data analysis have a very long history, and have been used with great success in many disciplines, from Physics to Econometrics. The Bayesian paradigm is based on an interpretation of probability as a rational, conditional measure of uncertainty, which

(23)

closely matches the sense of the word ’probability’ in ordinary language. Statistical inference about a quantity of interest is described as the modification of the uncertainty about its value in the light of evidence, and Bayes’ theorem precisely specifies how this modification should be made.

The main difference between the two methodologies is that, from frequentist point of view, the parameter of the model is fixed while in contrast, from the Bayesian perspective θ, the parameter, is considered as a random variable.

1.1.3 The rule of Bayes

In the simplest form, one can find the Bayes’ theorem as follows

Theorem 1.1.1. Consider Ai, i = 1, 2, . . . and B events where p(Ai), i = 1, 2, . . . , a partition

of event space and p(B) denote the probability of occurrence for Ai and B, respectively. The

p(Ai, B) is called the joint probability of Ai and B. According to Bayes,

p(Ai | B) =

p(Ai, B)

p(B) , i = 1, 2, . . .

This relation implies that p(Ai, B) = p(B | Ai)p(Ai), so we have the posterior probability of

Ai as

p(Ai| B) =

p(B | Ai)p(Ai)

p(B) .

Now we can define

p(B) =X i p(B | Ai)p(Ai), i = 1, 2, . . . so p(Ai| B) = p(B | Ai)p(Ai) P ip(B | Ai)p(Ai) , i = 1, 2, . . . .

According to Bayes, p(Ai | B) is called the posterior probability of Ai and p(Ai) is stated

as prior probability of Ai. The Theorem has been defined for Ai discreetly distributed.

In Bayesian paradigm, the process of learning from the data is systematically implemented. Using Bayes’ theorem, we combine the available prior information with the information pro-vided by data, to produce the required posterior distribution. Computation of posterior densities is often facilitated by noting that Bayes’ theorem may be simply expressed as

p(Ai| B) ∝ p(B | Ai)p(Ai),

where ∝ stands for ”proportional to”.

Now, let θ be a continuous parameter. So one can define a prior probability distribution function of θ as p(θ). In Bayesian model, we define the probabilistic model as

P = {f (x | θ) : θ ∈ Θ},

where x = (x1, . . . , xn) are our observations and θ is a parameter and f (x | θ) is the conditional

probability distribution function of X with fixed parameter θ. We know that fX,Θ(x, θ) = f (x | θ)p(θ),

(24)

where p(θ) is a distribution of θ (before knowing x). After realization of an experience, we observe X = x. Applying the Bayes’ theorem we have

p(θ | x) = R f (x | θ)p(θ)

Θf (x | θ)p(θ)dθ

, θ ∈ Θ. (1.1)

Remember that p(θ | x) is called the posterior distribution of θ. Also the denominator of (1.1) is

f (x) = Z

Θ

f (x | θ)p(θ)dθ, where f (x) is a predictive distribution of θ and R

Θp(θ | x)dθ = 1. The Bayesian paradigm is

p(θ | x) ∝ f (x | θ)p(θ). (1.2)

We can interpret (1.2) as

Posterior distribution of θ ∝ likelihood function × prior distribution of θ.

So we verify the importance of likelihood function in Bayes formula. The likelihood function can be interpreted as an expression of information about θ, provided by given x. We can present the Bayesian scheme in the following diagram, Figure 1.1.

Figure 1.1: Procedure of Bayesian scheme

Briefly, Bayesians posterior distribution incorporates via the Bayes theorem for all avail-able information about a parameter (Initial information + Experience information). We know that p(θ | x) provides all necessary information about θ so we can say that all Bayesian in-ference procedures are based exclusively on p(θ | x). We should keep in mind that all these inferences are made partially using the calculation of probabilities.

(25)

1.2 Applicability of Bayesian paradigm

In sharp contrast to most conventional statistical methods, which may only be exactly applied to a handful of relatively simple stylized situations, Bayesian methods are totally general. Indeed, for a given probability model and prior distribution over its parameters, the derivation of posterior distributions is a well-defined mathematical exercise. In particular, Bayesian methods do not require any particular regularity conditions on the probability model, do not depend on the existence of sufficient statistics of finite dimension, do not rely on asymptotic relations, and do not require the derivation of any sampling distribution, nor the existence of a pivotal statistic whose sampling distribution is independent of the parameters. However, when used in complex models with many parameters, Bayesian methods often require the computation of multidimensional definite integrals and, for a long time in the past, this requirement effectively placed practical limits on the complexity of the problems which could be handled. This has dramatically changed in recent years with the general avail-ability of large computing power, and the parallel development of simulation-based numerical integration strategies like importance sampling or Markov chain Monte Carlo (MCMC).

Bernardo (2003) also mentioned that above methods provide a structure within which many complex models may be analyzed using generic software. MCMC is numerical integra-tion using Markov chains. Monte Carlo integraintegra-tion proceeds by drawing samples from the required distributions, and computing sample averages to approximate expectations. MCMC methods draw the required samples by running appropriately defined Markov chains for a long time; specific methods to construct those chains include the Gibbs sampler and the Metropolis algorithm, originated in the 1950s in the literature of statistical physics. The development of improved algorithms and appropriate diagnostic tools to establish their convergence, remains a very active research area.

1.2.1 Markov Chain Monte Carlo(MCMC)

A major limitation towards more widespread implementation of Bayesian approaches is that obtaining the posterior distribution often requires the integration of high-dimensional functions. This can be computationally very difficult, but several approaches short of direct integration have been proposed. A review can be found in Smith (1991), Evans & Swartz (1995), and Tanner (1996). Before launching into our discussion, we present a quick review of the problem. The problem is to simulate observations from a posterior distribution, obtained via Bayes’ Rule as

p(θ| x) ∝ f (x | θ)p(θ),

where f (x | θ) denoted the likelihood function and p(θ), the prior distribution for the vector of k model parameters θi, θ = (θ1, . . . , θk).

Since 1990s, the application of Bayesian approach was increased sharply, and proportion-ally, the Markov Chain Monte Carlo (MCMC) algorithms has become more and more im-portant. In the last decades MCMC used as the main tool to cope with statistical problems where a distribution is known up to a proportionality constant and some approximation of its features is sought for. In particular the large availability of easy-to-implement algorithms such as the Gibbs sampler and some of its variants allowed for an exponential growth of the routine application of simulation-based Bayesian inference. The theoretical development of these techniques and the enlargement of their scope of application have been boosted by their

(26)

extension to distributions supported on subspaces of variable dimension starting from Carlin & Chib (1995) and the Reversible Jump (RJ) of Green (1995).

Since then a lot of research efforts have been devoted to enlarging the availability of MCMC simulation techniques as well as the applicability of the available ones and also for trying to overcome some of their intrinsic difficulties; another crucial point still open to improvements is related to finding effective diagnostics to monitor appropriate convergence of the chain. Currently there is a substantial presence in the literature of applications of Bayesian inference via MCMC methods. However, it is apparent that a lot of expertise with these techniques is needed for a safe and successful implementation and often problem-specific difficulties can arise.

1.2.2 Gibbs sampling algorithm

The Gibbs sampler, introduced by Geman & Geman (1984), is a special case of Metropolis-Hastings sampling where the random value is always accepted, see 1.2.5. The task remains to specify how to construct a Markov chain whose values converge to the target distribution. The key to the Gibbs sampler is that, one only considers univariate conditional distributions. The univariate distribution is the distribution when the variable is random. Such conditional distributions are far easier to simulate than complex joint distributions and usually have simple forms often being normals, inverse χ2, or other common prior distributions. Thus, one simulates n random variables sequentially from the n univariate conditionals rather than generating a single n-dimensional vector in a single pass using the full joint distribution.

We know that Gibbs sampling, in its basic incarnation, is a special case of the Metropolis-Hastings algorithm. The point of Gibbs sampling is that, given a multivariate distribution, it is simpler to sample from a conditional distribution than to marginalize by integrating over a joint distribution. Suppose that we want to obtain N samples of x = (x1, . . . , xn) from a

joint distribution, f (x1, . . . , xn). Denote the ith sample by xi = (xi1, . . . , xin) . We proceed as

follows:

1. We begin with some initial values for each variable.

2. For each sample i = 1, 2, . . . , N , sample each variable xi_j from the conditional distri-bution f (xi_j | xi

−j) where xi−j means all the components xi1, . . . , xin except jth. That

is, sample each variable from the distribution of that variable conditioned on all other variables, making use of the most recent values and updating the variable with its new value as soon as it has been sampled.

The samples obtained are similar to the samples obtained if we were using the joint distri-bution of all variables, because, the marginal distridistri-bution of any subset of variables can be approximated by simply examining the samples for that subset of variables, ignoring the rest. In addition, the expected value of any variable can be approximated by averaging over all the samples. And the following steps in order to complete the Gibbs algorithm

• The initial values of the variables can be determined randomly or by some other algo-rithms such as expectation-maximization.

• It is not actually necessary to determine an initial value for the first variable sampled. • It is common to ignore some number of samples at the beginning (the so-called burn-in

(27)

an expectation. For example, the first 1, 000 samples might be ignored, and then every 20th sample averaged, throwing away all the rest. The reason for this is that successive samples are not independent of each other but form a Markov chain with some amount of correlation; the stationary distribution of the Markov chain is the desired joint dis-tribution over the variables, but it may take a while for that stationary disdis-tribution to be reached, so we must have a burn-in time. Sometimes, algorithms can be used to determine the amount of autocorrelation between samples and the value of n (the period between samples that are actually used) computed from this, but in practice there is a fair amount of ”black magic” involved.

• The process of simulated annealing is often used to reduce the ”random walk” behavior in the early part of the sampling process i.e., the tendency to move slowly around the sample space, with a high amount of autocorrelation between samples, rather than moving around quickly, as is desired. Other techniques that may reduce autocorrelation are collapsed Gibbs sampling, blocked Gibbs sampling, and ordered over-relaxation; In this part of our study, we need to use the estimators(see the estimation procedure in chapter 2) in order to do a computational study. Our study is from Bayesian point of view so obviously we use one of MCMC methods. In computational area we can use the Gibbs sampler which is one of the popular and useful methods of MCMC. According to Geman & Geman (1984), the Gibbs sampling is an MCMC technique for drawing dependent samples from complex high dimensional distributions. Gibbs sampling is involve little more than sampling from full conditional distributions, which can be both complex and computationally expensive to evaluate. Gilks & Wild (1992) have shown that in practice full conditionals are often log-concave distributions.

Since we may be facing a non-concave distribution functions, we use a method called Adaptive Rejection sampling introduced by Gilks & Wild (1992). The Adaptive Rejection Sampling is highly used for the problems that the probability density functions are log-concave. The same Authors, Gilks et al. (1995), introduced the generalization of ARS method which is called Adaptive Rejection Metropolis Sampling (ARMS).

ARMS method is introduced to deal with non-log-concave full conditional distributions. Gilks and Wild generalized ARS to include a Hastings–Metropolis algorithm step. Following Gilks (1992) in the next sections we introduce the Adaptive Rejection Sampling (ARS) algo-rithm, and then the Adaptive Rejection Metropolis Sampling (ARMS) algorithm in details. Here before describing ARS, we first describe the rejection sampling.

1.2.3 Rejection Sampling

Rejection Sampling, Ripley (1987), is a method for drawing independent samples from a distribution which is proportional to f (x). For this, we require a sampling distribution g(x), from which samples can be drawn and furthermore, there is a finite constant m such that

mg(x) = f (x), ∀x ∈ D,

where D is the domain of f . Gilks et al. (1995) introduced the following algorithm for Rejection Sampling (RS).

(28)

Step 2 : sample U from uniform(0,1); Step 3 : if U > f (X)/mg(X) then { rejection step; go back to step 1; } else { acceptance step: set XR= X; }; Step 4 : return XR.

Further iterations of steps 1 − 4 will produce independent samples from f .

1.2.4 ARS

When we are facing the Adaptive Rejection Sampling (ARS), the first question is ’what is Adaptive Rejection (AR)?’

”AR is adaptive” means that the envelope and the squeezing function which form upper and lower bounds to f (x), respectively converge to the density f (x) as sampling proceeds. The envelope and squeezing functions are piecewise exponentials. The adaptive nature of our technique enables samples to be drawn with few evaluations of f(x); therefore it will be useful in situations where the evaluation of f (x) is computationally expensive.

Now that we know what is adaptive Rejection Sampling method, we need to study ARS procedure. ARS reduces the number of evaluations of f (X) in Rejection Sampling, (RS), by improving the sampling density g(x) after each rejection so that m decreases monotonically. The improvement is made by incorporating into g(x), information about f (x) obtained at each of the previously rejected points. For univariate log-concave densities, this can be done by the method of Gilks & Wild (1992), or alternatively, by the method of Gilks (1992). For this, the domain D of f is an interval of the real line densities with respect to Lebesque measure, and we define log-concavity of f as

logf (a) − 2logf (b) + logf (c) < 0, ∀a, b, c ∈ D such that a < b < c.

This definition does not assume continuity in derivatives of f and includes, for example, linear and piecewise linear continuous functions.

Let Sn = xi, i = 0, ..., n + 1 denote a current set of distance of a coordinate from the the

vertical axis, (y − axis) of a graph, measured on a line which is parallel to the horizontal axis (x − axis) in ascending order, where x0 and xn+1 are the possible infinite lower and upper

limits of D. For 1 < i < j < n, let Lij(x; Sn) denote the straight line through the points

[xi, logf (xi)] and [xj, logf (xj)], and for other (i, j) let Lij(x; Sn) be undefined. In this part

one can define a piecewise linear function hn(x) as

hn(x) = min[Li−1,i(x; Sn), Li+1, i+2(x; Sn)], xi≤ x < xi+1, (1.3)

where we notationally suppress the dependence of hn(x) on Sn. Here we establish the

conven-tion that if b is undefined then min(a, b) = min(b, a) = a. As a consequence of the assumed log-concavity of f (x), hn(x) is an envelope for logf (x), i.e., hn(x) > logf (x) everywhere in

(29)

Figure 1.2: Adaptive rejection function h4(x) (−) constructed from equation (1.3) for a

long-concave function f (x) (Gilks et al. (1995)).

We can now perform rejection sampling with the sampling distribution given by gn(x) = 1 mn exp hn(X), (1.4) where mn= Z exp hn(X)dx.

Note that gn(x) is a piecewise exponential distribution and can be sampled directly (Gilks &

Wild 1992).

The important feature of the sampling distribution gn(x), defined in equation (1.4), is that

it can be updated each time that f (X) is evaluated. We have then the following algorithm for ARS:

Step 0 : initialize n and Sn;

Step 1 : sample X from gn(x);

Step 2 : sample U from uniform(0,1); Step 3 : if U > f (X)/ exp hn(X) then {

rejection step;

set Sn+1= SnS{X};

relabel points in Sn in ascending order;

(30)

else {

acceptance step: set XA= X; };

Step 4 : return XA.

The relabeling in step 3 is for notational consistency with equation (1.3). At each iteration of ARS, the number of points of contact between logf (x) and hn(x) is increased by 1, thereby

reducing m, and decreasing the probability of rejection at step 3. This is illustrated in Figure 1.3.

Figure 1.3: Updated current set S5 and rejection function h5(x) (−) after incorporating X

in Figure 1.2 (Gilks et al. (1995)).

Further iterations of steps 1 − 4 will produce independent samples from f , while hn(x) is

continually improving, making rejections increasingly less likely.

Gilks et al. (1995) showed that for densities f (x), which are not log-concave, ARS cannot be used as hn(x) may not be an envelope for log f (x). To deal with non-log-concave densities

they proposed to append a Hastings–Metropolis algorithm step to ARS. So first, we briefly describe the Hastings–Metropolis algorithm; then we discuss the ARMS.

1.2.5 Hastings–Metropolis Algorithm

The Metropolis algorithm, Metropolis et al. (1953), like the Gibbs sampler is an MCMC method. We describe the generalization of the algorithm given by Hastings (1970), which requires a proposal distribution π(. | Z) from which samples X can be drawn for any Z in D. Gilks et al. (1995) defined the Hastings–Metropolis algorithm as follows:

(31)

Step 1 : sample X from π(x | xi);

Step 2 : sample U from uniform(0,1); Step 3 : if U > min 1,f (x)π(xi|X) f (x)π(X|xi) { rejection step; set xi+1= xi; } else { acceptance step: set xi+1= X; }

Step 4 : increment i and go back to step 1.

After suitably many iterations of this algorithm, the samples xi can be considered to be

dependent samples from f (x). Tierney (1991) suggested the use of the Hastings–Metropolis algorithm within Gibbs sampling to sample from full conditional distributions. Indeed, this was the original form of the Metropolis algorithm. For this, x0 should be the value of x at

the beginning of the current Gibbs iteration, and xl will be the new value for x. Just one

iteration of steps 1 − 4 suffices to preserve the stationary distribution of the Gibbs chain. However, this chain may be slower to converge through rejections at step 3.

1.2.6 ARMS

As we mentioned before (in section 1.2.4) ARS can not be used to sample from non-log-concave distribution. To sample from such distributions, we could abandon rejection sampling in favor of the Hastings–Metropolis algorithm, applied to update one parameter (or one set of parameters) at a time. However, to avoid high probabilities of rejection (and hence slower convergence of the chain) it may be helpful to adapt the proposal density π to the shape of the full conditional density f . Since ARS provides a way of adapting a function to f , we propose to use ARS to create a good proposal density. We then append to ARS a single Hastings–Metropolis step, thus creating an ARMS within Gibbs chain. However, unlike ARS, ARMS will not produce independent samples from f .

Let (x, y) denote the complete set of variables being sampled by the Gibbs sampler. As before, x is the current variable to be sampled from its full conditional density proportional to f (x), where we notationally suppress the conditioning on y. Let Xcur denote the current

value of x at a given iteration of the Gibbs sampler. The aim then is to replace Xcur with a

new value XM from f .

For ARMS, we construct a function hn(X) which is slightly more complex than in

expres-sion (1.3)

hn(x) = max[Li,i+1(x, Sn), min{Li−1,i(x, Sn), Li+1,i+2(x, Sn)}], xi ≤ x ≤ xi+1, (1.5)

where, if b is undefined then

min(a, b) = min(b, a) = max(a, b) = max(b, a) = a.

In general, hn(x) will not be an envelope of log f (x), as illustrated in Figure 1.4, Gilks et al.

(1995).

The sampling density gn(x) is given by equation (1.4) as before. The algorithm defined

(32)

Figure 1.4: Adaptive rejection function h5(x) (−) for a non-log-concave function f (x),

constructed from equation (1.5) (Gilks et al. (1995)).

Step 0 : initialize n and Snindependently of Xcur;

Step 1 : sample X from gn(x);

Step 2 : sample U from uniform(0,1); Step 3 : if U > f (X)/exp hn(X) then {

ARS rejection step; rejection step;

set Sn+1= SnS{X};

relabel points in Sn in ascending order;

increment n and go back to step 1; } else {

ARS acceptance step; acceptance step: set XA= X; };

Step 4 : sample U from uniform(0,1);

Step 5 : if U > min 1,f (XA)min{f (Xcur), exp hn(Xcur)}

f (XA)min{f (XA), exp hn(XA)} then { Hastings–Metropolis rejection step;

set xM = xCU R; }

else {

Hastings–Metropolis acceptance step: set XM = XA; };

(33)

1.3 Thinning operator

We have different types of thinning operators which help us to develop different models for counting series, such as Binomial thinning operator, iterated thinning operator and etc. In this work, however, we use the Binomial thinning operator to develop our model. In the next section we introduce the Binomial thinning operator and some of its properties.

1.3.1 Binomial thinning operator

The thinning operation is used to count variables, where in a set of elements each element is selected (or eliminated) with a certain probability. For example, consider a cup full of balls, red and blue; a probability of taking out red ball is α, α ◦ X represent the number of red balls removed from the cup with replacement in X extraction.

The thinning operation is about a non-negative integer-valued random variable, X, which is defined by α ◦ X = X X i=1 Yi

where {Yi} is a sequence of independent identically distributed random variables with

param-eter α of Bernoulli distribution, p(Yi = 1) = α and 1 − p(Yi = 0) = α, independent of X.

A sequence Y1, Y2, . . . designed by countable series α ◦ X, and note that given X, α ◦ X has

Binomial distribution of parameters (X, α).

1.3.2 Properties of Binomial thinning operator

The thinning operation was introduced by Steutel & van Harn (1979) and studied by several authors. The properties of the thinning operation are presented in the next lemma, see Oliveira (2000).

Lemma 1.3.1 (Properties of thinning operation).

Let X, Y , and Z be non-negative integer-valued identically distributed random variables and α, β ∈ [0, 1], are non-negative real constants. So

0 ◦ Y = 0, 1 ◦ Y = Y,

and α ◦ (β ◦ Y ) = (αβ) ◦ Y , where ’=’ denote an equality in distribution.

(34)

of X, Y , and Z then

(i) E(α ◦ Y ) = αE(Y )

(ii) E(α ◦ Y )2 = α2E(Y2) + α(1 − α)E(Y ) (iii) E[X(α ◦ Y )] = αE(XY )

(iv) E[(α ◦ Y )(β ◦ Z)] = αβE(Y Z)

(v) E(α ◦ Y )3 = α3E(Y3) + 3α2(1 − α)E(Y2) + α(1 − α)(1 − 2α)E(Y ) (vi) E[X(α ◦ Y )2] = α2E(XY2) + α(1 − α)E(XY )

(vii) E[XY (β ◦ Z)] = βE(XY Z)

(viii) E[(α ◦ Y )2X] = α2E[XY2] + α(1 − α)E[XY ] (iix) E[X(α ◦ Y )(β ◦ Z)] = αβE[XY Z].

The proof of the lemma is available in Appendix A.

1.4 The IN AR(1) process

In the last decade there has been a growing interest in studying integer-valued time series and, in particular, time series of counts. In autoregressive process, we study a variable whose value at time t is denoted by Xtand in face of integer-valued variables are defining on a finite

or countably infinite integer sets. Examples of this process are the number of patients in an emergency of the hospital at a specific time interval, or the number of persons in a queue waiting for a bus.

We can define the IN AR(1) process as

Xt= α ◦ Xt−1+ t, t = 0, ±1, ±2, . . . , (1.6)

where α is a coefficient that belongs to interval [0, 1], t is a sequence of uncorrelated

non-negative integer-valued random variables which are independent identically distributed (i.i.d) with mean µ and finite variance σ2. We know also that Xt−1 and t are independent. The

”◦” is called thinning operator. In this work we study IN AR(1) using Binomial thinning operator.

First order Integer-valued AutoRegressive model, IN AR(1), was studied by Al-Osh & Alzaid (1987) where they studied the model based on the binomial thinning operator by considering the Poisson as a distribution of innovations. In Al-Osh & Aly (1992), they studied it using the Iterated thinning operator considering the distribution of innovations as Negative Binomial. The IN AR(1) model is used mostly for modeling and generating sequences of depending counting processes.

1.4.1 The IN AR(1) process with Binomial thinning operator

The Binomial thinning operator lets us to find that a realization of Xtcontain two random

components: the survivors of the elements of the process at time t − 1, Xt−1, each with

probability of survival α, denoted by α ◦ Xt−1, and the elements which enter in the system

in the interval ]t − 1, t], as innovation term(t). In this study we consider ’◦’ as a Binomial

(35)

In Al-Osh & Alzaid (1987), the authors defined the marginal distribution of (1.6) in terms of the innovation sequence {t} as

Xt=d ∞

X

j=0

αj◦ t−j. (1.7)

For α ∈ (0, 1), it can be seen from (1.7) that the dependency of Xt on the sequence {t}

decays exponentially with the time lag of this sequence. They also point out that

(Xt, Xt−k) = (αk◦ Xt−k+ k−1

X

i=0

αi◦ t−i, Xt−k).

The mean and the variance of the process {Xt} as defined in (1.6) are simply

E(Xt) = αE(Xt−1) + E(t) = αtE(X0) + E(t) t−1

X

j=0

αj, (1.8)

and

var(Xt) = α2var(Xt−1) + α(1 − α)E(Xt−1) + σ2

= α2tvar(X0) + (1 − α) t X j=1 α2j−1E(Xt−j) + σ2 t X j=1 α2(j−1). (1.9)

It can be seen from (1.8) and (1.9) that second-order stationary requires the initial value of the process, X0, as E(X0) = µ (1 − α), var(X0) = (αµ + σ2) (1 − α2₎ .

For any non-negative integer k, the covariance at lag k, γ(k), is γ(k) = cov(Xt−k, Xt) = cov(Xt−k, α ◦ Xt−k) + cov(Xt−k, k−1 X j=0 αj ◦ t−j) = αkvar(Xt−k) + k−1 X j=0 αjcov(Xt−k, t−k) = αkγ(0)

The above equation shows that auto-covariance function, γ(k), is always positive for INAR(1). Al-Osh & Alzaid (1987) showed that Xt has a Poisson distribution if and only if t has a

(36)

1.5 The likelihood function

The likelihood function defines, using the fact that an IN AR(1) process is a Markov process. Let x = (x1, x2, . . . , xn), be an observed sample from an IN AR(1) process, and let

Θ substitute by Θ = (α, θ, r), be the set of parameters for the IN AR(1) process. Leonenko et al. (2007) showed that the likelihood function is

L(x; Θ) = P (X1 = x1) Πnt=2 P (Xt= xt|Xt−1 = xt−1) = P (X1 = x1) Πnt=2 P (Xt= α ◦ xt−1+ t) = P (X1 = x1) Πnt=2 min(xt,xt−1) X r=0 xt−1 r αr(1 − α)xt−1−r_{P (} t= xt− r)

Using this expression for the likelihood function the Maximum Likelihood estimator can be computed numerically.

(37)

Chapter 2

Estimation procedures

This chapter is devoted to the estimation procedure. First, we give some basic information about IN AR(1) with additive outliers. Then, we derive the appropriate estimators related to each parameter using conditional least square and Yule–Walker methods. Finally, in order to start the Bayesian estimation procedure, we need to construct the conditional maximum likelihood estimation.

2.1 Definition of IN AR(1) models with additive outliers

Assume that the observed time series Y1, Y2, . . . , Yn, is generated by

Yt= Xt+ ηtδt, 1 ≤ t ≤ n,

where Xt is a IN AR(1) process with negative binomial distribution for innovations and is

defined as

Xt= α ◦ Xt−1+ t

where ” ◦ ” is the Binomial thinning operator, t has a negative binomial distribution,

δ1, δ2, . . . , δn, are independent and identically distributed as Bernoulli with probability ,

Xt−1 and tare independents. This means that if, δt= 1, the observation Ytis contaminated

with probability , and with an Additive Outlier (AO) of magnitude ηt.

2.2 Primary estimation procedures for parameters

2.2.1 The Conditional Least Square estimators

For estimating IN AR(1), we face a complicated procedure. The complication arises from the fact that the conditional distribution of Xt given Xt−1 in the IN AR(1) process is the

convolution of the distribution of t with a Binomial with parameters (Xt−1, α). We assume

that the sequence {t} has a negative binomial distribution with parameters θ and r. So the

mean and the variance of tis defined as

E(t) = r(1 − θ) θ V ar(t) = r(1 − θ) θ2

(38)

In an IN AR(1) process, given Xt−1 and t, Xtis still a random variable. The conditional

mean of Xt given Xt−1 is given by

E(Xt| Xt−1) = αXt−1+

r(1 − θ)

θ ≡ g(θ, Xt−1),

where θ = (α, θ, r) is the set of parameters to be estimated. Klimko & Nelson (1978) devel-oped an estimation procedure which we are going to use here. The estimation is based on minimization of the sum of squared deviations about the conditional expectation. Therefore, the conditional least square (CLS) estimates for α, θ and r are those variables which minimize

Qn(θ) = n

X

t=1

[(Xt− g(θ, Xt−1))]2,

with respect to θ. The partial derivatives of Qn(θ) with respect to α, θ and r, are

∂Qn(θ) ∂α = −2 n X t=1 Xt−1 Xt− αXt−1− r(1 − θ) θ ∂Qn(θ) ∂θ = −2 n X t=1 1 − θ θ Xt− αXt−1− r(1 − θ) θ ∂Qn(θ) ∂r = −2 r θ2 n X t=1 Xt− αXt−1− r(1 − θ) θ

In order to find a minimizer for α, r, θ of Qn(θ), which is a multi-variable function, the

following should hold:

∂Qn(θ) ∂α = 0 (2.1) ∂Qn(θ) ∂θ = 0 ∂Qn(θ) ∂r = 0 Now (2.1) gives −2 n X t=1 Xt−1 Xt− αXt−1− r(1 − θ) θ = 0, −2 n X t=1 (1 − θ) θ Xt− αXt−1− r(1 − θ) θ = 0, −2r θ2 n X t=1 Xt− αXt−1− r(1 − θ) θ = 0,

(39)

and then n X t=1 Xt−1Xt− α n X t=1 X_t−12 −r(1 − θ) θ n X t=1 Xt−1 = 0 n X t=1 Xt−1Xt− α n X t=1 X_t−12 −nr(1 − θ) θ = 0 n X t=1 Xt−1Xt− α n X t=1 X_t−12 −nr(1 − θ) θ = 0

The CLS estimator is not capable of distinguish the parameters of the negative binomial, we can only obtain ˆµ = E(t) as the estimator of E(t). In order to obtain an estimation for

E(t) and the parameter α, we can apply the conditional expectation of Xt with respect to

Xt−1. So, we define ˆµ = r(1−θ)_θ and one obtains

ˆ α = Pn t=1Xt−1Xt− ( Pn t=1Xt Pn t=1Xt−1) /n Pn t=1Xt−12 − ( Pn t=1Xt−1) 2 /n = n Pn t=1Xt−1Xt− (Pt=1n XtPnt=1Xt−1) nPn t=1Xt−12 − ( Pn t=1Xt−1) 2 ˆ µ = Pn t=1Xt− ˆαPnt=1Xt−1 n .

Naturally, we would expect that, for stationary processes, the estimates for the mean and the variance calculated from the first n observations, are very close to those calculated from the overall (n + 1) observations regarding the above mentioned relations, one observes that conditional least square in ˆµ and ˆα are very close to the estimators of other methods such as Yule–Walker.

It can be easily verified that the functions g, _∂α∂g, ∂g_∂µ and _∂α∂µ∂2g satisfy all the regularity conditions found in Klimko & Nelson (1978) for g(θ). Consequently the conditional least square in ˆµ and ˆα are strongly consistent. Furthermore, we know that

E[(t− µ)3] =

r(θ − 1)(θ − 2)

θ3 , where 0 < θ < 1 and r > 0,

which tells us that E[3_t] < ∞. Using the Klimko and Nelson conditions, ( ˆα, ˆµ) are asymp-totically normally distributed as

n1/2( ˆθ − θ0) ∼ M V N 0, V−1W V−1

where ˆθ = ( ˆα, ˆµ)0, and θ0 _{= α}0_{, µ}0_{denotes the ’true’ value of the parameters. The matrix}

V is a 2 × 2 matrix with elements Vij = E ∂g(θ0_{, X} t−1) ∂θi ,∂g(θ 0_{, X} t−1) ∂θj , i, j = 1, 2. by considering (θ1, θ2) = (α, µ), the elements of W have the form

Wij = E u2_t(θ0)∂g(θ 0_{, X} t−1) ∂θi ,∂g(θ 0_{, X} t−1) ∂θj , i, j = 1, 2, with u2_t(θ0) = Xt− g(θ0, Xt−1).

(40)

2.2.2 The Yule–Walker estimation

Yule–Walker estimation method for IN AR(1) parameters, was introduced by Al-Osh & Alzaid (1987). The Yule–Walker method, also called a method of moments, consists in re-placing the theoretical auto-covariance function by the correspondent sample auto-covariance function. In order to follow Silva (2005), we have to consider IN AR(p) model which was introduced by Alzaid & Al-Osh (1990). According to Du & Li (1991) definition, an IN AR(p) model is defined as Xt= p X i=1 αi◦ xt−i+ t, αi ≥ 0, i = 1, . . . , p − 1; αp> 0

where t is an i.i.d. sequence as before,

αi◦ xt−i∼ B(Xt−i, αi)

mutually independent and independent of t.

Let R(.) be the auto-covariance function of the IN AR(p) process {Xt},

R(k) = E [(Xt+k− E(X))(Xt− E(X))] .

Then R(.) satisfies a set of Yule–Walker type of different equations, which can be written respectively, in scalar and vectorial form, through

R(0) = Vp+Pp_i=1αiR(i) R(k) =Pp i=1αiR(i − k) ⇐⇒ Rpα =      −V_p 0 .. . 0      . Here Rp= [R(j, k)] = [R(p + 1 − k, p + 1 − j)] = [R(j − k)] . Thus, let ˆ R(k) = 1 N N −k X t=0 Xt− ¯X Xt+k− ¯X , k ∈ Z,

be the sample auto-covariance function of X, where ¯X = _N1 PN

t=1Xt, is the sample mean.

The Yule–Walker estimators of α1, α2, . . . , αp are obtained by solving the following system of

linear equations for IN AR(p):

ˆ Rp−1α = ˆˆ Rp ⇔      ˆ R(0) R(1)ˆ · · · R(p − 1)ˆ ˆ R(1) R(0)ˆ · · · R(p − 2)ˆ .. . ... · · · ... ˆ R(p − 1) R(p − 2)ˆ · · · R(0)ˆ           ˆ α1 ˆ α2 .. . ˆ αp      =      ˆ R(1) ˆ R(2) .. . ˆ R(p)      .

in order to the parameters.

The estimators for µ and σ2 are ˆ µ = X¯ 1 − p X i=1 ˆ αi ! , ˆ σ2 = Vˆp− ¯X p X i=1 ˆ σ_i2,

(41)

where ˆ Vp = R(0) −ˆ p X i=1 ˆ αiR(i),ˆ

and ˆσ2_i is an estimator of the variance of the counting series for the ith thinning operation. Here we consider the Yule–Walker estimation for IN AR(1) time series with Binomial thinning operator and negative binomial distribution for innovations. So we have

Xt= α ◦ Xt−1+ t,

where t ∼ N B(r, θ). Also keep in mind that in the case of Binomial thinning operator,

α ◦ Xt−1 has Binomial distribution. So we can start using the Yule–Walker estimators in

particular case of p = 1, which are ˆ R(1) = α ˆˆ R(0), ˆ µ = X (1 − ˆ¯ α) , ˆ σ2 = R(0) − ˆˆ α ˆR(1) − ¯X ˆσ_E2, where ˆσ_E2 = ˆα(1 − ˆα) ˆ R(0) = 1 n n X t=1 (Xt− ¯X)2, ˆ R(1) = 1 n n−1 X t=1 (Xt− ¯X)(Xt−1− ¯X), and ˆ µ = E(ˆ t) = ˆr 1 − ˆθ ˆ θ , ˆ σ2 = V ar(ˆ t) = ˆr 1 − ˆθ ˆ θ2 .

So we have the following system              ˆ R(1) = ˆα ˆR(0) ˆ r1−ˆ_ˆθ θ = ¯X (1 − ˆα) ˆ r1−ˆ_ˆθ θ2 = ˆR(0) − ˆα ˆR(1) − ¯X ˆα(1 − ˆα) then                  ˆ α = R(1)ˆ_ˆ R(0) ˆ r1−ˆ_ˆθ θ = ¯X 1 −R(1)ˆ_ˆ R(0) ˆ r1−ˆ_ˆθ θ2 = ˆR(0) − ˆ R(1)2 ˆ R(0) − ¯X ˆ R(1) ˆ R(0)(1 − ˆ R(1) ˆ R(0))

(42)

For the sake of brevity we call A = X¯ 1 −R(1)ˆ ˆ R(0) ! , B = R(0) −ˆ R(1)ˆ 2 ˆ R(0) − ¯X ˆ R(1) ˆ R(0)(1 − ˆ R(1) ˆ R(0)), and we have                ˆ α = R(1)ˆ_ˆ R(0) ˆ r1−ˆ_ˆθ θ = A ˆ r1−ˆ_ˆθ θ2 = B ⇐⇒              ˆ α = R(1)ˆ_ˆ R(0) ˆ r(1 − ˆθ) = Aˆθ ˆ r(1 − ˆθ) = B ˆθ2

where ˆθ 6= 0. Now, we need to find the relative expression for ˆα, ˆr and ˆθ. Therefore,                  ˆ α = R(1)ˆ_ˆ R(0) ˆ r = A θˆ (1−ˆθ) A θˆ (1−ˆθ) (1−ˆθ) ˆ θ2 = B ⇐⇒                ˆ α = R(1)ˆ_ˆ R(0) ˆ r = A θˆ (1−ˆθ) ˆ θ = A_B Finally, we obtain the Yule–Walker estimators as

               ˆ α = R(1)ˆ_ˆ R(0) ˆ r = _B−AA2 ˆ θ = A_B Substituting for A and B, we have

                         ˆ α = R(1)ˆ_ˆ R(0) ˆ r = ¯ X 1−R(1)ˆ_ˆ R(0) ˆ R(0)−( ˆR(1))2_ˆ R(0) − ¯X ˆ R(1) ˆ R(0) 1−R(1)ˆ_ˆ R(0) ˆ θ = ¯ X2₁₋R(1)ˆ ˆ R(0) 2 ˆ R(0)−( ˆR(1))2_ˆ R(0) − ¯X ˆ R(1) ˆ R(0) 1−R(1)ˆ_ˆ R(0) − ¯X1−R(1)ˆ_ˆ R(0)

By simplifying the above expressions, we have the Yule–Walker estimators as: •

ˆ

α = R(1)ˆ ˆ R(0)

(43)

• ˆ r = ¯ X1 −R(1)ˆ_ˆ R(0) ˆ R(0) −( ˆR(1))_ˆ 2 R(0) − ¯X ˆ R(1) ˆ R(0) 1 −R(1)ˆ_ˆ R(0) • ˆ θ = ¯ X21 −R(1)ˆ_ˆ R(0) 2 ˆ R(0) −( ˆR(1))_ˆ 2 R(0) − ¯X 1 − _ˆ R(1) ˆ R(0) 2

Now that we have obtained the expressions of parameters using CLS and Yule–Walker we can decide which one can be used in practical part. We use this estimators to obtain the initial value for our parameters in R. Using CLS, we have obtained two expression: one for ˆ

α, and the other one for ˆµ = r(1−θ)_θ . These estimates can not be used, because we need to find the initial value for ˆα, ˆr and ˆθ. We can find an expression for each one of the parameters using CLS, but we need to use a nonlinear method. Since in this work, we only use linear methods to find an estimator for the parameters, we choose the Yule–Walker estimators to use in the simulation of our model in practical part of our study.

2.3 Bayesian estimation procedure for IN AR(1) model with

and without additive outliers

2.3.1 The conditional maximum likelihood estimation

In this part we give some necessary information for calculating the conditional maximum likelihood. First we need to see what is the distribution of

Xt|Xt−1

To answer these question we note that

Xt= α ◦ Xt−1+ t.

So we have the distribution of Xt|Xt−1 :

Xt| Xt−1 = α ◦ Xt−1| Xt−1+ t| Xt−1.

Since t and Xt−1 are independent, then

Xt| Xt−1 = α ◦ Xt−1| Xt−1 + t.

We also know that

α ◦ Xt−1| (Xt−1= xt−1) ∼ B(xt−1, α),

and

(44)

Now we call Z = α ◦ Xt−1| (Xt−1= z) and W = t , then

Z ∼ B(z, α), and

W ∼ N B(r, θ).

With this information, we can start auxiliary calculations for p(Z + W = l).

p(Z + W = l) = z X i=0 p(Z = i) · p(W = l − i | Z = i) (2.2) = z X i=0 p(Z = i) · p(W = l − i) = z X i=0 z i αi(1 − α)z−i×r + l − i − 1 l − i θr(1 − θ)l−i. The following conditional probabilities are computed using (2.2)

p(X2 = x2| X1 = x1) = x1 X i=0 x1 i αi(1 − α)x1−i_×r + x2− i − 1 x2− i θr(1 − θ)x2−i .. . p(Xn= xn| Xn−1= xn−1) = n Y t=2 min(xn−1,xn) X i=0 xn−1 i αi(1 − α)xn−1−i ×r + xn− i − 1 xn− i θr(1 − θ)xn−i_. Therefore, the conditional likelihood function of Θ, δ, η, is

L(Θ, δ, η, ) = n Y t=2 min(xt−1,xt) X i=0 xt−1 i αi(1 − α)xt−1−ir + xt− i − 1 xt− i θr(1 − θ)xt−i_. where Θ = (θ, α, r), θ is the probability of success, and r is the number of successes. Fur-thermore, we know that δ = (δ1, δ2, . . . , δn) are independent identical distributed Bernoulli

variables and η = (η1, . . . , ηn) are i.i.d Poisson variables. We can obtain conditional likelihood

estimators by looking forward the expressions which minimizes L in order to α, r and θ.

2.3.2 Bayesian estimation procedure for parameters

Consider an outlier model as

Yt= Xt+ ηtδt, 1 ≤ t ≤ n,

with

(45)

Let 0 < α < 1, and t∼ N B(r, θ). We know that the probability function has the following form. P (t= et) = r + et− 1 et θr(1 − θ)et_{, e} t= 0, 1, 2, . . . ,

where θ is the probability of success, and r is the number of successes. Furthermore, we know that δ1, δ2, . . . , δn are independent identical distributed Bernoulli variables with

p(δt= 1) = ,

i.e.,

δt∼ B(1, ).

Moreover, the variables η1, η2, . . . , ηn, are independent and identically Poisson distributed,

i.e.,

ηt∼ P o(β), t = 1, 2, . . . , n,

where ηt, is the magnitude of the outlier in time t, and β is a positive integer number.

Now, to apply Bayesian approach we need prior distributions for the following parameters (α, r, θ | {z } Θ , δ1, . . . , δn | {z } δ , η1, . . . , ηn | {z } η , ).

The variable α has a Beta distribution, i.e.,

α ∼ Be(a, b), a, b > 0, One should also consider r as a Poisson distribution,

r ∼ P o(µ), µ > 0 where θ has a Beta distribution,

θ ∼ Be(c, d), c, d > 0. Finally, we consider as a Beta distribution,

∼ Be(f, g), f, g > 0

So the set of hyper-parameters a, b, c, d, r, β, f, g, µ are assumed to be known. Also keep in mind that mean of the Bayesian estimates of the parameters will be used. Let π(Θ, δ, η, ) denote the prior distribution for (Θ, δ, η, ). Using the above information, about independency of parameters, one finds the prior distribution for (Θ, δ, η, ) as

π(Θ, δ, η, ) = π(Θ) · π(δ) · π(η) · π(), where 0 < α < 1, r ∈ N0, 0 < θ < 1, ηt∈ N0 and 0 < < 1.

(46)

Also, we have π(α) ∝ αa−1(1 − α)b−1, 0 < α < 1 π(r) = e−µµ r r!, r = 0, 1, 2, . . . π(θ) ∝ θc−1_{(1 − θ)}d−1_{, 0 < θ < 1} π(ηt) = Πnt=1π(ηt) = Πn_t=1e−ββ ηt ηt! , ηt= 0, 1, . . . , t = 1, 2, . . . π() ∝ f −1(1 − )g−1, 0 < < 1. Therefore the π(Θ, δ, η, ) is proportional to

αa−1(1 − α)b−1 · e−µµ r r! · θ c−1_{(1 − θ)}d−1 _{· Π}n t=1e −ββηt ηt! · f −1(1 − )g−1. As we have seen in section 2.2.3, the conditional likelihood function for x1, x2, . . . , xn is

L(Θ, δ, η, ) = Πn_t=2P (Xt= xt| Xt−1= xt−1), where Xt= Yt− ηtδt. and finally L(Θ, δ, η, ) = n Y t=2 min(xt−1,xt) X i=0 xt−1 i αi(1 − α)xt−1−ir + xt− i − 1 xt− i θr(1 − θ)xt−i_. where Mt= min(xt−1, xt).

The posterior distribution of Θ, δ, η, is given by

π(Θ, δ, η, | Y ) ∝ L(Θ, δ, η, ) · π(Θ, δ, η, )

Considering the distribution of variables, π(Θ, δ, η, | Y ) is proportional to αa−1(1 − α)b−1· e−µµ r r! · θ c−1_{(1 − θ)}d−1_{· Π}n t=1e −ββηt ηt! · f −1(1 − )g−1· L(Θ, δ, η, ), with 0 < α < 1, 0 < < 1, β > 0, µ > 0, ηt= 0, 1, . . . and t = 2, 3, . . . , n.

In this part we should find the full conditional distribution for α, r and θ in order to apply the Gibbs sampling to estimate the parameters. These distributions are given by:

1. For α, π(α | Y , r, θ, δ, η, ) ∝ L(Θ, δ, η, ) · π(α) 2. For r, π(r | Y , α, θ, δ, η, ) ∝ L(Θ, δ, η, ) · π(r) 3. For θ, π(θ | Y , r, α, δ, η, ) ∝ L(Θ, δ, η, ) · π(θ)

(47)

We know that π(α) = αa−1(1 − α)b−1, 0 < α < 1, π(r) = e−µµ r r!, r = 0, 1, 2, . . . π(θ) = θc−1(1 − θ)d−1, 0 < θ < 1. Using the likelihood function we have:

1. The full conditional probability function for α with 0 < α < 1 is

π(α | Y , r, θ, δ, η, ) ∝ n Y t=2 Mt X i=0 xt−1 i αi(1 − α)xt−1−ir + xt− i − 1 xt− i θr(1 − θ)xt−i × αa−1(1 − α)b−1, 2. The full conditional probability function for r is

π(r | Y , α, θ, δ, η, ) ∝ n Y t=2 Mt X i=0 xt−1 i αi(1 − α)xt−1−ir + xt− i − 1 xt− i θr(1 − θ)xt−i × e−µµ r r!. where r = 0, 1, . . . .

3. The full conditional probability function for θ is

π(θ | Y , r, α, δ, η, ) ∝ n Y t=2 Mt X i=0 xt−1 i αi(1 − α)xt−1−ir + xt− i − 1 xt− i θr(1 − θ)xt−i × θc−1(1 − θ)d−1. where 0 < θ < 1.

4. The full conditional probability function for pj.

For each j = 2, 3, . . . , n, pj = p δj | (Y , r, α, θ, η, , δ(−j))

has a Bernoulli dis-tribution of pj, where δ(−j)) denotes the vector δ with the jth component deleted.

the denominator of the above relation becomes

p(Y = y | r, α, θ, η, , δ_(−j)) = f (Y | δj = 1, r, α, θ, η, , δ(−j))

(48)

Therefore, pj =

f (Y | δj = 1, r, α, θ, η, , δ(−j))

f (Y | δj = 1, r, α, θ, η, , δ(−j)) + (1 − )f (Y | δj = 1, r, α, θ, η, , δ(−j))

where C = f (Y | δj = 1, r, α, θ, η, , δ(−j)). To compute C, we have

p(Y | δj = 1, r, α, θ, η, , δ(−j)) =

f (xj, xj+1| δj = 1, r, α, θ, η, , δ(−j)),

we know that xj and xj+1 are not independent so

To find the expression of A and B we can do the following calculations: (a) We have the following calculations for finding the expression of A and B.

• For A

A = f (xj | δj = 1, xj−1, r, α, θ, η, , δ(−j))

= p(Xj = xj | δj = 1, xj−1, r, α, θ, η, , δ(−j))

= p(α ◦ Xj−1+ ej | δj = 1, xj−1, r, α, θ, η, , δ(−j)),

We know that when δj = 1 we have Xj = yj− ηj. Therefore,

A = p(α ◦ Xj−1+ ej = yj − ηj | δj = 1, xj−1, r, α, θ, η, , δ(−j)) = Mj X i=0 p(α ◦ Xj−1= i, ej = yj− ηj− i | δj, xj−1, r, α, θ, η, , δ(−j)),

where xj−1 < xj. Using the independency between α ◦ Xj−1 = i and ej =

yj− ηj− i, we have A = Mj X i=0 p(α ◦ Xj−1= i | δj = 1, xj−1, r, α, θ, η, , δ(−j)) × p(ej = yj− ηj− i | δj = 1, xj−1, r, α, θ, η, , δ(−j))).

Using the distributions that we defined for the parameters, we have A = Mj X i=0 Cxj−1 i αi(1 − α)xj−1

−i _{· C}r+yj−ηj−i−1

yj−ηj−i θ

r_{(1 − θ)}yj−ηj−i_,

where i = 0, 1, 2, . . . , xj−1; yj − ηj = i, i + 1, i + 2, . . . ; and Mj =

min(xj−1, xj), which using Xj = yj− ηj we can rephrase it to

(49)

• For B B = f (xj+1 | xj, δj = 1, r, α, θ, η, , δ(−j)) = p(Xj+1= xj+1| δj = 1, xj, r, α, θ, η, , δ(−j)) = p(α ◦ Xj+ ej+1= xj+1| δj = 1, xj, r, α, θ, η, , δ(−j)) = M_j∗ X i=0 p(α ◦ Xj = i, ej+1= xj+1− i | δj = 1, xj, r, α, θ, η, , δ(−j))

Since α ◦ Xj = i and ej+1 = xj+1− i are independent,

B = M∗ j X i=0 p(α ◦ Xj = i | δj = 1, xj, r, α, θ, η, , δ(−j)) · p ej+1= xj+1− i | δj = 1, xj, r, α, θ, η, , δ(−j) .

and using the distributions of the parameters, we have

B = M_j∗ X i=0 Cxj i αi(1 − α)xj −i _{· C}r+xj+1−i−1 xj+1−i θ r_{(1 − θ)}xj+1−i = Mj∗ X i=0 Cyj−ηj i α

i_{(1 − α)}yj−ηj−i _{· C}r+yj+1−ηj+1−i−1

xj+1−i θ

r_{(1 − θ)}yj+1−ηj+1−i

where xj = yj−ηj and min(xj, xj+1) = 0, 1, 2, . . . , min(yj−ηj, yj+1−ηj+1).

(b) If there does not exist any outlier in instant j, i.e., δj = 0 =⇒ xj = yj.

Now, we can give the estimation of C. • For C C = f (Y | δj = 0, r, α, θ, η, , δ(−j)) = j+1 Y t=j p(Xt= xt| Xt−1= xt−1, δj = 0, r, α, θ, η, , δ(−j))

we know that Xt= α ◦ Xt−1+ ej, so we have

C =

j+1

Y

t=j

p(α ◦ Xt−1+ et= xt| Xt−1= xt−1, δj = 0, r, α, θ, η, , δ(−j)),

and we know that α ◦ Xt−1= i and ej = xt− i are independent so we have

C = j+1 Y t=j p(α ◦ Xt−1= iet= xt| δj = 0, r, α, θ, η, , δ(−j)) × p(et= xt− i | Xt−1= xt−1, δj = 0, r, α, θ, η, , δ(−j)))

(50)

Now, using the distributions of parameters C = j+1 Y t=j Mt X i=0 Cxt−1 i α i_{(1 − α)}xt−1−i _{· C}r+xt−i−1 xt−1 θ r_{(1 − θ)}xt−i_, where Mt= min(xt, xt−1). In this case xt and xt−1 are given.

5. The full conditional probability function for ηj.

For estimating the dimension of the outlier; ηj, j = 2, 3, . . . , n; we need to consider two

situations.

(a) If there does not exist any outlier in t = j, i.e., δj = 0,

ηj | Y , α, r, θ, , δ, η(−j) ≡ ηj | α, r, θ, , θj = 0 ∼ ηj

and one finds the distributions of ηj ∼ P o(β) which is defined the prior

distribu-tions according to the first part of this chapter.

(b) If there exist an outlier in t = j, i.e., δj = 1, we need to know what is the estimation

of (2.3),

ηj | Y , α, r, θ, , δj = 1. (2.3)

The prior distribution of (2.3) is p(ηj | Y , α, r, θ, , δj = 1) =

p(ηj, Y | α, r, θ, , δj = 1)

p(Y = y | α, r, θ, , δj = 1, ηj)

. One can rephrase the prior probability of ηj as

p(ηj = ηj | α, r, θ, , δj = 1) · p(Y | α, r, θ, , δj = 1, ηj)

P+∞

ηj=0 p(ηj = ηj | α, r, θ, , δj = 1) · p(Y = y | α, r, θ, , δj = 1, ηj) . (2.4) The denominator of (2.4) is a constant, also we know that ηj and δj are

indepen-dent, therefore

p(ηj | Y , α, r, θ, , δj = 1) ∝ p(ηj = ηj | α, r, θ, ) · p(Y , | α, r, θ, , δj = 1, ηj)

where ηj = 0, 1, 2, . . . . Now, using the defined distributions we have

p(ηj = ηj | α, r, θ, ) = e−β βηj ηj! , and p(Y | α, r, θ, , δj = 1, ηj) = f (Xj, Xj+1| α, r, θ, , δj = 1, ηj), where f (Xj, Xj+1 | α, r, θ, , δj = 1, ηj) = A × B,

and A and B have been estimated before. 6. The full conditional probability function for .

Since the prior distribution of is Be(f, g), the conditional posterior is | Y , α, r, θ, η, δ ∼ Be(f + k, g + n − 1 − k) where k is the number of the components of δ that are equal to 1.

(51)

Chapter 3

Computational illustrations

In this chapter we use the full conditional distributions of α, θ, r, δ = (δ2, δ3, . . . , δn),

η = (η2, η3, . . . , ηn), and , to draw a sample of a Markov chain which converges to the joint

posterior distribution of the parameters. The purpose of computational study is to simulate the addive outlier of IN AR(1) model using the expressions of Chapter 3. We use the programs written in R program (see appendix D) to see if our model works properly, i.e., if it detects the outliers or not.

In most cases, we can not generate any data directly from the full conditionals. Since the conditional functions are not log-concave densities, we use Gibbs methodology within Metropolis step. In particular, the Adaptive Rejection Metropolis sampling - ARMS, Gilks et al. (1995) - is used inside the Gibbs sampler. When the number of iterations is sufficiently large, the Gibbs draw can be regarded as a sample from the joint posterior distribution. Accordingly, there are two key issues in a successful implementation of this methodology:

1. Deciding the length of the chain.

2. The burn-in period and establishing the convergence of the chain.

We use a burn-in period of M iterations; Then we iterate the Gibbs sampler for a further N iterations, but retain only each Lth value. This thinning strategy reduces the autocorrelation within the chain Silva & Pereira (2012). Also we know that the Bayesian estimators are the sample means of the values obtained for the parameters.

Once the posterior probability of the outlier occurrence at each time point, pj = P δj = 1 | Y , α, θ, r, η, , δ(−j) ,

is estimated, a cut-off point of 0.5 is used for detecting the outliers, i.e., there is a possible outlier when ˆpj > 0.5.

We now discuss the other relevant issues in the proposed Bayesian approach which is the choice of the hyper-parameters for prior distributions. Recall from the previous section that the distributions of α, θ, and r, are known, namely

α ∼ Be(a, b) θ ∼ Be(c, d) r ∼ P o(µ).

Outliers detection in INAR(1) model with negative binomial innovations

Soheila

Aghlmandi

Dete¸

c˜

ao bayesiana de outliers aditivos em processos

INAR(1) com inova¸

c˜

oes binomiais negativas.

Outliers detection in IN AR(1) model with negative

binomial innovations.

Soheila

Aghlmandi

Dete¸

c˜

ao bayesiana de outliers aditivos em processos

INAR(1) com inova¸

c˜

oes binomiais negativas.

Outliers detection in IN AR(1) model with negative

binomial innovations.

Contents

List of Figures

List of Tables

Introduction

Chapter 1

Preliminaries

1.1

Bayesian Methodology

1.2

Applicability of Bayesian paradigm

1.3

Thinning operator

1.4

The IN AR(1) process

1.5

The likelihood function

Chapter 2

Estimation procedures

2.1

Definition of IN AR(1) models with additive outliers

2.2

Primary estimation procedures for parameters

2.3

Bayesian estimation procedure for IN AR(1) model with

and without additive outliers

Chapter 3

Computational illustrations