• Nenhum resultado encontrado

1.1.1 Multi-State Models

N/A
N/A
Protected

Academic year: 2022

Share "1.1.1 Multi-State Models"

Copied!
115
0
0

Texto

(1)

SPATIO-TEMPORAL MULTI-STATE PROCESSES

by

Farouk S. Nathoo

M.Math (Statistics), University of Waterloo, 2000

B.Sc. (Mathematics and Statistics), University of British Columbia, 1998

a thesis submitted in partial fulfillment of the requirements for the degree of

Doctor of Philosophy in the Department

of

Statistics and Actuarial Science

!c Farouk S. Nathoo 2005 SIMON FRASER UNIVERSITY

Fall 2005

All rights reserved. This work may not be reproduced in whole or in part, by photocopy or other means, without the permission of the author.

(2)

Name: Farouk S. Nathoo Degree: Doctor of Philosophy

Title of thesis: Methods for the Analysis of Spatio-Temporal Multi- State Processes

Examining Committee: Dr. Richard Lockhart Chair

Dr. C. Dean Senior Supervisor

Dr. Alain Vanasse Supervisory Committee

Dr. Tim B. Swartz Supervisory Committee

Dr. Carl J. Schwarz

Internal External Examiner

Dr. Lance A. Waller External Examiner, Emory University

Date Approved: December 9, 2005

ii

(3)

Studies of recurring infection or chronic disease often collect longitudinal data on the disease status of subjects. Multi-state transitional models are commonly used for describing the development of such longitudinal data. In this setting, we model a stochastic process, which at any point in time will occupy one of a discrete set of states and interest centers on the transition process between states. For example, states may refer to the number of recurrences of an event or the stage of a disease.

Geographic referencing of data collected in longitudinal studies is progressively more common as scientific databases are being linked with GIS systems. This has created a need for statistical methods addressing the resulting spatial-longitudinal structure of the data. In this thesis, we develop hierarchical mixed multi-state models for the analysis of such longitudinal data when the processes corresponding to different subjects may be correlated spatially over a region. Methodological developments have been strongly driven by studies in forestry and spatial epidemiology.

Motivated by an application in forest ecology studying pine weevil infestations, the second chapter develops methods for handling mixtures of populations for spa- tial discrete-time two-state processes. The two-state discrete-time transitional model, often used for studying chronic conditions in human populations, is extended to set- tings where subjects are spatially arranged. A mixed spatially correlated mover-stayer model is developed. Here, clustering of infection is modelled by a spatially correlated random effect reflecting the density or closeness of the individuals under study. Analy- sis is carried out using maximum likelihood with a Monte Carlo EM algorithm for implementation and also using a fully Bayesian analysis.

The third chapter presents continuous-time spatial multi-state models. Here, joint

iii

(4)

intensities frailty model is developed where baseline intensity functions are modelled using both parametric Weibull forms as well as flexible representations based on cu- bic B-splines. The methodology is applied to a study of invasive cardiac procedure in Quebec examining readmission and mortality rates over a four-year period.

Finally, in the fourth chapter we return to the two-state discrete-time setting.

An extension of the mixed mover-stayer model is motivated and developed within the Bayesian framework. Here, a multivariate conditional autoregressive (MCAR) model is incorporated providing flexible joint correlation structures. We also consider a test for the number of mixture components, quantifying the existence of a hidden subgroup of ‘stayers’ within the population. Posterior summarization is based on a Metropolis-Hastings sampler and methods for assessing the model goodness-of-fit are based on posterior predictive comparisons.

iv

(5)

v

(6)

I would like to begin by thanking my senior supervisor Dr. Charmaine Dean for her support and guidance in countless ways. I feel very fortunate to have had Dr. Dean as my supervisor. I also wish to thank Dr. Alain Vanasse, Dr. Th´eophile Niyonsenga, Dr. Abbas Hemiari, Dr. Josiane Courteau and all other members of our GEOIDE research team for their support and friendship.

Many thanks go to the faculty and staff of the Department of Statistics and Ac- tuarial Science of Simon Fraser University for providing an excellent environment for graduate studies. I have many fond memories of my time here, especially playing on the departmental basketball team. A special thanks also goes to my friends and fel- low students: Laurie Ainsworth, David Beaudoin, Suman Jiwani, Crystal Linkletter, Chunfang Lin, Jason Nielsen, Pritam Ranjan, Giovanni Silva, Darby Thompson and many more...

I would also like to acknowledge Dr. Bovas Abraham from the Department of Statistics and Actuarial Science at the University of Waterloo and Dr. Harry Joe and Dr. Jim Zidek from the Department of Statistics at the University of British Columbia all for encouraging me to pursue statistics at a higher level.

I am grateful for the financial support provided by Simon Fraser University and the GEOIDE network throughout the course of my studies.

Finally, and most importantly, I would like to thank my family. Especially my dad who has been a constant source of support and wisdom and Yasmin Vasanji for her love and patience.

vi

(7)

Approval ii

Abstract iii

Dedication v

Acknowledgments vi

Contents vii

List of Tables ix

List of Figures xi

1 Introduction 1

1.1 Background . . . 1

1.1.1 Multi-State Models . . . 2

1.1.2 Hierarchical Spatial Modelling . . . 4

1.1.3 Markov chain Monte Carlo . . . 6

1.1.4 Splines and Temporal Smoothing . . . 7

1.2 Outline of Thesis . . . 9

1.2.1 Chapter 2 . . . 9

1.2.2 Chapter 3 . . . 10

1.2.3 Chapter 4 . . . 10

1.2.4 Chapter 5 . . . 11

vii

(8)

2.2 Inference Procedures . . . 16

2.2.1 Observed Information Matrix . . . 20

2.3 Study of Weevil Infestation . . . 21

2.4 Discussion . . . 29

3 Continuous-Time Spatial Multi-State Processes 35 3.1 Spatial Continuous-Time Multi-State Models . . . 38

3.1.1 A Joint Spatial Model for Random Effects . . . 41

3.1.2 Computational Implementation . . . 42

3.2 Study of Invasive Cardiac Procedure . . . 45

3.3 Discussion . . . 56

4 Extending the Spatial Mover-Stayer Model 62 4.1 Motivation . . . 63

4.2 An Extended Spatial Mover-Stayer Model . . . 65

4.2.1 Hypothesis Testing for Stayers . . . 67

4.2.2 Computational Implementation . . . 69

4.3 Analysis of Synthetic Data . . . 75

4.4 Analysis of Tree Infection data . . . 80

4.4.1 Model Validation . . . 83

4.5 Discussion . . . 90

5 Future Work 91 5.1 Spatially Correlated Mover-Stayer Allocations . . . 91

5.2 Spatial Adaptive Splines and P-Splines . . . 92

5.3 Accelerated Failure Time Models with Spatial Frailties . . . 93

5.4 Spatial Finite Mixtures . . . 94

Bibliography 96

viii

(9)

2.1 Parameter estimates for the spatial mover-stayer model. Bayes esti- mates are posterior means and standard deviations. Maximum likeli- hood estimates are obtained using the Monte Carlo EM algorithm. . . 28 3.1 DIC scores (after subtracting 340,000) and pD for the eight models

considered for Quebec cardiac data. . . 51 3.2 Posterior summaries of regression coefficients associated with each of

the three transitions associated with mortality . . . 53 3.3 Posterior summaries of regression coefficients associated with each of

the two transitions associated with readmission . . . 54 3.4 Posterior summaries for the conditional covariance matrix,Σ, obtained

from the final chosen model. . . 54 4.1 Calibration of the posterior probability assuming a fair priorP r(H0) =

P r(H1) = 12. The table is adapted from Raftery (1996), where it was use to calibrate the Bayes factor. . . 68 4.2 Posterior summaries obtained from the analysis of three simulated

datasets. For each parameter we give the 95% credible interval. Esti- mates of P r(pM = 1|Y) are also given along with Monte Carlo stan- dard errors (computed using the method of batch means employing 100 batches each of size 2000). . . 79 4.3 Posterior summaries obtained from fitting the extended spatial mover-

stayer model to the weevil infestation data. Here, we have defined σb0 =

Σ11,σb1 =

Σ22 and ρ= ΣΣ12

11Σ22. . . . 81 ix

(10)

x

(11)

1.1 State structures commonly employed for modelling chronic diseases (a) a typical state structure used for modelling a recurring disease process;

(b) a state structure for joint modelling of disease and mortality. . . . 3 2.1 Positions of trees within the plantation. The boundary is taken to be

the convex hull of these positions. . . 23 2.2 Raw estimates of the conditional probabilities of infection in each year. 24 2.3 Estimates from initial model exploring spatial scale and spatial corre-

lation. . . 26 2.4 Monte Carlo EM estimates by iteration for selected parameters in the

spatial mover-stayer model (a) β01; (b) β02; (c) β11; (d)β12; (e) σb0; (f) σb1; (g) PI; (h) PM. . . 27 2.5 Maximum likelihood estimates of temporal trends from the spatial

mover-stayer model with 95% confidence intervals (a) g0(t,α0); (b) g1(t,α1). Bayesian (posterior mean) estimates of temporal trends with 95% credible sets (c) g0(t,α0); (d) g1(t,α1). . . 30 2.6 Locations of the 100 largest (triangles) and smallest (circles) estimated

random effects from the spatial mover-stayer model (a)b0 - Likelihood Analysis; (b) b1 - Likelihood Analysis; (c) b0 - Bayesian Analysis; (d) b1 - Bayesian Analysis. . . 31 2.7 Locations of the 100 trees having the highest estimated posterior prob-

ability of resistance (a) Likelihood Analysis - P r(zi = 0|Y,Θ); (b)ˆ Bayesian Analysis - P r(zi = 0|Y). . . 32

xi

(12)

3.3 Estimated cumulative baseline intensities associated with mortality af- ter 0, 1 and 2 readmissions (posterior means and 95% credible intervals) a) Q014(t) - Spline; b) Q014(t) - Weibull; c) Q024(t) - Spline; d) Q024(t) - Weibull; e) Q034(t) - Spline; f)Q034(t) - Weibull. For comparison, the step-function estimates obtained from the semiparametric analysis are indicated within each plot by the grey curve. . . 48 3.4 Estimated cumulative baseline intensities of first and second readmis-

sion (posterior means and 95% credible intervals) a) Q012(t) - Spline;

b)Q012(t) - Weibull; c)Q023(t) - Spline; d)Q023(t) - Weibull. For com- parison, the step-function estimates obtained from the semiparametric analysis are indicated within each plot by the grey curve. . . 49 3.5 Matrix scatter plot comparing the posterior mean estimates ofb14,b24,

b34, b12 and b23. . . 55 3.6 Boxplots (arranged in increasing order by posterior median) obtained

from posterior samples of random effects associated with mortality a) b14; b) b24; c)b34. . . 57 3.7 Posterior mean maps of random effects associated with mortality a)

b14; b) b24; c)b34. . . 58 3.8 Boxplots (arranged in increasing order by posterior median) obtained

from posterior samples of random effects associated with readmission a) b12; b) b23. . . 59 3.9 Posterior mean maps of those random effects associated with first and

second readmission a) b12; b) b23. . . 60 4.1 Scatter plot comparing posterior mean estimates of b0 and b1: a) Es-

timates obtained in Chapter 2 assuming bl

ind IAR(σl), l = 0,1; b) Estimates obtained from expanded model assuming b 2CAR(κ,Σ). 64

xii

(13)

to larger values (a) exp(b0i), i= 1, ...,400; (b) exp(b1i), i= 1, ...,400. . 77 4.3 Scatter plot comparing simulated values of b0 and b1. . . 78 4.4 Posterior mean estimates of the temporal trends with 95% credible sets

(a) g0(t,α0); (b) g1(t,α1). . . 82 4.5 Locations of the 100 largest (triangles) and smallest (circles) estimated

random effects from the extended spatial mover-stayer model (a) b0; (b) b1. . . 84 4.6 Locations of trees which may be resistant (a) Those 100 trees having

the highest posterior probability of resistance P r(zi = 0|Y); (b) Those 715 trees which were never infected. . . 85 4.7 Posterior predictive distributions obtained from the extended spatial

mover-stayer model (a) T1(Y) - the total number of 0 1 transitions;

(b) T2(Y) - the total number of 1 0 transitions; (c) T3(Y) - the overall total number of transitions; (d) T4(Y) - total number of trees which were never infected; the dashed vertical line on each histogram represents the observed value of the test statistic. . . 88 4.8 Posterior predictive distributions obtained from the submodel which

sets pM = 1 (a) T1(Y) - the total number of 0 1 transitions; (b) T2(Y) - the total number of 1 0 transitions; (c) T3(Y) - the overall total number of transitions; (d) T4(Y) - total number of trees which were never infected; the dashed vertical line on each histogram repre- sents the observed value of the test statistic. . . 89

xiii

(14)

Introduction

1.1 Background

Multi-state modelling is a powerful and convenient approach for describing the pro- gression of longitudinal data. The framework is broad and encompasses techniques for the analysis of multivariate censored event-time data as well as methods for the analysis of longitudinal discrete data. In this thesis, multi-state transitional models are considered in a spatial setting. In essence, the work blends ideas adapted from lon- gitudinal data analysis and spatial statistics. Methodological developments have been strongly motivated by studies in: 1) forest ecology, where interest lies in managing trees, forests and their associated resources for human benefit, and 2) epidemiologic studies, where investigators are interested in the spatial distribution of health-related states or events.

In the forest ecological setting, we have developed methods for analysis of data arising from a study of recurrent white pine weevil (Pissodes strobi) infestation in a white pine plantation in British Columbia. In this seven-year longitudinal study, conducted by the Ministry of Forests in British Columbia, each tree within the plan- tation was inspected each Fall for the presence of infection. Our main interest was to describe the pattern of weevil infestation throughout the area over the seven years of observation. White pine weevil infection poses a significant threat to British Columbia forests and there has been enormous investment recently on studying this disease.

1

(15)

In the spatial epidemiological setting, we have developed techniques for analysis of spatial data arising from a study of revascularization intervention in Quebec. In this four-year longitudinal study, patients hospitalized for acute coronary syndrome were followed over time and information regarding subsequent hospital readmissions and mortality was obtained. Additional demographic and treatment information was also obtained for each patient along with the local health unit in which the subject resides. The local health units serve as a geographical stratification of the subjects involved in the study. Here, interest lies in the identification of spatial heterogeneity in both mortality and readmission rates across the various local health units of the province.

We begin, in this section, with a review of some preliminary ideas that form the basis for model building and inference in later chapters. We then outline the remainder of the thesis in the next section.

1.1.1 Multi-State Models

In the multi-state modelling framework we assume that individuals in some population will occupy one of states 1, ..., k over a period of time. As subjects are observed over time, they may make changes from one state to another and we refer to such changes of state astransitions. Examining transitions can give insight into the dynamic aspects of the process under consideration. The state structure, often depicted graphically, specifies the states and which state-to-state transitions are possible. Figure 1.1 gives examples of two state structures which are often employed for models of chronic disease. The state structure depicted in Figure 1.1a will be employed in Chapters 2 and 4 where we consider models for recurring tree infection. The structure shown in Figure 1.1b, the so-called illness-death model (Hougaard 2000), is appropriate for modelling both disease and mortality simultaneously.

Models describing the evolution of a discrete-time process Y(t), t = 0,1,2, ... are typically specified through transition probabilities

pij(t) = P r(Y(t) = j|Y(t1) =i, H(t)) (1.1) where H(t) = {Y(u), u = 0, ..., t1} denotes the history of the process up to time

(16)

a)

b)

Disease Free 0

Diseased 1

Disease Free 0

Diseased 1

Dead 2

Figure 1.1: State structures commonly employed for modelling chronic diseases (a) a typical state structure used for modelling a recurring disease process; (b) a state structure for joint modelling of disease and mortality.

(17)

t. In the continuous-time setting, the transition process is governed by transition intensity functions which are defined by

qij(t) = lim

s→0

P r(Y(t+s) = j|Y(t) = i, H(t))

s (1.2)

where H(t) = {Y(u),0 ≤u < t}. Typically, a Markov assumption is made where it is assumed that the entire history of the process is captured by the current state. In this case, (1.1) reduces to pij(t) = P r(Y(t) = j|Y(t1) =i) (for a 1st order chain) and (1.2) reduces to qij(t) = lims0 P r(Y(t+s)=j|Y(t)=i)

s . In more complex situations, Markov processes can be used as building blocks in a hierarchical framework used to specifymixed Markov models. This approach was considered by Cook and Ng (1997), Ng and Cook (1997) and Albert and Waclawiw (1998) in non-spatial settings.

1.1.2 Hierarchical Spatial Modelling

The modelling of non-Gaussian spatially correlated data typically proceeds in a hier- archical framework. Within such a framework, observations are assumed conditionally independent at the lowest level of the hierarchy and dependence is introduced at the second level through spatially correlated random effects. The random effects account for heterogeneity and, in many settings, represent covariates that are missing from the model. Within the realm of generalized linear models, the incorporation of spatial random effects has been studied extensively (see e.g., Besag et al., 1991, Bernardinelli and Montomoli, 1992, Best et al., 1999, Zhang 2002). In this thesis we adopt a similar approach, introducing random effects into the second level of hierarchical multi-state Markov processes.

A convenient distributional form for a vector of N spatially correlated random effects b = (b1, ..., bN) is the multivariate Gaussian with mean 0 and spatially struc- tured covariance matrix Σ. Here, associated with each bi is either a point location (xi, yi) R2 or a position on a (possibly irregular) lattice. Typically, one of two approaches is adopted for specifyingΣ. A direct and simple approach, known as geo- statistical modelling (Cressie, 1993, Diggle et al. 1998), requires knowledge of the point locations (xi, yi), i= 1, ..., N, and specifies Σ as a parametric function of these loca- tions. A simple example is the exponential form which sets Σij = σ2exp(−ρd(i, j))

(18)

and is based on two parameters ρ > 0 and σ2 > 0 and the distance, d(i, j), be- tween points i and j. An alternative approach, known as conditional autoregressive modelling (CAR) (Besag, 1974, Cressie 1993, Carlin and Louis 1996) is the spatial analogue of autoregressive time-series modelling and specifies Σ indirectly through the set of conditional distributions bi|bj$=i Ni, σi2), i = 1, ..., N. Here, each con- ditional distribution is assumed univariate normal with conditional variance σ2i > 0 and conditional meanµi =!N

j=1wijbj with wii = 0, i= 1, ..., N. The weightwij 0 can be based on either the distance between unitsiand j or indicators for their adja- cency on a lattice and reflects the influence ofbj on the conditional mean of bi. With these conditional specifications, the results of Besag (1974) can be used to show that Σ= (IW)1MwhereW= (wij),M= diag21, ..., σ2N}and we impose the restric- tionwijσ2j =wjiσ2i to ensure the symmetry of Σ. A special case which has been used extensively in disease mapping is the intrinsic autoregression which sets wij = CCiji

·

and σ2i = Cσ2

where the Cij’s are known user defined weights and Ci·=!

jCij. This specification leads to a singular multivariate Gaussian distribution for b.

Geostatistical models, due to their direct specification of Σare easily interpreted;

whereas, CAR models are most sensibly interpreted in a conditional sense. On the other hand, CAR models can be based on either point locations or derived at a lower spatial resolution using only the adjacency structure of a lattice. In addition, the conditional specification of CAR models makes them ideal for use with Markov chain Monte Carlo methods described in the next subsection. We adopt here the CAR modelling approach. Finally, we note that multivariate generalizations of the CAR modelling framework have been developed (see e.g. Kim et al. 2001, Carlin and Banerjee, 2002, Gelfand and Vounatsou 2003) which allow for the joint spatial modelling ofk >1 random effects associated with each spatial unit. We incorporate such joint spatial structures into our multi-state modelling framework in Chapters 3 and 4 of this thesis.

(19)

1.1.3 Markov chain Monte Carlo

Markov chain Monte Carlo is a collection of numerical simulation methods which allow the approximation of integrals that are analytically intractable. Even though the theory behind MCMC was developed much earlier (Metropolis et al. 1953), the techniques have become increasingly popular within the last decade, a result owing to the availability of cheap computing power.

The principle behind MCMC is the ergodic theorem applied to Markov chains.

Inference with respect to some target distributionπ is based on the construction of a Markov chain having π as its invariant distribution. The ergodic theorem then links expected values under π with observations, x0, x1, x2,..., from the Markov chain via

Jlim→∞

1 J + 1

"J i=0

f(xi) =Eπ[f(x)]

for any functionf, integrable with respect toπ. Expected values underπmay then be approximated using realizations of the Markov chain. The technique is most useful when drawing realizations directly from π is not feasible and π is sufficiently high dimensional and complex so that importance sampling methods cannot be employed.

This is typically the case with hierarchical spatial models involving large numbers of random effects.

The two most common MCMC algorithms are the Gibbs sampler (Geman and Ge- man 1984) and the Metropolis-Hastings algorithm (Hastings 1970). The Gibbs sam- pler is based on drawing from full conditional distributions. Supposex= (x1, ..., xN) and it is feasible to obtain realizations from the full conditional distributionπ(xj|x−j) where xj denotes x with xj removed. The Gibbs sampler changes the state of the chainxitoxi+1by updating eachxj, j = 1, ..., N, in turn by sampling the replacement value from the corresponding full conditional distributionπ(·|xi+11 , ..., xi+1j1, xij+1, ..., xiN).

The algorithm depends on the ability to draw from full conditional distributions. Of- ten, the full conditional distributions will not take standard forms but the correspond- ing densities will be log-concave. In this case, adaptive rejection sampling (Gilks and Wild 1992) may be employed.

In situations where it is difficult to sample from full conditional distributions, the

(20)

Metropolis-Hastings algorithm, a generalization of the Gibbs sampler, may be used.

In this case, the update from xi to xi+1 proceeds by first generating a candidate x&

from a proposal distribution q(·|xi). The proposed value, x&, is accepted as the new state of the chain,xi+1, with probability

α(xi,x&) = min{1,π(x&)q(xi|x&)

π(xi)q(x&|xi)} (1.3)

Otherwise we setxi+1 =xi and the chain does not move. Note also thatπ only needs to be known up to a normalizing constant. When the proposal density is symmetric (1.3) reduces toα(xi,x&) = min{1,π(xπ(x"i))}and the resulting special case is referred to as the Metropolis algorithm. Often, it is not feasible to update the whole ofxin one step.

In this case, as in the Gibbs sampler, we divide x into components x = (x1, ..., xN) and apply a Metropolis-Hastings step to each component. This scheme includes, as a special case, the so-called hybrid samplers (Gilks et al. 1996) that update some components via Gibbs steps and others using Metropolis-Hastings steps.

In practice, an initial portion of the realized Markov chain is discarded as burn-in, a period required for the chain to ‘forget’ its initial state and converge to the stationary distribution. Determination of convergence is best assessed through running multiple chains, each initialized at different points in the sample space of π. Analysis then compares the output of each chain using diagnostics (see eg. Gelman and Rubin 1992) and through the examination of sample trace plots, most importantly plots which display the value of the log(π) (up to an additive constant) at each state of the chain.

In Chapter 2, we employ the Gibbs sampler within each iteration of an EM al- gorithm to approximate the conditional expectations required at each E-step. In Chapters 3 and 4 we use the Metropolis-Hastings algorithm to draw samples from posterior distributions arising from Bayesian model specifications.

1.1.4 Splines and Temporal Smoothing

Splines provide a conceptually simple approach for approximating complex nonlinear functions (De Boor 1978). In this thesis, splines of one variable are employed for

(21)

modelling temporal variation in the process governing the state-to-state transitions of a multi-state model. The basic idea behind splines is the representation of a possibly complicated curve through a combination of relatively simple smooth segments where each segment is represented by a polynomial of order D. To ensure smoothness of the composite curve, constraints are imposed on each segment at the joining points which are called inner knots. Given a set ofL inner knots t1 < t2 < ... < tL a spline S of degree D may be written as

S(x) =

"L l=0

Sl(x)I(x[tl, tl+1)) (1.4) where

Sl(x) =

"D d=0

ald(x−tl)d, l= 0, ..., L

wheret0 < tL+1are boundary knots typically defined by the range of the data; and the alds are constrained to ensure thatS has continuous derivatives of all degrees≤D−1.

As discussed by MacNab (1999), the collection of all functions taking the form (1.4) forms a linear space of dimensionD+L+ 1. It is therefore spanned by anyD+L+ 1 linearly independent members of the space forming a basis. The approximation of a function f over the interval [t0, tL+1] using a spline of degree D and inner knots t1 < t2 < ... < tL is therefore given by

f(t)≈

D+L"

j=0

αjpj(t) (1.5)

where p(t) = {p0(t), ..., pD+L(t)} is any such basis and α0, ..., αD+L are unknown parameters. A convenient choice, which we employ in this thesis, is the B-spline basis that is easily computed using the recursive algorithm of De Boor (1978). In addition, our functional approximations will takeD= 3 and incorporate an intercept yielding

f(t)≈α0+

3+L"

j=1

αjpj(t)

where we exclude the first B-spline basis functionp0(t) in order to identify the inter- cept.

(22)

1.2 Outline of Thesis

This thesis consists of three projects examining spatial variation in longitudinal multi- state processes. Chapters 2 and 4 deal with processes in discrete-time and methods developed therein are applied to the aforementioned forest ecological study. Chapter 3 develops methods for continuous-time processes in a spatial epidemiological setting.

Each chapter is written in a style similar to that for publication and the summaries of the chapters provided below reflect the corresponding abstracts. As a result, some introductory material is repeated.

1.2.1 Chapter 2

Studies of recurring infection or chronic disease often collect longitudinal data on the disease status of subjects. Two-state transitional models are useful for analysis in such studies where, at any point in time, an individual may be said to occupy either a diseased or disease-free state and interest centers on the transition process between the two states. Here, two additional features are present. The data are spatially arranged and it is important to account for spatial correlation in the transitional processes corresponding to different subjects. In addition there are subgroups of individuals with different mechanisms of transitions. These subgroups are not known a priori and hence group membership must be estimated. Covariates modulating transitions are included in a logistic additive framework. Inference for the resulting mixture spatial Markov regression model is not straightforward. We develop here a Monte Carlo EM algorithm for maximum likelihood estimation and a Markov Chain Monte Carlo sampling scheme for summarizing the posterior distribution in a Bayesian analysis. The methodology is applied to a study of recurrent weevil infestation in British Columbia forests.

(23)

1.2.2 Chapter 3

Follow-up medical studies often collect longitudinal data on patients. Multi-state models can be employed for analysis in such studies where at any point in time, in- dividuals may be said to occupy one of a discrete set of states and it is of interest to examine the process governing state-to-state transitions. For example, states may refer to the number of recurrences of an event, or the stage of a disease. We de- velop a hierarchical Bayesian model for the analysis of such longitudinal data when the processes corresponding to different subjects may be correlated spatially over a region. Continuous-time Markov chains incorporating spatially correlated random effects are introduced. Here, joint modelling of both spatial correlation as well as correlation between different transition rates is required and a multivariate spatial approach is employed. A proportional intensities frailty model is developed, where baseline intensity functions are modelled using both parametric Weibull forms and flexible representations based on cubic B-splines. The methodology is applied to a study of revascularization intervention in Quebec. We consider patients admitted for acute coronary syndrome throughout the 139 local health units of the province and examine readmission and mortality rates over a four-year period.

1.2.3 Chapter 4

In this final chapter we return to the discrete-time setting of Chapter 2 and develop an extended model with inference conducted from a Bayesian perspective. A joint spatial random effects model is incorporated into the transitional process of a hier- archical mover-stayer model. In this case, the random effects allow for two types of correlation. In addition to allowing for spatial correlation, we also permit correlation between subject specific transition probabilities. This flexible correlation structure is accommodated through a multivariate conditional autoregressive (MCAR) model.

The chapter also develops a test for the number of mixture components, quantifying the existence of a hidden subgroup within the population. That is, we develop a test for ‘stayers’ in the mover-stayer model. The test is based on assigning a discrete mass

(24)

prior to the mixing probability. Testing of this point null hypothesis was of substan- tial interest to investigators in our forest ecological application. Inference is based on samples drawn from the posterior distribution using a Metropolis-Hastings algo- rithm. Finally, methods for assessing the model goodness-of-fit are developed based on posterior predictive comparisons.

1.2.4 Chapter 5

The thesis closes with a discussion of future work.

(25)

A Discrete-Time Spatial Two-State Process

Studies of recurring infection or chronic disease often collect longitudinal data on the disease status of subjects. In many such studies, subjects are observed at regular time intervals and assessed for the presence/absense of a condition, such as a disease.

Statistical analysis of the resulting longitudinal binary data is conveniently conducted through the use of two-state transitional models; in particular, when interest lies in the probabilities of transition between the diseased and disease-free states. In such analyses, it is typically assumed that individuals under observation are independent.

Markov chain modelling is a commonly used approach for describing a process which yields temporally dependent binary sequences. Inference in such models was considered in an early paper by Anderson and Goodman (1957) for the simple case where all subjects share the same transition probabilities. Muenz and Rubinstein (1985) allow the transition process to vary from subject to subject through regres- sion modelling of the transition probabilities. In many scenarios there exists extra- variation which is not explained by the available covariates. To account for this extra variation, two stage, conditionally Markov processes can be employed. At the first stage, the data obtained from each subject are assumed to be drawn from a two-state Markov chain. At the second stage, continuous mixing distributions are used to model heterogeneity in transitions. Cook and Ng (1997) develop such a model incorporating

12

(26)

bivariate Gaussian random effects into transition probabilities and conduct maximum likelihood estimation using numerical integration. Albert and Waclawiw (1998) de- velop a similar model but specify only the first two moments of the independent random effects and conduct inference using generalized estimating equations.

An alternative approach to account for heterogeneity in a Markov Chain analysis is based on finite mixtures. In particular, this is useful when it is thought that a subgroup of individuals, known as ‘stayers’, will remain in their initial state throughout the course of observation. Such an approach can be employed for studying disease in populations where it is hypothesized that some members are resistant or immune to the condition studied. Models of this sort have been discussed by Frydman (1984), Fuchs and Greenhouse (1988) and Cook, Kalblfleisch and Yi (2002).

Independence between subjects is an assumption that is made in all models dis- cussed above. In the application we consider, the subjects under observation are spatially arranged and it is of essence to describe the spatial correlation. Our moti- vating example is a study of recurrent weevil infestation in a white pine plantation in British Columbia. In this seven-year longitudinal study conducted by the Ministry of Forests in British Columbia each tree within the plantation was examined in the fall for the presence of infection. Of primary interest was to describe the pattern of weevil infestation throughout the area over the seven years of observation. White pine weevil infection poses a significant threat to British Columbia forests and there has been enormous investment recently on studying this disease.

In this chapter we present a transitional model for spatio-temporal two-state processes. There are several features of this model and our analysis which distin- guish them from the usual two-state model analysis. Importantly, spatial random effects are incorporated into transition probabilities to accommodate correlation. In our study it was hypothesized that heterogeneity might arise through the presence of trees which were resistant to infection. In fact a major scientific objective in a follow-up analysis would be to identify and characterize such resistant trees with the goal of populating secondary forests with such qualities. To address this statistically, excess heterogeneity is accommodated by allowing for a subgroup of individuals whose initial state is absorbing. The resulting two-component model is of the mover-stayer

(27)

type with spatially correlated random effects introduced into the component associ- ated with ‘movers’. With the two layers of mixing distributions, one discrete and one involving a high dimensional spatial mixture, analytic tools for conducting inference need careful consideration. We develop these in both the likelihood and Bayesian frameworks and present a comparison of such methods in our analysis. Estimation is approached by maximum likelihood using a Monte Carlo EM algorithm in the classical framework and through Markov Chain Monte Carlo summarization of the posterior distribution in the Bayesian setting.

The remainder of the chapter is organized as follows. In Section 2.1 we specify the mixed spatially correlated mover-stayer model. Section 2.2 develops maximum likelihood inference for our model. A spatio-temporal analysis of weevil infestation in a white pine plantation in British Columbia is discussed in Section 2.3. In Section 2.4 we discuss extensions involving multivariate spatial processes and continuous time modelling.

2.1 Spatio-Temporal Mixed Two-State Model

Suppose there areN subjects, spatially arranged throughout some region and subject i is observed over a sequence of ni equally spaced time points. Upon observation, each individual will occupy one of two possible states representing say, the presence or absence of some condition, for example an infection. We let state 1 denote the infected state and state 0, the infection-free state. Let yi(t) be the binary variable denoting the state occupied by subjecti at timet and yi = (yi(0), ..., yi(ni1))& the sequence of states occupied by subjecti, i= 1, ..., N.

The mixed mover-stayer model is specified hierarchically where, at the first stage of the model, we assume each response vector, yi, is independently drawn from a compartmental model having density

fM S(yi|Z,b0,b1) =



I(yi =0) if zi = 0, fM C(yi|b0,b1) if zi = 1

(2.1) where Z = {z1, z2, ..., zN} is a vector of latent variables with zi ∈ {0,1} allocating

(28)

subject i into one of two mixture components and we adopt independent allocations to these components, zi

ind Bernoulli(pMi) i = 1, ..., N, where pMi = P r(zi = 1).

Extensions, allowing the components of Z to be spatially correlated, are considered in section 5. In (2.1), one mixture component places all its mass on the zero vector while the other component distributes mass according to the densityfM C(yi|b0,b1), which is that of a 1st order, two-state Markov chain given by

fM C(yi|b0,b1) =pyIii(0)(1−pIi)1−yi(0) &

tD0i

p01i(t)yi(t)(1−p01i(t))1−yi(t)

× &

tD1i

p10i(t)1−yi(t)(1−p10i(t))yi(t) (2.2) whereDli={t >0|yi(t1) =l}, l= 0,1,pIi is an initial state probability andp01i(t) andp10i(t) are transition probabilities. The transition probabilities are modelled using additive logistic specifications

logit(p01i(t)) = β0&xi(t)+g0(t,α0) +b0i,

logit(p10i(t)) = β1&xi(t)+g1(t,α1) +b1i, (2.3) i = 1, ..., N, t = 1, ..., ni1, where xi(t) is a p-vector of covariates associated with subject i at time t; β a vector of regression parameters; g0(t,α0) and g1(t,α1) are functions of time describing temporal trends in transitions andb0i and b1i are random effects accounting for spatial correlation.

Several types of temporal trends can be considered. We allow for flexible forms using cubic B-splines. The cubic B-spline representations used here are given by

gl(t,αl) =αl0 +

K"l+3 j=1

αljplj(t), l= 0,1 (2.4) where αl = (αl0, ..., αlKl+3), l = 0,1, are vectors of unknown coefficients and {pl1(t), ..., plKl+3(t)}, l = 0,1, are sets of known B-spline basis functions with Kl, l = 0,1, representing the number of inner knots used in the representations. For the spatial random effects,b0 = (b01, ..., b0N)" and b1 = (b11, ..., b1N)", we adopt Gaussian intrinsic autoregressive (IAR) models based on conditional specifications of the form

bli|{blj, j +=i} ∼N '!

jCijblj

!

jCij

, σbl2

!

jCij

(

, l= 0,1 (2.5)

(29)

whereCij are user-defined weights measuring the closeness or adjacency of subjects i and j (Cii = 0) with b0, b1 and Z assumed independent. As a result of (2.5), each vector bl, l = 0,1, is distributed as N(0,Σl), with Σl having generalized inverse Σl 1 = σ12

bl(DC), l= 0,1; C= (Cij) is often called the neighbourhood matrix and D= diag{C1., C2., ..., CN.} with Ci.=!N

j=1Cij.

Under this model, the marginal likelihood for Y={y1, ...,yN} takes the form L(Θ) =E)&N

i=1

fM S(yi|Z,b0,b1)*

(2.6) where Θ = 01, σ2b0, σb12 ,{pIi},{pMi},α01} denotes the model parameters and the expectation in (2.6) is taken with respect to the distributions of b0, b1 and Z.

There are two situations where the above compartmental model may be considered.

Empirically, the data may suggest that several individuals never change states over time; additionally, scientific considerations may point to a need to address the presence of subgroups even if this is not empirically obvious. The mixed mover-stayer model allows for a subgroup of subjects whose initial state is absorbing. These so called

‘stayers’ can represent individuals who are immune to infection and will therefore be observed in the disease-free state (state 0) at all times.

2.2 Inference Procedures

In this section we outline procedures for maximum likelihood inference. As the mar- ginal likelihood function (2.6) is analytically intractable, we develop a Monte Carlo maximum likelihood scheme.

The EM algorithm (Dempster et al., 1977) is a popular tool for conducting maxi- mum likelihood inference in situations involving missing data. We outline maximum likelihood procedures based on a Monte Carlo implementation of the algorithm (Wei and Tanner 1990), where the random effects and latent variables are treated as miss- ing data. In situations where the E-step of the EM algorithm does not admit a closed form, Wei and Tanner (1990), among others, proposed that the E-step can be carried

(30)

out using Monte Carlo integration resulting in what has been called the MCEM algo- rithm. This algorithm is useful for estimation in our model since the expectation in (2.6) involves an integral of high dimension. Alternative approaches could use analytic approximations to these integrals. For example MacNab and Dean (2000) investigate the use of penalized quasi-likelihood (Breslow and Clayton, 1993) for estimation in spatial random effect models. Such an approach, while being less computationally intensive than MCEM, can yield severely biased estimates in the case of binary data (Lin and Breslow, 1993). Previous applications of the MCEM algorithm include Chan and Ledolter (1995), who study a model for count data incorporating temporally cor- related random effects, and Chan and Kuk (1997) who examine probit-linear mixed models with correlated random effects. Most recently, Zhang (2002) developed a Monte Carlo version of the EM gradient algorithm, in a geostatistical setting.

Procedures for Bayesian inference are more carefully detailed in the following sec- tion where the application is considered. To simplify the presentation we assume pIi = pI and pMi = pM, i = 1, ..., N. Permitting variation, for example, regression modelling of the initial probabilities is easily accommodated. Note that a prime focus here is on investigating transition probabilities so we direct attention to modelling these.

Under the mixed mover-stayer model, a sufficient statistic for Θ is given by T= (Y0,Y1) where

Y0 ={yi(0), I{yi =0}, {yi(t)|t∈D0i}; i= 1, ..., N} (2.7) and

Y1 ={{yi(t)|t∈D1i}; i= 1, ..., N}. (2.8) The marginal likelihood function forY, given in (2.6), can be correspondingly factor- ized into two terms

L(Θ,Y) =L00,Y0)×L11,Y1)

whereΘ0 = (β0, σb0, pI, pM0)" andΘ1 = (β1, σb11)" divide the model parameters into two disjoint sets. As a result, maximum likelihood estimates can be obtained by

(31)

maximizing L0(·) andL1(·) separately with L1(·) = E)&N

i=1

&

tD1i

exp(β1&xi(t)+g1(t,α1) +b1i)1yi(t) 1 + exp(β1&xi(t)+g1(t,α1) +b1i)

*

(2.9) where the expectation in (2.9) is taken with respect to the distribution of b1 and

L0(·) =E)

q(Y0,Z, pI)

&N i=1

&

tD0i

exp(β0&xi(t)+g0(t,α0) +b0i)ziyi(t) [1 + exp(β0txi(t)+g0(t,α0) +b0i)]zi

* (2.10)

where

q(Y0,Z, pI) =p!I Ni=1ziyi(0)(1−pI)!Ni=1zi(1−yi(0)) &

{i|zi=0}

I{yi =0}

and the expectation in (2.10) is taken with respect to the distributions b0 and Z.

Both (2.9) and (2.10) are maximized using separate MCEM algorithms; however, we note that the form of (2.10) reduces to that of (2.9) whenZ =1andq(Y0,Z, pI)1.

We therefore outline the MCEM procedure for (2.10), maximization of (2.9) being a special case.

Treating b0 and Z as missing data, the complete-data loglikelihood associated with (2.10) takes the form:

lc0,Y0,b0,Z) = l(1)c (pI,Y0,Z) +lc(2)(pM,Z) +l(3)c00,Y0,b0,Z) +l(4)cb0,b0) where,

l(1)c (pI,Y0,Z) = log(pI)

"N i=1

ziyi(0) + log(1−pI)

"N i=1

zi(1−yi(0)),

l(2)c (pM,Z) = log(pM)

"N i=1

zi+ log(1−pM)(N

"N i=1

zi), l(3)c00,Y0,b0,Z) =

"N i=1

"

t∈D0i

ziyi(t)[β0&xi(t)+g0(t,α0) +b0i]−zilog(1 + exp(β0&xi(t)+g0(t,α0) +b0i)),

(32)

l(4)cb0,b0) =(N 1) log(σb0) b0&(D−C)b0

2b0

Starting with initial parameter estimates, Θ(0)0 and setting h = 0, the algorithm consists of four steps:

1. Run the Gibbs sampler and generate J realizations, b(1)0 ,Z(1), ...,b(J)0 ,Z(J), from the conditional distribution f(b0,Z|Y0,Θ(h)0 )

2. Calculate Q(Θ0(h)0 ) = J1!J

k=1lc0,Y0,b(k)0 ,Z(k)) 3. Maximize Q(Θ0(h)0 ) over Θ0 to obtain Θ(h+1)0

4. Assess convergence. If convergence has been achieved then stop. Else Set h = h+ 1 and go to step 1

In implementing this algorithm, two important issues arise: first is the choice of Monte Carlo sample size, J, to be used at each iteration and second is monitoring convergence of the algorithm. Wei and Tanner (1990) suggest using small values ofJ in the initial stages of the algorithm and increasing J as the algorithm moves closer to convergence. Regarding convergence, they recommend plotting estimates at each iteration of the algorithm. Convergence is then indicated by the stabilization of the process with random fluctuations about some fixed value.

Gibbs sampling at the (h+ 1)st iteration requires simulation from the full con- ditional distributions, [b0i|Y0,b(−i)0 ,Z,Θ(h)0 ] and [zi|Y0,Z(i),b0,Θ(h)0 ], i = 1, ..., N. These full conditional distributions are given byfi(b0i|Y0,b(0i),Z,Θ(h)0 )





exp(Ci.(b0i

1 Ci.

!n

j=1Cijb0j)2

b0(h)2 ) if zi = 0,

exp(Ci.(b0i

1 Ci.

!n

j=1Cijb0j)2 b0(h)2 ),

tD0i

exp(β0(h)xi(t)+g0(t,α(h)0 )+b0i)yi(t)

1+exp(β0(h)xi(t)+g0(t,α(h)0 )+b0i) if zi = 1.

(2.11) and [zi|Y0,Z(i),b0,Θ(h)0 ]Bernoulli(pzi) with

pzi =



1 if yi +=0,

p(h)M (1−p(h)I )"ni−1

t=1 [1+exp(β0(h)xi(t)+g0(t,α(h)0 )+b0i)]−1 (1p(h)M )+p(h)M (1p(h)I )"ni−1

t=1 [1+exp(β0(h)xi(t)+g0(t,α(h)0 )+b0i)]1 if yi =0 (2.12)

Referências

Documentos relacionados

FEDORA is a network that gathers European philanthropists of opera and ballet, while federating opera houses and festivals, foundations, their friends associations and

Nos dados obtidos, os professores apontaram a relevância da experiência profissional no seu desenvolvimento e de suas ações e relataram ter aprendido sobre si, sobre seus

Se nem todos os anos eram de carência, a fartura nunca foi tanta que nos permitisse ter o reino todo abastecido, mais as praças africanas, os celeiros cheios para o ano seguinte,

The three dimensional model used finite solid elements, with eight nodes and three degrees of freedom in each node (translations). The dimension of the mesh was defined based

Uma das explicações para a não utilização dos recursos do Fundo foi devido ao processo de reconstrução dos países europeus, e devido ao grande fluxo de capitais no

Neste trabalho o objetivo central foi a ampliação e adequação do procedimento e programa computacional baseado no programa comercial MSC.PATRAN, para a geração automática de modelos

FIGURA 5 - FLUTUAÇÃO POPULACIONAL DE Hypothenemus eruditus, Sampsonius dampfi e Xyleborus affinis (SCOLYTINAE, CURCULIONIDAE) COLETADOS COM ARMADILHAS ETANÓLICAS (25%)