• Nenhum resultado encontrado

1. In Section 5.2, we looked at a classical data set modelled by autore-gression techniques, the lynx data. Another such set involves the an-nual W¨olfer sunspot numbers between 1770 and 1869 (Hand et al.

1994, pp. 85–86) (read across rows):

101 82 66 35 31 7 20 92 154 125

85 68 38 23 10 24 83 132 131 118

90 67 60 47 41 21 16 6 4 7

14 34 45 43 48 42 28 10 8 2

0 1 5 12 14 35 46 41 30 24

16 7 4 2 8 17 36 50 62 67

71 48 28 8 13 57 122 138 103 86

63 37 24 11 15 40 62 98 124 96

66 64 54 39 21 7 4 23 55 94

96 77 59 44 47 30 16 7 37 74

They measure the average number of sunspots on the sun each year.

Can you find a Markov model of an appropriate order to describe these data adequately? A number of different suggestions have been made in the literature.

2. The following table gives the enrollment at Yale University, 1796–1975 (Anscombe, 1981, p. 130) (read across rows):

115 123 168 195 217 217 242 233 200

222 204 196 183 228 255 305 313 328

350 352 298 333 349 376 412 407 481

473 459 470 454 501 474 496 502 469

485 536 514 572 570 564 561 608 574

550 537 559 542 588 584 522 517 531

555 558 604 594 605 619 598 565 578

641 649 599 617 632 644 682 709 699

724 736 755 809 904 955 1031 1051 1021

1039 1022 1003 1037 1042 1096 1092 1086 1075 1134 1245 1365 1477 1645 1784 1969 2202 2350 2415 2615 2645 2674 2684 2542 2712 2816 3142 3138 3806 3605 3433 3450 3312 3282 3229 3288 3272 3310 3267 3262 2006 2554 3306 3820 3930 4534 4461 5155 5316 5626 5457 5788 6184 5914 5815 5631 5475 5362 5493 5483 5637 5747 5744 5694 5454 5036 5080 4056 3363 8733 8991 9017 8519 7745 7688 7567 7555 7369 7353 7664 7488 7665 7793 8129 8221 8404 8333 8614 8539 8654 8666 8665 9385 9214 9231 9219 9427 9661 9721 The primary irregularities in these data occur during the two World Wars. Develop an adequate Markov model for these count data. Among other possibilities, compare a normal model that uses differences in enrollment between years with a Poisson model that involves ratios of successive enrollment rates. Is there evidence of overdispersion?

3. Annual snowfall (inches) in Buffalo, New York, USA, was recorded from 1910 to 1972 (Parzen, 1979) (read across rows):

126.4 82.4 78.1 51.1 90.9 76.2 104.5 87.4 110.5 25.0 69.3 53.5 39.8 63.6 46.7 72.9 79.7 83.6 80.7 60.3 79.0 74.4 49.6 54.7 71.8 49.1 103.9 51.6 82.4 83.6 77.8 79.3 89.6 85.5 58.0 120.7 110.5 65.4 39.9 40.1 88.7 71.4 83.0 55.9 89.9 84.8 105.2 113.7 124.7 114.5 115.6 102.4 101.4 89.8 71.5 70.9 98.3 55.5 66.1 78.4 120.5 97.0 110.0

Find an appropriate model to describe these time series data. Is there evidence of a trend or of a cyclical phenomenon?

4. Beveridge (1936) also gives the average daily wages (pence) of eral other classes of labourers each decade from 1250 to 1459 on sev-eral Winchester manors in England. Those for carpenters and masons (both in Taunton manor) are shown below, as well as the rates for agricultural labourers and the price of wheat used in the example above.

Agricultural Carpenter’s Mason’s Wheat

Decade rate wage wage price

1250– 3.30 3.01 2.91 4.95

1260– 3.37 3.08 2.95 4.52

1270– 3.45 3.00 3.23 6.23

1280– 3.62 3.04 3.11 5.00

1290– 3.57 3.05 3.30 6.39

1300– 3.85 3.14 2.93 5.68

1310– 4.05 3.12 3.13 7.91

1320– 4.62 3.03 3.27 6.79

1330– 4.92 2.91 3.10 5.17

1340– 5.03 2.94 2.89 4.79

1350– 5.18 3.47 3.80 6.96

1360– 6.10 3.96 4.13 7.98

1370– 7.00 4.02 4.04 6.67

1380– 7.22 3.98 4.00 5.17

1390– 7.23 4.01 4.00 5.45

1400– 7.31 4.06 4.29 6.39

1410– 7.35 4.08 4.30 5.84

1420– 7.34 4.11 4.31 5.54

1430– 7.30 4.51 4.75 7.34

1440– 7.33 5.13 5.15 4.86

1450– 7.25 4.27 5.26 6.01

Can models similar to those for agricultural labourers be developed for the other two types of workers? Does it make any difference that these are wages and not rates for piece work?

5. A number of women in the United States of America were followed over five years, from 1967 to 1971, in the University of Michigan Panel Study of Income Dynamics. The sample consisted of white women who were continuously married to the same husband over the five-year period. Having worked in the five-year is defined as having earned any money during the year. The sample paths of labour force participation are given in the following table (Heckman and Willis, 1977):

1971 1970 1969 1968 1967 Yes No

Yes Yes Yes Yes 426 38

No 16 47

Yes No 11 2

No 12 28

Yes Yes No 21 7

No 0 9

Yes No 8 3

No 5 43

Yes Yes Yes No 73 11

No 7 17

Yes No 9 3

No 5 24

Yes Yes No 54 16

No 6 28

Yes No 36 24

No 35 559

Study how the most recent employment record of each woman depends on her previous history. Is there indication of heterogeneity among the women? Notice that here there are two types of stable behaviour that might be classified stayers.

6. The numbers of deaths by horse kicks in the Prussian army from 1875 to 1894 for 14 corps (Andrews and Herzberg, 1985, p. 18) are as follows:

Corps Year

G 0 2 2 1 0 0 1 1 0 3 0 2 1 0 0 1 0 1 0 1 I 0 0 0 2 0 3 0 2 0 0 0 1 1 1 0 2 0 3 1 0 II 0 0 0 2 0 2 0 0 1 1 0 0 2 1 1 0 0 2 0 0 III 0 0 0 1 1 1 2 0 2 0 0 0 1 0 1 2 1 0 0 0 IV 0 1 0 1 1 1 1 0 0 0 0 1 0 0 0 0 1 1 0 0 V 0 0 0 0 2 1 0 0 1 0 0 1 0 1 1 1 1 1 1 0 VI 0 0 1 0 2 0 0 1 2 0 1 1 3 1 1 1 0 3 0 0 VII 1 0 1 0 0 0 1 0 1 1 0 0 2 0 0 2 1 0 2 0 VIII 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 1 0 1 IX 0 0 0 0 0 2 1 1 1 0 2 1 1 0 1 2 0 1 0 0 X 0 0 1 1 0 1 0 2 0 2 0 0 0 0 2 1 3 0 1 1 XI 0 0 0 0 2 4 0 1 3 0 1 1 1 1 2 1 3 1 3 1 XIV 1 1 2 1 1 3 0 4 0 1 0 3 2 1 0 2 1 1 0 0 XV 0 1 0 0 0 0 0 1 0 1 1 0 0 0 2 2 0 0 0 0

G indicates the guard corps. This and corps I, VI, and XI have dif-ferent organizations than the others. Can you detect any trends with time? Are there systematic differences among the corps?

7. Reanalyze the data on children wheezing in Exercise 2.5, taking into account the longitudinal aspect of the data.

6

Survival Data

6.1 General Concepts

6.1.1 Skewed Distributions

A duration is the time until some event occurs. Thus, the response is a non-negative random variable. If the special case of a survival time is being observed, the event is considered to be absorbing, so that observation of that individual must stop when it occurs. We first consider this case, although most of the discussion applies directly to more general durations such as the times between repeated events, called event histories (Chapter 7). Usually, the distribution of durations will not be symmetric, but will have a form like that in Figure 6.1 (this happens to be a log normal distribution). This restricts the choice of possible distributions to be used. For example, a normal distribution would not be appropriate. Suitable distributions within the generalized linear model family include the log normal, gamma, and inverse Gaussian.

6.1.2 Censoring

Because individuals are to be observed over time, until the prescribed event, and because time is limited and costly, not all individuals may be followed until an event. Such data are called censored. Censored observations are incomplete, but they still contain important information. We know that the event did not occur before the end of the observation period.

0 10 20 30 40 50 0

0.1 0.2 0.3 0.4 0.5

Survival Time

Probability density

FIGURE 6.1. A typical density function for a survival curve.

Censoring can occur for a number of reasons. For example, the protocol for the study may specify observation over a fixed period of time or indi-vidual cases may disappear from the study for some reason.

Planned censoring may occur in two main ways:

If recording of an event must stop after a fixed time interval, we have Type Iortime censoring, most often used in medical studies.

If the study must continue until complete information is available on a fixed number of cases, we haveType II or failure censoring, most common in industrial testing.

However, cases may drop out for reasons not connected with the study or beyond the control of the research worker. These may or may not be linked to the response or explanatory variables; for example, through side effects under a medical treatment. If they are related, the way in which censoring occurs cannot simply be ignored in the model. Thus, in complex cases, where censoring is not random, a model will need to be constructed for it, although generally very little information will be available in the data about such a model.

It is important to distinguish situations where censoring only depends on information already available at the censoring point from other possibilities because, for that case, such modelling may be possible. For example, the censoring indicator could be made to depend on the available explanatory variables in some form of regression model.

Even with ignorable causes of censoring, the analysis is further complic-ated because we cannot simply use the density function as it stands.

6.1.3 Probability Functions

If the probability density function is f(t) and the cumulative probability function isF(t), then thesurvivor functionis

S(t) = 1−F(t) and thehazard function

h(t) = f(t) S(t)

= −dlogS(t) dt

This is the rate or intensity of the point processes of the previous chapter.

Then, we have

S(t) = exp

t

0 h(u)du

f(t) = h(t) exp

t

0 h(u)du

where

h(u)du is called theintegrated hazard orintensity.

Suppose thatIi is a code or indicator variable for censoring, withIi= 1 if the observation i is completely observed and Ii = 0 if it is censored.

Then, the probability for a sample ofn individuals will be approximately (because the density assumes that one can actually observe in continuous time) proportional to

[f(ti)]Ii[S(ti)]1−Ii (6.1)

and a likelihood function can be derived from this. In most cases, this does not yield a generalized linear model.

6.2 “Nonparametric” Estimation

Before looking at specific parametric models for a set of data, it is often useful to explore the data by means of a “nonparametric” estimation pro-cedure similar to those used in some of the previous chapters. As usual, such models are, in fact, highly parametrized, generally being saturated models.

The most commonly used for survival data is the Kaplan–Meier product limit estimate(Kaplan and Meier, 1958).

TABLE 6.1. Remission times (weeks) from acute leukaemia under two treatments, with censored times indicated by asterisks. (Gehan, 1965, from Freireichet al.)

6-mercaptopurine

6 6 6 6* 7 9* 10 10* 11* 13 16 17* 19* 20* 22 23 25* 32* 32* 34* 35*

Placebo

1 1 2 2 3 4 4 5 5 8 8 8 8 11 11 12 12 15 17 22 23

Ifπj is the probability of having an event at timetj, conditional on not having an event until then, that is, on surviving to that time, the likelihood function is

L(π) = k j=1

πjdj(1−πj)nj−dj

where nj is the number having survived and still under observation, and hence still known to be at risk just prior totj, called the risk set,dj is the number having the event at timetj, andπj is the hazard or intensity attj. This is a special application of the binomial distribution, with maximum likelihood estimates, ˆπj = dj/nj. Then, the product limit estimate of the survivor function is just the product of the estimated probabilities of not having the event at all time points up to the one of interest:

S(t) =ˆ

j|tj<t

nj−dj nj

a special application of Equation (5.1). This may be plotted in various ways (Lindsey, 1992, pp. 52–57) to explore what form of parametric model might fit the data. It provides a saturated model to which others can be compared.

Example

Table 6.1 gives a classical data set on the time maintained in remission for cases of acute leukaemia under two treatments. In this trial, conduc-ted sequentially so that patients were entering the study over time, 6-mercaptopurine was compared to a placebo. The results in the table are from one year after the start of the study, with an upper limit of the obser-vation time of about 35 weeks.

The Kaplan–Meier estimates of the survivor functions for these two groups are plotted in Figure 6.2. We see how the treatment group has longer es-timated survival times than the placebo group. 2

0 0.2 0.4 0.6 0.8 1

Survival probability

0 7 14 21 28 35

Time

Placebo

6-mercaptopurine

FIGURE 6.2. Kaplan–Meier curves for the survival data of Table 6.1.

No documento Applying Generalized Linear Models (páginas 116-125)