More Complex Models - Applying Generalized Linear Models

The exponential, logistic, and Gomperz models, in addition to polynomials (that rarely are justiﬁed), are those most frequently used for growth curves.

However, more complex functions of time can also be used; an example is given in Chapter 10. Obviously, if other supplementary information is available, this should also be incorporated in the model. In most cases, this will involve explanatory variables, but, in some cases, multivariate processes will be required.

A simple function for a bivariate Poisson process can be written

λ(t, u) =λ_Dtλ_Ru (4.3)

where, in our example,t will be the diagnosis time with mean incidence, λ_Dt, anduthe reporting delay, with mean rate,λ_Ru, both growth functions of unspeciﬁed form. In this model, the two processes are assumed to be independent, so that the reporting delay distribution is the same at all points in time; that is, it is stationary. More complex models can be built up from this, usually by adding the appropriate interaction terms.

Example

For the above data on the growth of the AIDS epidemic, the problem of reporting delays was mentioned. We shall now incorporate this information in the model. In this way, we shall be able to use all of the observations, not just the earlier, complete ones. One problem that was not discussed above

is the large number of observations, 132,170 in the complete data set. This will mean that the usual inference procedures, including the standard AIC, will point to very complex models. In spite of this, I shall continue in the usual way, but discuss the consequences further below.

Data on incidence of AIDS with reporting delays take the form of a rectangular contingency table with observations in one triangular corner missing (see Exercise 4.7 for an example). The two dimensions of such a table are the diagnosis period and the reporting delay. The margin for diagnosis period (as in Table 4.2) gives the total incidence over time, but with the most recent values too small because of the missing triangle of values not yet reported. Hay and Wolak (1994) give diagnoses and reporting delays in the United States of America by quarter, except for an additional zero delay category, yielding a 34×17 table with a triangle of 120 missing cells (not reproduced here, although one margin was given in Table 4.2).

We are now simultaneously interested in how AIDS cases are occurring in the population over time (as above) and how these cases, occurring at a given time point, actually subsequently arrive at the central oﬃce. The latter process may be evolving over time, for example, if reporting improves or if the increasing number of cases swamps available facilities. Then our model will be a bivariate process, as described above.

The bivariate Poisson process of Equation (4.3), where we do not specify how the number of cases is growing, just corresponds to a log linear model for independence that can be ﬁtted as

FACDELAY+FACQUARTER

where FACDELAY and FACQUARTER are appropriate factor variables. When there are missing cells, we have a quasi-independence model where the tri-angle of missing data is weighted out (to obtain predictions of incidence be-low, it must not be simply left out of the model). The ﬁtted values for the diagnosis-time margin in this “nonparametric” (quasi-) stationary model are plotted as the solid line in Figure 4.5. We discover a predicted leveling oﬀ of AIDS cases for 1990. For this model, we obtain a deviance of 5381.6 (AIC 5481.6) with 392 d.f., indicating substantial nonstationarity.

A completely “nonparametric” nonstationary model, that is, the satur-ated model (FACDELAY*FACQUARTER), will not provide estimates for the missing triangle of values. Some assumption must be made about the evol-ution of reporting delays over time, the strongest being stationarity, that is, no change, that we have just used. One possible simple nonstationary model is the following interaction model:

FACDELAY+FACQUARTER+LINDELAY·FACQUARTER+LINQUART·FACDELAY whereLINDELAYandLINQUARTare linear, instead of factor, variables. With such grouped data for LINDELAY, as described above, we use centres of three-month quarterly periods, but with an arbitrary 0.1 value for the zero

0 5000 10000 15000 20000

Cases

1982 1984 1986 1988 1990 1992

Year Non-parametric stationary Non-parametric non-stationary Semi-parametric

Parametric

FIGURE 4.5. Estimated AIDS incidence in the U.S.A. taking into account re-porting delays.

reporting delay category (to allow for logarithms below). Because we are interested in rates or intensities per unit time (months), we use an oﬀset of log(3) for all delay periods, in all models, except for an arbitrarily log(0.2) for the zero reporting delay. For these data, the deviance decreases by 2670.0 (AIC 2905.6) on 47 d.f. with respect to the (quasi-) stationary model, a strong indication of nonstationarity. For comparison with subsequent AICs, those for these two models are given in line (1) of Table 4.3.

This nonstationary model has been plotted as the dashed line in Figure 4.5. It yields completely unstable predictions for the last quarters. Notice that this and the previous model follow the diagnosed cases exactly for the period until 1989 (24 quarters), where there were no missing cases due to

reporting delays. 2

A simple parametric bivariate model could have an exponential growth in one dimension and a Weibull distribution for the other (Chapters 6 and 7). This can be written

λ(t, u) =αe^β²^ut^β¹

This can be fitted as a log linear model using a linear time variable for the first dimension and the logarithm of the time for the second, instead of the factor variables above. Again, this is a (quasi-) stationary model. Notice that we do not include factor variables to fix the marginal totals of the contingency table at their observed values.

We can generalize this model by taking other transformations of time.

Again, nonstationarity can be introduced by means of interactions between

TABLE 4.3. Deviances for a series of models ﬁtted to the reporting delay data of Hay and Wolak (1994).

Stationary Nonstationary

Model d.f. AIC d.f. AIC

(1) “Nonparametric” 392 5481.6 345 2905.6 (2) “Semiparametric” 422 6083.4 390 3385.5 (3) Parametric 435 6295.3 426 4020.8

the (transformed) time variables for the two time dimensions. We might consider time, its reciprocal, and its logarithm. Such a bivariate (quasi-) stationary model would be written

λ(t, u) =αe^β¹^t+β²^/tt^β³e^β⁴^u+β⁵^/uu^β⁶ Example

If we apply this model to the AIDS reporting delays, we use LINQUART+RECQUART+LOGQUART

+LINDELAY+RECDELAY+LOGDELAY

The deviance for this model is 899.7 larger than that for the correspond-ing stationary “nonparametric” model, on 43 d.f., indicatcorrespond-ing a considerably poorer model. If we add all nine interaction terms to yield a nonstationary model, we obtain a deviance 2302.5 on nine d.f. smaller than the previous one, but 1277.2 larger than the corresponding nonstationary “nonparamet-ric” model above, however with 81 fewer parameters. Again, the AICs for these two parametric models are summarized in line (3) of Table 4.3.

The fitted values for the margin for diagnoses for this nonstationary para-metric model have been plotted, as the dotted line, in Figure 4.5. Notice how the curve no longer follows the random fluctuations of the completed diagnosis counts up until 1987. This accounts for much of the lack of fit of this model. Not surprisingly, the predictions no longer indicate a levelling off.

Much of the apparent lack of ﬁt of the parametric models may be due to the form of the intensity function for delays that is high for short delays, but then descends rapidly for longer delays. This can be checked by ﬁtting a parametric intensity for diagnosis and a “nonparametric” one for the delay.

For (quasi-) stationarity, we ﬁnd a reasonable model to be LINQUART+RECQUART+LOGQUART+FACDELAY whereFACDELAYis again a factor variable, and for nonstationarity

RECQUART+LOGQUART+FACDELAY +FACDELAY·(LINQUART+LOGQUART)

Quarter 0.

10 20

30 40

50 Delay

1000 Rate

FIGURE 4.6. The estimated rates of AIDS reporting in the United States of America, from the nonstationary parametric model.

with AICs as given in line (2) of Table 4.3. The ﬁtted values for this non-stationary “semiparametric” model give substantially lower prediction than those for our nonstationary parametric model, as can be seen in Figure 4.5.

The diﬀerence in deviance between them is 707.3 with 36 d.f. These are both quite diﬀerent from the very unstable nonstationary “nonparametric”

model, although the latter has a deviance 569.9 on 45 d.f. smaller than the

“semiparametric” model.

Other nonstationary models, parametric for diagnoses, give fairly similar results for the bivariate intensity. By playing with various transformations of time, a somewhat better parametric model can be found, with predictions very close to those for the “semiparametric” model. We discover that returns with short delays are rapidly increasing with time, but those for longer delays are growing even more rapidly. This can be seen from the plot in Figure 4.6. In contrast, a stationary model, such as that used by Hay and Wolak (1994), has reporting rates increasing at all delays in the same way.

Thus, the missing triangle will contain estimates that are too low, explaining the leveling oﬀ indicated by such a model in 1990.

It is interesting to compare these results with those for England, as seen in Figure 4.7, where delays are becoming shorter so that the stationary model, that ignores this, predicts a rise in AIDS cases. See also Lindsey

(1996a) and Exercise 4.7 below. 2

Note that we have not attempted to construct a likelihood interval for the estimates, as we did above for the marginal data without reporting

Quarter 0.

40 Delay

20 40 60 Rate

FIGURE 4.7. The estimated rates of AIDS reporting in the United Kingdom, from the nonstationary parametric model.

delays. This would be an essential further step that needs to be carried out before the projections from these models are useful. However, they depend critically on the model, as we saw above. Indeed, the intervals published in the papers cited above for the two sets of data, using stationary models in both cases, do not even cover the point projection from our better ﬁtting nonstationary models!

In this example, the AIC for the saturated model is 916. We saw in Table 4.3 that none of the models ﬁtted comes close to this value. This is due to the large number of observations in this data set. For such models to be selected, a factor considerably greater than two times the number of estimated parameters would have to be added to the deviance. In other words, to compensate for the extremely large number of observations, the smoothing factor,a, of Section A.1.4 would have to be smaller.

Summary

In this chapter, we have been primarily concerned with the nonlinear form of the regression curve for growth data. So far, we have ignored the de-pendence within the series of values due to observations coming from the same individual(s). We look at this in the next chapter. Combining nonlin-ear regression models with dependence among observations is only slowly becoming feasible (Chapter 10) in the generalized linear model context. The books on longitudinal data cited in the next chapter are useful references.

No documento Applying Generalized Linear Models (páginas 89-95)