Exercises - Applying Generalized Linear Models

1. (a) Figures 1.1 and 1.2 show respectively how the normal and bino-mial distributions change as the mean changes. Although inform-ative, these graphics are, in some ways, fundamentally diﬀerent.

Discuss why.

(b) Construct a similar plot for the Poisson distribution. Is it more similar to the normal or to the binomial plot?

2. Choose some data set from a linear regression or analysis of variance course that you have had and suggest some more appropriate model for it than ones based on the normal distribution. Explain how the model may be useful in understanding the underlying data generating mechanism.

3. Why is intrinsic alias more characteristic of models for designed ex-periments whereas extrinsic aliases arises most often in observation studies such as sample surveys?

4. (a) Plot the likelihood function for the mean parameter of a Poisson distribution when the estimated mean is ¯y_• = 2.5 for n = 10 observations. Give an appropriate likelihood interval about the mean.

(b) Repeat for the same estimated mean whenn= 30 and compare the results in the two cases.

(d) How do these results relate to Fisher information? To the use of standard errors as a measure of estimation precision?

2

Discrete Data

2.1 Log Linear Models

Traditionally, the study of statistics begins with models based on the normal distribution. This approach gives students a biased view of what is possible in statistics because, as we shall see, the most fundamental models are those for discrete data. As well, the latter are now by far the most commonly used in applied statistics. Thus, we begin our presentation of generalized linear regression modelling with the study of log linear models.

Log linear models and their special case for binary responses, logistic models, are designed for the modelling of frequency and count data, that is, those where the response variable involves discrete categories, as described in Section 1.1.4. Because they are based on the exponential family of dis-tributions, they constitute a direct extension of traditional regression and analysis of variance. The latter models are based on the normal distribu-tion (Chapter 9), whereas logistic and log linear models are based on the Poisson or multinomial distributions and their special cases, such as the binomial distribution. Thus, they are all members of the generalized linear model family.

Usually, although not necessarily, one models either the frequencies of occurrence of the various categories or the counts of events. Occasionally, as in some logistic regression models, the individual indicator variables of the categories are modelled. However, when both individual and grouped frequency data are available, they both give identical results. Thus, for the moment, we can concentrate, here, on grouped frequency data.

TABLE 2.1. A two-way table for change over time.

Time 2

A B

Time A 45 13

1 B 12 54

2.1.1 Simple Models

In order to provide a brief and simple introduction to logistic and log linear models, I have chosen concrete applications to modelling changes over time.

However, the same principles that we shall study here also apply to cross-sectional data with a set of explanatory variables.

Observations over Time

Consider a simple two-way contingency table, Table 2.1, where some re-sponse variable with two possible values, A and B, was recorded at two points in time. A ﬁrst characteristic that we may note is a relative stability over time, as indicated by the large frequencies on the diagonal. In other words, response at time 2 depends heavily on that at time 1, most often being the same.

As a simple model, we might consider that the responses at time 2 have a binomial distribution and that this distribution depends on what response was given at time 1. Thus, we might have the simple linear regression model

log π1|j

π2|j

=β0+β1x_j

wherex_j is the response at time 1 and π_i|j is the conditional probability of responsei at time 2 given the observed value of x_j at time 1. Then, if β1= 0, this indicates independence, that is, that the second response does not depend on the ﬁrst. In the Wilkinson and Rogers (1973) notation, the model can be written simply as the name of the variable:

TIME1

If the software also required speciﬁcation of the response variable at the same time, this would become

TIME2˜TIME1

whereTIME2represents a 2×2 matrix of the frequencies in the table, with columns corresponding to the two possible response values at the second time point.

Thislogistic regression model, with a logit link, the logarithm of the ratio of probabilities, is the direct analogue of classical (normal theory) linear

TABLE 2.2. A two-way table of clustered data.

Right eye

A B

Left A 45 13

eye B 12 54

regression. On the other hand, if x_j is coded (−1,1) or (0,1), we may rewrite this as

log π1|j

π2|j

=µ+α_j

where µ=β0, the direct analogue of an analysis of variance model, with the appropriate constraints. With suitable software,TIME1would simply be declared as a factor variable having two levels.

Example

The parameter estimates for Table 2.1 are ˆβ0 = ˆµ = 1.242 and ˆβ1 = ˆ

α1 = −2.746, when x_j is coded (0,1), with an AIC of 4. (The deviance is zero and there are two parameters.) That with α1 = β1 = 0, that is, independence, has AIC 48.8. Thus, in comparing the two models, the ﬁrst, with dependence on the previous response, is much superior, as indicated

by the smaller AIC. 2

Clustered Observations

Let us momentarily leave data over time and consider, instead, the same table, now Table 2.2, as some data on the two eyes of people. We again have repeated observations on the same individuals, but here they may be considered as being made simultaneously rather than sequentially. Again, there will usually be a large number with similar responses, resulting from the dependence between the two similar eyes of each person.

Here, we would be more inclined to model the responses simultaneously as a multinomial distribution over the four response combinations, withjoint probability parameters, π_ij. In that way, we can look at the association between them. Thus, we might use a log link such that

log(π_ij) =φ+µ_i+ν_j+α_ij (2.1) With the appropriate constraints, this is again an analogue of classical ana-lysis of variance. It is called alog linear model. If modelled by the Poisson representation (Section 2.1.2), it could be given in one of two equivalent ways:

REYE∗LEYE

REYE+LEYE+REYE·LEYE

With speciﬁcation of the response variable, the latter becomes FREQ˜REYE+LEYE+REYE·LEYE

whereFREQis a four-element vector containing the frequencies in the table.

Notice that, in this representation of the multinomial distribution, the “res-ponse variable”,FREQ, is not really a variable of direct interest at all.

Example

Here, the parameter estimates for Table 2.1 are ˆφ= 2.565,ν1= 1.424, ˆµ1= 1.242, and ˆα11=−2.746, with an AIC of 8. (Again, the deviance is zero, but here there are four parameters.) That withα11= 0 has AIC 52.8. (This is 4 larger than in the previous case because the model has two more parameters, but the diﬀerence in AIC is the same.) The conclusion is identical, that the independence model is much inferior to that with dependence. 2 Log Linear and Logistic Models

The two models just described have a special relationship to each other.

With the same constraints, the dependence parameter,α, is identical in the two cases because

log

π₁_|₁π₂_|₂ π1|2π2|1

= log

π11π22

π12π21

The normed profile likelihoods forα = 0are also identical, although the AICs are not because of the different numbers of parameters explicitly es-timated in the two models (differences in AIC are, however, the same). This is a general result: in cases where both are applicable, logistic and log linear models yield the same conclusions. The choice is a matter of convenience.

This is a very important property, because it means that such models can be used for retrospective sampling. Common examples of this include, in medicine, case-control studies, and, in the social sciences, mobility studies.

These results extend directly to larger tables, including higher dimen-sional tables. There, direct analogues of classical regression and ANOVA models are still applicable. Thus, complex models of dependence among categorical variables can be built up by means of multiple regression. Ex-planatory variables can be discrete or continuous (at least if the data are not aggregated in a contingency table).

2.1.2 Poisson Representation

With a log linear model, we may have more than two categories for the re-sponse variable(s), so that we require a multinomial, instead of a binomial,

distribution. This cannot generally be directly ﬁtted by standard general-ized linear modelling software. However, an important relationship exists between the multinomial and Poisson distributions that makes ﬁtting such models possible.

Consider independent Poisson distributions with means µ_k and corres-ponding numbers of eventsn_k. Let us condition on the observed total num-ber of events,n_•=

kn_k. From the properties of the Poisson distribution, this total will also have the same distribution, with mean µ_• =

kµ_k. Then, the conditional distribution will be

e⁻^µkµ^nk_k nk! e^−µ•µⁿ_•^•

n•!

n_• n1· · ·n_K

k=1

µ_k µ_•

_n_k

a multinomial distribution with probabilities,π_k =µ_k/µ_•. Thus, any mul-tinomial distribution can be ﬁtted as a product of independent Poisson dis-tributions with the appropriate conditioning on the total number of events.

Specifically, this means that, when fitting such models, the product of all explanatory variables must be included in the minimal log linear model, in order to fix the appropriate marginal totals in the table:

R1+R2+· · ·+E1∗E2∗ · · · (2.2) where Ri represents a response variable and Ej an explanatory variable.

This ensures that all responses have proper probability distributions (Lind-sey, 1995b). Much of log linear modelling involves searching for simple structural models of relationships among responses (Ri) and of dependen-cies of responses on explanatory variables.

No documento Applying Generalized Linear Models (páginas 39-45)