Smoothing Data Ž - GENERALIZED ADDITIVE MODELS*

IIIII

4.8 GENERALIZED ADDITIVE MODELS*

4.8.1 Smoothing Data Ž

or Ž0.54, 0.78 ..

Ž .

This is wider, however, than the Wald interval of 0.59, 0.73 for comparing independent proportions, which ignores the overdispersion.

INTRODUCTION TO GENERALIZED LINEAR MODELS

154

not smooth so much that it suppresses interesting patterns. This approach may suggest that a linear model is adequate with a particular link or suggest ways to improve on linearity. Some software packages that do not have GAMs can smooth the data by employing a type of regression that gives greater weight to nearby observations in predicting the value at a given point;

such locally weighted least squares regressionis often referred to as lowess. We prefer GAMs because they recognize explicitly the form of the response. For instance, with a binary response, lowess can give predicted values below 0 or above 1, which cannot happen with a GAM.

Even when one plans to use GLMs, a GAM can be helpful for exploratory analysis. For instance, for continuous X with continuous responses, scatter diagrams provide visual information about the dependence of Y on X. For binary responses, the following example shows that such diagrams are not very informative. Plotting the fitted smooth function for a predictor may reveal a general trend without assuming a particular functional relationship.

Ž .

FIGURE 4.7 Whether satellites are present 1, yes; 0, no , by width of female crab, with smoothing fit of generalized additive model.

NOTES 155 4.8.2 GAMs for Horseshoe Crab Example

In Section 4.3.2, Figure 4.4 showed the trend relating number of satellites for horseshoe crabs to their width. This smooth curve is the fit of a generalized additive model, assuming a Poisson distribution and using the log link.

In the next chapter we’ll use logistic regression to model the probability that a crab has at least one satellite. For crab i, let y_is1 if she has at least one satellite and y_is0 otherwise. Figure 4.7 plots these data against xscrab width. It consists of a set of points with y_is1 and a second set of points with y_is0. The numbered symbols indicate the number of observa-tions at each point. It appears that y_is1 tends to occur relatively more often at higher x values. Figure 4.7 also shows a curve based on smoothing the data using a GAM, assuming a binomial response and logit link. This curve shows a roughly increasing trend and is more informative than viewing the binary data alone. It suggests that an S-shaped regression function may describe this relationship relatively well.

NOTES

Section 4.1: Generalized Linear Model

Ž . Ž .

4.2. Distribution 4.1 is called anatural orlinear exponential family to distinguish it from a Ž .

more general exponential family that replaces y by r y in the exponential term. For

Ž .

other generalizations, see Jorgensen 1987 . Books on GLMs and related models, inⲐ approximate order of technical level from highest to lowest, are McCullagh and Nelder Ž1989 , Fahrmeir and Tutz 2001 , Aitkin et al. 1989 , Dobson 2002 , and Gill 2000 .. Ž . Ž . Ž . Ž .

Ž .

devia-'

Ž .

tion ␲r ␤ 3 .

Ž .

4.21 Show representation 4.18 for the binomial distribution.

Ž . 4

4.22 Let Y_i be a bin n_i,␲_i variate for group i, is1, . . . ,N, with Y_i independent. Consider the model that ␲₁s ⭈⭈⭈ s␲_N. Denote that

4 Ž . Ž .

common value by␲. For observations yi, show that␲ˆs Ýyi rÝni . When all n_is1, for testing this model’s fit in the N=2 table, show that X²sn. Thus, goodness-of-fit statistics can be completely

unin-Ž .

formative for ungrouped data. See also Problem 5.37.

Ž .

4.23 Suppose that Y_i is Poisson with g ␮_i s␣q␤x_i, where x_is1 for is1, . . . ,n_A from group A and xis0 for isnAq1, . . . ,nAqnB

from group B. Show that for any link function g, the likelihood

Ž .

equations 4.22 imply that fitted means ␮ˆA and ␮ˆB equal the sample means.

4.24 For binary data with sample proportion y_i based on n_i trials, we use

Ž .

quasi-likelihood to fit a model using variance function 4.46 . Show that parameter estimates are the same as for the binomial GLM but that the covariance matrix multiplies by␾.

Ž .

4.25 A binomial GLM ␲_is⌽ Ý_j␤_jx_{i j} with arbitrary inverse link function

Ž . Ž .

⌽ assumes that_$ n Y_i _i has a bin n_i,␲_i distribution. Find w_i in 4.27 ˆ

Ž . Ž .

and hence cov ␤ . For logistic regression, show that w_isn_i␲_i 1y␲_i . 4.26 A GLM has parameter ␤ with sufficient statistic S. A goodness-of-fit test statistic T has observed value t_o. If ␤ were known, a P-value is

Ž . Ž < .

PsP TGt_o;␤ . Explain why P TGt_o S is the uniform minimum variance unbiased estimator of P.

4.27 Let y_{i j} be observation j of a count variable for group i,is1, . . . ,I,

4 Ž .

js1, . . . ,n_i. Suppose that Y_{i j} are independent Poisson with E Y_{i j} s␮_i.

a. Show that the ML estimate of ␮i is ␮ˆisyisÝj i jy rni.

b. Simplify the expression for the deviance for this model. For testingw this model, it follows from Fisher 1970, p. 58, originally publishedŽ

PROBLEMS 163

. Ž .2

in 1925 that the deviance and the Pearson statisticÝ Ýi j yi jyyi ryi

Ž .

have approximate chi-squared distributions with dfsÝ_i n_iy1 .

Ž . Ž .2

For a single group, Cochran 1954 referred toÝ_j y₁_jyy₁ ry₁ as the ®ariance test for the fit of a Poisson distribution, since it compares the sample variance to the estimated Poisson variance y₁.x 4.28 Conditional on ␭,Y has a Poisson distribution with mean ␭. Values of

Ž . Ž .

␭ vary according to gamma density 13.12 , which has E ␭ s␮,

Ž . ²

var ␭ s␮rk. Show that marginally Y has the negative binomial

Ž .

distribution 4.12 . Explain why the negative binomial model is a way to handle overdispersion for the Poisson.

Ž . Ž .

4.29 Consider the class of binary models 4.8 and 4.9 . Suppose that the standard cdf ⌽ corresponds to a probability density function ␾ that is symmetric around 0.

Ž .

a. Show that x at which␲ x s0.5 is xs y␣r␤.

Ž . Ž . Ž .

b. Show that the rate of change in ␲ x when ␲ x s0.5 is ␤␾ 0 .

'

Ž .

Show this is 0.25␤ for the logit link and ␤r 2␲ where␲s3.14 . . . for the probit link.

c. Show that the probit regression curve has the shape of a normal cdf

< <

with meany␣r␤ and standard deviation 1r ␤ .

Ž ².

4.30 Show the normal distribution N ␮,␴ with fixed ␴ satisfies family Ž4.1 , and identify the components. Formulate the ordinary regression. model as a GLM.

4.31 In Problem 4.30, when ␴ is also a parameter, show that it satisfies the

Ž .

exponential dispersion family 4.14 .

Ž . 1

4.32 For binary observations, consider the model ␲ x s ₂ q Ž1r␲.tan^y¹Ž␣q␤x.. Which distribution has cdf of this form? Explain when a GLM using this curve might be more appropriate than logistic regression.

Ž . Ž .

No documento Categorical Data Analysis (páginas 168-178)