• Nenhum resultado encontrado

Smoothing Data Ž

No documento Categorical Data Analysis (páginas 168-178)

IIIII

4.8 GENERALIZED ADDITIVE MODELS*

4.8.1 Smoothing Data Ž

or Ž0.54, 0.78 ..

Ž .

This is wider, however, than the Wald interval of 0.59, 0.73 for comparing independent proportions, which ignores the overdispersion.

INTRODUCTION TO GENERALIZED LINEAR MODELS

154

not smooth so much that it suppresses interesting patterns. This approach may suggest that a linear model is adequate with a particular link or suggest ways to improve on linearity. Some software packages that do not have GAMs can smooth the data by employing a type of regression that gives greater weight to nearby observations in predicting the value at a given point;

such locally weighted least squares regressionis often referred to as lowess. We prefer GAMs because they recognize explicitly the form of the response. For instance, with a binary response, lowess can give predicted values below 0 or above 1, which cannot happen with a GAM.

Even when one plans to use GLMs, a GAM can be helpful for exploratory analysis. For instance, for continuous X with continuous responses, scatter diagrams provide visual information about the dependence of Y on X. For binary responses, the following example shows that such diagrams are not very informative. Plotting the fitted smooth function for a predictor may reveal a general trend without assuming a particular functional relationship.

Ž .

FIGURE 4.7 Whether satellites are present 1, yes; 0, no , by width of female crab, with smoothing fit of generalized additive model.

NOTES 155 4.8.2 GAMs for Horseshoe Crab Example

In Section 4.3.2, Figure 4.4 showed the trend relating number of satellites for horseshoe crabs to their width. This smooth curve is the fit of a generalized additive model, assuming a Poisson distribution and using the log link.

In the next chapter we’ll use logistic regression to model the probability that a crab has at least one satellite. For crab i, let yis1 if she has at least one satellite and yis0 otherwise. Figure 4.7 plots these data against xscrab width. It consists of a set of points with yis1 and a second set of points with yis0. The numbered symbols indicate the number of observa-tions at each point. It appears that yis1 tends to occur relatively more often at higher x values. Figure 4.7 also shows a curve based on smoothing the data using a GAM, assuming a binomial response and logit link. This curve shows a roughly increasing trend and is more informative than viewing the binary data alone. It suggests that an S-shaped regression function may describe this relationship relatively well.

NOTES

Section 4.1: Generalized Linear Model

Ž . Ž .

4.2. Distribution 4.1 is called anatural orlinear exponential family to distinguish it from a Ž .

more general exponential family that replaces y by r y in the exponential term. For

Ž .

other generalizations, see Jorgensen 1987 . Books on GLMs and related models, in approximate order of technical level from highest to lowest, are McCullagh and Nelder Ž1989 , Fahrmeir and Tutz 2001 , Aitkin et al. 1989 , Dobson 2002 , and Gill 2000 .. Ž . Ž . Ž . Ž .

Ž .

See also Firth 1991 .

Section 4.3: Generalized Linear Models for Counts

4.2. For further discussion of Poisson regression and related models for count data, see

Ž . Ž . Ž . Ž .

Breslow 1984 , Cameron and Trivedi 1998 , Frome 1983 , Hinde 1982 , Lawless Ž1987 , and Seeber 1998 and references therein.. Ž .

Section 4.4: Moments and Likelihood for Generalized Linear Models

Ž . Ž . Ž .

4.3. The function b in 4.14 is called the cumulant function, since when a s1 its

Ž .

derivatives yield the cumulants of the distribution Jorgensen 1987 .

For many GLMs, including Poisson models with log link and binary models with logit link, with full-rank model matrix the Hessian is negative definite and the log likelihood is a strictly concave function. Then ML estimates of model parameters exist and are

Ž .

unique under quite general conditions Wedderburn 1976 .

Section 4.5: Inference for Generalized Linear Models ˆ

Ž .w Ž .x

4.4. The matrixWused in cov see 4.28 , in the hat matrix for standardized Pearson

w Ž .x w Ž .x

residuals see 4.38 , and in Fisher scoring see 4.40 is the inverse of the covariance

Ž . Ž .

matrix of the linearized form of gy see Section 4.6.3 .

INTRODUCTION TO GENERALIZED LINEAR MODELS

156

Ž .

McCullagh and Nelder 1989, Chap. 12 discussed model checking for GLMs. For

Ž . Ž .

discussions about residuals, see also Green 1984 , Pierce and Schafer 1986 , Pregibon Ž1980, 1981 , and Williams 1987 . Pregibon 1982 showed that the squared standardized. Ž . Ž . Pearson residual is the score statistic for testing whether the observation is an outlier.

Ž .

Davison and Hinkley 1997, Sec. 7.2 discussed bootstrapping in GLMs.

Section 4.6: Fitting Generalized Linear Models

Ž .

4.5. Fisher 1935b introduced the Fisher scoring method to calculate ML estimates for probit models. For further discussion of GLM model fitting and the relationship

Ž .

between iterative reweighted least squares and ML estimation, see Green 1984 ,

Ž . Ž . Ž .

Jorgensen 1983 , McCullagh and Nelder 1989 , and Nelder and Wedderburn 1972 .

Ž . Ž . Ž .

Green 1984 , Jorgensen 1983 , and Palmgren and Ekholm 1987 also discussed this relation for exponential family nonlinear models.

Section 4.7: Quasi-likelihood and Generalized Linear Models

Ž .

4.6. For more on quasi-likelihood, see Sections 11.4, 12.6.4, and 13.3, Breslow 1984 , Cox Ž1983 , Firth 1987 , Hinde and Demetrio 1998 , McCullagh 1983 , McCullagh and. Ž . ´ Ž . Ž .

Ž . Ž . Ž .

Nelder 1989 , Nelder and Pregibon 1987 , and Wedderburn 1974, 1976 . See Heyde Ž1997 for a theoretical perspective..

Section 4.8: Generalized Additi©e Models

4.7. Besides GAMs, other nonparametric smoothing methods can describe the dependence

Ž . Ž

of a binary response on a predictor. For instance, see Copas 1983 , Lloyd 1999, Chap.

. Ž .

5 , and Section 15.3.3 for kernel smoothing and Kauermann and Tutz 2001 for models with random effects.

PROBLEMS Applications

4.1 In the 2000 U.S. presidential election, Palm Beach County in Florida was the focus of unusual voting patterns including a large number ofŽ illegal double votes apparently caused by a confusing ‘‘butterfly ballot.’’. Many voters claimed that they voted mistakenly for the Reform Party candidate, Pat Buchanan, when they intended to vote for Al Gore.

Figure 4.8 shows the total number of votes for Buchanan plotted against the number of votes for the Reform Party candidate in 1996 ŽRoss Perot , by county in Florida. For details, see A. Agresti and B.. Ž Presnell, J. Law Public Policy, Volume 13, Fall 2001, 117᎐134..

a. In county i, leti denote the proportion of the vote for Buchanan and let xidenote the proportion of the vote for Perot in 1996. For the linear probability model fitted to all counties except Palm Beach County, ␲ˆis y0.0003q0.0304xi. Give the value of P in the

PROBLEMS 157

FIGURE 4.8 Total vote, by county in Florida, for Reform Party candidates Buchanan in 2000 and Perot in 1996.

interpretation: The estimated proportion vote for Buchanan in 2000 was roughly P% of that for Perot in 1996.

b. For Palm Beach County, ␲is0.0079 and xis0.0774. Does this result appear to be an outlier? Explain.

w Ž .x

c. For logistic regression, log␲ˆir1y␲ˆi s y7.164q12.219xi. Find

␲ˆi in Palm Beach County. Is that county an outlier for this model?

4.2 For games in baseball’s National League during nine decades, Table 4.6 shows the percentage of times that the starting pitcher pitched a complete game.

TABLE 4.6 Data for Problem 4.2

Percent Percent Percent

Decade Complete Decade Complete Decade Complete

19001909 72.7 19301939 44.3 19601969 27.2

1910᎐1919 63.4 1940᎐1949 41.6 1970᎐1979 22.5

19201929 50.0 19501959 32.8 19801989 13.3 Source:Data from George Will, Newsweek, Apr. 10, 1989.

INTRODUCTION TO GENERALIZED LINEAR MODELS

158

a. Treating the number of games as the same in each decade, the ML fit of the linear probability model is ␲ˆs0.7578y0.0694x, where

Ž .

xsdecade xs1, 2, . . . , 9 . Interpret 0.7578 andy0.0694.

b. Substituting xs10, 11, 12, predict the percentages of complete games for the next three decades. Are these predictions plausible?

Why?

Ž . w

c. The ML fit with logistic regression is ␲ˆsexp 1.148y0.315x r1

Ž .x

qexp 1.148y0.315x . Obtain ␲ˆi for xs10, 11, 12. Are these more plausible?

Ž .

4.3 For Table 3.7 with scores 0, 0.5, 1.5, 4.0, 7.0 for alcohol consumption, ML fitting of the linear probability model for malformation has output.

Parameter Estimate Std Error Wald 95% Conf Limits

Intercept 0.0025 0.0003 0.0019 0.0032

Alcohol 0.0011 0.0007 y0.0003 0.0025

Interpret the model fit. Use it to estimate the relative risk of malfor-mation for alcohol consumption levels 0 and 7.0.

4.4 For Table 4.2, refit the linear probability model or the logistic

regres-Ž . regres-Ž . Ž . Ž . Ž . Ž

sion model using the scores a 0, 2, 4, 6 , b 0, 1, 2, 3 , and c 1, 2, ˆ

3, 4 . Compare. ␤ for the three choices. Compare fitted values. Sum-marize the effect of linear transformations of scores, which preserve relative sizes of spacings between scores.

4.5 For Table 4.3, let Ys1 if a crab has at least one satellite, and Ys0 otherwise. Using xsweight, fit the linear probability model.

a. Use ordinary least squares. Interpret the parameter estimates. Find

Ž .

the estimated probability at the highest observed weight 5.20 kg . Comment.

b. Try to fit the model using ML, treating Y as binomial. The failurew

Ž .

is due to a fitted probability falling outside the 0, 1 range. The fit in part a is ML for a normal random component, for which fittedŽ . values outside this range are permissible.x

c. Fit the logistic regression model. Show that the fitted probability at a weight of 5.20 kg equals 0.9968.

d. Fit the probit model. Find the fitted probability at 5.20 kg.

4.6 An experiment analyzes imperfection rates for two processes used to fabricate silicon wafers for computer chips. For treatment A applied to 10 wafers, the numbers of imperfections are 8, 7, 6, 6, 3, 4, 7, 2, 3, 4.

Treatment B applied to 10 other wafers has 9, 9, 8, 14, 8, 13, 11, 5, 7, 6

PROBLEMS 159 imperfections. Treat the counts as independent Poisson variates having means ␮A and ␮B.

a. Fit the model log␮s␣q␤x, where xs1 for treatment B and Ž .

xs0 for treatment A. Show that exp ␤ s␮Br␮A, and interpret its estimate.

b. Test H0:␮As␮B with the Wald or likelihood ratio test of H0:␤s0. Interpret.

c. Construct a 95% confidence interval for ␮Br␮A. ŽHint: First con-struct one for ␤..

d. Test H0: ␮As␮B based on this result: If Y1 and Y2 are

indepen-Ž < .

dent Poisson with means ␮1 and ␮2, then Y Y1 1qY2 is binomial

Ž .

with nsY1qY2 and␲s␮1r ␮1q␮2 .

4.7 For Table 4.3, Table 4.7 shows SAS output for a Poisson loglinear model fit using Xsweight and Ysnumber of satellites.

Ž .

a. Estimate E Y for female crabs of average weight, 2.44 kg.

b. Use ␤ˆ to describe the weight effect. Show how to construct the reported confidence interval.

c. Construct a Wald test that Y is independent of X. Interpret.

d. Can you conduct a likelihood-ratio test of this hypothesis? If not, what else do you need?

e. Is there evidence of overdispersion? If necessary, adjust standard errors and interpret.

TABLE 4.7 SAS Output for Problem 4.7

Criterion DF Value

Deviance 171 560.8664

Pearson Chi- Square 171 535.8957

Log Likelihood 71.9524

Parameter Estimate Std Error Wald 95% Conf Limits Chi- Sq Pr > ChiSq Intercept y0.4284 0.1789 y0.7791 y0.0777 5.73 0.0167 weight 0.5893 0.0650 0.4619 0.7167 82.15 <.0001

4.8 Refer to Problem 4.7. Using the identity link with xsweight, ␮ˆs

ˆ Ž .

y2.60q2.264x, where ␤s2.264 has SEs0.228. Repeat parts a through c .Ž .

4.9 Refer to Table 4.3.

a. Fit a Poisson loglinear model using both Wsweight and Cs color to predict Ysnumber of satellites. Assigning dummy vari-ables, treat Cas a nominal factor. Interpret parameter estimates.

INTRODUCTION TO GENERALIZED LINEAR MODELS

160

Ž . Ž .

b. Estimate E Y for female crabs of average weight 2.44 kg that are Ž .i medium light, andŽ .ii dark.

c. Test whether color is needed in the model. ŽHint: From Section 4.5.4, the likelihood-ratio statistic comparing models is the differ-ence in deviances..

d. The estimated color effects are monotone across the four cate-gories. Fit a simpler model that treats C as quantitative and assumes a linear effect. Interpret its color effect and repeat the

Ž . Ž .

analyses of parts b and c . Compare the fit to the model in part Ž .a . Interpret.

e. Add width to the model. What effect does the strong positive correlation between width and weight have? Are both needed in the model?

4.10 In Section 4.3.2, refer to the Poisson model with identity link. The fit

Ž .

using least squares is ␮ˆs y10.42q 0.51x SEs0.11 . Explain why the parameter estimates differ and why the SE values are so different.

4.11 For the negative binomial model fitted to the crab satellite counts with

ˆ Ž .

log link and width predictor, ␣ˆs y4.05, ␤s0.192 SEs0.048 ,

ˆy1 Ž . ˆ

k s1.106 SEs0.197 . Interpret. Why is SE for␤ so different from SEs0.020 for the corresponding Poisson GLM in Sec 4.3.2? Which is more appropriate? Why?

4.12 Refer to Problem 4.6. The sample mean and variance are 5.0 and 4.2 for treatment A and 9.0 and 8.4 for treatment B.

a. Is there evidence of overdispersion for the Poisson model having a dummy variable for treatment? Explain.

b. Fit the negative binomial loglinear model. Note that the estimated dispersion parameter is 0 and that estimates of treatment means and standard errors are the same as with the Poisson loglinear GLM.

c. For the overall sample of 20 observations, the sample mean and variance are 7.0 and 10.2. Fit the loglinear model having only an intercept term under Poisson and negative binomial assumptions.

Compare results, and compare confidence intervals for the overall mean response. Why do they differ? ŽNote: This shows how the Poisson model can deteriorate when an important covariate is unmeasured..

4.13 Table 4.8 shows the free-throw shooting, by game, of Shaq O’Neal of

Ž .

the Los Angeles Lakers during the 2000 NBA basketball playoffs.

Commentators remarked that his shooting varied dramatically from game to game. In game i, suppose that Yisnumber of free throws

PROBLEMS 161 TABLE 4.8 Data for Problem 4.13

Number Number of Number Number of Number Number of Game Made Attempts Game Made Attempts Game Made Attempts

1 4 5 9 4 12 17 8 12

2 5 11 10 1 4 18 1 6

3 5 14 11 13 27 19 18 39

4 5 12 12 5 17 20 3 13

5 2 7 13 6 12 21 10 17

6 7 10 14 9 9 22 1 6

7 6 14 15 7 12 23 3 12

8 9 15 16 3 10

Source: www.nba.com.

Ž . 4

made out of ni attempts is a bin ni,␲i variate and the Yi are independent.

a. Fit the model, ␲is␣, and find and interpret ␣ˆ and its standard error. Does the model appear to fit adequately? ŽNote:You could check this with a small-sample test of independence of the 23=2 table of game and the binary outcome..

b. Adjust the standard error for overdispersion. Using the original SE and its correction, find and compare 95% confidence intervals for

␣. Interpret.

4.14 Refer to Table 13.6. Fit a loglinear model with a dummy variable for

Ž . Ž .

race, a assuming a Poisson distribution, and b allowing overdisper-sion with a quasi-likelihood approach. Compare results.

4.15 Refer to Problem 4.6. The wafers are also classified by thickness of

Ž .

silicon coating zs0, low; zs1, high . The first five imperfection counts reported for each treatment refer to zs0 and the last five refer to zs1. Analyze these data.

14.6 Refer to Table 13.9 on frequency of sexual intercourse. Analyze these data.

Theory and Methods

4.17 Describe the purpose of the link function of a GLM. What is the identity link? Explain why it is not often used with binomial or Poisson responses.

Ž .

4.18 For known k, show that the negative binomial distribution 4.12 has

Ž . w Ž .x

exponential family form 4.1 with natural parameter log ␮r ␮qk .

INTRODUCTION TO GENERALIZED LINEAR MODELS

162

4.19 For binary data, define a GLM using the log link. Show that effects refer to the relative risk. Why do you think this link is not often used?

ŽHint:What happens if the linear predictor takes a positive value?.

Ž . Ž .

4.20 For the logistic regression model 4.6 with ␤)0, show that a as

Ž . Ž . Ž .

x™⬁,␲ x is monotone increasing, and b the curve for␲ x is the cdf of a logistic distribution having mean y␣r␤ and standard

devia-'

Ž .

tion ␲r ␤ 3 .

Ž .

4.21 Show representation 4.18 for the binomial distribution.

Ž . 4

4.22 Let Yi be a bin ni,␲i variate for group i, is1, . . . ,N, with Yi independent. Consider the model that ␲1s ⭈⭈⭈ s␲N. Denote that

4 Ž . Ž .

common value by␲. For observations yi, show that␲ˆs Ýyi rÝni . When all nis1, for testing this model’s fit in the N=2 table, show that X2sn. Thus, goodness-of-fit statistics can be completely

unin-Ž .

formative for ungrouped data. See also Problem 5.37.

Ž .

4.23 Suppose that Yi is Poisson with gi s␣q␤xi, where xis1 for is1, . . . ,nA from group A and xis0 for isnAq1, . . . ,nAqnB

from group B. Show that for any link function g, the likelihood

Ž .

equations 4.22 imply that fitted means ␮ˆA and ␮ˆB equal the sample means.

4.24 For binary data with sample proportion yi based on ni trials, we use

Ž .

quasi-likelihood to fit a model using variance function 4.46 . Show that parameter estimates are the same as for the binomial GLM but that the covariance matrix multiplies by␾.

Ž .

4.25 A binomial GLM ␲is⌽ Ýjjxi j with arbitrary inverse link function

Ž . Ž .

⌽ assumes that$ n Yi i has a bin ni,␲i distribution. Find wi in 4.27 ˆ

Ž . Ž .

and hence cov ␤ . For logistic regression, show that wisnii 1y␲i . 4.26 A GLM has parameter ␤ with sufficient statistic S. A goodness-of-fit test statistic T has observed value to. If ␤ were known, a P-value is

Ž . Ž < .

PsP TGto;␤ . Explain why P TGto S is the uniform minimum variance unbiased estimator of P.

4.27 Let yi j be observation j of a count variable for group i,is1, . . . ,I,

4 Ž .

js1, . . . ,ni. Suppose that Yi j are independent Poisson with E Yi j s␮i.

a. Show that the ML estimate of ␮i is ␮ˆisyij i jy rni.

b. Simplify the expression for the deviance for this model. For testingw this model, it follows from Fisher 1970, p. 58, originally publishedŽ

PROBLEMS 163

. Ž .2

in 1925 that the deviance and the Pearson statisticÝ Ýi j yi jyyi ryi

Ž .

have approximate chi-squared distributions with dfsÝi niy1 .

Ž . Ž .2

For a single group, Cochran 1954 referred toÝj y1jyy1 ry1 as the ®ariance test for the fit of a Poisson distribution, since it compares the sample variance to the estimated Poisson variance y1.x 4.28 Conditional on ␭,Y has a Poisson distribution with mean ␭. Values of

Ž . Ž .

␭ vary according to gamma density 13.12 , which has E ␭ s␮,

Ž . 2

var ␭ s␮rk. Show that marginally Y has the negative binomial

Ž .

distribution 4.12 . Explain why the negative binomial model is a way to handle overdispersion for the Poisson.

Ž . Ž .

4.29 Consider the class of binary models 4.8 and 4.9 . Suppose that the standard cdf ⌽ corresponds to a probability density function ␾ that is symmetric around 0.

Ž .

a. Show that x at which␲ x s0.5 is xs y␣r␤.

Ž . Ž . Ž .

b. Show that the rate of change in ␲ x when ␲ x s0.5 is ␤␾ 0 .

'

Ž .

Show this is 0.25␤ for the logit link and ␤r 2␲ where␲s3.14 . . . for the probit link.

c. Show that the probit regression curve has the shape of a normal cdf

< <

with meany␣r␤ and standard deviation 1r ␤ .

Ž 2.

4.30 Show the normal distribution N ␮,␴ with fixed ␴ satisfies family Ž4.1 , and identify the components. Formulate the ordinary regression. model as a GLM.

4.31 In Problem 4.30, when ␴ is also a parameter, show that it satisfies the

Ž .

exponential dispersion family 4.14 .

Ž . 1

4.32 For binary observations, consider the model ␲ x s 2 q Ž1r␲.tany1Ž␣q␤x.. Which distribution has cdf of this form? Explain when a GLM using this curve might be more appropriate than logistic regression.

Ž . Ž .

No documento Categorical Data Analysis (páginas 168-178)