ROBUST REGRESSION METHODS - Applied Bayesian Modelling

Table 3.14 Kyphosis outcome, standard Probit and general additive models

Probit regression Mean St. devn. 2.5% Median 97.5%

Deviance 64.9 2.7 61.6 64.3 71.8

Intercept ÿ1.130 0.744 ÿ2.615 ÿ1.102 0.277

Number 0.229 0.110 0.020 0.227 0.448

Start ÿ0.125 0.038 ÿ0.202 ÿ0.125 ÿ0.053

Age 0.006 0.004 ÿ0.001 0.006 0.013

Additive form on age

Deviance 59.4 4.1 52.1 58.8 70.8

Intercept ÿ1.337 1.206 ÿ4.301 ÿ1.295 1.270

Number 0.249 0.127 ÿ0.033 0.247 0.553

Start ÿ0.147 0.045 ÿ0.255 ÿ0.146 ÿ0.046

Age (Linear) 0.004 0.013 ÿ0.024 0.004 0.033

t² 0.0013 0.0011 0.0002 0.0009 0.0057

The deviance of the GAM model improves only slightly on the linear model, despite being more heavily parameterised; the penalised fit (as may be verified) therefore deteriorates. The plot of the smooth suggests a low order polynomial might be sufficient to model the nonlinearity, and this would be less heavily parameterised than a GAM.

the posterior density is proportional to s^ÿ(n1)Yⁿ

[1(yiÿbxi)²=ns²]^ÿ(n1)=2

Similarly, if the outcomeyis multivariate Studenttof dimensionqwith dispersionS, andp(b,S)/ jSj^ÿ1, the posterior is proportional to

jSj^ÿ(n1)Yⁿ

n(y_iÿbx_i)S^ÿ1(y_iÿbx_i)

_ÿ(nq)=2

whereSis aqqdispersion matrix.

The equivalent scale mixture specification in either case involves unknown weight parametersv_ithat scale the overall variance or dispersion parameter(s) of the Normal.

Thus, for a univariate outcome, the Studenttmodel may be expressed as y_ibx_ie_i

e_iN(0,s²=v_i) viG(n=2,n=2)

(3:23)

The multivariate version of this takes againvi asG(n=2,n=2), and takes theith vector observationyito be sampled from a multivariate Normal with dispersion matrix

SiS=vi

Suspect observations (i.e. potential outliers) with small weightsvi and large distances yiÿbxifrom the regression model have effects

(y_iÿbx_i)S^ÿ1_i (y_iÿbx_i)

on the posterior density down-weighted, with the degree of down-weighting usually being enhanced for smaller values ofn.

Other densities with heavier (or possibly lighter) tails than the Normal are obtained with alternative densities for thev_i or other types of mixing of densities. For instance, Smith (1981) considers a model for calculating marginal likelihoods, which involves choosing between the Normal, Double Exponential or Uniform densities, with prior probabilities of 1/3 on each density. The M-estimators of Huber (1981), which form the base for much work on robust estimation, are based on densities

P(yjm,s,k) / exp [ÿU(d)]

whered (yÿm)=s, and

U(d)0:5d² if jdj<k U(d)kjdj ÿ0:5k² if jdj k

Asktends to infinity, this form tends to the Normal while forknear zero it approxi-mates the double exponential.

3.6.1 Binary selection models for robustness

A variant of the scale mixture approach for metric dependent variables takes v as binary with probabilitylPr(v1) of selecting a Normal density with meanmand

ROBUST REGRESSION METHODS 119

standard deviations. On the other hand, ifv0, then an overdispersed alternative is selected with the same mean but standard deviationks, wherekconsiderably exceeds 1.

For example, takinglto be small, e.g.lU(0, 0:1) andkU(2, 3) allows protection against a low level of contamination (of up to 10% of the observations) and variance inflation in that contaminated component of between four and nine times the overall level. Settinglto a very low level, e.g.l0:01, andkpositive with unrestricted ceiling, allows for a small number of extreme outliers.

Outlier resistant models may be seen as allowing for measurement error or (for categorical outcomes) misclassification. Thus in regression for binary outcomes, an outlier in the response y can be seen as possibly due to a transposition 0!1 or 1!0. These two cases occur when (a) the observed outcome y_i1 despite a model probability fory1 being close to zero, i.e.

Pr(yi1jxi)pi0

and (b) whenyi0 despitepibeing close to 1. If residuals are defined asZ_iyiÿpi, then an outlier is indicated if the absolute residual is close to 1. The sensitivity of a binary regression to outliers in part depends on the assumed link (e.g. the logit and complementary log-log links have heavier tails than the probit).

Regardless of link, a model for resistant binary regression may also be specified, including a mechanism for transposition between 0 and 1. Let yi be the recorded response, and also define p~_iPr(~y_i1jx_i), where ~y_i is the true response. Following Copas (1988), assume a transposition¹¹occurs with a small probabilitygsuch that the probability of the actually recorded response being 1 is given by

piPr(yi1jxi)

Pr(y_i1j~y_i0)Pr(~y_i0jx_i)Pr(y_i1j~y_i1)Pr(~y_i1jx_i)

(1ÿg)p~ig(1ÿp~i) (3:24)

where (for example)

logit(~pi)bxi

The likelihood ratio thatyis an outlier (or more particularly, misrecorded as a result of transcription) as against it belonging to the data (or being a genuine observation) is then given by

R_iy_i(1ÿp~_i)=p~_i(1ÿy_i)p~_i=(1ÿp~_i) (3:25)

3.6.2 Diagnostics for discordant observations

The search for robust or resistant fits in general linear models extends to consider outlying points in the design space (of theXvariables) as well as outlying responses (y).

Logistic models, binary or multiple, may be especially sensitive to such outliers.

Adjusting for such outliers may help to avoid unjustified rejection of models in their entirety, or of particular explanatory variates within models, because of distortions due to a few unusual data points.

11 In BUGs this involves the coding, for a single predictorx[i]:

y[i]dbern(pstar[i])

pstar[i]<- (1-gamma)*p[i]gamma*(1-p[i]) logit(p[i])<- beta[1]x[i]*beta[2]

Methods for outlier detection or for down-weighting influential observations may be based on appropriately defined residual terms (Copas, 1988). For example, for a logit regression for binary outcomes with probabilitypi, the components of deviance are

d_i ÿ2 log (1ÿp^_i) wheny_i0 ÿ2 log (p^i) whenyi1

and takingêiyiÿE(yijxi)yiÿpîin line with a general definition of residuals gives di ÿ2 log (1ÿ jêij) (3:26) One may modify the usual likelihoods or deviances to downweight influential observa-tions, via influence functions g(u) which penalise large values ofu (Pregibon, 1982).

Whereas maximum likelihood estimation is equivalently minimisation ofSidi, wheredi

is the deviance component of theith case, robust estimation instead minimisesS_ig(d_i), whereg(d_i)<d_ifor larged_i, so lessening the influence of cases with large deviances. The g(d_i) may be based on the estimated residuals, as in Equation (3.26).

Another approach to obtaining residuals from a binary regression involves the latent variable method of Albert and Chib (1993), which is equivalent either to probit or logit regression: for example, if cases have different and known weights under the probit option, then a homogenous variance of 1 may be replaced by variances (inverse of the weights) averaging 1.

One may also consider the effect of each observation on the fitted model; for example, regression coefficients may be sensitive to particular points with unusual configurations of design variables,xi1,xi2,: :xip. Thus estimates of a coefficientbwhen all cases are included may be compared with the same coefficient estimate b[i] when case i is excluded (Weiss, 1994; Geisser, 1990). The differences

Db_ib[i]ÿb

may then be plotted in order of the observations. A cross-validatory approach to model assessment omitting a single case at a time therefore has the advantage not just of providing a pseudo marginal likelihood and pseudo Bayes factor, but of providing a measure of the sensitivity of the regression coefficients to exclusion of certain observa-tions. One may obtain posterior summaries of theDb_i, ascertain which are most clearly negative or positive, and so produce the most distortion of the all cases estimateb.

Example 3.12 Group means from contaminated sampling Chaloner (1994) and Chal-oner and Brant (1988) discusses the identification of outliers at two levels within a one way analysis of variance problem. Specifically, Chaloner (1994) considers an outlier as an observation with residual ei exceeding a threshold k appropriately defined. For Normal data withnobservations, and setting

kF^ÿ1{0:50:5(0:95¹⁼ⁿ)}

ensures that the prior probability of no outliers is 0.95. Chaloner then analyses data from Sharples (1990) using both inflated variance and mean shift contaminated mixture models for outliers (see Table 3.15 for the data concerned). For a one-way data set with I5 groups (andn_i6 observations within theith group), within group errors were sampled with probability 0.1 from a gamma distribution with mean 5.5. In addition the group means used to generate the data were 25 for groups 1±4, but 50 for group 5.

ROBUST REGRESSION METHODS 121

Table 3.15 Data generated with contamination process (Sharples, 1990) (Outliers starred) Observations in Group

Group 1 2 3 4 5 6 Average

1 24.80 26.90 26.65 30.93 33.77 63.31* 28.61

2 23.96 28.92 28.19 26.16 21.34 29.46 26.34

3 18.30 23.67 14.47 24.45 24.89 28.95 22.46

4 51.42* 27.97 24.76 26.67 17.58 24.29 24.25

5 34.12 46.87 58.59* 38.11 47.59 44.67 42.27

Three of the 30 observations are identified asa priorioutliers (known to be sampled from the contamination gamma density). The question is then to identify the probability of outliers among the data y_ij(i1,::,n_j;j1,::,J) and among the group means.

Specifically, suppose

y_ijN(u_j,t²_w) ujN(m,t²_b) Then the first and second stage residuals are defined as

e_ij(y_ijÿu_j)=t_w and

ej(ujÿm)=tb

and compared tok above. If Pr(e_ij>k) or Pr(e_j>k) exceeds the prior probability of 0.05, then an outlier is indicated. Prior specification oft²_wandt²_bis important, as certain priors may allow excessive shrinkage (t²_b too small). Note that the observed between group variance is around 60.

Consider first a prior (prior A) for the overall variance t²t²_wIt²_b and then the proportion assigned to within group variation decided by a ratiopwith a flat beta prior.

Prior B is used by Chaloner, namely

Pr(t²_w,t²_b)/t^ÿ2_w (t²_wIt²_b)^ÿ1

which in BUGS involves a double grid prior scaled to ensure total mass of one. The grid takes account of the actual value of the between group variance to the extent that small overall variances (below 5) are excluded. Finally, prior C is a double grid prior with equal probability over the pairs of values oft²_wandt²_b in the grid.

Whatever the prior, it is clear that the likelihood fort²_b is relatively flat, but prior C results in a higher mean estimate for between group variance than the other two.

Despite this, probabilities of individual observation outliers are relatively similar regardless of the prior adopted as are the estimates of the true group meansu₁,u₂,: :,u₅. The probabilities that y₁,6 and y₄,1 are outliers are, respectively, {0.38, 0.028), {0.42, 0.037} and {0.35, 0.031} under the three priors, whereas Chaloner, who uses a Normal approximation in conjunction with a Laplace approximation, finds values of 0.44 and 0.05. Priors A and B give a posterior probability of group 5 being an outlier of around 0.05 to 0.06; this compares to 0.019 cited by Chaloner.

Example 3.13 Stack loss To illustrate the effect of robust alternatives to the Normal for metric outcomes based on the Studenttdensity, considern21 points in the classic data set for stack loss,yand predictorsx1air flow,x2temperature andx3acid.

A simple Normal errors model gives an estimated equation, with posterior means and standard deviations

y ÿ43:6 0:72x11:28x2ÿ0:11x3

(10:5) (0:12) (0:32) (0:13)

Langeet al. (1989) show a small improvement in likelihood in moving from a Normal (degrees of freedomn 1) to a Cauchy density (n0:5), with the maximum likeli-hood estimate of the degrees of freedom provided byn1:1. Here an exponential prior fornis assumed with meanZ, which is itself assigned a uniform prior¹²between 0.01 and 1. It may be noted that a proper prior is needed onn to avoid relapsing to the Normal (see Geweke, 1993, p. S27).

Adopting this approach (Model B in Program 3.13) gives a median value of around 1.7 forn, with the credible interval ranging from 0.5±60. The estimated equation is now

y ÿ39:40:81x10:71x2ÿ0:09x3

(7:6) (0:12) (0:37) (0:11) so that the coefficient onx2 is considerably reduced.

We then take the scale mixture approach to the Studenttbut withnknown (as 1.7).

This shows the lowest weights (vias in (3.23)), namely 0.78 and 0.67, for observations 4 and 21. Adopting the MLE value of 1.1 instead gives the lowest weights (all under 0.25) on observations 3, 4 and 21 (cf. Langeet al., 1989, p. 883).

Finally Model D in¹³Program 3.13 also uses a scale mixture approach, but withn also unknown. The outcome of this model, from a run of 10 000 iterations, is a median fornof 1.3, and weights of 0.14 and 0.10 on observations 4 and 21. The coefficient onx₂ is further reduced to average 0.61 (compare Langeet al., 1989, Table 1).

Example 3.14 Leukaemia survival To illustrate the transposition model (3.24) for a binary outcome, we consider the leukaemia data of Fiegl and Zelen (1965), but with response forn33 subjects being defined according to whether they survived for a year or more (y1 for deaths at more than a year). The two covariates arex1white blood cell count andx2 positive or negative AG (presence or absence of certain morphologic characteristics in the white cells). A standard logistic regression shows both covariates negatively related to extended survival (Table 3.20), with the WBC coefficient being ÿ0.04.

In adopting the alternative model allowing for possible contamination, it is to be noted that Copas (1988, p. 245) found by maximum likelihood methods that a value ofg0:003 in (3.21) leads to a large increase in the absolute size of the WBC coefficient. This means the predicted longer term survival chances of patients with low WBC counts are

12 This is broadly equivalent to assuming the degrees of freedom is between 1 and 100.

13 Run in version 1.2.

ROBUST REGRESSION METHODS 123

Table 3.20 Leukaemia data, logit model and logit under transposition, parameter summary

Mean St. devn. 2.5% Median 97.5%

Standard Model

Intercept 1.073 0.708 ÿ0.231 1.049 2.538

WBC ÿ0.041 0.022 ÿ0.092 ÿ0.038 ÿ0.005

Negative AG ÿ2.49 1.036 ÿ4.688 ÿ2.432 ÿ0.601

Transposition Model

Intercept 3.68 2.04 0.32 3.39 11.07

WBC ÿ0.32 0.18 ÿ0.89 ÿ0.30 ÿ0.02

Negative AG ÿ3.38 1.89 ÿ10.47 ÿ3.12 ÿ0.30

g 0.0066 0.0049 0.0003 0.0055 0.0228

Likelihood Ratio (Misrecording vs. Genuine) (see equation (3.22) )

Mean Median Mean Median

D1 0.1268 0.0689 D18 53.7300 2.6940

D2 0.0969 0.0433 D19 48.4700 1.7380

D3 0.1906 0.1263 D20 0.6457 0.4211

D4 0.1342 0.0753 D21 1.8150 0.9181

D5 18.82 4.73 D22 0.1486 0.0917

D6 1.2400 0.7816 D23 0.4155 0.2796

D7 1.0180 0.6772 D24 0.1177 0.0680

D8 0.5933 0.1937 D25 0.0260 0.0047

D9 ÿ39.61 5.65 D26 0.0120 0.0005

D10 0.3770 0.2835 D27 0.0111 0.0004

D11 0.8145 0.5736 D28 0.0091 0.0002

D12 0.0970 0.0022 D29 0.0130 0.0007

D13 0.0809 0.0010 D30 0.0206 0.0026

D14 0.0114 0.0001 D31 0.0017 0.0001

D15 0.0114 0.0001 D32 0.0011 0.0001

D16 0.0383 0.0001 D33 0.0011 0.0001

D17 8966 9999

increased. However, further increases ing had a relatively small impact. Therefore, a prior ongmay be set drawing on this analysis, and specifically

gE(300)

A three chain run of 25 000 iterations shows convergence from around 5000 iter-ations, and the summary is based on iterations 5000±25 000. The posterior mean for the WBC coefficient now stands atÿ0.34, with a 95% interval confined to negative values.

The posterior mean for g itself is just under 0.007. The likelihood ratios (3.25) are highest (hence chance of outlier status greatest) for observation 17, which hasy1 (the patient survived 65 weeks) but a WBC measure of 100. The coefficient on the dummy index for AG status is also increased in absolute size, though like the WBC coefficient is estimated less precisely (i.e. the posterior standard deviation is increased over the standard logit).

Example 3.15 Travel for shopping To illustrate alternative approaches to robustness with binary outcomes to the contamination/misclassification model, consider data from Guy et al. (1983) for a panel survey of shopping behaviour. This involved 84 family households in suburban Cardiff, approximately equidistant from the city centre, with the responseyibeing whether or not the household used a city centre store during a particular week. The predictors are income (Inc), household size (Hsz, for number of children) and whether the wife was working (WW). The first two covariates are respectively ordinal (with levels 1 for income under £1000 up to 8 for incomes over £15 000) and discrete ± but both are taken as continuous.

Wrigley and Dunn (1986) consider issues of resistant and robust logit regression against a substantive background, whereby positive effects of income and working wife on central city shopping are expected, but a negative effect of household size.

They argue that it is preferable to work with all the explanatory variables considered relevant and that exclusion of predictors because of one or two outliers or influential observations should be avoided. Let

piPr(yi1)

Wrigley and Dunn cite estimates from a maximum likelihood fit as follows (with standard errors in brackets)

logit(p_i) ÿ0:720:14 Incÿ0:56 Hsz0:83 WW (3:27) (0:91) (0:23) (0:19) (0:54)

So the significance of the working wife variable is only marginal (i.e. income is signifi-cant at 5% only if a one-tail test is used).

Here we adopt mildly informative priors (cf. Chib, 1995) in a logit link model:

N(0.75, 25) priors on b_Inc and b_WW are taken in line with an expected positive effect on central city shopping of income and female labour activity, while a N(ÿ0.75, 25) prior on b_Hsz reflects the expected negative impact of household size.

From a 10 000 iteration three chain run a stronger effect of income is obtained than in Equation (3.27). The analogous equation to that above, with posterior standard deviations in brackets, is

logit(pi) ÿ0:560:39 Incÿ0:61 Hsz1:07 WW (0:97) (0:25) (0:20) (0:59)

The 90% credible intervals on both the income and working wife variables are entirely confined to positive values, though this is not true for the 95% intervals. The highest deviance components are obtained for observations 5, 55, 58, 71 and 83. These points account for about 20% of the total deviance (minus twice the log likelihood, which averages aboutÿ50). The highest deviance is for case 55. The conditional predictive ordinates are lowest for cases 55 and 71.

Similarly, case 55 is the highest average residual under the latent utility logit model of Albert and Chib (1993) ± see Model (B) in Program 3.15. The logit is matched by a latent utility following Studentt errors with 8 degrees of freedom (logistic errors are

`heavy tailed' like the Studenttdensity). Trying Studenttsampling with smaller degrees of freedom (e.g. t(2) errors) makes little difference to any conclusions about credible intervals for income and working wife coefficients ± the 95% intervals still straddle zero.

ROBUST REGRESSION METHODS 125

Note that either the Normal or Student t latent utility approach of Albert and Chib allow different known weights for each point. This approach might therefore be used to downweight certain observations using an influence function based on deviance or leverage contributions (see the Exercises).

As an illustration of potential distortion from particular data points Model C in Program 3.15 applies the full cross-validation methodology based on single case omis-sion. The differences between b_Inc andb_Inc[i] for income (Del.beta.Inc[] in Model C) show that major changes in this coefficient are caused by exclusion of particular points.

The average value of b_Inc from the standard logit link model is 0.39 with posterior standard deviation of 0.25. Exclusion of case 71 raises this coefficient by over half this standard deviation, to around 0.58, while exclusion of case 58 raises it to around 0.47.

By contrast excluding case 29 lowers the coefficient to around 0.26. There might therefore be grounds for excluding case 71 at least, as it figures as an outlier and is influential on the regression. As discussed above, other options to assess robustness of inferences may be used, which retain the suspect case(s), but model them via contamin-ated priors or discrete mixture regressions.

No documento Applied Bayesian Modelling (páginas 129-137)