Table 3.14 Kyphosis outcome, standard Probit and general additive models
Probit regression Mean St. devn. 2.5% Median 97.5%
Deviance 64.9 2.7 61.6 64.3 71.8
Intercept ÿ1.130 0.744 ÿ2.615 ÿ1.102 0.277
Number 0.229 0.110 0.020 0.227 0.448
Start ÿ0.125 0.038 ÿ0.202 ÿ0.125 ÿ0.053
Age 0.006 0.004 ÿ0.001 0.006 0.013
Additive form on age
Deviance 59.4 4.1 52.1 58.8 70.8
Intercept ÿ1.337 1.206 ÿ4.301 ÿ1.295 1.270
Number 0.249 0.127 ÿ0.033 0.247 0.553
Start ÿ0.147 0.045 ÿ0.255 ÿ0.146 ÿ0.046
Age (Linear) 0.004 0.013 ÿ0.024 0.004 0.033
t2 0.0013 0.0011 0.0002 0.0009 0.0057
The deviance of the GAM model improves only slightly on the linear model, despite being more heavily parameterised; the penalised fit (as may be verified) therefore deteriorates. The plot of the smooth suggests a low order polynomial might be sufficient to model the nonlinearity, and this would be less heavily parameterised than a GAM.
the posterior density is proportional to sÿ(n1)Yn
i1
[1(yiÿbxi)2=ns2]ÿ(n1)=2
Similarly, if the outcomeyis multivariate Studenttof dimensionqwith dispersionS, andp(b,S)/ jSjÿ1, the posterior is proportional to
jSjÿ(n1)Yn
i1
11
n(yiÿbxi)Sÿ1(yiÿbxi)
ÿ(nq)=2
whereSis aqqdispersion matrix.
The equivalent scale mixture specification in either case involves unknown weight parametersvithat scale the overall variance or dispersion parameter(s) of the Normal.
Thus, for a univariate outcome, the Studenttmodel may be expressed as yibxiei
eiN(0,s2=vi) viG(n=2,n=2)
(3:23)
The multivariate version of this takes againvi asG(n=2,n=2), and takes theith vector observationyito be sampled from a multivariate Normal with dispersion matrix
SiS=vi
Suspect observations (i.e. potential outliers) with small weightsvi and large distances yiÿbxifrom the regression model have effects
(yiÿbxi)Sÿ1i (yiÿbxi)
on the posterior density down-weighted, with the degree of down-weighting usually being enhanced for smaller values ofn.
Other densities with heavier (or possibly lighter) tails than the Normal are obtained with alternative densities for thevi or other types of mixing of densities. For instance, Smith (1981) considers a model for calculating marginal likelihoods, which involves choosing between the Normal, Double Exponential or Uniform densities, with prior probabilities of 1/3 on each density. The M-estimators of Huber (1981), which form the base for much work on robust estimation, are based on densities
P(yjm,s,k) / exp [ÿU(d)]
whered (yÿm)=s, and
U(d)0:5d2 if jdj<k U(d)kjdj ÿ0:5k2 if jdj k
Asktends to infinity, this form tends to the Normal while forknear zero it approxi-mates the double exponential.
3.6.1 Binary selection models for robustness
A variant of the scale mixture approach for metric dependent variables takes v as binary with probabilitylPr(v1) of selecting a Normal density with meanmand
ROBUST REGRESSION METHODS 119
standard deviations. On the other hand, ifv0, then an overdispersed alternative is selected with the same mean but standard deviationks, wherekconsiderably exceeds 1.
For example, takinglto be small, e.g.lU(0, 0:1) andkU(2, 3) allows protection against a low level of contamination (of up to 10% of the observations) and variance inflation in that contaminated component of between four and nine times the overall level. Settinglto a very low level, e.g.l0:01, andkpositive with unrestricted ceiling, allows for a small number of extreme outliers.
Outlier resistant models may be seen as allowing for measurement error or (for categorical outcomes) misclassification. Thus in regression for binary outcomes, an outlier in the response y can be seen as possibly due to a transposition 0!1 or 1!0. These two cases occur when (a) the observed outcome yi1 despite a model probability fory1 being close to zero, i.e.
Pr(yi1jxi)pi0
and (b) whenyi0 despitepibeing close to 1. If residuals are defined asZiyiÿpi, then an outlier is indicated if the absolute residual is close to 1. The sensitivity of a binary regression to outliers in part depends on the assumed link (e.g. the logit and complementary log-log links have heavier tails than the probit).
Regardless of link, a model for resistant binary regression may also be specified, including a mechanism for transposition between 0 and 1. Let yi be the recorded response, and also define p~iPr(~yi1jxi), where ~yi is the true response. Following Copas (1988), assume a transposition11occurs with a small probabilitygsuch that the probability of the actually recorded response being 1 is given by
piPr(yi1jxi)
Pr(yi1j~yi0)Pr(~yi0jxi)Pr(yi1j~yi1)Pr(~yi1jxi)
(1ÿg)p~ig(1ÿp~i) (3:24)
where (for example)
logit(~pi)bxi
The likelihood ratio thatyis an outlier (or more particularly, misrecorded as a result of transcription) as against it belonging to the data (or being a genuine observation) is then given by
Riyi(1ÿp~i)=p~i(1ÿyi)p~i=(1ÿp~i) (3:25)
3.6.2 Diagnostics for discordant observations
The search for robust or resistant fits in general linear models extends to consider outlying points in the design space (of theXvariables) as well as outlying responses (y).
Logistic models, binary or multiple, may be especially sensitive to such outliers.
Adjusting for such outliers may help to avoid unjustified rejection of models in their entirety, or of particular explanatory variates within models, because of distortions due to a few unusual data points.
11 In BUGs this involves the coding, for a single predictorx[i]:
y[i]dbern(pstar[i])
pstar[i]<- (1-gamma)*p[i]gamma*(1-p[i]) logit(p[i])<- beta[1]x[i]*beta[2]
Methods for outlier detection or for down-weighting influential observations may be based on appropriately defined residual terms (Copas, 1988). For example, for a logit regression for binary outcomes with probabilitypi, the components of deviance are
di ÿ2 log (1ÿp^i) whenyi0 ÿ2 log (p^i) whenyi1
and taking^eiyiÿE(yijxi)yiÿp^iin line with a general definition of residuals gives di ÿ2 log (1ÿ j^eij) (3:26) One may modify the usual likelihoods or deviances to downweight influential observa-tions, via influence functions g(u) which penalise large values ofu (Pregibon, 1982).
Whereas maximum likelihood estimation is equivalently minimisation ofSidi, wheredi
is the deviance component of theith case, robust estimation instead minimisesSig(di), whereg(di)<difor largedi, so lessening the influence of cases with large deviances. The g(di) may be based on the estimated residuals, as in Equation (3.26).
Another approach to obtaining residuals from a binary regression involves the latent variable method of Albert and Chib (1993), which is equivalent either to probit or logit regression: for example, if cases have different and known weights under the probit option, then a homogenous variance of 1 may be replaced by variances (inverse of the weights) averaging 1.
One may also consider the effect of each observation on the fitted model; for example, regression coefficients may be sensitive to particular points with unusual configurations of design variables,xi1,xi2,: :xip. Thus estimates of a coefficientbwhen all cases are included may be compared with the same coefficient estimate b[i] when case i is excluded (Weiss, 1994; Geisser, 1990). The differences
Dbib[i]ÿb
may then be plotted in order of the observations. A cross-validatory approach to model assessment omitting a single case at a time therefore has the advantage not just of providing a pseudo marginal likelihood and pseudo Bayes factor, but of providing a measure of the sensitivity of the regression coefficients to exclusion of certain observa-tions. One may obtain posterior summaries of theDbi, ascertain which are most clearly negative or positive, and so produce the most distortion of the all cases estimateb.
Example 3.12 Group means from contaminated sampling Chaloner (1994) and Chal-oner and Brant (1988) discusses the identification of outliers at two levels within a one way analysis of variance problem. Specifically, Chaloner (1994) considers an outlier as an observation with residual ei exceeding a threshold k appropriately defined. For Normal data withnobservations, and setting
kFÿ1{0:50:5(0:951=n)}
ensures that the prior probability of no outliers is 0.95. Chaloner then analyses data from Sharples (1990) using both inflated variance and mean shift contaminated mixture models for outliers (see Table 3.15 for the data concerned). For a one-way data set with I5 groups (andni6 observations within theith group), within group errors were sampled with probability 0.1 from a gamma distribution with mean 5.5. In addition the group means used to generate the data were 25 for groups 1±4, but 50 for group 5.
ROBUST REGRESSION METHODS 121
Table 3.15 Data generated with contamination process (Sharples, 1990) (Outliers starred) Observations in Group
Group 1 2 3 4 5 6 Average
1 24.80 26.90 26.65 30.93 33.77 63.31* 28.61
2 23.96 28.92 28.19 26.16 21.34 29.46 26.34
3 18.30 23.67 14.47 24.45 24.89 28.95 22.46
4 51.42* 27.97 24.76 26.67 17.58 24.29 24.25
5 34.12 46.87 58.59* 38.11 47.59 44.67 42.27
Three of the 30 observations are identified asa priorioutliers (known to be sampled from the contamination gamma density). The question is then to identify the probability of outliers among the data yij(i1,::,nj;j1,::,J) and among the group means.
Specifically, suppose
yijN(uj,t2w) ujN(m,t2b) Then the first and second stage residuals are defined as
eij(yijÿuj)=tw and
ej(ujÿm)=tb
and compared tok above. If Pr(eij>k) or Pr(ej>k) exceeds the prior probability of 0.05, then an outlier is indicated. Prior specification oft2wandt2bis important, as certain priors may allow excessive shrinkage (t2b too small). Note that the observed between group variance is around 60.
Consider first a prior (prior A) for the overall variance t2t2wIt2b and then the proportion assigned to within group variation decided by a ratiopwith a flat beta prior.
Prior B is used by Chaloner, namely
Pr(t2w,t2b)/tÿ2w (t2wIt2b)ÿ1
which in BUGS involves a double grid prior scaled to ensure total mass of one. The grid takes account of the actual value of the between group variance to the extent that small overall variances (below 5) are excluded. Finally, prior C is a double grid prior with equal probability over the pairs of values oft2wandt2b in the grid.
Whatever the prior, it is clear that the likelihood fort2b is relatively flat, but prior C results in a higher mean estimate for between group variance than the other two.
Despite this, probabilities of individual observation outliers are relatively similar regardless of the prior adopted as are the estimates of the true group meansu1,u2,: :,u5. The probabilities that y1,6 and y4,1 are outliers are, respectively, {0.38, 0.028), {0.42, 0.037} and {0.35, 0.031} under the three priors, whereas Chaloner, who uses a Normal approximation in conjunction with a Laplace approximation, finds values of 0.44 and 0.05. Priors A and B give a posterior probability of group 5 being an outlier of around 0.05 to 0.06; this compares to 0.019 cited by Chaloner.
Example 3.13 Stack loss To illustrate the effect of robust alternatives to the Normal for metric outcomes based on the Studenttdensity, considern21 points in the classic data set for stack loss,yand predictorsx1air flow,x2temperature andx3acid.
A simple Normal errors model gives an estimated equation, with posterior means and standard deviations
y ÿ43:6 0:72x11:28x2ÿ0:11x3
(10:5) (0:12) (0:32) (0:13)
Langeet al. (1989) show a small improvement in likelihood in moving from a Normal (degrees of freedomn 1) to a Cauchy density (n0:5), with the maximum likeli-hood estimate of the degrees of freedom provided byn1:1. Here an exponential prior fornis assumed with meanZ, which is itself assigned a uniform prior12between 0.01 and 1. It may be noted that a proper prior is needed onn to avoid relapsing to the Normal (see Geweke, 1993, p. S27).
Adopting this approach (Model B in Program 3.13) gives a median value of around 1.7 forn, with the credible interval ranging from 0.5±60. The estimated equation is now
y ÿ39:40:81x10:71x2ÿ0:09x3
(7:6) (0:12) (0:37) (0:11) so that the coefficient onx2 is considerably reduced.
We then take the scale mixture approach to the Studenttbut withnknown (as 1.7).
This shows the lowest weights (vias in (3.23)), namely 0.78 and 0.67, for observations 4 and 21. Adopting the MLE value of 1.1 instead gives the lowest weights (all under 0.25) on observations 3, 4 and 21 (cf. Langeet al., 1989, p. 883).
Finally Model D in13Program 3.13 also uses a scale mixture approach, but withn also unknown. The outcome of this model, from a run of 10 000 iterations, is a median fornof 1.3, and weights of 0.14 and 0.10 on observations 4 and 21. The coefficient onx2 is further reduced to average 0.61 (compare Langeet al., 1989, Table 1).
Example 3.14 Leukaemia survival To illustrate the transposition model (3.24) for a binary outcome, we consider the leukaemia data of Fiegl and Zelen (1965), but with response forn33 subjects being defined according to whether they survived for a year or more (y1 for deaths at more than a year). The two covariates arex1white blood cell count andx2 positive or negative AG (presence or absence of certain morphologic characteristics in the white cells). A standard logistic regression shows both covariates negatively related to extended survival (Table 3.20), with the WBC coefficient being ÿ0.04.
In adopting the alternative model allowing for possible contamination, it is to be noted that Copas (1988, p. 245) found by maximum likelihood methods that a value ofg0:003 in (3.21) leads to a large increase in the absolute size of the WBC coefficient. This means the predicted longer term survival chances of patients with low WBC counts are
12 This is broadly equivalent to assuming the degrees of freedom is between 1 and 100.
13 Run in version 1.2.
ROBUST REGRESSION METHODS 123
Table 3.20 Leukaemia data, logit model and logit under transposition, parameter summary
Mean St. devn. 2.5% Median 97.5%
Standard Model
Intercept 1.073 0.708 ÿ0.231 1.049 2.538
WBC ÿ0.041 0.022 ÿ0.092 ÿ0.038 ÿ0.005
Negative AG ÿ2.49 1.036 ÿ4.688 ÿ2.432 ÿ0.601
Transposition Model
Intercept 3.68 2.04 0.32 3.39 11.07
WBC ÿ0.32 0.18 ÿ0.89 ÿ0.30 ÿ0.02
Negative AG ÿ3.38 1.89 ÿ10.47 ÿ3.12 ÿ0.30
g 0.0066 0.0049 0.0003 0.0055 0.0228
Likelihood Ratio (Misrecording vs. Genuine) (see equation (3.22) )
Mean Median Mean Median
D1 0.1268 0.0689 D18 53.7300 2.6940
D2 0.0969 0.0433 D19 48.4700 1.7380
D3 0.1906 0.1263 D20 0.6457 0.4211
D4 0.1342 0.0753 D21 1.8150 0.9181
D5 18.82 4.73 D22 0.1486 0.0917
D6 1.2400 0.7816 D23 0.4155 0.2796
D7 1.0180 0.6772 D24 0.1177 0.0680
D8 0.5933 0.1937 D25 0.0260 0.0047
D9 ÿ39.61 5.65 D26 0.0120 0.0005
D10 0.3770 0.2835 D27 0.0111 0.0004
D11 0.8145 0.5736 D28 0.0091 0.0002
D12 0.0970 0.0022 D29 0.0130 0.0007
D13 0.0809 0.0010 D30 0.0206 0.0026
D14 0.0114 0.0001 D31 0.0017 0.0001
D15 0.0114 0.0001 D32 0.0011 0.0001
D16 0.0383 0.0001 D33 0.0011 0.0001
D17 8966 9999
increased. However, further increases ing had a relatively small impact. Therefore, a prior ongmay be set drawing on this analysis, and specifically
gE(300)
A three chain run of 25 000 iterations shows convergence from around 5000 iter-ations, and the summary is based on iterations 5000±25 000. The posterior mean for the WBC coefficient now stands atÿ0.34, with a 95% interval confined to negative values.
The posterior mean for g itself is just under 0.007. The likelihood ratios (3.25) are highest (hence chance of outlier status greatest) for observation 17, which hasy1 (the patient survived 65 weeks) but a WBC measure of 100. The coefficient on the dummy index for AG status is also increased in absolute size, though like the WBC coefficient is estimated less precisely (i.e. the posterior standard deviation is increased over the standard logit).
Example 3.15 Travel for shopping To illustrate alternative approaches to robustness with binary outcomes to the contamination/misclassification model, consider data from Guy et al. (1983) for a panel survey of shopping behaviour. This involved 84 family households in suburban Cardiff, approximately equidistant from the city centre, with the responseyibeing whether or not the household used a city centre store during a particular week. The predictors are income (Inc), household size (Hsz, for number of children) and whether the wife was working (WW). The first two covariates are respectively ordinal (with levels 1 for income under £1000 up to 8 for incomes over £15 000) and discrete ± but both are taken as continuous.
Wrigley and Dunn (1986) consider issues of resistant and robust logit regression against a substantive background, whereby positive effects of income and working wife on central city shopping are expected, but a negative effect of household size.
They argue that it is preferable to work with all the explanatory variables considered relevant and that exclusion of predictors because of one or two outliers or influential observations should be avoided. Let
piPr(yi1)
Wrigley and Dunn cite estimates from a maximum likelihood fit as follows (with standard errors in brackets)
logit(pi) ÿ0:720:14 Incÿ0:56 Hsz0:83 WW (3:27) (0:91) (0:23) (0:19) (0:54)
So the significance of the working wife variable is only marginal (i.e. income is signifi-cant at 5% only if a one-tail test is used).
Here we adopt mildly informative priors (cf. Chib, 1995) in a logit link model:
N(0.75, 25) priors on bInc and bWW are taken in line with an expected positive effect on central city shopping of income and female labour activity, while a N(ÿ0.75, 25) prior on bHsz reflects the expected negative impact of household size.
From a 10 000 iteration three chain run a stronger effect of income is obtained than in Equation (3.27). The analogous equation to that above, with posterior standard deviations in brackets, is
logit(pi) ÿ0:560:39 Incÿ0:61 Hsz1:07 WW (0:97) (0:25) (0:20) (0:59)
The 90% credible intervals on both the income and working wife variables are entirely confined to positive values, though this is not true for the 95% intervals. The highest deviance components are obtained for observations 5, 55, 58, 71 and 83. These points account for about 20% of the total deviance (minus twice the log likelihood, which averages aboutÿ50). The highest deviance is for case 55. The conditional predictive ordinates are lowest for cases 55 and 71.
Similarly, case 55 is the highest average residual under the latent utility logit model of Albert and Chib (1993) ± see Model (B) in Program 3.15. The logit is matched by a latent utility following Studentt errors with 8 degrees of freedom (logistic errors are
`heavy tailed' like the Studenttdensity). Trying Studenttsampling with smaller degrees of freedom (e.g. t(2) errors) makes little difference to any conclusions about credible intervals for income and working wife coefficients ± the 95% intervals still straddle zero.
ROBUST REGRESSION METHODS 125
Note that either the Normal or Student t latent utility approach of Albert and Chib allow different known weights for each point. This approach might therefore be used to downweight certain observations using an influence function based on deviance or leverage contributions (see the Exercises).
As an illustration of potential distortion from particular data points Model C in Program 3.15 applies the full cross-validation methodology based on single case omis-sion. The differences between bInc andbInc[i] for income (Del.beta.Inc[] in Model C) show that major changes in this coefficient are caused by exclusion of particular points.
The average value of bInc from the standard logit link model is 0.39 with posterior standard deviation of 0.25. Exclusion of case 71 raises this coefficient by over half this standard deviation, to around 0.58, while exclusion of case 58 raises it to around 0.47.
By contrast excluding case 29 lowers the coefficient to around 0.26. There might therefore be grounds for excluding case 71 at least, as it figures as an outlier and is influential on the regression. As discussed above, other options to assess robustness of inferences may be used, which retain the suspect case(s), but model them via contamin-ated priors or discrete mixture regressions.