4.2.1 Prediction
Consider model-based inference for prediction, whereRmodels are consid-ered, each having the parameterθas the predicted value of interest. Each model iallows an estimate of the parameter,θi. If one of the models was clearly the K-L best (e.g., if itsw≥0.90), then inference could probably be made, con-ditionally, on the selected best model. However, it is often the case that no single model is clearly superior to some of the others in the set. If the pre-dicted value (θˆ) differs markedly across the models (i.e., theθˆi differ across the modelsi 1,2, . . . , R), then it is risky to base prediction on only the selected model. An obvious possibility is to compute a weighted estimate of the predicted value, weighting the predictions by the Akaike weights (wi).
Model Averaging This concept leads to the model averaged estimates,
ˆ¯
θ R
i1
wiθˆi, (4.1)
whereθˆ¯denotes a model averaged estimate ofθ. Alternatively, if the bootstrap is used to provide the estimated model selection frequencies (πˆi), model averaging can be done using,
θˆ¯
R i1
ˆ
πiθˆi. (4.2)
This type of model averaging is useful for prediction problems or in cases where a particular parameter (e.g.,γ an immigration probability) occurs in all the models in the set. Prediction is an ideal way to view model averaging, because each model in a set, regardless of its parametrization, can be used to make a predicted value.
Hirotugu Akaike was born in 1927 in Fujinomiya-shi, Shizuoka-jen, in Japan. He received B.S. and D.S. degrees in mathematics from the University of Tokyo in 1952 and 1961, re-spectively. He worked at the Institute of Statistical Mathematics for over 30 years, becoming its Director General in 1982. He has received many awards, prizes, and honors for his work in theoretical and applied statistics (deLeeuw 1992, Parzen 1994). The three-volume set,
“Proceedings of the First US/Japan Conference on the Frontiers of Statistical Modeling:
An Informational Approach (Bozdogan 1994) commemorated Professor Hirotugu Akaike’s 65th birthday. Bozdogan (1994) records that the idea of a connection between the Kullback–
Leibler discrepancy and the empirical log-likelihood function occurred to Akaike on the morning of March 16, 1971, as he was taking a seat on a commuter train.
4.2.2 Averaging Across Model Parameters
If one has a large number of closely related models, such as in linear-regression based variable selection (e.g., all subsets selection), designation of a single best model is unsatisfactory because that “best” model is often highly variable. That is, the model estimated to be best would vary from data set to data set, where replicate data sets would be collected under the same underlying process. In this situation, model averaging provides a relatively much more stabilized inference.
The concept of inference being tied to all the models can be used to reduce model selection bias effects on linear regression coefficient estimates in all subsets selection. For the linear regression coefficientβj associated with pre-dictor variablexj there are two versions of model averaging. First, we have the estimateβˆ¯j whereβj is averaged over all models in whichxjappears (i.e.,
whenjis not zero):
βˆ¯j R
i1wiIj(gi)βˆj,i
w+(j) , w+(j)
R i1
wiIj(gi), and
Ij(gi)
1 if predictorxj is in modelgi, 0 otherwise.
Here,βˆj,idenotes the estimator ofβj based on modelgi. The notationw+(j) is merely the sum of the Akaike weights over all models in the set where predictor variablej is explicitly in the model. Note,w+(j) is itself a model-average value about whether variablexj is in (or not in) a particular model.
Thus,βˆ¯j is a “natural” average to consider, as it only averagesβˆj over models where an unknownβj parameter appears. Note, however, that the estimatorβˆ¯j ignores evidence about modelsgiwhereinβj,i ≡0.
An alternative way to average over linear regression models is to consider that variablexj is “in” every model, it is just that in some models the corre-spondingβj is set to zero, rather than being considered unknown. Conditional on modelgibeing selected, model selection has the effect of biasingβˆj,iaway from zero (Section 1.6). Thus, a second model-averaged estimator, denoted β˜¯j, is suggested:
˜¯
βj w+(j)βˆ¯j.
Thisβ˜¯j actually derives from model averaging over allR models. In cases wherexj is not in a particular model, it is becauseβj,i ≡0 is used instead of the estimateβˆj,i. The resultant average is identical tow+(j)θˆ¯j. Heuristically, w+(j) serves to shrink the conditionalθˆ¯jback towards zero, and this shrinkage serves to ameliorate much of the model selection bias ofθˆ¯j (Section 1.6).
Investigation of this general idea and its extensions are an open research area.
One point here is that whileβˆ¯j can be computed ignoring models other than those wherexj appears,β˜¯i does require fitting allR of the a priori models.
Improved inference requires fitting all the a priori models and then using a type of model averaging. When possible, one should use inference based on all the models, via model averaging and selection bias adjustments, rather than a “select the best model and ignore the others” strategy.
There are several advantages, both practical and philosophical, to model-averaging, when it is appropriate. Where a model averaged estimator can be used it often has reduced bias and, sometimes has better precision, compared to
θˆfrom the selected best model. Hoeting et al. (1999) provides an introduction to model averaging from a Bayesian viewpoint (also see Leamer 1978 for motivating ideas). Bayesian model averaging is easy to understand, but can be difficult to implement in practice. Information-theoretic methods for model averaging are easy both to understand and implement, even when there is a large number of models, each with potentially many parameters.
While there are many cases where model averaging is useful, we warn against model averaging structural parameter estimates in some types of nonlinear models. While it is often appropriate to average slope parameters in linear regression models, structural parameters in nonlinear models such as
E(y)(a+bx)/(1+cx) or E(y)a(1−[1+(x/c)d]−b) should not be averaged. For example, a weighted average across these two models of any of the parametersa,b, c, or d would not be appropriate. Instead, model averaging the predicted expected response variable E(y), for a givenˆ value ofx, across models, is advantageous in reaching a robust inference that is not conditional on only a single model.
It is important to realize that the expected value of the model-averaged esti-mate, E(θ), is not necessarily the same asˆ¯ θfrom absolute truth. Under classical sampling theory the estimatorθˆ(≡ ˆθi for the selected modelgiwhich varies by sample), arrived at in the two-stage process of model selection followed by parameter estimation given the model, is by definition an unbiased estimator of E(ˆ¯θ) as given by (4.1 or 4.2). Therefore, the unconditional sampling variance of θˆ ≡ ˆ¯θis to be computed with respect to E(ˆ¯θ). Any remaining bias, E(θˆ¯)−θ, in
ˆ¯
θcannot be measured or allowed for in model selection uncertainty. However, part of the intent of having a good set of models and sound model selection is to render this bias negligible with respect to the unconditional se(θˆ).
Model-averaging ideas are well developed from the Bayesian perspective (see Madigan and Raftery 1994, Draper 1995, Raftery 1996a and (particularly) Hoeting et al. 1999; Newman 1997 provides an application). Model averaging has not yet been commonly adapted into applied frequentist inferences. Some theoretical basis for these approaches and ideas appears in Chapter 6 (also see Buckland et al. 1997 and the Bayesian references just above).