VARIABLE SELECTION METHODS IN REGRESSION

Regression Analysis and Forecasting

3.6 VARIABLE SELECTION METHODS IN REGRESSION

106 REGRESSION ANAL YS!S AND FOREC..\ST!NG

The hat diagonals will identify points that are potentially influential due to their location in x-space. It is desirable to consider both the location of the point and the response variable in measuring influence. Cook [ 1977. 1979] has suggested using a measure of the squared distance between the least squares estimate based on all 11

points ~ and the estimate obtained by deleting the ith point. say. ~

1

^;^,.This distance measure can be expressed as

D; =

(~- ~u/X'X(~- ~u1l.

pMSE ⁱ⁼^J.^{2 ...}^II (3.59)

A reasonable cutoff for D; is unity. That is, we usually consider observations for which D; > I to be influential. Cook's distance statistic D; is actually calculated from

r²Var [ )·(x;)]

D - -'- ----'·-- , - p Var(e;)

r;-, h;, - - - -

pI -h;; (3.60)

Note that, apart from the constant p, D; is the product of the square of the ith studentized residual and the ratio h;; /(I - h;; ). This ratio can be shown to be the distance from the vector x; to the centroid of the remaining data. Thus D, is made up of a component that reflects how well the regression model fits the ith observation y, and a component that measures how far that point is from the rest of the data. Either component (or both) may contribute to a large value of D;.

Mini tab will calculate and save the values of Cook's distance statistic D;. Table 3.6 displays the values of Cook's distance statistic for the regression model for the patient satisfaction data in Example 3.1. The largest value, 0.467041. is associated with observation 9. This value was calculated from Eq. (3.60) as follows:

r;

h;t

D; = - ' - - - - pI -h;;

( -2.65767)² 0.165533 3 I - 0.165533

=

^0.467041

This does not exceed twice the cutoff of unity. so there are no influential observations in these data.

VARIABLE SELECTION METHODS IN REGRESSION 107

frequently involve a moderately large or large set of candidate predictors, and the objective of the analyst here is to fit a regression model to the "best subset" of these candidates. This can be a complex problem, as these data sets frequently have outliers, strong correlations between subsets of the variables, and other complicating features.

There are several techniques that have been developed for selecting the best subset regression model. Generally, these methods are either stepwise-type variable selection methods or all possible regressions. Stepwise-type methods build a regression model by either adding or removing a predictor variable to the basic model at each step. The forward selection version of the procedure begins with a model containing none of the candidate predictor variables and sequentially inserts variables into the model one-at-a-time until a final equation is produced. The criterion for entering a variable into the equation is that the t-statistic for that variable must be significant.

The process is continued until there are no remaining candidate predictors that qualify for entry into the equation. In backward elimination, the procedure begins with all of the candidate predictor variables in the equation, and then variables are removed one- at-a-time to produce a final equation. The criterion for removing a variable is usually based on the t-statistic, with the variable having the smallest t-statistic considered for removal first. Variables are removed until all of the predictors remaining in the model have significant !-statistics. Stepwise regression usually consists of a combination of forward and backward stepping. There are many variations of the basic procedures.

In all possible regressions with K candidate predictor variables, the analyst exam- ines all 2 K possible regression equations to identify the ones with potential to be a useful model. Obviously, as K becomes even moderately large, the number of possible regression models quickly becomes formidably large. Efficient algorithms have been developed that implicitly rather than explicitly examine all of these equations.

Typically, only the equations that are found to be "best" according to some criterion (such as minimum MSE) at each subset size are displayed. For more discussion of variable selection methods, see textbooks on regression such as Montgomery, Peck, and Vining l2006] or Myers [ 1990].

Example 3.8

Table 3.7 contains an expanded set of data for the hospital patient satisfaction data introduced in Example 3.1. In addition to the patient age and illness severity data, there are two additional regressors, an indicator of whether the patent is a surgical patient (I) or a medical patient (0), and an index indicating the patient's anxiety level.

We will use this data to illustrate how variable selection methods in regression can be used to help the analyst build a regression model.

We will illustrate the forward selection procedure first. The Minitab output that results from applying forward selection to this data is shown in Table 3.8. We used the Minitab default significance level of 0.25 for entering variables. The forward selection algorithm inserted the predictor patient age first, then severity, and finally a third predictor variable, anxiety, was inserted into the equation.

Table 3.9 presents the results of applying the Minitab backward elimination procedure to the patient satisfaction data, using the default level of 0.10 for removing variables. The procedure begins with all four predictors in the model. then the

TABLE3.7 Expanded Patient Satisfaction Data

Observation Age Severity Surgical-Medical Anxiety Satisfaction

I 55 50 0 2.1 68

2 46 24 2.8 77

3 30 46 3.3 96

4 35 48 4.5 80

5 59 58 0 2.0 43

6 61 60 0 5.1 44

7 74 65 5.5 26

8 38 42 I 3.2 88

9 27 42 0 3.1 75

10 51 50 2.4 57

II 53 38 I 2.2 56

12 41 30 0 2.1 88

13 37 31 0 1.9 88

14 24 34 0 3.1 102

15 42 30 0 3.0 88

16 50 48 4.2 70

17 58 61 4.6 52

18 60 71 I 5.3 43

19 62 62 0 7.2 46

20 68 38 0 7.8 56

21 70 41 7.0 59

22 79 66 6.2 26

23 63 31 4.1 52

24 39 42 0 3.5 83

25 49 40 I 2.1 75

TABLE 3.8 Minitab Forward Selection for the Patient Satisfaction Data in Table 3.6 Stepwise Regression: Satisfaction Versus Age, Severity, ...

Forward selection. Alpha-to-Enter: 0.25

Response is Satisfaction on 4 predictors, with N 25

Step 1 2 3

Constant 131.1 143.5 143.9

Age -1.29 -1.03 -1.11

T-Value -9.98 -8.92 -8.40

P-Value 0.000 0.000 0.000

Severity -0.56 -0.58

T-Value -4.23 -4.43

P-Value 0.000 0.000

Anxiety 1.3

T-Value 1. 23

VARIABLE SELECTION METHODS IN REGRESSION 109 TABLE 3.9 Minitab Backward Elimination for the Patient Satisfaction Data

Stepwise Regression: Satisfaction Versus Age, Severity, ...

Backward elimination. Alpha-to-Remove: 0.1

Response is Satisfaction on 4 predictors, with N 25

Step 1 2 3

Constant 143.9 143.9 143.5

Age -1.12 -1.11 -1.03

T-Va1ue -8.08 -8.40 -8.92 P-Value 0.000 0.000 0.000 Severity -0.59 -0.58 -0.56 T-Value -4.32 -4.43 -4.23 P-Value 0.000 0.000 0.000

Surg-Med 0.4

T-Value 0.14

P-Value 0.892

Anxiety 1.3 1.3

T-Value 1. 21 1. 23

P-Value 0.242 0.233

s 7.21 7.04 7.12

R-Sq 90.36 90.35 89.66

R-Sq(adj) 88.43 88.97 88.72

surgical-medical indicator variable was removed, followed by the anxiety predictor.

The algorithm concluded with both patient age and severity in the model. Note that in this example, the forward selection procedure produced a different model than the backward elimination procedure. This happens fairly often, so it is usually a good idea to investigate different model-building techniques for a problem.

Table 3 .I 0 is the Mini tab stepwise regression algorithm applied to the patient satisfaction data. The default significance levels of 0.15 to enter or remove variables from the model were used. At the first step, patient age is entered in the model.

Then severity is entered as the second variable. At that point, none of the remaining predictors met the 0.15 significance level criterion to enter the model, so stepwise regression terminated with age and severity as the model predictors. This is the same model found by backwards elimination.

Table 3.11 shows the results of applying Minitab's all possible regressions algorithm to the patient satisfaction data. Since there are k

=

4 predictors, there are 16 possible regression equations. Mini tab shows only the best two of each subset size,

110 REGRESSIOI' AI'ALYSI~ A!SD FOREC.-\STING

TABLE 3.10 Minitab Stepwise Regression Applied to the Patient Satisfaction Data Stepwise Regression: Satisfaction Versus Age, Severity, ...

Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15

Response is Satisfaction on 4 predictors, with K 25

Step 1 2

Constant 131.1 143.5

Age -1.29 -1.03

T-Value -9.98 -8.92

P-Value 0.000 0.000

Severity -0.56

T-Value -4.23

P-Value 0.000

s 9.38 7.12

R-Sq 81.24 89.66

R-Sq(adj) 80.43 88.72

TABLE 3.11 Minitab All Possible Regressions Algorithm Applied to the Patient Satisfaction Data

Best Subsets Regression: Satisfaction Versus Age, Severity, ...

Response is Satisfaction

s s

e u A v r n e g X

r - i A i M e

Mallows g t e t

Vars R-Sq R-Sq(adj) C-p s e Y d y

1 81.2 80.4 17.9 9.3752 X

1 52.3 50.2 78.0 14.955 X

2 89.7 88.7 2.5 7.1177 X X

2 81.3 79.6 19.7 9.5626 X X

3 90.4 89.0 3.0 7.0371 X X X

3 89.7 88.2 4.5 7.2846 X X X

4 90.4 88.4 5.0 7.2074 X X X X Ver Best Subsets

GENERALIZED AND WEIGHTED LEAST SQUARES 111 along with the full (four-variable) model. For each model, Minitab presents the value of R², the adjusted R², the square root of the mean squared error (S), and the Mallows C" statistic, which is a measure of the amount of bias and variance in the model. If a model is specified incorrectly and important predictors are left out, then the predicted values are biased and the value of the CP statistic will exceed p, the number of model parameters. However, a correctly specified regression model will have no bias and the value of Cp should equal p. Generally, models with small values of the Cp statistic are desirable.

The model with the smallest value of C" is the two-variable model with age and severity (the value of C Pis 2.5, actually less than p = 3). The model with the smallest value of the mean squared error (or its square root, S) is the three-variable model with age, severity, and anxiety. Both of these models were found using the stepwise- type algorithms. Either one of these models is likely to be a good regression model describing the effects of the predictor variables on patient satisfaction. •

No documento Introduction to Time Series Analysis and Forecasting (páginas 119-124)