• Nenhum resultado encontrado

Lecture 7 Linear Regression Diagnostics

N/A
N/A
Protected

Academic year: 2022

Share "Lecture 7 Linear Regression Diagnostics"

Copied!
41
0
0

Texto

(1)

Lecture 7

Linear Regression Diagnostics

BIOST 515

January 27, 2004

(2)

Major assumptions

1. The relationship between the outcomes and the predictors is (approximately) linear.

2. The error term has zero mean.

3. The error term has constant variance.

4. The errors are uncorrelated.

5. The errors are normally distributed or we have an adequate sample size to rely on large sample theory.

We should always check fitted models to make sure that these

assumptions have not been violated.

(3)

Departures from the underlying assumptions cannot be de-

tected using any of the summary statistics we’ve examined so

far such as the t or F statistics or R 2 . In fact, tests based on

these statistics may lead to incorrect inference since they are

based on many of the assumptions above.

(4)

Residual analysis

The diagnostic methods we’ll be exploring are based primarily on the residuals. Recall, the residual is defined as

e i = y i − y ˆ i , i = 1, . . . , n, where

ˆ

y = X β. ˆ

If the model is appropriate, it is reasonable to expect the residu-

als to exhibit properties that agree with the stated assumptions.

(5)

Characteristics of residuals

• The mean of the {e i } is 0:

¯

e = 1 n

n

X

i=1

e i = 0.

• The estimate of the population variance computed from the sample of the n residuals is

S 2 = 1

n − p − 1

n

X

i=1

e 2 i

which is the residual mean square, M SE = SSE/(n − p − 1).

(6)

• The {e i } are not independent random variables. In general,

if the number of residuals (n) is large relative to the number

of independent variables (p), the dependency can be ignored

for all practical purposes in an analysis of residuals.

(7)

Methods for standardizing residuals

• Standardized residuals

• Studentized residuals

• Jackknife residuals

(8)

Standardized residuals

An obvious choice for scaling residuals is to divide them by their estimated standard error. The quantity

z i = e i

√ M SE

is called a standardized residual. Based on the linear re-

gression assumptions, we might expect the z i s to resemble a

sample from a N (0, 1) distribution.

(9)

Studentized residuals

Using MSE as the variance of the ith residual e i is only an approximation. We can improve the residual scaling by dividing e i by the standard deviation of the ith residual. We can show that the covariance matrix of the residuals is

var(e) = σ 2 (I − H ).

Recall H = X (X 0 X ) −1 X 0 is the hat matrix. The variance of the ith residual is

var(e i ) = σ 2 (1 − h i ),

where h i is the ith element on the diagonal of the hat matrix

and 0 ≤ h i ≤ 1.

(10)

The quantity

r i = e i

p M SE (1 − h i )

is called a studentized residual and approximately follows a t distribution with n − p − 1 degrees of freedom (assuming the assumptions stated at the beginning of lecture are satisfied).

Studentized residuals have a mean near 0 and a variance, 1

n − p − 1

n

X

i=1

r i 2 ,

that is slightly larger than 1. In large data sets, the standardized

and studentized residuals should not differ dramatically.

(11)

Jackknife residuals

The quantity

r (−i) = r i

s M SE

M SE (−i) = e i q

M SE (−i) (1 − h i )

= r i

s (n − p − 1) − 1 (n − p − 1) − r i 2

is called a jackknife residual (or R-Student residual).

M SE (−i) is the residual variance computed with the ith ob- servation deleted.

Jackknife residuals have a mean near 0 and a variance 1

(n − p − 1) − 1

n

X

i=1

r (−i) 2

that is slightly greater than 1. Jackknife residuals are usually

the preferred residual for regression diagnostics.

(12)

How to use residuals for diagnostics?

Residual analysis is usually done graphically. We may look at

• Quantile plots: to assess normality

• Scatterplots: to assess model assumptions, such as constant variance and linearity, and to identify potential outliers

• Histograms, stem and leaf diagrams and boxplots

(13)

Quantile-quantile plots

Quantile-quantile plots can be useful for comparing two samples to determine if they arise from the same distribution.

Similarly, we can compare quantiles of a sample to the expected quantiles if the sample came from some distribution F for a visual assessment of whether the sample arises from F . In linear regression, this can help us determine the normality of the residuals (if we have relied on an assumption of normality).

To construct a quantile-quantile plot for the residuals, we

plot the quantiles of the residuals against the theorized quantiles

if the residuals arose from a normal distribution. If the residuals

come from a normal distribution the plot should resemble a

straight line. A straight line connecting the 1st and 3rd quartiles

is often added to the plot to aid in visual assessment.

(14)

Samples from N (0, 1) distribution

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

● ●

●●

●●

● ●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

−3 −2 −1 0 1 2 3

−3 −2 −1 0 1 2 3

Normal Q−Q Plot

Theoretical Quantiles

Sample Quantiles

(15)

Samples from a skewed distribution

● ●

●●

●●

● ●

●●

● ●

● ● ●

●● ●

● ●

● ●

● ●

● ● ●

● ●

●●●

● ● ●

● ● ● ●

● ●

● ●

● ● ●

●●

● ●

●● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

●●

● ●

● ●

● ●

● ● ●

● ●

● ●

● ●●

● ●

● ● ● ●

●●

● ●

●●

● ●●

● ●

● ●

●● ●

● ●

●●

●●

● ●

● ●

● ●●●

● ●

● ●

● ●

● ●

● ●

● ●

● ● ●

● ●

●●

●●

● ●

● ●

●●

● ●

●●

●●

● ●

● ●●

●● ●

● ●

● ●

● ●

● ●

●●

● ●

●●● ●●

●●

● ●

● ● ●

● ●

●●

● ●

● ●

● ●

● ●

●●

● ● ●

● ● ●●

● ● ●

● ●

● ●

● ● ●

● ●

●●

● ● ●

● ●

● ●

● ●

●●●

● ●

● ● ●● ●● ●

● ●

● ●

● ●

● ●

● ● ●

●● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

●●

● ●

● ●

● ●

● ●

● ● ●●

● ●

● ●

● ●●

● ●●

● ●

● ●

● ●

●●

● ●

● ● ●● ●

● ●

● ●

●●

● ●

● ●

● ●

● ●

● ●

● ●●

●● ●

● ●●

● ●

● ●

● ● ●

● ●●

●● ●

● ●

● ●

● ●

● ●●

● ●

● ●

● ●

●●

●●

● ●

● ●

● ●

● ●

● ● ●

● ● ●

● ●

● ●

●● ●

● ●

● ●

● ●●

● ● ●

● ●

● ●

● ●

● ●

●● ●

● ● ●

● ●

● ● ●

● ●

●●

●●

●●

● ●

● ● ●

● ●

● ●

●●●

● ●

●●

● ●

● ● ●

●● ●

●● ●●

● ●

● ●

● ●

● ●

● ●

● ●

●●

● ●

● ●

●●

●●● ●

●●

● ●

●●●

● ●

● ●

● ●

● ●●

● ●

● ●

● ●

● ●

● ●

●● ●

● ●● ● ●

−3 −2 −1 0 1 2 3

−2 0 2 4

Normal Q−Q Plot

Theoretical Quantiles

Residuals

Histogram of y

Residuals

Density

−2 −1 0 1 2 3 4

0.0 0.2 0.4

(16)

Samples from a heavy-tailed distribution

● ●

● ● ● ●

● ●

● ●

● ●●

●●

● ●

● ● ●●● ● ●●

● ●

● ●

● ●

● ● ● ●●● ● ●●

● ● ●

● ●

● ●

● ● ● ●

● ●

● ●

● ● ●

● ●

● ●

● ●

● ●

● ● ● ● ●

● ● ●

● ● ●

● ●

●● ● ●

● ● ● ●● ● ●

● ● ● ●

● ●

● ●

● ●

● ●

● ● ● ●

● ●●● ●● ●●●● ●●●●

● ●

● ● ●

● ● ●

● ● ●

● ●●● ● ●● ●

● ●

● ●

● ● ●

● ● ● ●● ●

●●

● ●

●● ● ●

● ●

● ●● ●

● ● ● ●

● ● ●

●● ●

● ● ●

● ● ●

● ● ● ● ●●●●●● ●

● ● ●

● ●

● ●

● ●

● ●

●●

● ●● ●

● ● ● ●● ● ●

● ● ●● ●

● ●

● ● ●

● ●

● ● ●●● ●

● ●

● ●

● ● ●

● ● ● ● ● ●●

●● ● ●

● ●

●●● ● ● ●

● ●

● ●

● ●

●●● ● ●

● ● ● ●● ● ●

●●

● ●

● ●

●● ●●● ● ●

● ●

● ● ●

● ●● ●

● ●

● ●

● ●●

● ●

● ● ● ● ●

● ●

● ●

● ●●

● ●

● ●

● ●

● ●

● ●

● ● ●●●● ● ●

● ● ●

● ● ● ●

● ● ●

●●

● ● ●

● ● ● ● ● ●● ●

● ● ●

● ●

● ● ●

● ● ● ●

● ● ● ●

● ● ●●●

● ●

●●● ● ● ●● ●

● ●

● ● ● ● ●

● ● ● ● ● ●

● ●

● ●● ●

● ●

● ● ●

● ●

● ● ●

● ● ●

● ●

● ●

● ●

● ●

● ●

● ●

●● ● ●

● ●

● ●●

● ● ●

●● ●●●● ●●●●●●●

● ● ●

● ●

● ●●● ●● ● ● ●●

● ●●●

● ● ● ●

● ● ●

● ● ●●

● ●●● ● ● ●

● ●

● ●

● ●

● ● ●

● ●

●●

● ● ●

● ●

● ●

● ● ● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ● ● ● ●

● ●

● ● ● ●

● ●

● ●

●●

● ●

● ●

● ● ●

● ●

● ● ●

● ●

● ● ● ●

● ● ● ●●● ●●

● ●●●

● ● ● ●

● ●

● ●

● ● ● ●●

● ●

● ●

● ●● ● ●

● ● ●

● ● ● ●

● ● ● ● ● ● ● ●

● ●

● ●

● ● ●●

● ●

● ●

● ● ● ● ●

●● ● ●

● ● ● ● ●●

●● ●

● ● ●

● ● ● ●●

● ●

● ● ●

●● ● ●

● ●

● ●● ●● ● ●● ●●

● ● ●

● ●● ● ● ●

● ● ●

● ● ● ● ●

● ● ● ●

● ● ●

●● ● ●

● ● ● ● ●

● ●

● ● ●●●● ●

● ● ●

● ● ●

● ● ●

● ●

● ●

● ● ●

● ●● ●

● ● ●

● ● ●

● ● ●

● ● ●

● ● ● ● ●

● ●●●● ●● ●

● ● ● ●

● ●●

● ●● ● ●

● ● ● ●●

● ●●

● ●● ●

● ● ●

● ● ● ●

● ●

● ●

● ● ●

● ● ●

● ●

● ●● ●

● ● ● ● ●

● ●

● ● ●● ●●● ●

● ●

● ● ●

● ● ● ●

● ●

−3 −2 −1 0 1 2 3

−4 0 4 8

Normal Q−Q Plot

Theoretical Quantiles

Residuals

Histogram of y

Residuals

Density

−4 −2 0 2 4 6 8

0.0 0.3 0.6

(17)

Samples from a light-tailed distribution

● ●

● ●

● ● ●

●●

● ●

● ●

● ● ●

● ●

●●

● ●

● ●

● ●

● ●

● ●

●●

● ●

● ●●

● ●

● ●

●●

● ●

● ●

● ●

● ●

● ●

●●

● ●

● ●●

● ●

●●

● ●

● ● ●

● ●

●●

● ●

● ●

● ●

● ●

● ●

●●

● ●

●●

● ●

●●

●●

●●

● ●

● ●

●●●

● ●

● ●

● ●

● ●

●●●

● ●

● ●

●● ●

● ●

●●

● ●

● ●

● ●●

● ●

● ● ●

●●

● ●

● ●

●●

● ●

● ●

● ●

●●●

● ●

●●

●●

●●

● ●

● ●

● ●

●●

● ●

● ●

● ●

● ●

● ●

● ●

●●

● ●

● ●

● ●

● ●●

● ●

● ●

● ●

● ●

● ●

●● ●

● ●

● ●

● ●

● ●

●●

●●

● ●

● ●●

● ●

● ●

● ●

● ●

●● ●

● ●

● ●

●●●

● ● ●

● ●

● ●●

● ●

● ●

● ● ●●

●●

●●

● ●

●●●

● ●

●●

● ●

● ●

−3 −2 −1 0 1 2 3

−2 0 1 2

Normal Q−Q Plot

Theoretical Quantiles

Residuals

Histogram of y

Residuals

Density

−2 −1 0 1 2

0.0 0.2 0.4

(18)

Scatterplots

Another useful aid for inspection is a scatterplot of the residuals against the fitted values and/or the predictors. These plots can help us identify:

• Non-constant variance

• Violation of the assumption of linearity

• Potential outliers

(19)

● ●

● ●

● ●

● ●

● ●

●●

● ● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

●●

● ●

● ●

● ●

● ●

● ●

● ●●

● ●

● ●

● ●

● ●

● ● ●

● ●

● ●

● ●

● ● ●

● ●

● ●

●●

● ●

● ●

● ●

● ●

● ●

● ● ●

● ●

● ●

● ●

● ●

● ●

●●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

●●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ● ●

● ●

●●

● ●

● ●

● ●

● ●

● ●

● ●

● ● ●

Satisfactory residual plot

y ^

i

e

i

(20)

●●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

●●

● ●

● ●

● ● ●

● ●

● ● ●

● ●

● ●●

● ●

● ●

● ●

● ●●

● ●

● ●

● ●

●●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

●●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

Non−constant variance

y ^

i

e

i

Referências

Documentos relacionados

Destarte esta pesquisa buscou desenvolver práticas de ensino, aprendizagem e criações artísticas que foram realizadas em espaços educativos formais e não formais, a fim de estimular

No caso concreto deste estudo, esta estratégia englobou a realização de cinco trabalhos experimentais por cada um dos grupos de trabalho constituídos, em cada um dos dois

The aim of the present study was to evaluate the food habits and food choice motives of a sample of university students, examining the differences between both genders and

There- fore, the purpose of this work was to investigate the relationship between flower respiration and flower longevity, in potted min- iature rose plants (Rosa hybrida L.), as

Return on Invested Capital (ROIC) should also be a measure of profitability used once the postal bank estimates a break-even point in 3 years (2018). Banco CTT should also be

The probability of attending school four our group of interest in this region increased by 6.5 percentage points after the expansion of the Bolsa Família program in 2007 and

Se alguém que, como o autor destas linhas, viveu a Escola de Cinema (actual Departamento de Cinema da ESTC) nos últimos 25 ou mais anos, pode objectar algo a este amigável

102 “Raça, gênero e criminologia: Reflexões sobre o controle social das mulheres negras a partir da criminologia positivista de Nina Rodrigues”, de Naila Franklin (2017); e