Multivariate Linear Regression

(1)

Multivariate Linear Regression

Nathaniel E. Helwig

Assistant Professor of Psychology and Statistics University of Minnesota (Twin Cities)

Updated 16-Jan-2017

(2)

Copyright

Copyright c2017 by Nathaniel E. Helwig

(3)

Outline of Notes

1) Multiple Linear Regression Model form and assumptions Parameter estimation

Inference and prediction

2) Multivariate Linear Regression Model form and assumptions Parameter estimation

Inference and prediction

(4)

Multiple Linear Regression

(5)

MLR Model: Scalar Form

Themultiple linear regressionmodel has the form y_i=b₀+

p

X

j=1

b_jx_ij +e_i

fori∈ {1, . . . ,n}where

y_i ∈Ris the real-valuedresponsefor thei-th observation b₀∈Ris the regressionintercept

b_j ∈Ris thej-th predictor’s regressionslope x_ij ∈Ris thej-thpredictorfor thei-th observation e_i ^iid∼N(0, σ²)is a Gaussianerror term

(6)

MLR Model: Nomenclature

The model ismultiplebecause we havep>1 predictors.

Ifp=1, we have asimplelinear regression model

The model islinearbecausey_i is a linear function of the parameters (b₀,b₁, . . . ,b_pare the parameters).

The model is aregressionmodel because we are modeling a response variable (Y) as a function of predictor variables (X₁, . . . ,Xp).

(7)

MLR Model: Assumptions

The fundamental assumptions of the MLR model are:

1 Relationship betweenX_j andY islinear(given other predictors)

2 x_ij andy_i areobserved random variables(known constants)

3 e_i ^iid∼N(0, σ²)is anunobserved random variable

4 b₀,b₁, . . . ,b_pareunknown constants

5 (y_i|x_i1, . . . ,x_ip)^ind∼ N(b₀+Pp

j=1b_jx_ij, σ²) note: homogeneity of variance

Note: b_j is expected increase inY for 1-unit increase inX_j with all other predictor variables held constant

(8)

MLR Model: Matrix Form

The multiple linear regression model has the form y=Xb+e

where

y= (y₁, . . . ,y_n)⁰ ∈Rⁿ is then×1response vector

X= [1_n,x₁, . . . ,x_p]∈R^n×(p+1)is then×(p+1)design matrix

• 1_nis ann×1 vector of ones

• x_j = (x_1j, . . . ,x_nj)⁰ ∈Rⁿisj-th predictor vector (n×1)

b= (b₀,b₁, . . . ,b_p)⁰ ∈R^p+1is(p+1)×1vector of coefficients e= (e₁, . . . ,en)⁰ ∈Rⁿis then×1error vector

(9)

MLR Model: Matrix Form (another look)

Matrix form writes MLR model for allnpoints simultaneously y=Xb+e





 y₁ y₂ y₃ ... y_n







=







1 x₁₁ x₁₂ · · · x_1p 1 x₂₁ x₂₂ · · · x_2p 1 x₃₁ x₃₂ · · · x_3p ... ... ... . .. ... 1 x_n1 x_n2 · · · x_np











 b₀ b₁ b₂ ... b_p





 +





 e₁ e₂ e₃ ... e_n







(10)

MLR Model: Assumptions (revisited)

In matrix terms, the error vector is multivariate normal:

e∼N(0_n, σ²I_n)

In matrix terms, the response vector is multivariate normal givenX:

(y|X)∼N(Xb, σ²I_n)

(11)

Ordinary Least Squares

Theordinary least squares(OLS) problem is min

b∈R^p+1

ky−Xbk²= min

b∈R^p+1 n

X

i=1

y_i−b₀−Pp j=1b_jx_ij

2

wherek · kdenotes the Frobenius norm.

The OLS solution has the form

bˆ = (X⁰X)⁻¹X⁰y

(12)

Fitted Values and Residuals

SCALAR FORM:

Fitted valuesare given by yˆ_i = ˆb₀+Pp

j=1bˆ_jx_ij andresidualsare given by

eˆ_i =y_i−yˆ_i

MATRIX FORM:

Fitted valuesare given by ˆy=Xbˆ andresidualsare given by

eˆ=y−yˆ

(13)

Hat Matrix

Note that we can write the fitted values as yˆ=Xbˆ

=X(X⁰X)⁻¹X⁰y

=Hy whereH=X(X⁰X)⁻¹X⁰ is thehat matrix.

His a symmetric and idempotent matrix: HH=H

Hprojectsyonto the column space ofX.

(14)

Multiple Regression Example in R

> data(mtcars)

> head(mtcars)

mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2

Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

> mtcars$cyl <- factor(mtcars$cyl)

> mod <- lm(mpg ~ cyl + am + carb, data=mtcars)

> coef(mod)

(Intercept) cyl6 cyl8 am carb

25.320303 -3.549419 -6.904637 4.226774 -1.119855

(15)

Regression Sums-of-Squares: Scalar Form

In MLR models, the relevant sums-of-squares are Sum-of-Squares Total: SST =Pn

i=1(y_i−y¯)² Sum-of-Squares Regression: SSR=Pn

i=1(ˆy_i−y¯)² Sum-of-Squares Error: SSE=Pn

i=1(y_i−yˆ_i)²

The correspondingdegrees of freedomare SST: df_T =n−1

SSR: df_R =p

SSE: df_E =n−p−1

(16)

Regression Sums-of-Squares: Matrix Form

In MLR models, the relevant sums-of-squares are SST =

n

X

i=1

(y_i −y¯)²

=y⁰[I_n−(1/n)J]y SSR=

n

X

i=1

(ˆy_i −y¯)²

=y⁰[H−(1/n)J]y SSE =

n

X

i=1

(y_i −yˆ_i)²

=y⁰[I_n−H]y

(17)

Partitioning the Variance

We can partition the total variation iny_i as SST =

n

X

i=1

(yi−y¯)²

=

n

X

i=1

(yi−yˆi+ ˆyi −y¯)²

=

n

X

i=1

(ˆy_i−y¯)²+

n

X

i=1

(y_i−yˆ_i)²+2

n

X

i=1

(ˆy_i−y¯)(y_i−yˆ_i)

=SSR+SSE+2

n

X

i=1

(ˆyi−y¯)ˆei

=SSR+SSE

(18)

Regression Sums-of-Squares in R

> anova(mod)

Analysis of Variance Table Response: mpg

Df Sum Sq Mean Sq F value Pr(>F) cyl 2 824.78 412.39 52.4138 5.05e-10 ***

am 1 36.77 36.77 4.6730 0.03967 * carb 1 52.06 52.06 6.6166 0.01592 * Residuals 27 212.44 7.87

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> Anova(mod, type=3) Anova Table (Type III tests) Response: mpg

Sum Sq Df F value Pr(>F) (Intercept) 3368.1 1 428.0789 < 2.2e-16 ***

cyl 121.2 2 7.7048 0.002252 **

am 77.1 1 9.8039 0.004156 **

carb 52.1 1 6.6166 0.015923 *

Residuals 212.4 27 ---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(19)

Coefficient of Multiple Determination

Thecoefficient of multiple determinationis defined as R²= SSR

SST

=1−SSE SST

and gives the amount of variation iny_i that is explained by the linear relationships withx_i1, . . . ,x_ip.

When interpretingR²values, note that. . . 0≤R²≤1

LargeR²values do not necessarily imply a good model

(20)

Adjusted Coefficient of Multiple Determination (R

_a²

)

Including more predictors in a MLR model can artificially inflateR²: Capitalizing on spurious effects present in noisy data

Phenomenon ofover-fittingthe data

TheadjustedR²is a relative measure of fit:

R_a²=1− SSE/df_E SST/df_T

=1− σˆ² s²_Y wheres²_Y =

Pn

i=1(yi−¯y)²

n−1 is the sample estimate of the variance ofY.

(21)

Regression Sums-of-Squares in R

> smod <- summary(mod)

> names(smod)

[1] "call" "terms" "residuals" "coefficients"

[5] "aliased" "sigma" "df" "r.squared"

[9] "adj.r.squared" "fstatistic" "cov.unscaled"

> summary(mod)$r.squared [1] 0.8113434

> summary(mod)$adj.r.squared [1] 0.7833943

(22)

Relation to ML Solution

Remember that(y|X)∼N(Xb, σ²In), which implies thatyhas pdf f(y|X,b, σ²) = (2π)^−n/2(σ²)^−n/2e⁻

1

2σ2(y−Xb)⁰(y−Xb)

As a result, thelog-likelihoodofbgiven(y,X, σ²)is ln{L(b|y,X, σ²)}=− 1

2σ²(y−Xb)⁰(y−Xb) +c wherec is a constant that does not depend onb.

(23)

Relation to ML Solution (continued)

Themaximum likelihood estimate(MLE) ofbis the estimate satisfying max

b∈R^p+1

− 1

2σ²(y−Xb)⁰(y−Xb)

Now, note that. . . max_b∈

R^p+1−_2σ¹₂(y−Xb)⁰(y−Xb) =max_b∈

R^p+1−(y−Xb)⁰(y−Xb) max_b∈_Rp+1−(y−Xb)⁰(y−Xb) =min_b∈_Rp+1(y−Xb)⁰(y−Xb)

Thus, the OLS and ML estimate ofbis the same: bˆ= (X⁰X)⁻¹X⁰y

(24)

Estimated Error Variance (Mean Squared Error)

The estimated error variance is ˆ

σ²=SSE/(n−p−1)

=

n

X

i=1

(y_i−yˆ_i)²/(n−p−1)

=k(I_n−H)yk²/(n−p−1) which is an unbiased estimate of error varianceσ².

The estimateσˆ²is themean squared error(MSE) of the model.

(25)

Maximum Likelihood Estimate of Error Variance

˜

σ²=Pn

i=1(y_i−yˆ_i)²/nis the MLE ofσ².

From our previous results usingσˆ², we have that E(˜σ²) = n−p−1

n σ²

Consequently, thebiasof the estimatorσ˜²is given by n−p−1

n σ²−σ²=−(p+1) n σ² and note that−^(p+1)_n σ²→0 asn→ ∞.

(26)

Comparing σ ˆ

²

and σ ˜

²

Reminder: the MSE and MLE ofσ²are given by ˆ

σ²=k(I_n−H)yk²/(n−p−1)

˜

σ²=k(I_n−H)yk²/n

From the definitions ofσˆ²andσ˜²we have that

˜ σ²<σˆ²

so the MLE produces a smaller estimate of the error variance.

(27)

Estimated Error Variance in R

# get mean-squared error in 3 ways

> n <- length(mtcars$mpg)

> p <- length(coef(mod)) - 1

> smod$sigma^2 [1] 7.868009

> sum((mod$residuals)^2) / (n - p - 1) [1] 7.868009

> sum((mtcars$mpg - mod$fitted.values)^2) / (n - p - 1) [1] 7.868009

# get MLE of error variance

> smod$sigma^2 * (n - p - 1) / n [1] 6.638633

(28)

Summary of Results

Given the model assumptions, we have bˆ∼N(b, σ²(X⁰X)⁻¹)

ˆy∼N(Xb, σ²H)

ˆe∼N(0, σ²(I_n−H))

Typicallyσ²is unknown, so we use the MSEˆσ²in practice.

(29)

ANOVA Table and Regression F Test

We typically organize the SS information into anANOVA table:

Source SS df MS F p-value

SSR Pn

i=1(ˆy_i−y¯)² p MSR F^∗ p^∗

SSE Pn

i=1(y_i−yˆ_i)² n−p−1 MSE

SST Pn

i=1(y_i−y¯)² n−1

MSR = ^SSR_p ,MSE = _n−p−1^SSE ,F^∗ = ^MSR_MSE ∼F_p,n−p−1, p^∗ =P(F_p,n−p−1>F^∗)

F^∗-statistic andp^∗-value are testingH₀:b₁=· · ·=b_p =0 versus H₁:b_k 6=0 for somek ∈ {1, . . . ,p}

(30)

Inferences about b ˆ

j

with σ

²

Known

Ifσ²is known, form 100(1−α)% CIs using

bˆ₀±Z_α/2σ_b₀ bˆ_j±Z_α/2σ_b_j where

Z_α/2is normal quantile such thatP(X >Z_α/2) =α/2

σb₀ andσb_j are square-roots of diagonals ofV(ˆb) =σ²(X⁰X)⁻¹

To testH₀:b_j =b_j^∗vs.H₁:b_j6=b^∗_j (for somej ∈ {0,1, . . . ,p}) use Z = (ˆb_j−b_j^∗)/σb_j

which follows a standard normal distribution underH₀.

(31)

Inferences about b ˆ

j

with σ

²

Unknown

Ifσ²is unknown, form 100(1−α)% CIs using

ˆb₀±t_n−p−1^(α/2) ˆσ_b₀ bˆ_j±t_n−p−1^(α/2) ˆσ_b_j where

t_n−p−1^(α/2) ist_n−p−1quantile withP(X >t_n−p−1^(α/2) ) =α/2 ˆ

σ_b₀ andσˆ_b_j are square-roots of diagonals ofV(ˆˆ b) = ˆσ²(X⁰X)⁻¹

To testH₀:b_j =b_j^∗vs.H₁:b_j6=b^∗_j (for somej ∈ {0,1, . . . ,p}) use T = (ˆb_j−b_j^∗)/ˆσ_b_j

which follows atn−p−1distribution underH₀.

(32)

Coefficient Inference in R

> summary(mod) Call:

lm(formula = mpg ~ cyl + am + carb, data = mtcars) Residuals:

Min 1Q Median 3Q Max

-5.9074 -1.1723 0.2538 1.4851 5.4728 Coefficients:

Estimate Std. Error t value Pr(>|t|) (Intercept) 25.3203 1.2238 20.690 < 2e-16 ***

cyl6 -3.5494 1.7296 -2.052 0.049959 * cyl8 -6.9046 1.8078 -3.819 0.000712 ***

am 4.2268 1.3499 3.131 0.004156 **

carb -1.1199 0.4354 -2.572 0.015923 * ---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.805 on 27 degrees of freedom Multiple R-squared: 0.8113, Adjusted R-squared: 0.7834 F-statistic: 29.03 on 4 and 27 DF, p-value: 1.991e-09

> confint(mod)

2.5 % 97.5 %

(Intercept) 22.809293 27.8313132711 cyl6 -7.098164 -0.0006745487 cyl8 -10.613981 -3.1952927942

(33)

Inferences about Multiple b ˆ

j

Assume thatq<pand want to test if a reduced model is sufficient:

H₀:b_q+1=b_q+2=· · ·=b_p=b^∗ H₁:at least oneb_k 6=b^∗

Compare the SSE for full and reduced (constrained) models:

(a) Full Model: y_i =b₀+Pp

j=1b_jx_ij+e_i (b) Reduced Model: y_i =b₀+Pq

j=1b_jx_ij +b^∗Pp

k=q+1x_ik+e_i

Note: setb^∗ =0 to removeX_q+1, . . . ,X_pfrom model.

(34)

Inferences about Multiple b ˆ

j

(continued)

Test Statistic:

F^∗ = SSE_R−SSE_F

df_R−df_F ÷SSE_F df_F

= SSE_R−SSE_F

(n−q−1)−(n−p−1) ÷ SSE_F n−p−1

∼F(p−q,n−p−1)

where

SSE_Ris sum-of-squares error for reduced model SSE_F is sum-of-squares error for full model df_Ris error degrees of freedom for reduced model df_F is error degrees of freedom for full model

(35)

Inferences about Linear Combinations of b ˆ

j

Assume thatc= (c₁, . . . ,c_p+1)⁰ and want to test:

H₀:c⁰b=b^∗ H₁:c⁰b6=b^∗

Test statistic:

t^∗= c⁰bˆ−b^∗ ˆ

σp

c⁰(X⁰X)⁻¹c

∼tn−p−1

(36)

Confidence Interval for σ

²

Note that ^{(n−p−1)ˆ}_σ₂ ^σ² = ^SSE_σ2 =

Pn i=1ˆe²_i

σ² ∼χ²_n−p−1

This implies that

χ²(n−p−1;1−α/2)< (n−p−1)ˆσ²

σ² < χ²(n−p−1;α/2)

whereP(Q> χ²(n−p−1;α/2)) =α/2, so a 100(1−α)% CI is given by (n−p−1)ˆσ²

χ²(n−p−1;α/2)

< σ²< (n−p−1)ˆσ² χ²(n−p−1;1−α/2)

(37)

Interval Estimation

Idea: estimateexpected value of responsefor a given predictor score.

Givenx_h= (1,x_h1, . . . ,x_hp), the fitted value isyˆ_h =x_hˆb.

Variance ofyˆ_his given byσ²_¯_y

h =V(x_hb) =ˆ x_hV(ˆb)x⁰_h=σ²x_h(X⁰X)⁻¹x⁰_h Useσˆ_¯_y²

h = ˆσ²x_h(X⁰X)⁻¹x⁰_hifσ²is unknown We can testH₀:E(y_h) =y_h^∗vs.H₁:E(y_h)6=y_h^∗

Test statistic:T = (ˆy_h−y_h^∗)/ˆσy¯h, which followstn−p−1distribution 100(1−α)% CI forE(y_h): yˆ_h±t_n−p−1^(α/2) ˆσ¯yh

(38)

Predicting New Observations

Idea: estimateobserved value of responsefor a given predictor score.

Note: interested in actualyˆ_hvalue instead ofE(ˆy_h)

Givenx_h= (1,x_h1, . . . ,x_hp), the fitted value isyˆ_h =x_hˆb.

Note: same as interval estimation

When predicting a new observation, there are two uncertainties:

location of the distribution ofY forX₁, . . . ,X_p(captured byσ²_¯_y

h) variability within the distribution ofY (captured byσ²)

(39)

Predicting New Observations (continued)

Two sources of variance are independent soσ_y²_h =σ²_¯_y

h+σ² Useσˆ_y²

h = ˆσ²_¯_y

h+ ˆσ²ifσ²is unknown

We can testH₀:y_h=y_h^∗ vs.H₁:y_h6=y_h^∗

Test statistic:T = (ˆy_h−y_h^∗)/ˆσyh, which followst_n−p−1distribution 100(1−α)%Prediction Interval (PI)fory_h: yˆ_h±t_n−p−1^(α/2) σˆyh

(40)

Confidence and Prediction Intervals in R

# confidence interval

> newdata <- data.frame(cyl=factor(6, levels=c(4,6,8)), am=1, carb=4)

> predict(mod, newdata, interval="confidence")

fit lwr upr

1 21.51824 18.92554 24.11094

# prediction interval

> newdata <- data.frame(cyl=factor(6, levels=c(4,6,8)), am=1, carb=4)

> predict(mod, newdata, interval="prediction")

fit lwr upr

1 21.51824 15.20583 27.83065

(41)

Simultaneous Confidence Regions

Given the distribution ofbˆ (and some probability theory), we have that (ˆb−b)⁰X⁰X(ˆb−b)

σ² ∼χ²_p+1 and (n−p−1)ˆσ²

σ² ∼χ²_n−p−1 which implies that

(ˆb−b)⁰X⁰X(ˆb−b)

(p+1)ˆσ² ∼ χ²_p+1/(p+1)

χ²_n−p−1/(n−p−1) ≡F(p+1,n−p−1)

To form a 100(1−α)% confidence region (CR) use limits such that (ˆb−b)⁰X⁰X(ˆb−b)≤(p+1)ˆσ²F(p+1,n−p−1)^(α)

whereF(p+1,n−p−1)^(α) is the critical value for significance levelα.

(42)

Simultaneous Confidence Regions in R

−10 −6 −4 −2 0 2

−14−10−6−20

α = 0.1

cyl6 coefficient

cyl8 coefficient

●

−10 −6 −4 −2 0 2

−14−10−6−20

α = 0.01

cyl6 coefficient

cyl8 coefficient

●

dev.new(height=4,width=8,noRStudioGD=TRUE) par(mfrow=c(1,2))

confidenceEllipse(mod,c(2,3),levels=.9,xlim=c(-11,3),ylim=c(-14,0), main=expression(alpha*" = "*.1),cex.main=2)

(43)

Multivariate Linear

Regression

(44)

MvLR Model: Scalar Form

Themultivariate (multiple) linear regressionmodel has the form y_ik =b_0k+

p

X

j=1

b_jkx_ij +e_ik

fori∈ {1, . . . ,n}andk ∈ {1, . . . ,m}where

y_ik ∈Ris thek-th real-valuedresponsefor thei-th observation b_0k ∈Ris the regressioninterceptfork-th response

b_jk ∈Ris thej-th predictor’s regressionslopefork-th response x_ij ∈Ris thej-thpredictorfor thei-th observation

(e_i1, . . . ,e_im)^iid∼N(0_m,Σ)is a multivariate Gaussianerror vector

(45)

MvLR Model: Nomenclature

The model ismultivariatebecause we havem>1 response variables.

The model ismultiplebecause we havep>1 predictors.

Ifp=1, we have a multivariatesimplelinear regression model

The model islinearbecausey_ik is a linear function of the parameters (b_jk are the parameters forj∈ {1, . . . ,p+1}andk ∈ {1, . . . ,m}).

The model is aregressionmodel because we are modeling response variables (Y₁, . . . ,Ym) as a function of predictor variables (X₁, . . . ,Xp).

(46)

MvLR Model: Assumptions

The fundamental assumptions of the MLR model are:

1 Relationship betweenX_j andY_k islinear(given other predictors)

2 x_ij andy_ik areobserved random variables(known constants)

3 (e_i1, . . . ,e_im)^iid∼N(0_m,Σ)is anunobserved random vector

4 b_k = (b_0k,b_1k, . . . ,b_pk)⁰ fork ∈ {1, . . . ,m}areunknown constants

5 (y_ik|x_i1, . . . ,x_ip)∼N(b_0k+Pp

j=1b_jkx_ij, σ_kk)for eachk ∈ {1, . . . ,m}

note: homogeneity of variancefor each response

Note: b_jk is expected increase inY_k for 1-unit increase inX_j with all other predictor variables held constant

(47)

MvLR Model: Matrix Form

The multivariate multiple linear regression model has the form Y=XB+E

where

Y= [y₁, . . . ,ym]∈R^n×mis then×mresponse matrix

• yk = (y1k, . . . ,ynk)⁰∈Rⁿisk-th response vector (n×1)

X= [1_n,x₁, . . . ,x_p]∈R^n×(p+1)is then×(p+1)design matrix

• 1_nis ann×1 vector of ones

• x_j = (x_1j, . . . ,x_nj)⁰ ∈Rⁿisj-th predictor vector (n×1)

B= [b₁, . . . ,bm]∈R^(p+1)×mis(p+1)×mmatrix of coefficients

• b_k = (b0k,b1k, . . . ,bpk)⁰∈R^p+1isk-th coefficient vector (p+1×1) E= [e₁, . . . ,e_m]∈R^n×mis then×merror matrix

• ek = (e1k, . . . ,enk)⁰ ∈Rⁿisk-th error vector (n×1)

(48)

MvLR Model: Matrix Form (another look)

Matrix form writes MLR model for allnmpoints simultaneously

Y=XB+E







y₁₁ · · · y_1m y21 · · · y2m

y31 · · · y3m

..

. . .. ... y_n1 · · · ynm







=







1 x₁₁ x₁₂ · · · x_1p 1 x21 x22 · · · x2p

1 x31 x32 · · · x3p

.. .

..

. . .. ... 1 x_n1 x_n2 · · · xnp













b₀₁ · · · b_0m b11 · · · b1m

b21 · · · b2m

..

. . .. ... b_p1 · · · bpm





 +







e₁₁ · · · e_1m e21 · · · e2m

e31 · · · e3m

..

. . .. ... e_n1 · · · enm







(49)

MvLR Model: Assumptions (revisited)

Assuming that thensubjects are independent, we have that e_k ∼N(0n, σ_kkIn)wheree_k isk-th column ofE

e_i ^iid∼N(0_m,Σ)wheree_i isi-th row ofE

vec(E)∼N(0nm,Σ⊗In)where⊗denotes the Kronecker product vec(E⁰)∼N(0_nm,I_n⊗Σ)where⊗denotes the Kronecker product

The response matrix is multivariate normal givenX (vec(Y)|X)∼N([B⁰⊗I_n]vec(X),Σ⊗I_n) (vec(Y⁰)|X)∼N([In⊗B⁰]vec(X⁰),In⊗Σ)

where[B⁰⊗I_n]vec(X) =vec(XB)and[I_n⊗B⁰]vec(X⁰) =vec(B⁰X⁰).

(50)

MvLR Model: Mean and Covariance

Note that the assumed mean vector for vec(Y⁰)is

[In⊗B⁰]vec(X⁰) =vec(B⁰X⁰) =





 B⁰x₁

... B⁰x_n







wherex_i is thei-th row ofX

The assumed covariance matrix for vec(Y⁰)is block diagonal

I_n⊗Σ=







Σ 0_m×m · · · 0_m×m 0m×m Σ · · · 0m×m

... ... . .. ... 0 0 · · · Σ







(51)

Ordinary Least Squares

Theordinary least squares(OLS) problem is min

B∈R^(p+1)×m

kY−XBk²= min

B∈R^(p+1)×m n

X

i=1 m

X

k=1

y_ik−b_0k −Pp

j=1b_jkx_ij2

wherek · kdenotes the Frobenius norm.

OLS(B) =kY−XBk²=tr(Y⁰Y)−2tr(Y⁰XB) +tr(B⁰X⁰XB)

∂OLS(B)

∂B =−2X⁰Y+2X⁰XB The OLS solution has the form

Bˆ = (X⁰X)⁻¹X⁰Y ⇐⇒ bˆ_k = (X⁰X)⁻¹X⁰y_k

whereb_k andy_k denote thek-th columns ofBandY, respectively.

(52)

Fitted Values and Residuals

SCALAR FORM:

Fitted valuesare given by yˆ_ik = ˆb_0k+Pp

j=1ˆb_jkx_ij andresidualsare given by

eˆ_ik =y_ik−yˆ_ik

MATRIX FORM:

Fitted valuesare given by Yˆ =XBˆ andresidualsare given by

Eˆ =Y−Yˆ

(53)

Hat Matrix

Note that we can write the fitted values as Yˆ =XBˆ

=X(X⁰X)⁻¹X⁰Y

=HY whereH=X(X⁰X)⁻¹X⁰ is thehat matrix.

His a symmetric and idempotent matrix: HH=H

Hprojectsy_k onto the column space ofXfork ∈ {1, . . . ,m}.

(54)

Multivariate Regression Example in R

> data(mtcars)

> head(mtcars)

mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2

Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

> mtcars$cyl <- factor(mtcars$cyl)

> Y <- as.matrix(mtcars[,c("mpg","disp","hp","wt")])

> mvmod <- lm(Y ~ cyl + am + carb, data=mtcars)

> coef(mvmod)

mpg disp hp wt

(Intercept) 25.320303 134.32487 46.5201421 2.7612069 cyl6 -3.549419 61.84324 0.9116288 0.1957229 cyl8 -6.904637 218.99063 87.5910956 0.7723077 am 4.226774 -43.80256 4.4472569 -1.0254749 carb -1.119855 1.72629 21.2764930 0.1749132

(55)

Sums-of-Squares and Crossproducts: Vector Form

In MvLR models, the relevant sums-of-squares and crossproducts are Total: SSCP_T =Pn

i=1(y_i−¯y)(y_i−¯y)⁰ Regression: SSCP_R =Pn

i=1(ˆy_i−y)(ˆ¯ y_i−y)¯ ⁰ Error: SSCP_E =Pn

i=1(y_i−yˆ_i)(y_i−ˆy_i)⁰

wherey_i andyˆ_i denote thei-th rows ofYandYˆ =XB, respectively.ˆ

The correspondingdegrees of freedomare SSCP_T: df_T =m(n−1)

SSCP_R: df_R =mp

SSCP_E: df_E =m(n−p−1)

(56)

Sums-of-Squares and Crossproducts: Matrix Form

In MvLR models, the relevant sums-of-squares are SSCP_T =

n

X

i=1

(y_i−y)(y¯ _i−y)¯ ⁰

=Y⁰[I_n−(1/n)J]Y SSCP_R=

n

X

i=1

(ˆy_i−y)(ˆ¯ y_i−y)¯ ⁰

=Y⁰[H−(1/n)J]Y SSCP_E =

n

X

i=1

(y_i−yˆ_i)(y_i −ˆy_i)⁰

=Y⁰[I_n−H]Y

(57)

Partitioning the SSCP Total Matrix

We can partition the total covariation iny_i as SSCPT =

n

X

i=1

(y_i−¯y)(y_i −y)¯ ⁰

=

n

X

i=1

(yi−ˆyi + ˆyi−¯y)(yi −ˆyi+ ˆyi −¯y)⁰

=

n

X

i=1

(ˆy_i−¯y)(ˆy_i −y)¯ ⁰+

n

X

i=1

(y_i−ˆy_i)(y_i−yˆ_i)⁰+2

n

X

i=1

(ˆy_i −¯y)(y_i−yˆ_i)⁰

=SSCPR+SSCPE+2

n

X

i=1

(ˆyi−y)ˆ¯ e⁰_i

=SSCPR+SSCPE

(58)

Multivariate Regression SSCP in R

> ybar <- colMeans(Y)

> n <- nrow(Y)

> m <- ncol(Y)

> Ybar <- matrix(ybar, n, m, byrow=TRUE)

> SSCP.T <- crossprod(Y - Ybar)

> SSCP.R <- crossprod(mvmod$fitted.values - Ybar)

> SSCP.E <- crossprod(Y - mvmod$fitted.values)

> SSCP.T

mpg disp hp wt

mpg 1126.0472 -19626.01 -9942.694 -158.61723 disp -19626.0134 476184.79 208355.919 3338.21032 hp -9942.6938 208355.92 145726.875 1369.97250 wt -158.6172 3338.21 1369.972 29.67875

> SSCP.R + SSCP.E

mpg disp hp wt

mpg 1126.0472 -19626.01 -9942.694 -158.61723 disp -19626.0134 476184.79 208355.919 3338.21033 hp -9942.6938 208355.92 145726.875 1369.97250 wt -158.6172 3338.21 1369.973 29.67875

(59)

Relation to ML Solution

Remember that(y_i|x_i)∼N(B⁰x_i,Σ), which implies thaty_i has pdf f(y_i|x_i,B,Σ) = (2π)^−m/2|Σ|^−1/2exp{−(1/2)(y_i−B⁰x_i)⁰Σ⁻¹(y_i−B⁰x_i)}

wherey_i andx_i denote thei-th rows ofYandX, respectively.

As a result, thelog-likelihoodofBgiven(Y,X,Σ)is ln{L(B|Y,X,Σ)}=−1

2

n

X

i=1

(y_i−B⁰x_i)⁰Σ⁻¹(y_i−B⁰x_i) +c

wherec is a constant that does not depend onB.

(60)

Relation to ML Solution (continued)

Themaximum likelihood estimate(MLE) ofBis the estimate satisfying max

B∈R^(p+1)×m

MLE(B) = max

B∈R^(p+1)×m

−1 2

n

X

i=1

(y_i−B⁰x_i)⁰Σ⁻¹(y_i−B⁰x_i)

and note that(y_i−B⁰x_i)⁰Σ⁻¹(y_i−B⁰x_i) =tr{Σ⁻¹(y_i−B⁰x_i)(y_i−B⁰x_i)⁰}

Taking the derivative with respect toBwe see that

∂MLE(B)

∂B =−2

n

X

i=1

x_iy⁰_iΣ⁻¹+2

n

X

i=1

x_ix⁰_iBΣ⁻¹

=−2X⁰YΣ⁻¹+2X⁰XBΣ⁻¹

(61)

Estimated Error Covariance

The estimated error variance is Σˆ = SSCP_E

n−p−1

= Pn

i=1(y_i−ˆy_i)(y_i−yˆ_i)⁰ n−p−1

= Y⁰(I_n−H)Y n−p−1

which is an unbiased estimate of error covariance matrixΣ.

The estimateΣˆ is themean SSCP errorof the model.

(62)

Maximum Likelihood Estimate of Error Covariance

Σ˜ = ¹_nY⁰(I_n−H)Yis the MLE ofΣ.

From our previous results usingΣ, we have thatˆ E( ˜Σ) = n−p−1

n Σ

Consequently, thebiasof the estimatorΣ˜ is given by n−p−1

n Σ−Σ=−(p+1) n Σ and note that−^(p+1)Σ→0 asn→ ∞.

(63)

Comparing Σ ˆ and Σ ˜

Reminder: the MSSCPE and MLE ofΣare given by Σˆ =Y⁰(I_n−H)Y/(n−p−1) Σ˜ =Y⁰(I_n−H)Y/n

From the definitions ofΣˆ andΣ˜ we have that

˜

σkk <σˆkk for allk

whereσˆ_kk andσ˜_kk denote thek-th diagonals ofΣˆ andΣ, respectively.˜ MLE produces smaller estimates of the error variances

(64)

Estimated Error Covariance Matrix in R

> n <- nrow(Y)

> p <- nrow(coef(mvmod)) - 1

> SSCP.E <- crossprod(Y - mvmod$fitted.values)

> SigmaHat <- SSCP.E / (n - p - 1)

> SigmaTilde <- SSCP.E / n

> SigmaHat

mpg disp hp wt

mpg 7.8680094 -53.27166 -19.7015979 -0.6575443 disp -53.2716607 2504.87095 425.1328988 18.1065416 hp -19.7015979 425.13290 577.2703337 0.4662491 wt -0.6575443 18.10654 0.4662491 0.2573503

> SigmaTilde

mpg disp hp wt

mpg 6.638633 -44.94796 -16.6232233 -0.5548030 disp -44.947964 2113.48487 358.7058833 15.2773945 hp -16.623223 358.70588 487.0718440 0.3933977 wt -0.554803 15.27739 0.3933977 0.2171394

(65)

Expected Value of Least Squares Coefficients

The expected value of the estimated coefficients is given by E( ˆB) =E[(X⁰X)⁻¹X⁰Y]

= (X⁰X)⁻¹X⁰E(Y)

= (X⁰X)⁻¹X⁰XB

=B soBˆ is an unbiased estimator ofB.

(66)

Covariance Matrix of Least Squares Coefficients

The covariance matrix of the estimated coefficients is given by V{vec( ˆB⁰)}=V{vec(Y⁰X(X⁰X)⁻¹)}

=V{[(X⁰X)⁻¹X⁰⊗I_m]vec(Y⁰)}

= [(X⁰X)⁻¹X⁰⊗Im]V{vec(Y⁰)}[(X⁰X)⁻¹X⁰⊗Im]⁰

= [(X⁰X)⁻¹X⁰⊗I_m][I_n⊗Σ][X(X⁰X)⁻¹⊗I_m]

= [(X⁰X)⁻¹X⁰⊗I_m][X(X⁰X)⁻¹⊗Σ]

= (X⁰X)⁻¹⊗Σ

Note: we could also writeV{vec( ˆB)}=Σ⊗(X⁰X)⁻¹

(67)

Distribution of Coefficients

The estimated regression coefficients are a linear function ofYso we know thatBˆ follows a multivariate normal distribution.

vec( ˆB)∼N[vec(B),Σ⊗(X⁰X)⁻¹] vec( ˆB⁰)∼N[vec(B⁰),(X⁰X)⁻¹⊗Σ]

The covariance between two columns ofBˆ has the form Cov(ˆb_k,bˆ_`) =σk`(X⁰X)⁻¹

and the covariance between two rows ofBˆ has the form Cov(ˆb_g,ˆb_j) = (X⁰X)⁻¹_gj Σ

where(X⁰X)⁻¹_gj denotes the(g,j)-th element of(X⁰X)⁻¹.