Multivariate Analysis

(1)

Multivariate Analysis

ANOVA and MANOVA

Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br [email protected]

(2)

ANOVA

• ANalysis Of VAriance

– separar e estimar diferentes causas de variação

– testar se a alteração de um fator controlado leva a uma diferença

significativa entre os valores médios obtidos

– um fator (controlado ou aleatório) além

do erro aleatório da medida  ANOVA

One-Way

(3)

Como Funciona a ANOVA

• ANOVA divide, basicamente, a

variabilidade em variabilidade Entre Grupos e variabilidade Dentro dos

Grupos, e compara as duas

• Quanto maior for a primeira

comparada à segunda, maior será a

evidência de que existe variabilidade

entre grupos

(4)

Como Funciona a ANOVA

• Define-se a soma de quadrados total, SQ

_T

, calculada a partir de todos os dados, em que 𝑥 é a média amostral global

𝑺𝑸

_𝑻

= 𝒙

_𝒊

− 𝒙

^𝟐

𝒊

• Note que a estimativa usual de variância de uma amostra é

𝝈

^𝟐

= 𝑺𝑸

_𝑻

𝑵 − 𝟏 = 𝑴𝑸

• Podemos subdividir SQ

_T

como

𝑺𝑸

_𝑻

= 𝑺𝑸

_𝑫

+ 𝑺𝑸

_𝑬

Soma de Quadrados Dentro do Grupo Soma de Quadrados Entre os Grupos

(5)

Como Funciona a ANOVA

• SQ

_D

e SQ

_E

são definidos como 𝑺𝑸 _𝑫 = 𝒙 _𝒊 − 𝒙 _𝟏 ^𝟐

𝒈𝒑𝟏

+ 𝒙 _𝒊 − 𝒙 _𝟐 ^𝟐

𝒈𝒑𝟐

+ ⋯ com 𝑥 _𝑘 é a média amostral do grupo k, e

𝑺𝑸 _𝑬 = 𝒏 _𝟏 𝒙 _𝟏 − 𝒙 ^𝟐 + 𝒏 _𝟐 𝒙 _𝟐 − 𝒙 ^𝟐 + ⋯

em que n

_k

é o tamanho amostral do

grupo k

(6)

Como Funciona a ANOVA

• Após separar a variabilidade, podem-se obter estimativas independentes da variância

populacional comum 𝜎

²

a partir de SQ

_E

e SQ

_D

. Essas estimativas são chamadas de médias quadráticas (MQ), e obtemos as seguintes estimativas

𝝈

_{𝒆𝒏𝒕𝒓𝒆}^𝟐

= 𝑺𝑸

_𝑬

𝒎 − 𝟏 𝝈

_{𝒅𝒆𝒏𝒕𝒓𝒐}^𝟐

= 𝑺𝑸

_𝑫

𝑵 − 𝒎

em que m é o número de grupos e N é o

tamanho amostral total

(7)

Comparação de várias médias

• Problema: estabilidade de um reagente fluorescente

condições medidas 𝒙 _𝒎

A

recém preparada 102, 100, 101 101

B

armazenada, 1h, escuro 101, 101, 104 102 C

armazenada, 1h, sombra 97, 95, 99 97 D

armazenada, 1h, luz 90, 92, 94 92

𝒙 = 𝟗𝟖

(8)

Comparação de várias médias

• SQ

_D

𝑺𝑸

_𝑫

= 𝒙

_𝒊

− 𝒙

_𝟏 ^𝟐

𝒈𝒑𝟏

+ 𝒙

_𝒊

− 𝒙

_𝟐 ^𝟐

𝒈𝒑𝟐

+ ⋯

𝑺𝑸_𝑫 = 𝟏𝟎𝟐 − 𝟏𝟎𝟏 ^𝟐 + 𝟏𝟎𝟎 − 𝟏𝟎𝟏 ^𝟐 + 𝟏𝟎𝟏 − 𝟏𝟎𝟏 ^𝟐 _{𝒂𝒎𝒐𝒔𝒕𝒓𝒂 𝑨} + 𝟏𝟎𝟏 − 𝟏𝟎𝟐 ^𝟐 + 𝟏𝟎𝟏 − 𝟏𝟎𝟐 ^𝟐 + 𝟏𝟎𝟒 − 𝟏𝟎𝟐 ^𝟐 _{𝒂𝒎𝒐𝒔𝒕𝒓𝒂 𝑩} + 𝟗𝟕 − 𝟗𝟕 ^𝟐 + 𝟗𝟓 − 𝟗𝟕 ^𝟐 + 𝟗𝟗 − 𝟗𝟕 ^𝟐 _{𝒂𝒎𝒐𝒔𝒕𝒓𝒂 𝑪}

+ 𝟗𝟎 − 𝟗𝟐 ^𝟐 + 𝟗𝟐 − 𝟗𝟐 ^𝟐 + 𝟗𝟒 − 𝟗𝟐 ^𝟐 _{𝒂𝒎𝒐𝒔𝒕𝒓𝒂 𝑫}

𝑺𝑸

_𝑫

= 𝟐 + 𝟔 + 𝟖 + 𝟖 = 𝟐𝟒

(9)

Comparação de várias médias

• SQ

_E

𝑺𝑸 _𝑬 = 𝒏 _𝟏 𝒙 _𝟏 − 𝒙 ^𝟐 + 𝒏 _𝟐 𝒙 _𝟐 − 𝒙 ^𝟐 + ⋯

no exemplo n

₁

= n

₂

= n

₃

= n

₄

= 3

𝑺𝑸_𝑬 = 𝟑 𝟏𝟎𝟏 − 𝟗𝟖 ^𝟐 + 𝟑 𝟏𝟎𝟐 − 𝟗𝟖 ^𝟐 + 𝟑 𝟗𝟕 − 𝟗𝟖 ^𝟐 + 𝟑 𝟗𝟐 − 𝟗𝟖 ^𝟐

𝑺𝑸 _𝑬 = 𝟐𝟕 + 𝟒𝟖 + 𝟑 + 𝟏𝟎𝟖 = 𝟏𝟖𝟔

(10)

Comparação de várias médias

• SQ

_T

𝑺𝑸 _𝑻 = 𝑺𝑸 _𝑫 + 𝑺𝑸 _𝑬

𝑺𝑸

_𝑻

= 𝟐𝟒 + 𝟏𝟖𝟔 = 𝟐𝟏𝟎

• 𝝈 _{𝒆𝒏𝒕𝒓𝒆} ^𝟐 = _𝒎−𝟏 ^𝑺𝑸

^𝑬

𝝈

= 𝟏𝟖𝟔

𝟒 − 𝟏 = 𝟔𝟐

• 𝝈 _{𝒅𝒆𝒏𝒕𝒓𝒐} ^𝟐 = ^𝑺𝑸

^𝑫

𝑵−𝒎

𝝈

= 𝟐𝟒

𝟏𝟐 − 𝟒 = 𝟑

(11)

Como Funciona a ANOVA

• Teste da Hipótese

– hipótese nula, H

₀

: todas as amostras

pertencem à uma mesma população com média  e variância 𝝈

^𝟐

– estimativa de 𝝈

^𝟐

• variação dentro das amostras: 𝝈

• variação entre as amostras: 𝝈

(12)

Como Funciona a ANOVA

• Teste da Hipótese

hipótese nula, H

₀

:

verdadeira

• estimativas de 𝝈^𝟐 não devem diferir significativamente

falsa

• 𝝈_{𝒆𝒏𝒕𝒓𝒆}^𝟐 > 𝝈_{𝒅𝒆𝒏𝒕𝒓𝒐}^𝟐

• ^𝝈^{𝒆𝒏𝒕𝒓𝒆}^𝟐

𝝈_{𝒅𝒆𝒏𝒕𝒓𝒐}^𝟐 > 𝟏

𝝈

− 𝝈

= 𝟎

(13)

Como Funciona a ANOVA

• Teste da Hipótese e Teste F

– pra saber se um valor é maior do que um outro  uma cauda, teste F

se F_calc > F_tab  hipótese nula é descartada

como F_calc é maior do que F_tab, com 95% de confiança, as

médias das amostras diferem significativamente

Tabela F, P = 0,05

•  é o número do grau de liberdade

•F é sempre maior do que 1

(14)

Tabela ANOVA

fonte de variação SQ  MQ F

entre as amostras 186 3 62 20,7

dentro das amostras 24 8 3

Total 210 11

(15)

Octave

• Faz o teste F para comparar se as médias são estatisticamente iguais

• Não mostra a tabela

• Calcula o valor de F

• Calcula o valor de p

– H

₀

: as médias são iguais?

– p < 0,05  hipótese nula é falsa

• Calcula gle (graus de liberdade entre) e gld (graus de liberdade dentro)

• Matriz Y: grupos em colunas

(16)

Octave

> [p,F,gle,gld] = anova(Y)

(17)

Aplicativo Java

(18)

MANOVA

MANOVA: Multivariate Analysis of Variance

(19)

MANOVA: What Kinds of Hypotheses Can it Test?

• A MANOVA or multivariate analysis of variance is a way to test the hypothesis that one or more

independent variables (IV), or factors, have an

effect on a set of two or more dependent variables (DV)

– For example, you might wish to test the hypothesis that sex and ethnicity interact to influence a set of job-related outcomes including attitudes toward co-workers, attitudes toward supervisors, feelings of belonging in the work

environment, and identification with the corporate culture – As another example, you might want to test the

hypothesis that three different methods of teaching writing result in significant differences in ratings of

student creativity, student acquisition of grammar, and assessments of writing quality by an independent panel of judges

(20)

Why Should You Do a MANOVA?

• You do a MANOVA instead of a series of one-at-a- time ANOVAs for two main reasons

– Supposedly to reduce the experiment-wise level of Type I error 8 F tests at 0.05 each means the experiment-wise probability of making a Type I error (rejecting the null hypothesis when it is in fact true) is 40%! The so-called overall test or omnibus test protects against this inflated error probability only when the null hypothesis is true. If you follow up a significant multivariate test with a bunch of ANOVAs on the individual variables without adjusting the error rates for the individual tests, there’s no

“protection”

– Another reasons to do MANOVA. None of the individual ANOVAs may produce a significant main effect on the DV, but in combination they might, which suggests that the variables are more meaningful taken together than

considered separately

• MANOVA takes into account the intercorrelations

among the DVs

(21)

Type I and II error

• Type I error

– A type I error occurs when one rejects the null hypothesis when it is true

• Type II error

– A type II error occurs when one rejects the alternative hypothesis (fails to reject the null hypothesis) when the alternative hypothesis is true

If there is a diagnostic value demarcating the choice of two means, moving it to decrease type I error will increase type II error (and vice-versa)

(22)

Assumptions of MANOVA

1. Multivariate normality

– All of the DVs must be distributed normally (can visualize this with histograms; tests are available for checking this out)

– Any linear combination of the DVs must be distributed normally

• Check out pairwise relationships among the DVs for nonlinear relationships using scatter plots

(23)

Assumptions of MANOVA

– All subsets of the variables must have a multivariate normal distribution

• These requirements are rarely if ever tested in practice

• MANOVA is assumed to be a robust test that can stand up to

departures from multivariate normality in terms of Type I error rate

Log-likelihood density (log scale) using multivariate normal distribution (correlated)

(24)

Assumptions of MANOVA

–

Statistical power (power to detect a main or

interaction effect) may be reduced when distributions are very plateau-like (platykurtic)

– If the classes in the center of the distribution have more or less the same frequency, the resulting histogram looks like a plateau

(25)

Assumptions of MANOVA, cont’d

2. Homogeneity of the covariance matrices – In ANOVA we talked about the need for the

variances of the dependent variable to be equal across levels of the independent variable

• In MANOVA, the univariate requirement of equal variances has to hold for each one of the dependent variables

– In MANOVA we extend this concept and require that the “covariance matrices” be homogeneous

• Computations in MANOVA require the use of matrix algebra, and each Person’s “score” on the dependent variables is actually a “vector” of scores on DV1, DV2, DV3, …, DVn

• The matrices of the covariances -the variance shared between any two variables- have to be equal across all levels of the independent variable

(26)

Assumptions of MANOVA, cont’d

– This homogeneity assumption is tested with a test that is similar to Levene’s test for the ANOVA case. It is called Box’s M, and it works the same way: it tests the hypothesis that the covariance matrices of the dependent variables are significantly different across levels of the independent variable

• Putting this in English, what you don’t want is the case where if your independent variable (IV), was, for example, ethnicity, all the people in the “other” category had scores on their 6 dependent variables clustered very tightly around their mean, whereas people in the

“white” category had scores on the vector of 6 dependent variables clustered very loosely around the mean. You don’t want a

leptokurtic set of distributions for one level of the IV and a platykurtic set for another level

(27)

Assumptions of MANOVA, cont’d

• If Box’s M is significant, it means you have violated an assumption of MANOVA. This is not much of a problem if you have equal cell sizes and large N; it is a much bigger issue with small sample sizes

and/or unequal cell sizes (in factorial anova if there are unequal cell sizes the sums of squares for the three sources (two main effects and interaction effect) won’t add up to the Total Sum of Squares, SS)

(28)

Assumptions of MANOVA, cont’d

3. Independence of observations

– Subjects’ scores on the dependent measures should not be influenced by or related to

scores of other subjects in the condition or level

– Can be tested with an intraclass correlation coefficient if lack of independence of

observations is suspected

(29)

Multivariate Analysis