Multivariate Analysis
ANOVA and MANOVA
Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br [email protected]
ANOVA
• ANalysis Of VAriance
– separar e estimar diferentes causas de variação
– testar se a alteração de um fator controlado leva a uma diferença
significativa entre os valores médios obtidos
– um fator (controlado ou aleatório) além
do erro aleatório da medida ANOVA
One-Way
Como Funciona a ANOVA
• ANOVA divide, basicamente, a
variabilidade em variabilidade Entre Grupos e variabilidade Dentro dos
Grupos, e compara as duas
• Quanto maior for a primeira
comparada à segunda, maior será a
evidência de que existe variabilidade
entre grupos
Como Funciona a ANOVA
• Define-se a soma de quadrados total, SQ
T, calculada a partir de todos os dados, em que 𝑥 é a média amostral global
𝑺𝑸
𝑻= 𝒙
𝒊− 𝒙
𝟐𝒊
• Note que a estimativa usual de variância de uma amostra é
𝝈
𝟐= 𝑺𝑸
𝑻𝑵 − 𝟏 = 𝑴𝑸
• Podemos subdividir SQ
Tcomo
𝑺𝑸
𝑻= 𝑺𝑸
𝑫+ 𝑺𝑸
𝑬Soma de Quadrados Dentro do Grupo Soma de Quadrados Entre os Grupos
Como Funciona a ANOVA
• SQ
De SQ
Esão definidos como 𝑺𝑸 𝑫 = 𝒙 𝒊 − 𝒙 𝟏 𝟐
𝒈𝒑𝟏
+ 𝒙 𝒊 − 𝒙 𝟐 𝟐
𝒈𝒑𝟐
+ ⋯ com 𝑥 𝑘 é a média amostral do grupo k, e
𝑺𝑸 𝑬 = 𝒏 𝟏 𝒙 𝟏 − 𝒙 𝟐 + 𝒏 𝟐 𝒙 𝟐 − 𝒙 𝟐 + ⋯
em que n
ké o tamanho amostral do
grupo k
Como Funciona a ANOVA
• Após separar a variabilidade, podem-se obter estimativas independentes da variância
populacional comum 𝜎
2a partir de SQ
Ee SQ
D. Essas estimativas são chamadas de médias quadráticas (MQ), e obtemos as seguintes estimativas
𝝈
𝒆𝒏𝒕𝒓𝒆𝟐= 𝑺𝑸
𝑬𝒎 − 𝟏 𝝈
𝒅𝒆𝒏𝒕𝒓𝒐𝟐= 𝑺𝑸
𝑫𝑵 − 𝒎
em que m é o número de grupos e N é o
tamanho amostral total
Comparação de várias médias
• Problema: estabilidade de um reagente fluorescente
condições medidas 𝒙 𝒎
A
recém preparada 102, 100, 101 101
B
armazenada, 1h, escuro 101, 101, 104 102 C
armazenada, 1h, sombra 97, 95, 99 97 D
armazenada, 1h, luz 90, 92, 94 92
𝒙 = 𝟗𝟖
Comparação de várias médias
• SQ
D𝑺𝑸
𝑫= 𝒙
𝒊− 𝒙
𝟏 𝟐𝒈𝒑𝟏
+ 𝒙
𝒊− 𝒙
𝟐 𝟐𝒈𝒑𝟐
+ ⋯
𝑺𝑸𝑫 = 𝟏𝟎𝟐 − 𝟏𝟎𝟏 𝟐 + 𝟏𝟎𝟎 − 𝟏𝟎𝟏 𝟐 + 𝟏𝟎𝟏 − 𝟏𝟎𝟏 𝟐 𝒂𝒎𝒐𝒔𝒕𝒓𝒂 𝑨 + 𝟏𝟎𝟏 − 𝟏𝟎𝟐 𝟐 + 𝟏𝟎𝟏 − 𝟏𝟎𝟐 𝟐 + 𝟏𝟎𝟒 − 𝟏𝟎𝟐 𝟐 𝒂𝒎𝒐𝒔𝒕𝒓𝒂 𝑩 + 𝟗𝟕 − 𝟗𝟕 𝟐 + 𝟗𝟓 − 𝟗𝟕 𝟐 + 𝟗𝟗 − 𝟗𝟕 𝟐 𝒂𝒎𝒐𝒔𝒕𝒓𝒂 𝑪
+ 𝟗𝟎 − 𝟗𝟐 𝟐 + 𝟗𝟐 − 𝟗𝟐 𝟐 + 𝟗𝟒 − 𝟗𝟐 𝟐 𝒂𝒎𝒐𝒔𝒕𝒓𝒂 𝑫
𝑺𝑸
𝑫= 𝟐 + 𝟔 + 𝟖 + 𝟖 = 𝟐𝟒
Comparação de várias médias
• SQ
E𝑺𝑸 𝑬 = 𝒏 𝟏 𝒙 𝟏 − 𝒙 𝟐 + 𝒏 𝟐 𝒙 𝟐 − 𝒙 𝟐 + ⋯
no exemplo n
1= n
2= n
3= n
4= 3
𝑺𝑸𝑬 = 𝟑 𝟏𝟎𝟏 − 𝟗𝟖 𝟐 + 𝟑 𝟏𝟎𝟐 − 𝟗𝟖 𝟐 + 𝟑 𝟗𝟕 − 𝟗𝟖 𝟐 + 𝟑 𝟗𝟐 − 𝟗𝟖 𝟐
𝑺𝑸 𝑬 = 𝟐𝟕 + 𝟒𝟖 + 𝟑 + 𝟏𝟎𝟖 = 𝟏𝟖𝟔
Comparação de várias médias
• SQ
T𝑺𝑸 𝑻 = 𝑺𝑸 𝑫 + 𝑺𝑸 𝑬
𝑺𝑸
𝑻= 𝟐𝟒 + 𝟏𝟖𝟔 = 𝟐𝟏𝟎
• 𝝈 𝒆𝒏𝒕𝒓𝒆 𝟐 = 𝒎−𝟏 𝑺𝑸
𝑬𝝈
𝒆𝒏𝒕𝒓𝒆𝟐= 𝟏𝟖𝟔
𝟒 − 𝟏 = 𝟔𝟐
• 𝝈 𝒅𝒆𝒏𝒕𝒓𝒐 𝟐 = 𝑺𝑸
𝑫𝑵−𝒎
𝝈
𝒅𝒆𝒏𝒕𝒓𝒐𝟐= 𝟐𝟒
𝟏𝟐 − 𝟒 = 𝟑
Como Funciona a ANOVA
• Teste da Hipótese
– hipótese nula, H
0: todas as amostras
pertencem à uma mesma população com média e variância 𝝈
𝟐– estimativa de 𝝈
𝟐• variação dentro das amostras: 𝝈
𝒅𝒆𝒏𝒕𝒓𝒐𝟐• variação entre as amostras: 𝝈
𝒆𝒏𝒕𝒓𝒆𝟐Como Funciona a ANOVA
• Teste da Hipótese
hipótese nula, H
0:
verdadeira
• estimativas de 𝝈𝟐 não devem diferir significativamente
falsa
• 𝝈𝒆𝒏𝒕𝒓𝒆𝟐 > 𝝈𝒅𝒆𝒏𝒕𝒓𝒐𝟐
• 𝝈𝒆𝒏𝒕𝒓𝒆𝟐
𝝈𝒅𝒆𝒏𝒕𝒓𝒐𝟐 > 𝟏
𝝈
𝒆𝒏𝒕𝒓𝒆𝟐− 𝝈
𝒅𝒆𝒏𝒕𝒓𝒐𝟐= 𝟎
Como Funciona a ANOVA
• Teste da Hipótese e Teste F
– pra saber se um valor é maior do que um outro uma cauda, teste F
se Fcalc > Ftab hipótese nula é descartada
como Fcalc é maior do que Ftab, com 95% de confiança, as
médias das amostras diferem significativamente
Tabela F, P = 0,05
• é o número do grau de liberdade
•F é sempre maior do que 1
Tabela ANOVA
fonte de variação SQ MQ F
entre as amostras 186 3 62 20,7
dentro das amostras 24 8 3
Total 210 11
Octave
• Faz o teste F para comparar se as médias são estatisticamente iguais
• Não mostra a tabela
• Calcula o valor de F
• Calcula o valor de p
– H
0: as médias são iguais?
– p < 0,05 hipótese nula é falsa
• Calcula gle (graus de liberdade entre) e gld (graus de liberdade dentro)
• Matriz Y: grupos em colunas
Octave
> [p,F,gle,gld] = anova(Y)
Aplicativo Java
MANOVA
MANOVA: Multivariate Analysis of Variance
MANOVA: What Kinds of Hypotheses Can it Test?
• A MANOVA or multivariate analysis of variance is a way to test the hypothesis that one or more
independent variables (IV), or factors, have an
effect on a set of two or more dependent variables (DV)
– For example, you might wish to test the hypothesis that sex and ethnicity interact to influence a set of job-related outcomes including attitudes toward co-workers, attitudes toward supervisors, feelings of belonging in the work
environment, and identification with the corporate culture – As another example, you might want to test the
hypothesis that three different methods of teaching writing result in significant differences in ratings of
student creativity, student acquisition of grammar, and assessments of writing quality by an independent panel of judges
Why Should You Do a MANOVA?
• You do a MANOVA instead of a series of one-at-a- time ANOVAs for two main reasons
– Supposedly to reduce the experiment-wise level of Type I error 8 F tests at 0.05 each means the experiment-wise probability of making a Type I error (rejecting the null hypothesis when it is in fact true) is 40%! The so-called overall test or omnibus test protects against this inflated error probability only when the null hypothesis is true. If you follow up a significant multivariate test with a bunch of ANOVAs on the individual variables without adjusting the error rates for the individual tests, there’s no
“protection”
– Another reasons to do MANOVA. None of the individual ANOVAs may produce a significant main effect on the DV, but in combination they might, which suggests that the variables are more meaningful taken together than
considered separately
• MANOVA takes into account the intercorrelations
among the DVs
Type I and II error
• Type I error
– A type I error occurs when one rejects the null hypothesis when it is true
• Type II error
– A type II error occurs when one rejects the alternative hypothesis (fails to reject the null hypothesis) when the alternative hypothesis is true
If there is a diagnostic value demarcating the choice of two means, moving it to decrease type I error will increase type II error (and vice-versa)
Assumptions of MANOVA
1. Multivariate normality
– All of the DVs must be distributed normally (can visualize this with histograms; tests are available for checking this out)
– Any linear combination of the DVs must be distributed normally
• Check out pairwise relationships among the DVs for nonlinear relationships using scatter plots
Assumptions of MANOVA
– All subsets of the variables must have a multivariate normal distribution
• These requirements are rarely if ever tested in practice
• MANOVA is assumed to be a robust test that can stand up to
departures from multivariate normality in terms of Type I error rate
Log-likelihood density (log scale) using multivariate normal distribution (correlated)
Assumptions of MANOVA
–
Statistical power (power to detect a main or
interaction effect) may be reduced when distributions are very plateau-like (platykurtic)
– If the classes in the center of the distribution have more or less the same frequency, the resulting histogram looks like a plateau
Assumptions of MANOVA, cont’d
2. Homogeneity of the covariance matrices – In ANOVA we talked about the need for the
variances of the dependent variable to be equal across levels of the independent variable
• In MANOVA, the univariate requirement of equal variances has to hold for each one of the dependent variables
– In MANOVA we extend this concept and require that the “covariance matrices” be homogeneous
• Computations in MANOVA require the use of matrix algebra, and each Person’s “score” on the dependent variables is actually a “vector” of scores on DV1, DV2, DV3, …, DVn
• The matrices of the covariances -the variance shared between any two variables- have to be equal across all levels of the independent variable
Assumptions of MANOVA, cont’d
– This homogeneity assumption is tested with a test that is similar to Levene’s test for the ANOVA case. It is called Box’s M, and it works the same way: it tests the hypothesis that the covariance matrices of the dependent variables are significantly different across levels of the independent variable
• Putting this in English, what you don’t want is the case where if your independent variable (IV), was, for example, ethnicity, all the people in the “other” category had scores on their 6 dependent variables clustered very tightly around their mean, whereas people in the
“white” category had scores on the vector of 6 dependent variables clustered very loosely around the mean. You don’t want a
leptokurtic set of distributions for one level of the IV and a platykurtic set for another level
Assumptions of MANOVA, cont’d
• If Box’s M is significant, it means you have violated an assumption of MANOVA. This is not much of a problem if you have equal cell sizes and large N; it is a much bigger issue with small sample sizes
and/or unequal cell sizes (in factorial anova if there are unequal cell sizes the sums of squares for the three sources (two main effects and interaction effect) won’t add up to the Total Sum of Squares, SS)