• Nenhum resultado encontrado

Updating strategies and speed/accuracy trade-offs in variable selection algorithms for multivariate data analysis

N/A
N/A
Protected

Academic year: 2021

Share "Updating strategies and speed/accuracy trade-offs in variable selection algorithms for multivariate data analysis"

Copied!
18
0
0

Texto

(1)

Updating Strategies and Speed/Accuracy

trade-offs in Variable Selection Algorithms

for Multivariate Data Analysis

FEG / CEGE

UNIVERSIDADE CATÓLICA PORTUGUESA

CENTRO REGIONAL DO PORTO

A. PEDRO DUARTE SILVA

(2)

Speed and Accuracy in Variable Selection

OUTLINE

1. CRITERIA FOR VARIABLE SELECTION

1.1. LINEAR MODELS WITH MULTIPLE RESPONSES

2. ALGORITHMS BASED ON SSCP MATRICES

4. FINAL REMARKS

1.2. A LINEAR HYPOTHESIS FRAMEWORK

2.1. UPDATING STRATEGIES 2.2 COMPUTATIONAL EFFORT

2.3. ERROR BOUNDS AND STABILITY CONDITIONS

(3)

Comparison Criteria:

Multivariate Indices

2 1

ccr

1/r r 1 i 2 i 2 ) ccr (1 1 τ r i 1 1 2 1 1 2 i ccr r

r

ccr

r 1 i 2 i 2

ξ

(

max ccr12 max 1 = Eigval1 (T-1E )

)

(

max 2 max v = tr E-1H

)

(

max 2 max u = tr T-1H

)

Speed and Accuracy in Variable Selection

LINEAR MODELS WITH MULTIPLE RESPONSES

v

r

v

) H (T Eigval ccri2 i 1 Y = X B + U E = SXY SYY-1 S YX T = SXX r = rank(H) H = T - E Roy criterion Lawley-Hotelling criterion 1 1 λ 1 λ Wilks criterion

)

det(T) det(W) Λ min max

(

τ2 Bartlett-Pillai criterion = 1 - 1/r

r

u

(i = 1, ..., r) = min

(

ncol(X),ncol(Y)

)

(4)

A LINEAR HYPOTHESIS FRAMEWORK

X = A

+ U

SELECT COLUMNS OF X IN ORDER TO EXPLAIN H1

PARTICULAR CASES:

H0: C

= 0

(i) LINEAR DISCRIMINANT ANALYSIS

(ii) MULTI-WAY MANOVA/MANCOVA EFFECTS

A = [1

g

]

= [

g

]

1 ... 0 1 ... ... ... ... 0 ... 1 1 C

= R(A) = R(A) N(C) r = dim( ) - dim( )

) H (T Eigval

ccri2 i 1 T = X’ (I - P ) X H = X’ (P - P ) X

(5)

Speed and Accuracy in Variable Selection

ALGORITHMS BASED ON SSCP MATRICES

2 y Xy Xy XX

s

S

S

S

A

FURNIVAL’S ALGORITHM FOR LINEAR REGRESSION (1971):

y = X’ + e

ˆ

min

s

e2

s

2y

S

yX

S

XX1

S

Xy GAUSSIAN ELIMINATION PIVOTS 2 y y2 1 y 2y 22 21 1y 12 11 s S S S S S S S S 1y 1 11 y1 2 y S S S s 12 1 11 y1 y2 1 11 1 y 1y 1 11 21 2y 12 1 11 21 22 1 11 21 1y 1 11 12 1 11 S S S S S S S S S S S S S S S S S S S S ...

X = [ X

1

| X

2

]

(6)

Speed and Accuracy in Variable Selection

ALGORITHMS BASED ON SSCP MATRICES

COMMENTS

1/4 OF THE PIVOTS UPDATE (2*2) SYMMETRIC SUBMATRICES

ONE SINGLE PIVOT NEEDS TO UPDATE A (P*P) SYMMETRIC MATRIX

(i) ONLY THE RIGHT-LOWER CORNER OF

A

NEEDS TO BE UPDATED AT EACH STEP

1/2 OF THE PIVOTS JUST HAVE TO COMPUTE s2

e

TOTAL NUMBER OF FLOATING POINT OPERATIONS:

6(2p) – (1/2)p2 - (7/2)p - 6 multiplications/divisions

4(2p) – (1/2)p2 - (5/2)p - 4 additions/subtractions

(7)

Speed and Accuracy in Variable Selection

ALGORITHMS BASED ON SSCP MATRICES

2 e 1 -XX yX Xy 1 -XX 1 -XX s S S -S S -S

-COMMENTS

BRANCH AND BOUND ALGORITHMS (FURNIVAL AND WILSON 1974)

6(2p) + O(p2) multiplications/divisions

4(2p) + O(p2) additions/subtractions

10(2p) + O(p2) flops

WORST CASE (NO PRUNING) TIME COMPLEXITY

FROM TO 2 y

s

...

...

...

(8)

Speed and Accuracy in Variable Selection

ALGORITHMS BASED ON SSCP MATRICES

COMMENTS

WORST CASE TIME COMPLEXITY: (iiI) AND REMAINS VALID ON APPLIED TO:

- DETERMINANTS OF THE TYPE det( 11)

- SUMS AND RATIOS OF THE ABOVE CRITERIA

SYMMETRIC, NON-SINGULAR AND POSITIVE-DEFINITE (5+5r) (2p) + O(p2) flops Lawley-Hotelling criterion r 1 i i 1 -i r 1 i i i 1 -1 - H E h h ' h ' E h E

(

)

tr tr v Bartlett-Pillai criterion r 1 i i 1 -i r 1 i i i 1 -1 -h T ' h ' h h T H T

)

(

tr tr u Wilks criterion = det(E) / det(T) (5+5r) (2p) + O(p2) flops 12 (2p) + O(p2) flops

- QUADRATIC FORMS OF THE TYPE ν1' Φ111 ν1 21 22

12 11 Φ Φ Φ Φ Φ

(9)

Speed and Accuracy in Variable Selection

ALGORITHMS BASED ON SSCP MATRICES

ERROR ANALYSIS

BASIC ALGORITHM det( 11) = det(M11)

L D L’

DECOMPOSITIONS OF SYMMETRIC POSITIVE-DEFENITE MATRICES,

M

11

,

BY GAUSSIAN ELIMINATION IF # X1 = K

k 1 i ii D COMPUTED Lˆ Dˆ Lˆ' = M11 + M11 u l u l l 1 γ AND (BARRLUND 1991) 2(M11) = ||M11||2 ||M11-1||2 l = dim(M11) u: = unit roundoff = - D11(k+1,K+1) 1 1 11 1' Φ ν ν 2 2 11 || L| D L' ||| M || k|| | ˆ ˆ |ˆ 2 11 2 1 11 F 11 2 1 11 2 11 F || ΔM || || M || 1 || ΔM || || M || || M || || D D || ˆ ) (M κ ) γ k (1 k 1 || M || ) (M κ ) γ k (1 γ k | (j) D D(j) | 11 2 1 k 2 11 11 2 1 k k 3/2 ˆ

(10)

Speed and Accuracy in Variable Selection

ALGORITHMS BASED ON SSCP MATRICES

ERROR ANALYSIS

Lawley-Hotelling criterion Bartlett-Pillai criterion or i 1 -i i 1 -i i 1 -1 -h E ' h α E ' h -h E -E -i 1 -i i 1 -i i 1 -1 -h T ' h β E ' h -h T -T -or r 1 i p 1 1 2 i 2 i i 2 1 p 1 p 3/2 ) (M κ γ 1) (p 1 1) p 1 || M || ) (M κ γ 1) p 1 γ 1) (p | |

)

(

( ( ˆv v i i i i α ' h h E M r 1 i p 1 1 2 i 2 i i 2 1 p 1 p 3/2 ) (N κ γ 1) (p 1 1) p 1 || N || ) (N κ γ 1) p 1 γ 1) (p | |

)

(

( ( ˆu u i i i i β ' h h T N

(11)

Speed and Accuracy in Variable Selection

ALGORITHMS BASED ON SSCP MATRICES

ERROR ANALYSIS

Wilks criterion M = E or -E-1 N = T or -T-1 r 1 i 2 i i 2 2 2 2 1 p 2 2 2 2 p p 3/2 ) O( (N) Egval (M) Egval 1 * * (N) κ (M) κ (N) κ (M) κ γ p 1 p 1 || N || || M || (N) κ (M) κ γ p 1 γ p | Λ Λ |

)

(

u ˆ

(12)

X = A

+ U

H0: C

= 0

= R(A) = R(A) N(C) r = dim( ) - dim( )

) H (T Eigval

ccri2 i 1 T = X’ (I - P ) X H = X’ (P - P ) X

Speed and Accuracy in Variable Selection

KEY INSIGHT:

THE COMPUTATION OF ccri2 CAN ALWAYS BE BASED ON A

GENERALIZED SINGULAR VALUE PROBLEM

WRITING:

H = X

H

’ X

H

E = X

E

’ X

E

X

H

= U

H

R

H

Q’

X

E

’ = U

E

R

E

Q’

THEN

ccr

i2

=

(

D

H/E

(i,i)

)

2

/

[

1 +

(

D

H/E

(i,i)

)

2

]

R

H

= D

H/E

R

E

(13)

Speed and Accuracy in Variable Selection

INITIALIZATION STEP

FIND X

H

AND X

E

DIRECTLY FROM

X

,

A

AND

C

FROM THE SVD’s

C V

s s2

V’

s

C’ = F D F ’

WE GET

X

E

= Ũ’ X

X

H

= D

-1/2

F’ C V

S S-1

U

S

’ X

A = U

V ’ =

Us U~ Σ Vs V~

(14)

Speed and Accuracy in Variable Selection

ITERATION STEP

X

H

= U

H

R

H

Q’

X

E

’ = U

E

R

E

Q’

R

H

= D

H/E

R

E AND BUT

MIGHT NOT BE TRIANGULAR

MIGHT NOT BE PARALLEL

1 0 0 Q' 1) k (., X U' | R U 1) k (., X | XE1 E E E E 1 0 0 Q' 1) k (., X U' | R U 1) k (., X | XH1 H H H H

1)

k

(.,

X

U'

|

R

H H H

1)

k

(.,

X

U'

|

R

E E E

(15)

Speed and Accuracy in Variable Selection

AN ALGORITHM BASED ON THE ORIGINAL DATA

ITERATION STEP

RESTORING TRIANGULAR STRUCTURE

- HOUSEHOLDER TRANSFORMATIONS ON

RESTORING PARALLELISM

- PAIGE’S (1986) ELEMENTARY 2*2 ROTATIONS

TIME COMPLEXITY OF EACH ITERATION:

1) k (., X U' | RH H H AND RE | U'E XE(.,k 1)

O

(s*k + r*k + k2)

(16)

Speed and Accuracy in Variable Selection

FINAL REMARKS

IN MOST PRATICAL PROBLEMS THE DATA DOES NOT EXHIBIT A PATTERN OF MULTICORRELATION STRONG ENHOUGH TO MAKE THE STABILITY OF ANY OF THE ALGORITHMS A RELEVANT CONCERN

WHEN SEVERE MULTICORRELATION IS PRESENT NONE OF THE ALGORITHMS CAN GIVE RELIABLE RESULTS

OFTEN STATISTICAL REASONS RECOMMEND THAT MULTICOLINEAR SUBSETS SHOULD BE AVOID, LONG BEFORE NUMERICAL ACCURACY BECAMES AN ISSUE

(17)

Speed and Accuracy in Variable Selection

FINAL REMARKS

WHEN THE NUMBER OF ORIGINAL VARIABLES IS MODERATE ALL ALGORITHMS ARE FAST ENOUGH SO THAT COMPUTATIONAL

EFFORT IS NOT A RELEVANT CONCERN

COMPUTATIONAL EFFORT

WHEN THE NUMBER OF ORIGINAL VARIABLES IS LARGE NONE OF THE ALGORITHMS CAN PROVE OPTIMAL RESULTS IN A REASONABLE TIME

WHEN TRUE OPTIMALITY CANNOT BE PROVEN THERE ARE NEVERTHELESS MANY HEURISTIC METHODS THAT ARE ABLE TO QUICKLY IDENTIFY GOOD (OFTEN THE BEST) VARIABLE SUBSETS EVEN IN PROBLEMS WITH A VERY LARGE NUMBER OF ORIGINAL VARIABLES

(18)

References

Barrlund, A. (1991). Pertubation Bounds for the LDL

H

and LU

Decompositions. BIT 31: 358-363.

Duarte Silva, A.P. (2001). Efficient Variable Screening for Multivariate

Analysis. Journal of Multivariate Analysis 76, 35-62.

Furnival, G.M. (1971). All Possible Regressions with Less Computation.

Technometrics 13: 403-408.

Furnival, G.M. & Wilson, R.W. (1974). Regressions by Leaps and

Bounds. Technometrics 16: 499-511.

Higham, N.J. (2002). Accuracy and Stability of Numerical Algorithms.

2nd Ed. SIAM Philadelphia, PA.

Paige, C.C. & Saunders, M.A. (1981). Towards a Generalized Singular

Value Decomposition. SIAM Journal of Numerical Analysis 18: 398-405.

Paige, C.C. (1986) Computing the Generalized Singular Value

Decomposition. SIAM Journal of Scientific and Statistical Computing 7:

1126-1146.

Referências

Documentos relacionados

Este modelo permite avaliar a empresa como um todo e não apenas o seu valor para o acionista, considerando todos os cash flows gerados pela empresa e procedendo ao seu desconto,

TÉCNICAS” são apresentadas as metodologias simplificadas utilizadas pela ANEEL no âmbito do 2ª e 3ª CRTP das distribuidoras, destacando-se distintos modelos

Quando ligar um televisor Sony, a entrada pode ser seleccionada automaticamente carregando no botão de entrada após a configuração (Entrada SYNC) - avance para o passo 5.. 4

The fourth generation of sinkholes is connected with the older Đulin ponor-Medvedica cave system and collects the water which appears deeper in the cave as permanent

Roll forward for pension funds means to extend the liability for a year or n-months from the last actuarial valuation and project to any date in time, according to IAS19

Nossos agradecimentos a todos os pesquisadores, funcionários e alunos de pós-graduação da Divisão de Sensoriamento Remoto, do Centro de Previsão de Tempo e Estudos Climáticos e

Neste trabalho o objetivo central foi a ampliação e adequação do procedimento e programa computacional baseado no programa comercial MSC.PATRAN, para a geração automática de modelos

Ousasse apontar algumas hipóteses para a solução desse problema público a partir do exposto dos autores usados como base para fundamentação teórica, da análise dos dados