Updating Strategies and Speed/Accuracy
trade-offs in Variable Selection Algorithms
for Multivariate Data Analysis
FEG / CEGE
UNIVERSIDADE CATÓLICA PORTUGUESA
CENTRO REGIONAL DO PORTO
A. PEDRO DUARTE SILVA
Speed and Accuracy in Variable Selection
OUTLINE
1. CRITERIA FOR VARIABLE SELECTION
1.1. LINEAR MODELS WITH MULTIPLE RESPONSES
2. ALGORITHMS BASED ON SSCP MATRICES
4. FINAL REMARKS
1.2. A LINEAR HYPOTHESIS FRAMEWORK
2.1. UPDATING STRATEGIES 2.2 COMPUTATIONAL EFFORT
2.3. ERROR BOUNDS AND STABILITY CONDITIONS
Comparison Criteria:
Multivariate Indices
2 1ccr
1/r r 1 i 2 i 2 ) ccr (1 1 τ r i 1 1 2 1 1 2 i ccr rr
ccr
r 1 i 2 i 2ξ
(
max ccr12 max 1 = Eigval1 (T-1E ))
(
max 2 max v = tr E-1H)
(
max 2 max u = tr T-1H)
Speed and Accuracy in Variable Selection
LINEAR MODELS WITH MULTIPLE RESPONSES
v
r
v
) H (T Eigval ccri2 i 1 Y = X B + U E = SXY SYY-1 S YX T = SXX r = rank(H) H = T - E Roy criterion Lawley-Hotelling criterion 1 1 λ 1 λ Wilks criterion)
det(T) det(W) Λ min max(
τ2 Bartlett-Pillai criterion = 1 - 1/rr
u
(i = 1, ..., r) = min(
ncol(X),ncol(Y))
A LINEAR HYPOTHESIS FRAMEWORK
X = A
+ U
SELECT COLUMNS OF X IN ORDER TO EXPLAIN H1
PARTICULAR CASES:
H0: C
= 0
(i) LINEAR DISCRIMINANT ANALYSIS
(ii) MULTI-WAY MANOVA/MANCOVA EFFECTS
A = [1
g]
= [
g]
1 ... 0 1 ... ... ... ... 0 ... 1 1 C= R(A) = R(A) N(C) r = dim( ) - dim( )
) H (T Eigval
ccri2 i 1 T = X’ (I - P ) X H = X’ (P - P ) X
Speed and Accuracy in Variable Selection
ALGORITHMS BASED ON SSCP MATRICES
2 y Xy Xy XX
s
S
S
S
A
FURNIVAL’S ALGORITHM FOR LINEAR REGRESSION (1971):
y = X’ + e
ˆ
min
s
e2s
2yS
yXS
XX1S
Xy GAUSSIAN ELIMINATION PIVOTS 2 y y2 1 y 2y 22 21 1y 12 11 s S S S S S S S S 1y 1 11 y1 2 y S S S s 12 1 11 y1 y2 1 11 1 y 1y 1 11 21 2y 12 1 11 21 22 1 11 21 1y 1 11 12 1 11 S S S S S S S S S S S S S S S S S S S S ...X = [ X
1| X
2]
Speed and Accuracy in Variable Selection
ALGORITHMS BASED ON SSCP MATRICES
COMMENTS
1/4 OF THE PIVOTS UPDATE (2*2) SYMMETRIC SUBMATRICES …
ONE SINGLE PIVOT NEEDS TO UPDATE A (P*P) SYMMETRIC MATRIX
(i) ONLY THE RIGHT-LOWER CORNER OF
A
NEEDS TO BE UPDATED AT EACH STEP1/2 OF THE PIVOTS JUST HAVE TO COMPUTE s2
e
TOTAL NUMBER OF FLOATING POINT OPERATIONS:
6(2p) – (1/2)p2 - (7/2)p - 6 multiplications/divisions
4(2p) – (1/2)p2 - (5/2)p - 4 additions/subtractions
Speed and Accuracy in Variable Selection
ALGORITHMS BASED ON SSCP MATRICES
2 e 1 -XX yX Xy 1 -XX 1 -XX s S S -S S -S
-COMMENTS
BRANCH AND BOUND ALGORITHMS (FURNIVAL AND WILSON 1974)
6(2p) + O(p2) multiplications/divisions
4(2p) + O(p2) additions/subtractions
10(2p) + O(p2) flops
WORST CASE (NO PRUNING) TIME COMPLEXITY
FROM TO 2 y
s
...
...
...
Speed and Accuracy in Variable Selection
ALGORITHMS BASED ON SSCP MATRICES
COMMENTS
WORST CASE TIME COMPLEXITY: (iiI) AND REMAINS VALID ON APPLIED TO:
- DETERMINANTS OF THE TYPE det( 11)
- SUMS AND RATIOS OF THE ABOVE CRITERIA
SYMMETRIC, NON-SINGULAR AND POSITIVE-DEFINITE (5+5r) (2p) + O(p2) flops Lawley-Hotelling criterion r 1 i i 1 -i r 1 i i i 1 -1 - H E h h ' h ' E h E
(
)
tr tr v Bartlett-Pillai criterion r 1 i i 1 -i r 1 i i i 1 -1 -h T ' h ' h h T H T)
(
tr tr u Wilks criterion = det(E) / det(T) (5+5r) (2p) + O(p2) flops 12 (2p) + O(p2) flops- QUADRATIC FORMS OF THE TYPE ν1' Φ111 ν1 21 22
12 11 Φ Φ Φ Φ Φ
Speed and Accuracy in Variable Selection
ALGORITHMS BASED ON SSCP MATRICES
ERROR ANALYSIS
BASIC ALGORITHM det( 11) = det(M11)L D L’
DECOMPOSITIONS OF SYMMETRIC POSITIVE-DEFENITE MATRICES,M
11,
BY GAUSSIAN ELIMINATION IF # X1 = K
k 1 i ii D COMPUTED Lˆ Dˆ Lˆ' = M11 + M11 u l u l l 1 γ AND (BARRLUND 1991) 2(M11) = ||M11||2 ||M11-1||2 l = dim(M11) u: = unit roundoff = - D11(k+1,K+1) 1 1 11 1' Φ ν ν 2 2 11 || L| D L' ||| M || k|| | ˆ ˆ |ˆ 2 11 2 1 11 F 11 2 1 11 2 11 F || ΔM || || M || 1 || ΔM || || M || || M || || D D || ˆ ) (M κ ) γ k (1 k 1 || M || ) (M κ ) γ k (1 γ k | (j) D D(j) | 11 2 1 k 2 11 11 2 1 k k 3/2 ˆSpeed and Accuracy in Variable Selection
ALGORITHMS BASED ON SSCP MATRICES
ERROR ANALYSIS
Lawley-Hotelling criterion Bartlett-Pillai criterion or i 1 -i i 1 -i i 1 -1 -h E ' h α E ' h -h E -E -i 1 -i i 1 -i i 1 -1 -h T ' h β E ' h -h T -T -or r 1 i p 1 1 2 i 2 i i 2 1 p 1 p 3/2 ) (M κ γ 1) (p 1 1) p 1 || M || ) (M κ γ 1) p 1 γ 1) (p | |)
(
( ( ˆv v i i i i α ' h h E M r 1 i p 1 1 2 i 2 i i 2 1 p 1 p 3/2 ) (N κ γ 1) (p 1 1) p 1 || N || ) (N κ γ 1) p 1 γ 1) (p | |)
(
( ( ˆu u i i i i β ' h h T NSpeed and Accuracy in Variable Selection
ALGORITHMS BASED ON SSCP MATRICES
ERROR ANALYSIS
Wilks criterion M = E or -E-1 N = T or -T-1 r 1 i 2 i i 2 2 2 2 1 p 2 2 2 2 p p 3/2 ) O( (N) Egval (M) Egval 1 * * (N) κ (M) κ (N) κ (M) κ γ p 1 p 1 || N || || M || (N) κ (M) κ γ p 1 γ p | Λ Λ |)
(
u ˆX = A
+ U
H0: C
= 0
= R(A) = R(A) N(C) r = dim( ) - dim( )
) H (T Eigval
ccri2 i 1 T = X’ (I - P ) X H = X’ (P - P ) X
Speed and Accuracy in Variable Selection
KEY INSIGHT:
THE COMPUTATION OF ccri2 CAN ALWAYS BE BASED ON A
GENERALIZED SINGULAR VALUE PROBLEM
WRITING:
H = X
H’ X
HE = X
E’ X
EX
H= U
HR
HQ’
X
E’ = U
ER
EQ’
THEN
ccr
i2=
(
D
H/E(i,i)
)
2/
[
1 +
(
D
H/E(i,i)
)
2]
R
H= D
H/ER
ESpeed and Accuracy in Variable Selection
INITIALIZATION STEP
FIND X
HAND X
EDIRECTLY FROM
X
,
A
AND
C
FROM THE SVD’s
C V
s s2V’
sC’ = F D F ’
WE GET
X
E= Ũ’ X
X
H= D
-1/2F’ C V
S S-1U
S’ X
A = U
V ’ =
Us U~ Σ Vs V~’
Speed and Accuracy in Variable Selection
ITERATION STEP
X
H= U
HR
HQ’
X
E’ = U
ER
EQ’
R
H= D
H/ER
E AND BUTMIGHT NOT BE TRIANGULAR
MIGHT NOT BE PARALLEL
1 0 0 Q' 1) k (., X U' | R U 1) k (., X | XE1 E E E E 1 0 0 Q' 1) k (., X U' | R U 1) k (., X | XH1 H H H H
1)
k
(.,
X
U'
|
R
H H H1)
k
(.,
X
U'
|
R
E E ESpeed and Accuracy in Variable Selection
AN ALGORITHM BASED ON THE ORIGINAL DATA
ITERATION STEP
RESTORING TRIANGULAR STRUCTURE
- HOUSEHOLDER TRANSFORMATIONS ON
RESTORING PARALLELISM
- PAIGE’S (1986) ELEMENTARY 2*2 ROTATIONS
TIME COMPLEXITY OF EACH ITERATION:
1) k (., X U' | RH H H AND RE | U'E XE(.,k 1)
O
(s*k + r*k + k2)Speed and Accuracy in Variable Selection
FINAL REMARKS
IN MOST PRATICAL PROBLEMS THE DATA DOES NOT EXHIBIT A PATTERN OF MULTICORRELATION STRONG ENHOUGH TO MAKE THE STABILITY OF ANY OF THE ALGORITHMS A RELEVANT CONCERN
WHEN SEVERE MULTICORRELATION IS PRESENT NONE OF THE ALGORITHMS CAN GIVE RELIABLE RESULTS
OFTEN STATISTICAL REASONS RECOMMEND THAT MULTICOLINEAR SUBSETS SHOULD BE AVOID, LONG BEFORE NUMERICAL ACCURACY BECAMES AN ISSUE
Speed and Accuracy in Variable Selection
FINAL REMARKS
WHEN THE NUMBER OF ORIGINAL VARIABLES IS MODERATE ALL ALGORITHMS ARE FAST ENOUGH SO THAT COMPUTATIONAL
EFFORT IS NOT A RELEVANT CONCERN
COMPUTATIONAL EFFORT
WHEN THE NUMBER OF ORIGINAL VARIABLES IS LARGE NONE OF THE ALGORITHMS CAN PROVE OPTIMAL RESULTS IN A REASONABLE TIME
WHEN TRUE OPTIMALITY CANNOT BE PROVEN THERE ARE NEVERTHELESS MANY HEURISTIC METHODS THAT ARE ABLE TO QUICKLY IDENTIFY GOOD (OFTEN THE BEST) VARIABLE SUBSETS EVEN IN PROBLEMS WITH A VERY LARGE NUMBER OF ORIGINAL VARIABLES