Universidade Federal Rural de Pernambuco
Minimum Classification Error
Principal Component Analysis
Tiago de Carvalho1, Maria Sibaldo1,
I. R. Tsang2, George Cavalcanti2
1Universidade Federal Rural de Pernambuco - UFRPE 2Universidade Federal de Pernambuco - UFPE
KDMILE
Introduction
• Principal Component Analysis (PCA)
• Unsuperviseddimensionality reduction
• Used insupervised tasks(face recognition, text classification)
• Supervised PCA for classification (Barshan et al. 2011)
• uses class representatives
• Supervised PCA for regression (Blair et al. 2006)
• pre-processing of feature selection
• Bayesian approach for classification
• depends on the covariance matrix as in PCA
• allows to estimate error rate from features
• Research Objective
• propose a new supervised PCA that selects projections that
Notation
Dataset matrix withnpoints and dfeatures.
X′ = xT 1 xT 2 .. . xT n . (1)
Thej-th point
xj =
xj1 xj2 .. . xjd , (2)
forj=1, . . . ,n.
The data mean vector is
¯
x = n−1 n X
j=1
xj. (3)
The data centered matrix is
X =
(x1−x)¯ T (x2−x)¯ T
.. . (xn−x)¯ T
Feature Extraction with PCA
The covariance matrix ofXis
ΣX= 1 nX
TX. (5)
• ξi is an eigenvector ofΣX, fori=1, . . . ,k.
Ek = [ξ1. . .ξk], (6)
• k=1, . . . ,d.
• kis the number of extracted features.
Thei-th extracted feature is
fi = [w1i. . .wni]T = Xξi. (7)
The projection of the pointxjis
wTj = [wj1. . .wjk]= xTj Ek. (8)
• λi is the eigenvalue ofξi
• λi is the variance offi
PCA Projected Data
The new data matrix is
W=XEk, (9)
The covariance matrix ofWisΣW =n−1WTW
ΣW=
λ1 0 . . . 0 0 λ2 . . . 0
..
. ... . .. ... 0 0 . . . λk
Meaning:
• extracted variable are uncorrelated
• it allows ignoring feature interactions, for feature selection
Bayes Error Rate
Probability of the classification error.
Simplified with five restrictions:
(1) The data presents a multivariate normal distribution. (2) The problem has only two classes.
(3) Equal prior probabilities for both classes.
(4) Both classes have the same covariance matrix (as in PCA). (5) The features are independent (similarly to PCA).
Then the Bayes error rate is given by
P(error)= √1 2π
Z ∞
r/2
Minimizing Bayes Error Rate
The Bayes error rate decreases asrincreases, the Mahalanobis distance between the mean vectors of the classes (µ1 andµ2):
r2 =(µ1−µ2)Σ−1(µ1−µ2). (11)
For diagonal covariance matrix
r=
v t d
X
i=1
µ1i−µ2i σi
!2
, (12)
σi is the standard deviation of the featureithat is the same for both classes.
Proposed Score
Choose projections (eigenvectors) that minimize Bayes error rate instead of maximize variance.
The mean of thei-th feature for thec-th class (c= 1,2) is
¯ wci=
Pn
j=1wijδjc Pn
j=1δjc
, (13)
δjc= 1if thej-point belongs to thec-th class, andδjc =0, otherwise.
si = (
|w1¯ i−w2¯ i|/λi, ifλi , 0 0, ifλi = 0
, (14)
Proposed Method
PCA projected features selected according to the proposed score minimize the Bayes error rate.
The proposed method consists in the following steps:
(1) Project the data asW=XEd
(2) Compute the mean for each class
(3) Compute scores of relevance (si)
(4) Selectkfeatures with the highest score
(5) Define the projection matrix with the eigenvectors of selected features
Sk =[ξ1. . .ξk] (15)
(6) Project data
Experiments
Datasets(UCI Machine Learning Repository):
• The Climate Model Simulation Crashes (540 points, 18 features)
• Banknote Authentication (1,372 points, 4 features). Metric: Accuracy.
Sampling: 100 holdouts (50% training / 50% testing). Classifiers:
• The 1-NN (Nearest Neighbor) with Euclidean distance
• Naive Bayes with normal kernel smoothing density estimate
• Pruned Decision Tree with Gini’s diversity index and a minimum of 10 nodes per leaf
Banknote (1-NN)
Accuracy for 2 features: 0.852 (PCA) and 0.959 (Proposed).
1 2 3 4
0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00
Number of extracted features
Accur
acy
Banknote (Decision Tree)
Accuracy for 2 features: 0.820 (PCA) and 0.947 (Proposed).
1 2 3 4
0.70 0.75 0.80 0.85 0.90 0.95 1.00
Number of extracted features
Accur
acy
Banknote (Naive Bayes)
Accuracy for 1 feature: 0.695 (PCA) and 0.891 (Proposed).
1 2 3 4
0.70 0.75 0.80 0.85 0.90 0.95 1.00
Number of extracted features
Accur
acy
Banknote (Linear Discriminant)
Accuracy for 1 feature: 0.614 (PCA) and 0.886 (Proposed).
1 2 3 4
0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00
Number of extracted features
Accur
acy
Climate (1-NN)
Accuracy for 10 features: 0.873 (PCA) and0.901(Proposed).
1 3 5 7 9 11 13 15 17 19
.840 .850 .860 .870 .880 .890 .900
Number of extracted features
Accur
acy
Climate (Naive Bayes)
Accuracy for 10 features: 0.916 (PCA) and0.922(Proposed).
1 3 5 7 9 11 13 15 17 19
.914 .916 .918 .920 .922 .924
Number of extracted features
Accur
acy
Climate (Decision Tree)
Accuracy for 4 features: 0.877 (PCA) and 0.887 (Proposed).
1 3 5 7 9 11 13 15 17 19
.860 .865 .870 .875 .880 .885 .890 .895
Number of extracted features
Accur
acy
Climate (Linear Discriminant)
Accuracy for 12 features: 0.923 (PCA) and 0.944 (Proposed).
1 3 5 7 9 11 13 15 17 19
.910 .920 .930 .940 .950
Number of extracted features
Accur
acy
Hypothesis test
The proposed method have accuracy significantly higher than PCA:
Climate
• from 2 to 16 extracted features (1-NN)
• from 3 to 11 extracted features (Naive Bayes)
• from 2 to 4 extracted features (Decision Tree)
• from 4 to 16 extracted features (Linear Discriminant) Banknote
Conclusion
Selected features in PCA are the ones of highest eigenvalues (λi) and the selected features in the proposed method are the ones with the highest discriminant score (si).
Proposed method have higher accuracy than PCA for a smaller number of features.
Future work:
• extend for more than 2 classes