Minimum Classification Error Principal Component Analysis

(1)

Universidade Federal Rural de Pernambuco

Minimum Classification Error

Principal Component Analysis

Tiago de Carvalho1_{, Maria Sibaldo}1_,

I. R. Tsang2_{, George Cavalcanti}2

1_{Universidade Federal Rural de Pernambuco - UFRPE} 2_{Universidade Federal de Pernambuco - UFPE}

KDMILE

(2)

Introduction

• Principal Component Analysis (PCA)

• Unsuperviseddimensionality reduction

• Used insupervised tasks(face recognition, text classification)

• Supervised PCA for classification (Barshan et al. 2011)

• uses class representatives

• Supervised PCA for regression (Blair et al. 2006)

• pre-processing of feature selection

• Bayesian approach for classification

• depends on the covariance matrix as in PCA

• allows to estimate error rate from features

• Research Objective

• propose a new supervised PCA that selects projections that

(3)

Notation

Dataset matrix withnpoints and dfeatures.

X′ ₌                  xT 1 xT 2 .. . xT n                  . (1)

Thej-th point

x_j ₌                 

xj1 xj2 .. . xjd                  , (2)

forj₌1, . . . ,n.

The data mean vector is

¯

x ₌ n−1 n X

j=1

x_j. (3)

The data centered matrix is

X ₌                 

(x1₋x)¯ T (x2₋x)¯ T

.. . (xn−x)¯ T

(4)

Feature Extraction with PCA

The covariance matrix ofXis

ΣX= 1 nX

T_X. ₍₅₎

• ξ_i is an eigenvector of_ΣX, fori=1, . . . ,k.

Ek = [ξ1. . .ξk], (6)

• k₌1, . . . ,d.

• kis the number of extracted features.

Thei-th extracted feature is

fi = [w1i. . .wni]T = Xξi. (7)

The projection of the pointxjis

wT_j ₌ [wj1. . .wjk]= xTj Ek. (8)

• λi is the eigenvalue ofξi

• λi is the variance offi

(5)

PCA Projected Data

The new data matrix is

W=XEk, (9)

The covariance matrix ofWis_ΣW =n−1WTW

ΣW=                 

λ1 0 . . . 0 0 λ2 . . . 0

..

. ... . .. ... 0 0 . . . λk

                 Meaning:

• extracted variable are uncorrelated

• it allows ignoring feature interactions, for feature selection

(6)

Bayes Error Rate

Probability of the classification error.

Simplified with five restrictions:

(1) The data presents a multivariate normal distribution. (2) The problem has only two classes.

(3) Equal prior probabilities for both classes.

(4) Both classes have the same covariance matrix (as in PCA). (5) The features are independent (similarly to PCA).

Then the Bayes error rate is given by

P(error)₌ _√1 2π

Z ∞

r/2

(7)

Minimizing Bayes Error Rate

The Bayes error rate decreases asrincreases, the Mahalanobis distance between the mean vectors of the classes (µ1 andµ2):

r2 ₌(µ₁₋µ₂)_Σ−1(µ₁₋µ₂). (11)

For diagonal covariance matrix

r=

v t _d

X

i=1

µ1i−µ2i σi

!2

, (12)

σi is the standard deviation of the featureithat is the same for both classes.

(8)

Proposed Score

Choose projections (eigenvectors) that minimize Bayes error rate instead of maximize variance.

The mean of thei-th feature for thec-th class (c₌ 1,2) is

¯ wci=

Pn

j=1wijδjc Pn

j=1δjc

, (13)

δjc= 1if thej-point belongs to thec-th class, andδjc =0, otherwise.

si = (

|w1¯ i−w2¯ i|/λi, ifλi , 0 0, ifλi = 0

, (14)

(9)

Proposed Method

PCA projected features selected according to the proposed score minimize the Bayes error rate.

The proposed method consists in the following steps:

(1) Project the data asW=XEd

(2) Compute the mean for each class

(3) Compute scores of relevance (si)

(4) Selectkfeatures with the highest score

(5) Define the projection matrix with the eigenvectors of selected features

Sk =[ξ1. . .ξk] (15)

(6) Project data

(10)

Experiments

Datasets(UCI Machine Learning Repository):

• The Climate Model Simulation Crashes (540 points, 18 features)

• Banknote Authentication (1,372 points, 4 features). Metric: Accuracy.

Sampling: 100 holdouts (50% training / 50% testing). Classifiers:

• The 1-NN (Nearest Neighbor) with Euclidean distance

• Naive Bayes with normal kernel smoothing density estimate

• Pruned Decision Tree with Gini’s diversity index and a minimum of 10 nodes per leaf

(11)

Banknote (1-NN)

Accuracy for 2 features: 0.852 (PCA) and 0.959 (Proposed).

1 2 3 4

0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00

Number of extracted features

Accur

acy

(12)

1 3 5 7 9 11 13 15 17 19

.860 .865 .870 .875 .880 .885 .890 .895

Accur

acy

(18)

Climate (Linear Discriminant)

1 3 5 7 9 11 13 15 17 19

.910 .920 .930 .940 .950

Accur

acy

(19)

Hypothesis test

The proposed method have accuracy significantly higher than PCA:

Climate

• from 2 to 16 extracted features (1-NN)

• from 3 to 11 extracted features (Naive Bayes)

• from 2 to 4 extracted features (Decision Tree)

• from 4 to 16 extracted features (Linear Discriminant) Banknote

(20)

Conclusion

Selected features in PCA are the ones of highest eigenvalues (λi) and the selected features in the proposed method are the ones with the highest discriminant score (si).

Proposed method have higher accuracy than PCA for a smaller number of features.

Future work:

• extend for more than 2 classes