Feature Selection - The Promoted ML Investigation Tool

5.4 The Promoted ML Investigation Tool

5.4.3 Feature Selection

Selecting the relevant feature subset is crucial in any machine learning algorithms.

Reducing the dataset improves the problem readability, shortens the training time, reduces the exploration dimensional space, and decreases the overfitting.

5.4.3.1 Variance Threshold

This technique removes features with low variability from the dataset (e.g., re-moving the features constant values). Further, the variance threshold method targets only the features (i.e., microarchitectural values) and not the labels (i.e., soft error analysis). If

the feature has a lower variation, usually a constant value, it carries a reduced amount of information and can be removed.

5.4.3.2 Principal Component Analysis

Figure 5.2: Principal Component Analysis.

(a) Linearly Dependent Data.

1 0 1 2 3

(b) Linearly Dependent Data.

1.0 0.5 0.0 0.5 1.0

Principal component analysis (PCA) is a mathematical method to reduce a dataset dimensionality by providing a set of orthogonal vectors indicating the maximum variance direction. Figure 5.2 shows random collections of points in orange and the black arrows represent the PCA components. The arrows indicate the direction and magnitude of the dataset variation when linearly dependent (Figures 5.2a and 5.2b) while a random dataset provides an arbitrary direction (Figure 5.2c). This work searches for the microarchitec-tural parameters with a significant impact on the application reliability. In this context, the PCA provides a fast approach to find linearly dependent variables that maximize a particular direction. By observing the components strength and direction (i.e., angle) it is possible to search the ones highlighting a strong correlation with the fault injection

119

outcome. A greater X-axis variation (i.e., soft error) has more relevant information (Fig-ure 5.2a) than a group with higher correlation (Fig(Fig-ure 5.2b) with a high angle principal component (i.e., close to90° or0°).

5.4.3.3 Linear Regression

Linear regression models the relationship between two scalar variables, one depen-dent variable, and one explanatory variable. This technique searches for the best-fitting straight line through the data points usually using a least squares approach. By analyz-ing the model angle and accumulated error it is possible to rank the features (Y) with a stronger impact on the soft error assessment (X).

Y =αX +β (5.1)

5.4.3.4 Correlation Coefficient

A correlation coefficient measures statistical dependence between two variables.

In other words, this function describes two variables relationship through a score number between +1 and -1, where 0 represents the absence of correlation, 1 a complete correla-tion, and the sign gives the correlation direction. Pearson’s is the most widely adopted correlation coefficient, and it measures linear relationships between two variables. Pear-son’s correlation coefficient is the covariance of the two variables divided by the product of their standard deviations.

ρ= cov(X, Y)

σ_xσ_y Pearson’s correlation coefficient Further, it can be affected by strong outliers in a given dataset. For this purpose, Charles Spearman proposed a nonparametric measure of rank correlation to assess two variables monotonic relationship. A monotonic relationship means that two variables tend to towards the same relative direction, not necessary at an equal rate. By using ranked variables, the Spearman’s coefficient presents a more robust solution to skewed points.

ρ_s= 1− 6P d²_i

n(n²−1) Spearman’s correlation coefficient The coefficient selection is dependent on the raw dataset and its distribution. It is

possible to filter features from a dataset by analyzing the ones with higher coefficients.

Figure 5.3 shows four distinct datasets of 100 points considering the Spearman’s and the Pearson’s coefficient. First, Figure 5.3a shows random points with no dependence.

Both coefficients have a similar performance in linear dependence scenarios (Figure 5.3b), while the Pearson’s suffers from strong outliers (Figure 5.3c). Figure 5.3d displays a more monotonic dataset, where Spearman’s has a better score.

Figure 5.3: Principal Component Analysis.

(a) Random.

2 0 2 4 6 8

5.4.3.5 Recursive Feature Elimination

Recursive Feature Elimination (RFE) uses a given external estimator to prune a dataset until a target number of features. It employs the estimator coefficient to recursively remove features, creates a model, and calculates its accuracy. The Scikit-learn provides an

121

RFE wrapper method for an arbitrary estimator. This work explored a linear regression, Huber Regressor, and among other estimators. The quality of the selection depends on the estimator capability to classify different features relevance for the accuracy model.

5.4.3.6 Euclidean Distance

In the original data or due to transformations columns in the dataframe may have (almost) identical values. Removing such features improve the training time without re-ducing the dataset information. This method computes the Euclidean Distance between two features, removing it if they are closer then a predefined threshold.

5.4.3.7 Soft Error Score

Microarchitectural information (i.e., features) comes in multiple ways and they represent a wide range of parameters such as cache misses, # of float instructions, or virtual memory page size. Each one has a different distribution, and for example, the vir-tual memory page size has few possible constant values while the # of float instructions ranges from zero to billions. Sometimes, parameters are constant for one application while fluctuates in another benchmark execution. Figure 5.4 shows six common feature arrangements found in the raw data from the gem5 microarchitectural statistics, for ex-ample, Figure 5.4a display a random value distribution. Figures 5.4b and 5.4c exhibit two linear relationships being the first one stronger while Figures 5.4d and 5.4e display constant features (i.e., independent from the soft error). Finally, Figure 5.4f demonstrates the average behavior of raw features, mixing some linear behavior, outliers, and constant values.

Applying one feature filtering algorithm does not provide the target results (i.e., the microarchitectural features with higher impact on the system soft error reliability). For instance, the correlation coefficients perform poorly on noise dataset (Figure 5.4f) where the relationships exist among other types of data (e.g., constant values). The PCA can find the maximum growth direction in complex features. However, it can result in false-positives depending on the data distribution (Figure 5.4e) Regression model provides a more robust solution without presenting a general solution. Further, applying multiple filters in sequence results on a small dataset with a significant amount of false-positive solutions.

For this purpose, this work proposes a filter score (i.e., from 0 to 1) to measure

Figure 5.4: Different of feature combination extracted from the gem5. Spearman correlation=-0.05Pearson correlation=-0.07

Regression Model

poly linear rbf

(b) Strong linear dependency.

0.5 0.0 0.5 1.0 1.5 2.0 2.5

Spearman correlation=0.84Pearson correlation=0.84

Regression Model

poly linear rbf

0.0 0.5 1.0 1.5 2.0 Spearman correlation=0.67Pearson correlation=0.69

Regression Model Spearman correlation=-0.50Pearson correlation=-0.00

Regression Model

poly linear rbf

(e) Multiple constant values.

0.5 0.0 0.5 1.0 1.5 2.0 2.5

Spearman correlation=-0.38Pearson correlation=-0.11

Regression Model

poly linear rbf

(f) Usual feature shape.

0.5 0.0 0.5 1.0 1.5 2.0 2.5

Spearman correlation=-0.06Pearson correlation=0.27

Regression Model

poly linear rbf

Source: The Authors

123

the feature quality according to the target soft error problem. This score results from sev-eral algorithms simultaneous execution, in other words, the PCA, regression, and other techniques are computed in a single step. These outcomes are standardized in values be-tween 0 and 1 to remove the parameter magnitude from the equation. The score computes the PCA first component, the Pearson’s correlation coefficient, variance, dispersion score, and linear regression. Pearson’s is preferable due to its smaller sensibility to false-positive linear relationships. Linear regression models with a45° (from the origin) represent the optimal correlation between two dependent variables, increasing or decreasing this angle shows a weaker correlation.

The selection process computes the Soft Error Score for all columns in the target dataframe. The tool ranks the features by the score selecting the firstN most significant values. This technique provides both filtering and ranking in a single technique, reducing the number of feature selection steps.

No documento Early Evaluation of Multicore Systems Soft Error Reliability Using Virtual Platforms (páginas 117-123)