# Evaluation Metrics

No documento Predicting Passenger Connectivity in an Airline s Hub Airport. Aerospace Engineering (páginas 31-34)

## 2.2 Imbalanced Data

### 2.2.2 Evaluation Metrics

The evaluation criteria is a key factor in assessing the classification performance of the model. A com-mon method for determining the performance of a classifier is through the use of the confusion matrix (Figure 2.6).

Figure 2.6: Confusion matrix for a binary classification problem.

In a confusion matrix, true negatives (TN) is the number of negative instances incorrectly classified as negative, false negatives (FN) is the number of positive instances correctly classified as negative, false positives (FP) is the number of negative instances incorrectly classified as positive and true positives (TP) is the number of instances correctly classified as positive.

The most common evaluation metrics derive from the confusion matrix. Typically, the most often used metric is accuracy:

accuracy= T P +T N

T P+F P +T N+F N (2.5)

In the framework of imbalanced datasets, the evaluation of the classifiers’ performance must take into account the class distribution. For this reason, accuracy may produce a biased illusion on imbalanced data and, consequently, it is not a good metric for measuring the performance of classifiers in these cases. Alternatively, other metrics can be computed from the confusion matrix. Some of these metrics are summarized in Table 2.1.

Table 2.1: Classification performance metrics based on the confusion matrix.

Metric Formula

precision T P+F PT P

recall, sensitivity, true positive rate (TPR) T PT P+F N

false positive rate (FPR) F P+T NF P

From the Table 2.1, it is possible to understand that precision quantifies the number of positive class predictions that actually belong to the positive class, while recall measures how often a positive class instance in the dataset was predicted as a positive class instance by the classifier. In other words, and in the context of this problem, precision measures what proportion of predicted missed connections was actually correct and recall measures what proportion of real missed connections was correctly identified. The G-mean measures the balance of the classification performance over the negative and positive classes. A low G-mean value is an indicator of a poor classifier’s performance. F1score is the harmonic mean of precision and recall.

Instead of simply predicting a sample as positive or negative, there are classifiers, called scoring classifiers, which give a numeric score for an instance to be classified in the positive or negative class.

Therefore, instances with a higher score are more likely to be classified as positive. The classifications are made by applying a threshold to a score. The choice of this threshold impacts the trade-off of the predictions .

A commonly used measure for evaluating the predictive performance of scoring classifiers is the receiver operating characteristic (ROC) curve. A ROC curve is a graphical evaluation metric which does not depend on a specific threshold.

In this graph, the x-axis represents the FPR and the y-axis the TPR. To build the plot, the instances are ordered according to the decreasing score value of being positive and then the threshold is varied from the highest score (most restrictive) to the lowest one (least restrictive). For each threshold value, there is one possible point in the ROC space, based on the obtained values of FPR and TPR for that threshold. These points can be interpolated to approximate the curve .

In ROC space, a good performance should be as close to the upper left corner as possible (see Figure 2.7(a)). This point, FPR=1 and TPR=1, corresponds to the perfect classification. The lower left corner (i.e., the origin of the graph) corresponds to a classifier always predicting the negative class whereas the upper right corner to predicting the positive class. The diagonal which connects this two points indicates the random performance. Points below this diagonal indicate a performance worse than random, hence all the points in the ROC curve should lie above this line.

The area under the curve (AUC) is a measure used to evaluate the overall performance of score classifiers. AUCROC can be interpreted as the probability that the model will rank a random positive sample more highly than a randomly negative chosen one. It ranges from ranges in value from 0 to 1.

The AUCROC of the random performance is 0.5, so it is expected that for any useful classifier this value is higher than 0.5.

Although widely used to evaluate classifiers under presence of imbalanced data, some researchers argue that ROC curve may be deceptive with respect to conclusions about the reliability of classification performance . Precision-recall (PR) curves, on the other hand, provide an accurate prediction of the classification performance.

The PR curves (Figure 2.7(b)) can be obtained in a similar way to what is done for the ROC curves, but using recall in the x-axis and precision in y-axis. In PR space, good classifiers should be as close as possible to the upper right corner, since this is the point which represents the best precision and recall

True Positive Rate

False Positive Rate

0 1

1

Random performance

(a) ROC curve

Recall

Precision

0 1

1 Random performance

(b) PR curve

Figure 2.7: Examples of ROC and PR curves.

trade-off. The random performance is defined as the ratio of the number of positive class samples over all the samples.

As in the ROC space, it is also possible to compute the area under the PR curve, AUCPR. However, in PR space this value does not have a probabilistic interpretation as in ROC space. The AUCPRvalue of the random classifier is expected to be close to the ratio of positive samples in the test set.

### 2.3Encoders

Many statistical and machine learning algorithms require all variables to be numeric. This means that if the dataset contains categorical data, one must encode it to numeric values before fitting and evaluating a model. This process is called categorical encoding.

There are many different approaches for handling categorical variables. Two of the most widely used techniques are ordinal encoding and one-hot encoding.

In ordinal encoding, each category is assigned an integer value from 0 to N-1 (N is the number of categories for the feature). This results in a single column of integers per feature. For example, if a dataset feature was “colour” and the values were “blue”, “green” and “red”, the encoding values would be 0, 1 and 2, respectively. In this scenario, the colors names do not have an order rank, but when the encoding is performed, a learning algorithm would consider the relationship between colors such as red being larger than green, and green larger than blue. Although this assumption is incorrect, the algorithm could still produce useful results. However, those results would not be optimal. This type of encoding it is really only appropriate to features with a known order between the categories.

To overcome this problem, one-hot encoding is commonly used. The idea behind this approach is to create multidimensional features based on the number of unique values in the categorical feature.

Given the same “colour” feature from the previous example, binary values can be used to indicate the particular colour of a sample, i.e. a red sample can be encoded as red=1, green=0, blue=0. Thus, each sample in the dataset is replaced with a vector and, in this example, one column becomes two.

This technique becomes very difficult to handle for high cardinality categorical variables, since it

generates too many features.

An alternative and more informative encoding method is target encoding, which transforms the cate-gorical variables into quasi-continuous numerical data. The general idea of this technique is to replace each categorical value with a blend of posterior probability of the target given particular categorical value,p(Y|X =xi), and the prior probability of the target over all the training data,p(Y). The blending is controlled by a regularization parameter that depends on the sample size:

Si=p(Y|X =xi)·λ(Ni) +p(Y)·(1−λ(Ni)) (2.6) whereNiis the size of the sample{X|X =xi} andλis a function of the sample size. The larger the sample size, the more the estimate is weighted towards the target givenX = xi. On the other hand, the smaller the sample size, the more the estimate is weighted towards the prior probability of the target .

No documento Predicting Passenger Connectivity in an Airline s Hub Airport. Aerospace Engineering (páginas 31-34)