In ML, supervised learning refers to models and algorithms that map from the input features to the corresponding output. The algorithm is given many samples where the prespecified target variable is known, so that the model may learn from this data. After training, the model will receive new inputs and determine the correspondent target based on the previous training data. Thus, a supervised learning model aims to generate a prediction of the correct target given unseen input data.
Supervised learning is further separated into two subcategories, classification and regression prob-lems, based on the alphabet of the output variables. Classification predicts unordered discrete class labels while regression consists of predicting continuous output variables. Classification algorithms will be the focus of this work.
2.1.1 General Approach to Classification
The learning process is divided in two phases: training and testing.
During training a classification model is fit to a previously established set of data classes. This classification algorithm builds the classifier by analysing a set of data made up of samples from the dataset under analysis and their corresponding class labels. The individual samples making up the training set can also be referred to as examples, instances or data points.
In the second step, during test, the model is used to predict class labels for a untouched set of data.
In this phase, if the training set was used to measure the classifier’s performance, the estimate would be too optimistic. This happens because the classifier overfits the data, i.e., in the training phase the model learns the anomalies and details of the training data that do not represent the overall data and,
therefore, impacts negatively the performance of the model on new data. To overcome the overfitting problem, a test set is used. This set of data is independent of the training samples, meaning that it is not used to build the classifier .
2.1.2 Classification Algorithms
There are many different supervised learning methods used for classification, such as neural networks, nearest neighbor classifiers, support vectors machines and decision trees. Among these techniques, decision tree based algorithms are one of the most commonly used methods in classification problems due to the easy interpretability achieved by the hierarchical placement of decisions. In general, the learning and classification steps of decision trees are fast and the classifiers have good performance.
As the name suggests, this technique uses a tree-like model of decisions to predict the target value. Just like a conventional tree, a decision tree is composed of internal decision nodes, branches and terminal nodes (or leaf nodes). Each decision node denotes a test on an attribute with discrete results, so each branch represents an outcome of the test. Depending on the outcome value, one of the branches is taken. This process is done by traveling along the tree from the root and recursively partitioning the data until reaching a leaf node, which contains the output value. Thus, for a given data point, the target value is found by going through the tree and making decisions based on the feature values .
In a classification tree (a decision tree used for classification) to split the nodes at the most informative features, one must use an impurity measure. A split is said to be pure if after the split all the instances following a specific branch belong to the same class. In this case, there is no need to split any further and a leaf node is added with the class.
There are several functions used to measure the impurity of the splits, such as ID3 , its extension C4.5 , and the classification and regression trees (CART) algorithm . The CART algorithm will be the focus since it provides a foundation for important algorithms like boosted decision trees.
The decision trees produced by CART are strictly binary, i.e., contain exactly two branches for each decision node. This algorithm uses the Gini impurity as splitting condition, which can be understood as
a criterion to minimize the probability of misclassification and is given by,
wherep(i)is the proportion of the samples that belong to classi, andmis the number of classes .
The attribute having the lower Gini impurity value is chosen as the root node. This process is repeated for another leaf node until all leaves are pure or all features have been used.
Ensemble learning is a ML technique that combines several classifiers (e.g., decision trees) into a meta-classifier in order to improve the generalization performance and robustness over each individual classifier alone.
Bootstrap aggregation , also called bagging, is an ensemble algorithm which fits different instances of the base classifier, each on random subsets of the original training set, and then combines their individual prediction using a majority voting process. Majority voting means that the class label selected has been predicted by the majority of the classifier’s instances.
Majority Voting Prediction
Figure 2.2: Bagging.
By using different sets of training data and introducing randomization into the algorithm, the model becomes less likely to generate errors. Therefore, the performance with the test data is improved and helps to avoid the overfitting problem of the base estimators.
Random Forest is a bagging-based algorithm that creates a collection of decision trees, i.e., a forest.
However, instead of sampling random subsets of the original training set, only a random subset of fea-tures is selected to build each decision tree. Sampling over feafea-tures ensures that the different trees do not fit the exact same information and, therefore, reduces the correlation between the different predic-tions.
Thus, random forest combines the bagging technique and the concept of feature subspace selection to improve the model’s performance.
As it was mentioned before, the bagging algorithm can be an effective method to reduce the variance of a model. However, bagging is not a good algorithm to be fitted in models that are too simple to capture the trend in the data, meaning that is ineffective in reducing the model bias.
Boosting, originally proposed by Schapire , is an ensemble technique which combines weak models that are no longer fitted independently from each other. The main idea behind boosting is to sequentially fit models by minimizing the errors from the previous models, that is, to let the simple base classifiers learn from misclassified training samples and, consequently, improve the performance of the combined estimator.
Figure 2.3: Boosting.
Gradient boosting, also known as gradient boosting machine (GBM) , is a boosting algorithm in which the main purpose is to minimize a loss function by adding sequential weak learners using the gradient descent technique. Gradient descent is an iterative optimization algorithm for finding a local minimum of a differentiable function.
Gradient boosting is a stage-wise additive model that creates learners during the learning process.
At each particular iteration, a weak learner is fitted and its predictions are compared with the correct expected outcome. The difference between these values represents the error of the model, and can be used to compute the gradient of the loss function. Then, the gradient value is used to understand the direction in which the model parameters need to be changed in order to minimize the error in the next training iteration.
In gradient boosting, the sample distribution is not modified because the weak learns train on the pseudo-residuals (the remaining residual errors of the ensemble so far and the actual output).
There-fore, the algorithm does not optimizes the model parameters directly but the boosted model predictions instead. Moreover, the gradients are added to the training process by fitting the remaining weak models to these values.
XGBoost  stands for extreme gradient boosting and implements machine learning algorithms under the gradient boosting framework. It is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable.
Since XGBoost derives from GBM, there are many similarities between the algorithms and their tuning parameters. Both of them are ensemble methods based on the CART principle using the gradient descent architecture. However, XGBoost improves upon the base GBM framework through system optimization and algorithmic enhancements, characterized by regularization and sparsity awareness.
In the system optimization it is worth highlighting the parallelization, tree pruning and the hardware optimization. XGBoost approaches the process of sequential tree building using parallelized and dis-tributed implementation. This extension improves algorithmic performance by making the learning pro-cess faster which enables quicker model exploration. Regarding the tree pruning, XGBoost uses a depht-first approach which is achieved by definind a new paramenter, maximum tree depth for base learners, instead of the stopping criterion for tree splitting. This approach improves computational per-formance significantly. To solve the hardware optimization problem, this algorithm introduces two key concepts. It uses cache awareness by allocating internal buffers in each thread (where the gradient statistics can be stored) and the out-of-core computing, which optimizes the available disk space and maximizes its usage when handling big datasets that do not fit into memory.
XGBoost provides regularization parameters that help to reduce model complexity and to prevent overfitting. The first, gamma, is used to set the minimum reduction in loss required to make a further split on a leaf node of the tree. When this parameter is specified, the algorithm will grow the tree to the maximum depth defined but then prune the tree to find and remove splits that do not meet the specified gamma. Thus, this parameter controls the complexity of a given tree. The other parameters, alpha and lambda represent L1 and L2 regularization, respectively. L1 regularization adds a penalty which can yield sparse solutions. On the other hand, L2 regularization constrains the coefficient norm and keeps all the variables small.
XGBoost includes a sparsity-aware split finding algorithm that naturally handles different types of sparsity patterns in the data (e.g., features with missing values) more efficiently.