Performance and Competitiveness of Tree-Based Pipeline Optimization Tool

(1)

MDSAA

Mestrado em Ciência de Dados e Métodos Analíticos

Master Program in Data Science and Advanced Analytics

Performance and Competitiveness of Tree-Based Pipeline Optimization Tool

Marija Grjazniha

A thesis presented as a partial requirement for

obtaining a master’s degree in Data Science and

Advanced Analytics

(2)

Title: Performance and Competitiveness of Tree-Based Pipeline

Optimization Tool: Marija Grjazniha

MDSAA

22022

(3)

(4)

NOVA Information Management School

Instituto Superior de Estatística e Gestão de Informação Universidade Nova de Lisboa

Performance And Competitiveness of Tree-Based Pipeline Optimization Tool

by

Marija Grjazniha

The thesis presented as a partial requirement for obtaining a master’s degree in Data Science and Advanced Analytics with a specialization in Data Science

Advisor: Leonardo Vanneschi

November 2022

(5)

STATEMENT OF INTEGRITY

I hereby declare having conducted this academic work with integrity. I confirm that I have not used plagiarism or any form of undue use of information or falsification of results along the process leading to its elaboration. I further declare that I fully acknowledge the Rules of Con- duct and Code of Honor from the NOVA Information Management School.

Antwerp. November 29, 2022

(6)

ABSTRACT

Automated machine learning (AutoML) is the process of automating the entire machine learning workflow when applied to real-world problems. AutoML can increase data science productivity while keeping the same performance and accuracy, allowing non-experts to use complex machine learning methods. Tree-based Pipeline Optimization Tool (TPOT) was one of the first AutoML methods created by data scientists and is targeted to optimize machine learning pipelines using genetic programming. While still under active development, TPOT is a very promising AutoML tool. This Thesis aims to explore the algorithm and analyse its performance using real word data. Results show that evolution-based optimization is at least as accurate as TPOT initialization. The effectiveness of genetic operators, however, depends on the nature of the test case.

KEYWORDS

Genetic Programming; Tree-Based Pipeline Optimization Tool; Machine Learning

(7)

LIST OF FIGURES

Figure 2.1 Example of GP tree with four primitive nodes (white) and five terminal nodes (grey).

... 4 Figure 2.2 Example of GP crossover. Parent trees (a) and (b) produce offspring (c) and (d). .. 4 Figure 2.3 Example of GP mutation. ... 5 Figure 2.4 Example of a tree-based representation of TPOT pipeline. ... 10 Figure 4.1 Basic waveforms and 100 instances of each of the three classes. ... 20 Figure 4.2 Breast cancer training and testing accuracy distributions for each testing scenario

(n=30). ... 21 Figure 4.3 Best internal CV score distribution by generation for three scenarios. Breast cancer

dataset. ... 22 Figure 4.4 Number of times classifiers appear in the root of TPOT optimal pipelines for the

breast cancer dataset. ... 24 Figure 4.5 Test score distribution for each classifier of TPOT optimal pipelines for the breast

cancer dataset. ... 25 Figure 4.6 Waveform training and testing accuracy distributions for each testing scenario

(n=30). ... 25 Figure 4.7 Best internal CV score distribution by generation for three scenarios. Waveform

dataset. ... 26 Figure 4.8 Number of times classifiers appear in the root of TPOT optimal pipelines for the

Waveform dataset. ... 27 Figure 4.9 Test score distribution for each classifier of TPOT optimal pipelines for the

Waveform dataset. ... 28 Figure 4.10 Bioavailability training and testing negative root mean squared error distributions

for each testing scenario (n=30). ... 29 Figure 4.11 Best internal CV score distribution by generation for three scenarios.

Bioavailability dataset. ... 30 Figure 4.12 Number of times regressors appear in the root of TPOT optimal pipelines for the

bioavailability dataset. ... 31 Figure 4.13 Test score distribution for each regressor of TPOT optimal pipelines for the

bioavailability dataset. ... 31 Figure 4.14 Concrete training and testing negative root mean squared error distributions for

each testing scenario (n=30). ... 31 Figure 4.15 Best internal CV score distribution by generation for three scenarios. Concrete

dataset. ... 32 Figure 4.16 Number of times regressors appear in the root of TPOT optimal pipelines for the

concrete dataset. ... 33

(9)

Figure 4.17 Test score distribution for each regressor of TPOT optimal pipelines for the concrete dataset. ... 34 Figure 4.18 Selected individual size distribution by generation. Breast cancer. ... 37 Figure 4.19 Amount and performance of operations by genetic operators. Breast cancer. ... 38 Figure 4.20 Distribution of median accuracies by the generation before and after genetic

operation. Breast cancer. ... 38 Figure 4.21 Amount and performance of mutations by pipeline generation. Breast cancer. .. 39 Figure 4.22 Amount and performance of mutations by pipeline size increase. Breast cancer. 39 Figure 4.23 Amount and performance of mutation operations by type. Breast cancer. ... 40 Figure 4.24 Amount and performance of mutation operations by the added object. Breast

cancer. ... 40 Figure 4.25 Amount and performance of mutation operations by the changed object (short).

Breast cancer. ... 40 Figure 4.26 Amount and performance of mutation operations by the removed object. Breast

cancer. ... 41 Figure 4.27 Amount and performance of mutation operations by the level of operation. Breast

cancer ... 41 Figure 4.28 Amount and performance of crossover operations by generation. Breast cancer. 41 Figure 4.29 Amount and performance of crossover operations by pipeline size increase. Breast

cancer. ... 42 Figure 4.30 Amount and performance of crossover operations by crossover type. Breast cancer.

... 42 Figure 4.31 Amount and performance of crossover operations by crossover base. Breast cancer.

... 42 Figure 4.32 Amount and performance of crossover operations by crossover base and type.

Breast cancer. ... 43 Figure 4.33 Amount and performance of crossover operations by the level of operation. Breast

cancer. ... 43 Figure 4.34 Selected individual size distribution by generation. Waveform. ... 44 Figure 4.35 Amount and performance of operations by genetic operators. Waveform ... 44 Figure 4.36 Distribution of median accuracies by the generation before and after genetic

operation. Waveform. ... 45 Figure 4.37 Amount and performance of mutations by pipeline generation. Waveform. ... 45 Figure 4.38 Amount and performance of mutations by pipeline size increase. Waveform. .... 46 Figure 4.39 Amount and performance of mutation operations by type. Waveform. ... 46 Figure 4.40 Amount and performance of mutation operations by the added object. Waveform.

... 46 Figure 4.41 Amount and performance of mutation operations by the changed object (short).

Waveform. ... 47

(10)

Figure 4.42 Amount and performance of mutation operations by the removed object.

Waveform. ... 47 Figure 4.43 Amount and performance of mutation operations by the level of operation.

Waveform. ... 48 Figure 4.44 Amount and performance of crossover operations by generation. Waveform. .... 48 Figure 4.45 Amount and performance of crossover operations by pipeline size increase.

Waveform. ... 48 Figure 4.46 Amount and performance of crossover operations by crossover type. Waveform.

... 49 Figure 4.47 Amount and performance of crossover operations by crossover base. Waveform.

... 49 Figure 4.48 Amount and performance of crossover operations by crossover base and type.

Waveform. ... 50 Figure 4.49 Amount and performance of crossover operations by the level of operation.

Waveform. ... 50 Figure 4.50 Selected individual size distribution by generation. Bioavailability. ... 51 Figure 4.51 Amount and performance of operations by genetic operators. Bioavailability. ... 51 Figure 4.52 Distribution of median accuracies by the generation before and after genetic

operation. Waveform. ... 51 Figure 4.53 Amount and performance of mutations by generation. Bioavailability. ... 52 Figure 4.54 Amount and performance of mutations by pipeline size increase. Bioavailability.

... 52 Figure 4.55 Amount and performance of mutation operations by type. Bioavailability. ... 53 Figure 4.56 Amount and performance of mutation operations by the added object.

Bioavailability. ... 53 Figure 4.57 Amount and performance of mutation operations by the changed object (short).

Bioavailability. ... 54 Figure 4.58 Amount and performance of mutation operations by the removed object.

Bioavailability. ... 54 Figure 4.59 Amount and performance of mutation operations by the level of operation.

Bioavailability. ... 55 Figure 4.60 Amount and performance of crossover operations by generation. Bioavailability.

... 55 Figure 4.61 Amount and performance of crossover operations by pipeline size increase.

Bioavailability. ... 55 Figure 4.62 Amount and performance of crossover operations by crossover type.

Bioavailability. ... 56 Figure 4.63 Amount and performance of crossover operations by crossover base.

Bioavailability. ... 56

(11)

Figure 4.64 Amount and performance of crossover operations by crossover base and type.

Bioavailability. ... 57

Figure 4.65 Amount and performance of crossover operations by the level of operation. Bioavailability. ... 57

Figure 4.66 Selected individual size distribution by generation. Concrete. ... 58

Figure 4.67 Amount and performance of operations by genetic operators. Concrete... 58

Figure 4.68 Distribution of median accuracies by the generation before and after genetic operation. Concrete. ... 58

Figure 4.69 Amount and performance of mutations by generation. Concrete. ... 59

Figure 4.70 Amount and performance of mutations by pipeline size increase. Concrete. ... 59

Figure 4.71 Amount and performance of mutation operations by type. Concrete. ... 60

Figure 4.72 Amount and performance of mutation operations by the added object. Concrete. ... 60

Figure 4.73 Amount and performance of mutation operations by the changed object (short). Concrete. ... 61

Figure 4.74 Amount and performance of mutation operations by the removed object. Concrete. ... 61

Figure 4.75 Amount and performance of mutation operations by the level of operation. Concrete. ... 61

Figure 4.76 Amount and performance of crossover operations by generation. Concrete. ... 62

Figure 4.77 Amount and performance of crossover operations by pipeline size increase. Concrete. ... 62

Figure 4.78 Amount and performance of crossover operations by crossover type. Concrete. 63 Figure 4.79 Amount and performance of crossover operations by crossover base. Concrete. 63 Figure 4.80 Amount and performance of crossover operations by crossover base and type. Concrete. ... 63

Figure 4.81 Amount and performance of crossover operations by the level of operation. Concrete. ... 64

(12)

LIST OF TABLES

Table 2.1 Number of classifier hyperparameters (H), hyperparameter combinations (HC) and

classifier appearance in configurations. ... 7

Table 2.2 Number of regressor hyperparameters (H), hyperparameter combinations (HC) and regressor appearance in configurations. ... 7

Table 2.3 Number of preprocessor hyperparameters (H), hyperparameter combinations (HC) and preprocessor appearance in configurations for classification (⚫) and regression (). ... 8

Table 2.4 Number of selector hyperparameters (H), hyperparameter combinations (HC) and selector appearance in c configurations for classification (^⚫) and regression ()... 9

Table 2.5 Number of feature constructor hyperparameters (H), hyperparameter combinations (HC) and their appearance configurations for classification (^⚫) and regression (). ... 9

Table 4.1 Experimental scenario settings ... 18

Table 4.2 Test case descriptions ... 19

Table 4.3 Example of a drug expressed in linear notation (SMILES) and as a set of molecular descriptors. ... 20

Table 4.4 Breast cancer test accuracy mean, standard deviation and Mann-Whitney U p-values. ... 22

Table 4.5 Waveform test accuracy mean ±standard deviation and T-test p-values. ... 26

Table 4.6 Bioavailability test RMSE mean ±standard deviation and T-test p-values. ... 29

Table 4.7 Concrete test RMSE mean ±standard deviation and T-test p-values. ... 32

Table 4.8 Summary of the first experiment results: mean accuracy/negative RMSE and standard deviation by scenario. ... 34

Table 4.9 Summary of internal CV score comparison by generation. ... 35

Table 4.10 Number of pipelines when a certain number of operators is used. ... 35

Table 4.11 Number of operations by the test case. ... 37

(13)

LIST OF ABBREVIATIONS AND ACRONYMS

BC Breast Cancer BIO Bioavailability

CON Concrete

CV Cross-Validation

DL Deep Learning

GP Genetic Programming ML Machine Learning MLP Multi-layer Perceptron MSE Mean Squared Error

NB Naïve Bayes

RFE Recursive Feature Elimination RMSE Root Mean Squared Error SGD Stochastic Gradient Descent SVC Support Vector Classification

TPOT Tree-Based Pipeline Optimization Tool

WF Waveform

XGB Extreme Gradient Boosting

(14)

1. INTRODUCTION

Machine learning (ML) can still be challenging when implementing existing models to work well on your new dataset. The development of an ML model requires signiﬁcant human expertise. It requires being aware of available algorithms and models and their constraints. It is necessary to use knowledge and experience-based intuition to determine the best ML architecture for each new application. However, even an expert has to iteratively implement alternative pipelines to find a good enough solution, which alone exposes an opportunity for automation.

Besides the repetitive nature of many ML processes, the high demand for ML expertise and the shortage of ML specialists have inspired the creation of tools that reduce the manual efforts of developing different ML pipelines.

Automated machine learning (AutoML) is the process of automating one or several ML workflow processes like model training, hyperparameter optimization and feature engi- neering. AutoML allows non-experts to use complex ML methods, increasing the productivity of data scientists, ML engineers, and ML researchers while keeping the same performance and accuracy. AutoML methods vary from hyperparameter optimization to meta-learning and neural architecture search (Hutter et al., 2019). AutoML is usually built on one or several ML frameworks and specializes in automating standard ML (like Auto-Sklearn) or deep learning (DL) models (like AutoKeras). Recent works compare different AutoML tools based on their performance and features (Balaji & Allen, 2018; Truong et al., 2019).

Tree-based Pipeline Optimization Tool (TPOT) was one of the first AutoML methods created by data scientists and is targeted to optimize ML pipelines using genetic programming, i.e., evolving pipelines by means of a process that resembles the natural evolution of species.

While still under active development, TPOT is a very promising AutoML tool. TPOT is not limited to model selection and parameter optimization but automatically constructs and optimizes tree-based ML pipelines, including data and feature preprocessing and selection and new feature construction. Although it has already proved its potential against standard ML analyses (Olson et al., 2016ab) and other AutoML tools (Balaji & Allen, 2018), scientists and ML prac- titioners may still not understand how it actually works.

This Thesis aims to explore the algorithm and analyse its performance using four test cases. TPOT exploration will show how pipelines are constructed and how genetic programming operators optimize the solution. The further study will be split into two independent experiments. Both of them are performed on the same datasets. The first experiment will determine if evolution benefits pipeline optimization by comparing the TPOT-optimized solutions

(15)

under different scenarios: one scenario that initialises 250 pipelines and others that evolve pipelines for 5 and 10 generations. The second experiment will uncover how mutation and crossover determine optimization effectiveness. The performance of each genetic operator will be meas- ured by the share of successful operations and analysed by its type, level of operation and generation.

The following chapter provides a theoretical background for genetic programming—

the foundation of TPOT—and an in-depth overview of the TPOT optimization procedure.

Chapter 3 describes the main components of further analysis: the objectives, data, and approaches to achieve stated objectives. Chapter 4 discusses experimental settings, test cases and results of the experiments. The thesis is concluded in chapter 5. The last chapter also includes recommendations for future works.

(16)

2. THEORETICAL BACKGROUND

2.1. GENETIC PROGRAMMING

First introduced and popularized by Koza (1992, 1994, 2010), genetic programming (GP) is a subfield of evolutionary algorithms. Like other evolutionary algorithms, GP is inspired by the process of natural selection and applies the principles of Darwin’s evolution theory. GP often uses tree-based data structures to represent the computer programs instead of the list structures typical for genetic algorithms. It applies biologically inspired operators: mutation, crossover, and selection, to solve optimization problems. A mutation is an innovation operator as it intro- duces new genetic material variations inside the population. Crossover is a conservation operator as it keeps existing genetic material and passes it through generations.

GP evolution process follows the steps described in Algorithm 2.1. Usually, the termination criterion is the number of generations the population is set to evolve.

Algorithm 2.1. Genetic programming evolution algorithm.

1. Initialization: generate a population of computer programs (individuals).

2. Until termination criteria are not met, perform the following steps:

2.1. Evaluate the performance of each individual and assign it a fitness value.

2.2. Apply the following steps to create a new population:

2.2.1. Selection: select a set of individuals probabilistically based on their fitness values.

2.2.2. Reproduction: copy some of the selected individuals to the new population.

2.2.3. Crossover: recombine randomly chosen parts of two selected individuals to create new ones.

2.2.4. Mutation: substitute randomly chosen parts of a selected individual with randomly generated new ones to create a new individual.

3. Return the single best computer program existing in any generation. This result of the run of GP may be a solution or an approximate solution to the problem.

Any GP individual represented as a tree is defined by a set of primitive functions and terminals. Functions may include mathematical functions, boolean, conditional and iterative operations. Terminals typically consist of problem variables and constants. Trees are generated randomly and, therefore, can vary in shape and size. During the evolution process, trees may consist of any combination of primitives and terminals while that combination can express a solution to the problem (sufficiency requirement), and each function can handle any combination of inputs it can encounter (closure requirement). Figure 2.1 illustrates an example of a three with arithmetical operators as primitives.

(17)

The selection algorithm decides which individuals will be replicated, mated, and mutated. That decision is made based on individuals’ fitness values. The fitness value is a numeric score assigned by a fitness function and reflects how well a program (individual) fits its pur- pose. The fitter the individual, the higher are chances of surviving or passing its genes to the next generation. The likelihood of being selected can be proportional to the fitness value (rou- lette wheel selection), based on the rank individual takes according to their fitness value (rank- ing selection) or based on a competition between a randomly chosen subset of individuals (tour- nament selection).

The crossover operator creates variations in the population by producing offspring that take parts taken from each parent. Figure 2.2 illustrates an example of a crossover between trees (a) and (b) that produce offspring: trees (c) and (d). Trees exchange subtrees below randomly chosen crossover points to produce the offspring who inherit a part of each parent tree. The crossover points of parent trees are marked with a dot in Figure 2.2. Illustrated crossover is called one-point crossover and is the most basic implementation.

GP mutation randomly selects a mutation point and swaps a subtree located under that point with a randomly generated subtree. It requires only one parent. Figure 2.3 shows the

Figure 2.1 Example of GP tree with four primitive nodes (white) and five terminal nodes (grey).

x x x

x

y

x x

y

a b c d

x x- x-y x x x- -y

Figure 2.2 Example of GP crossover. Parent trees (a) and (b) produce offspring (c) and (d).

(18)

mutation of a tree (a) whose subtree was replaced with a randomly generated tree (e). A mutation point is illustrated as a dot on edge.

More advanced mutation and crossover algorithms can control the depth of the newly created trees, for example, by limiting the size of randomly created trees during mutation. In the case of crossover, different probabilities to be chosen as a crossover point can be assigned to parent tree nodes.

2.2. TPOT

Tree-based Pipeline Optimization Tool (TPOT) is an open-source library for Python that optimizes ML pipelines using genetic programming (GP). GP primitives in TPOT are ML pipeline operators, i.e., existing ML algorithms like random forest regressor or standard scaler from Python packages like scikit-learn (Pedregosa et al., 2011). TPOT uses GP procedures to discover the best-performing pipeline for a given dataset. In order to automatically generate and optimize ML pipelines via GP algorithms, TPOT utilizes the Python package DEAP (Fortin et al., 2012). TPOT was first introduced by Olson et el. (2016).

TPOT automates the part of the pipeline that includes feature creation, preprocessing and selection, model selection and parameter optimization. Therefore, input data must be valid, accurate, consistent, and complete. Median values impute all missing values.

TPOT has frequently outperformed standard ML analyses (Olson et al., 2016ab, Le et al., 2020). It was also proved to perform better than other automated ML frameworks in solving regression problems and is second after auto-sklearn in classification problems (Balaji & Allen, 2018). Other distinctive features of TPOT include the possibility to set a maximum execution time, pausing and resuming the optimization process, and exporting a model as Python code.

Although TPOT search space can be adjusted through a custom configuration file, the following chapter describes the operation of the default configurations. The TPOT process

a

x x-

e

y x- x

x x

x

y

Figure 2.3 Example of GP mutation.

(19)

description is based on studying the TPOT 0.11.7 code (https://github.com/EpistasisLab/tpot) and official documentation (http://epistasislab.github.io/tpot/).

2.2.1. TPOT operators

Depending on the selected configuration, TPOT applies a different set of operators. TPOT operators include classifiers, regressors, preprocessors, selectors and a special type of operator that combines two copies of a dataset. Besides default configurations for classification and regression, TPOT configurations include (1) cuML, which uses classifiers and regressors from cuml library, (2) light, (3) MDR, which uses feature constructors from MDR library, (4) NN, which uses PyTorch neural networks, and (5) sparse, which supports sparse matrices and uses One Hot Encoder. While classification has all five additional configurations, regression has all but “NN”. As will be seen later, the same operator can have different hyperparameter options under different configurations.

Classifiers are supervised ML algorithms that learn to predict a class label of a given data point. TPOT uses a set of 17 classifiers. Twelve of those are algorithms from the sklearn library: three ensemble algorithms, two linear models, three naïve Bayes models, and one of each: neural networks, neighbour, tree, and support vector algorithms. Cuml is an ML library that allows running models on GPU effortlessly, which improves large dataset training time 10- 50 times compared to CPU-based equivalents like sklearn (Nolet et al., 2020). Two cuml algorithms are used under cuML configuration. Extreme Gradient Boosting (XGB) classifier comes from the xgboost library. Two algorithms: Logistic Regression and Multilayer Perceptron, are based on PyTorch but adjusted for TPOT.

ML algorithms typically have multiple hyperparameters. Like grid search, TPOT search space is limited to hyperparameter values specified in configuration settings. Hyperpa- rameter combinations are calculated as a multiplication of all hyperparameter options. For example, K Neighbors Classifier (sklearn) has three hyperparameters specified in TPOT configurations: (1) number of neighbours, which can be one of 100 values ranging from 1 to 100, (2) weight function, which can be either “uniform” or “distance”, and a power parameter for the Minkowski metric: 1 for Manhattan distance and 2 for Euclidean distance. Therefore, K Neighbors Classifier (sklearn) model can have 400 different hyperparameter combinations. Un- like other classifiers, Extreme Gradient Boosting (XGB) has a different number of hyperparameters and, thus, a different number of hyperparameter combinations, depending on the configuration: nine hyperparameters in the case of cuML and 7 in others. Table 2.1 summarises

(20)

classifier algorithms, their appearance in TPOT configurations, source libraries, and the number of hyperparameters and their value combinations.

Table 2.1 Number of classifier hyperparameters (H), hyperparameter combinations (HC) and classifier appearance in configurations.

Classifier Library H HC Applied in configuration

default cuML light MDR NN sparse

Logistic Regression cuml 2 33 ^⚫

K Neighbors Classifier cuml 2 100 ^⚫

Gradient Boosting Classifier sklearn 7 7600000 ^⚫ ^⚫

Extra Trees Classifier sklearn 6 30400 ^⚫ ^⚫

Random Forest Classifier sklearn 6 30400 ^⚫ ^⚫ ^⚫

Logistic Regression sklearn 3 44 ^⚫ ^⚫ ^⚫ ^⚫ ^⚫

SGD Classifier sklearn 8 6300 ^⚫ ^⚫

Bernoulli NB sklearn 2 12 ^⚫ ^⚫ ^⚫ ^⚫

Gaussian NB sklearn 0 1 ^⚫ ^⚫ ^⚫

Multinomial NB sklearn 2 12 ^⚫ ^⚫ ^⚫ ^⚫

K Neighbors Classifier sklearn 3 400 ^⚫ ^⚫ ^⚫ ^⚫

MLP Classifier sklearn 2 20 ^⚫ ^⚫

Linear SVC sklearn 5 440 ^⚫ ^⚫ ^⚫

Decision Tree Classifier sklearn 4 7600 ^⚫ ^⚫ ^⚫

XGB Classifier xgboost 7

9

20000 22400

⚫

Pytorch LR Classifier tpot/pytorch 4 240 ^⚫

Pytorch MLP Classifier tpot/pytorch 4 240 ^⚫

Regressors are ML algorithms that investigate relationships between variables and thus learn to predict continuous outcomes. Like in the case of classifiers, Extreme Gradient Boosting (XGB) regressor has a more extensive set of hyperparameters under cuML configuration than other configurations. TPOT uses a set of 16 regressors, 11 of which are sklearn models: 4 ensembles, four linear models, one neighbour model, one tree model, and one support vector model. Four algorithms are from the cuml library, and one from xgboost. None of the neural algorithms is currently used to solve regression problems. Table 2.2 summarises regressor algorithms used in TPOT, its source library, the number of hyperparameters and hyperparameter combinations used in TPOT and regressor appearance in configurations.

Table 2.2 Number of regressor hyperparameters (H), hyperparameter combinations (HC) and regressor appearance in configurations.

Regressor Library H HC Applied in configuration

default cuML light MDR sparse

Elastic Net cuml 2 105 ^

Lasso cuml 1 2 ^

Ridge cuml 0 1 ^

K Neighbors Regressor cuml 2 100 ^

Ada Boost Regressor sklearn 3 15 ^

Extra Trees Regressor sklearn 5 15200 ^

(21)

Regressor Library H HC Applied in configuration

default cuML light MDR sparse

Gradient Boosting Regressor sklearn 9 182400000 ^

Random Forest Regressor sklearn 5 15200 ^ ^

Elastic Net CV sklearn 2 105 ^ ^ ^ ^

Lasso Lars CV sklearn 1 2 ^ ^

Ridge CV sklearn 0 1 ^ ^ ^

SGD Regressor sklearn 8 3780 ^

K Neighbors Regressor sklearn 3 400 ^ ^ ^

Linear SVR sklearn 5 1100 ^ ^ ^

Decision Tree Regressor sklearn 3 3800 ^ ^

XGB Regressor xgboost 8

10

20000 22400



Preprocessing operators transform the dataset in some way. For example, scaling variables might neglect the effect of varying magnitudes on distance-based and variance-sensitive models. Most of the preprocessors are from the sklearn library, which includes seven value preprocessors like normalizer, binarizer and scalers, two kernel approximations (Nystroem, RBF Sampler), two decomposition algorithms like independent component analysis (ICA) and principal component analysis (PCA), feature agglomeration algorithm that recursively merges pair of clusters of features. Table 2.3 summarises preprocessor algorithms used in TPOT, its source library, the number of hyperparameters and hyperparameter combinations used in TPOT and preprocessor appearance in configurations. TPOT does not use any preprocessors under MDR configuration and only uses One Hot Encoder under sparse configuration.

Table 2.3 Number of preprocessor hyperparameters (H), hyperparameter combinations (HC) and preprocessor appearance in configurations for classification (^⚫) and regression ().

Preprocessor Library H HC Applied in configuration

Default cuML light MDR NN sparse

Binarizer sklearn 1 21 ^⚫ ^⚫ ^⚫ ^⚫

Fast ICA sklearn 1 21 ^⚫ ^⚫ ^⚫

Feature Agglomeration sklearn 2 15 ^⚫ ^⚫ ^⚫ ^⚫

Max Abs Scaler sklearn 0 1 ^⚫ ^⚫ ^⚫ ^⚫

Min Max Scaler sklearn 0 1 ^⚫ ^⚫ ^⚫ ^⚫

Normalizer sklearn 1 3 ^⚫ ^⚫ ^⚫ ^⚫

Nystroem sklearn 3 1890 ^⚫ ^⚫ ^⚫ ^⚫

PCA sklearn 2 10 ^⚫ ^⚫ ^⚫ ^⚫

Polynomial Features sklearn 3 1 ^⚫ ^⚫

RBF Sampler sklearn 1 21 ^⚫ ^⚫ ^⚫ ^⚫

Robust Scaler sklearn 0 1 ^⚫ ^⚫ ^⚫ ^⚫

Standard Scaler sklearn 0 1 ^⚫ ^⚫ ^⚫ ^⚫

Zero Count tpot 3 1 ^⚫ ^⚫ ^⚫ ^⚫

One Hot Encoder tpot 3 5 ^⚫ ^⚫ ^⚫ ^⚫

Selecting a relevant subset of variables can reduce the training time and overfitting and improve accuracy and model interpretability. TPOT uses scikit-learn (sklearn) selection algorithms in most configurations, including the default one. Those algorithms apply common

(22)

approaches for dimensionality reduction: removing features with low variance (Variance Threshold), univariate feature selection (Select Percentile, Select FWE), recursive feature elimination (RFE) and selection from a model based on importance weights. Under MDR configuration, TPOT uses scikit-rebate (skrebate) Relief-based feature selection algorithms. Table 2.4 summarises selector algorithms used in TPOT, its source library, the number of hyperparameters and hyperparameter combinations used in TPOT and selector appearance in configurations.

The “Select from model” algorithm has a higher number of hyperparameters and hyperparameter combinations under default and sparse classification configurations than regression configurations. Recursive feature elimination (RFE) can only be used to solve classification problems.

Table 2.4 Number of selector hyperparameters (H), hyperparameter combinations (HC) and selector appearance in c configurations for classification (^⚫) and regression ().

Selector Library H HC Applied in configuration

Default cuML light MDR NN sparse

Select Fwe sklearn 2 50 ^⚫ ^⚫ ^⚫ ^⚫ ^⚫

Select Percentile sklearn 2 99 ^⚫ ^⚫ ^⚫ ^⚫ ^⚫

Variance Threshold sklearn 1 8 ^⚫ ^⚫ ^⚫ ^⚫ ^⚫

RFE sklearn 4 800 ^⚫ ^⚫ ^⚫

Select From Model sklearn 4 3

840 420

⚫

⚫ ⚫ ⚫

⚫

ReliefF skrebate 2 30 ^⚫

SURF skrebate 1 5 ^⚫

SURFstar skrebate 1 5 ^⚫

MultiSURF skrebate 1 5 ^⚫

Multifactor dimensionality reduction (MDR) algorithms convert two or more variables to a single attribute, which changes the data representation space (Sohn et al., 2017). Two algorithms: MDR and continuous MDR, are used to engineer new features for classification and regression problems, respectively. Feature construction operator is available only under MDR configuration. Table 2.5 summarises details about feature constructor algorithms used in TPOT.

Table 2.5 Number of feature constructor hyperparameters (H), hyperparameter combinations (HC) and their appearance configurations for classification (^⚫) and regression ().

Feature Constructor Library H HC Applied in configuration

MDR

MDR mdr 2 4 ^⚫

Continuous MDR mdr 2 4 ^⚫

Default configuration covers a sufficient number of classifiers, regressors, preprocessors and selectors. CuML configuration has much fewer options in classifiers and regressors, however, included algorithms are more efficient than default ones. Light configuration algorithms are limited only to computationally efficient ones. MDR is the only configuration that allows feature construction and relief-based feature selection. It uses no preprocessors and

(23)

single classifier and regressor options: Logistic regression and Elastic Net CV, respectively.

NN is an extended version of the default configuration that allows running neural network algorithms. Currently, it can only be used for classification. Running TPOT on sparse data requires using a separate configuration with a limited number of classifiers and regressors and the only preprocessor—One Hot Encoder. Additionally, all configurations have a special operator that combines two copies of input data. However, this is not included in the total number of operators in individual fitness calculations.

2.2.2. TPOT pipeline

Operators discussed in the previous chapter are sequentially combined to create a pipeline. Tree representation of TPOT pipelines allows performing genetic operations: mutation and crossover. Each pipeline has at least one primitive node, either a classifier or a regressor, and a set of terminals: input matrix and hyperparameters. The initially generated population consists only of trees with at most two operators, but with a custom template, it is possible to initialize pipelines in different sizes. Pipelines can grow or shrink with generations.

A pipeline can be expressed as a string. A simple example is Pipeline 2.1: a classification model with one primitive node and three hyperparameters. Pipeline 2.2 is more complex and much less intuitive for a human. Figure 2.4 visualizes Pipeline 2.2 as a horizontal tree with a root node on the left. Primitives and input matrices are highlighted in red and blue, respectively. Figure 2.4 shows Pipeline 2.2 consists of a classifier, a preprocessor, and a selector:

KNeighborsClassifier(input_matrix, n_neighbors=10, p=2, weights=distance) (2.1) GradientBoostingClassifier(CombineDFs(Nystroem(SelectPercentile(input_matrix,

percentile=33), gamma=0.35000000000000003, kernel=linear, n_components=5), input_matrix), learning_rate=1.0, max_depth=6, max_features=0.4, min_samples_leaf=13, min_samples_split=13, n_estimators=100, subsample=0.8)

(2.2)

Figure 2.4 Example of a tree-based representation of TPOT pipeline.

(24)

Gradient Boosting classifiers combine two data copies as input, one of which is row data. At the same time, the other is a preprocessed set of selected variables.

The optimal pipeline discovered by TPOT is then converted into Python code. The code generated by TPOT only requires specifying the input data source. Otherwise, it is ready to run. The code imports all necessary functions, and the data extracts features and target values, splits data into training and testing sets, creates a pipeline using a pipeline constructor from the scikit-learn library, trains the pipeline and predicts the result. An example of a python code generated by TPOT can be found in Appendix 1.

2.2.3. Individual fitness evaluation

TPOT optimizes the accuracy achieved by the pipeline while accounting for its complexity, making an optimization process a two-objective problem. By default, model performance is evaluated in terms of accuracy (Equation 2.1) for classification problems and negative mean squared error (Equation 2.2) for regression problems. Thus, model performance is always a maximization problem. Other classification and regression score options vary from average precision to ROC AUC score and from negative median absolute error to R². K-fold cross- validation is used to get a more accurate performance evaluation. TPOT uses a 5-fold CV by default. Sacrifice for evaluation accuracy is increased evaluation time.

𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = ^{𝑇𝑃+𝑇𝑁}

𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁 (2.1)

−𝑀𝑆𝐸 = −¹

𝑛∑^𝑛_𝑖=1(𝑦_𝑖 − 𝑦̂_𝑖)² (2.2)

Pipeline complexity is evaluated by a number of operators, which include classifiers, regressors, preprocessors, and selectors. Operators that combine datasets do not contribute to the number of operators. The complexity objective is the minimization problem. TPOT uses the Pareto front to evaluate individual fitness by finding a trade-off between two objectives.

2.2.4. Evolution algorithm

Like any other genetic programming algorithm, TPOT evolves a population over a given number of populations, and the fittest individual from the last generation is selected as an optimal solution. However, every new generation is selected from a combination of previous population individuals and its offspring. Offspring is generated by mutation, crossover, or reproduction with given probabilities. Individuals subject to mutation and crossover are chosen randomly regardless of their fitness values. Algorithm 2.2 briefly summarises the process of discovering the TPOT optimal pipeline.

(25)

Algorithm 2.2. TPOT evolution algorithm.

0. Set a number of individuals in the population: population_size (default 100), number of generations to evolve the population: generation (default 100), number of offspring to produce in each generation: lambda (default equal to population_size), probability rate for mutation mutation_rate (default 0.9), probability rate for crossover crossover_rate (default 0.1).

1. Initialize a population set of population_size individuals.

2. Repeat generation times:

2.1. Repeat lambda times:

2.1.1. Randomly choose one of three options:

2.1.1.1. With probability mutation_rate, randomly select an individual from the population, mutate the individual, and add a mutated individual to the offspring set.

2.1.1.2. With probability crossover_rate, randomly select a pair of individuals from a set of eligible crossover pairs, mate them, and add the first outputting individual to the offspring set.

2.1.1.3. With probability 1 - mutation_rate - crossover_rate, randomly select an individual from the population, and add it to the offspring set.

2.2. Merge the population set with the offspring set.

2.3. Perform a selection of population_size individuals based on individual fitness values to create a new population set.

3. Select the fittest individual.

The selection operator used by TPOT is NSGA-II from the DEAP library (known as selNSGA2 in DEAP), which takes k individuals from a set of individuals larger than k and selects each individual at most once. NSGA-II or Non-dominated Sorting GA-II is a fast multi- objective evolutionary algorithm that uses non-dominated sorting and sharing. It has a better convergence to the near-Pareto-optimal front than older (Deb, 2002).

The crossover operator is a one-point crossover based on a similar algorithm from the DEAP library (cxOnePoint). First, a set of individual pairs eligible for crossover is created.

Individuals are eligible for crossover if they are not the same and share at least one primitive.

Then crossover is performed on a randomly chosen pair of eligible individuals. If no pairs are eligible for crossover, the mutation is performed instead.

The mutation operator randomly selects between three options: insert, shrink and replace a node. Insert and shrink options add and remove a tree element. If replacement is chosen, the algorithm randomly selects a node within an individual. If that node is terminal (i.e., a hyperparameter), the operator replaces it with a random allowed value. If the selected node is a primitive, the operator replaces the whole subtree, which has that primitive as a root, with another randomly created subtree.

(26)

3. METHODOLOGY

3.1. APPROACH

TPOT aims to discover the optimal pipeline by evolving a population of individuals or pipelines over generations. As discussed before, evolution involves performing genetic operations—mutation and crossover—and selection over randomly initialized pipelines. The value of TPOT is based on the idea that genetic operators benefit pipeline optimization.

The first experiment determines if evolving pipelines with TPOT output better-performing results than just initializing pipelines. For results to be comparable, both scenarios must have the same number of pipelines involved. For example, evolving 50 pipelines for five generations involves the same number of pipelines as initializing 250 pipelines.

A deeper analysis would require comparing model performances on different stages of evolution, i.e., generations. That would answer the question: if and when (i.e., in which generation) evolution scenario performance reaches the value of initialization scenario performance?

The second experiment aims to investigate the effectiveness of TPOT mutations and crossovers and discover what makes GP operators effective. Operator effectiveness is deter- mined by the share of operators that improve the fitness value of the pipeline.

3.2. OBJECTIVES

The main objectives of this thesis are:

• To determine if evolving pipelines with TPOT output better-performing results than just initializing pipelines.

• To compare different scenario performances over generations and determine the number of generations necessary for evolving pipelines to reach the level of initialized pipelines.

• To evaluate the performance of mutations and crossovers.

The first experiment will cover the first two objectives, and the second experiment the last one. Experiments are run independently from each other. Nevertheless, both experiments are repeated 30 times and run on the same four test cases described in detail in chapter 4.2.

3.3. DATA

After running the TPOT optimization process on some training data, TPOT returns a fitted TPOT object, which can be exported as a python code or used to get a prediction score. If the TPOT verbosity parameter is set to 2 or higher, then TPOT prints the best internal CV score for each generation and the best pipeline as a string. Therefore, available information about TPOT

(27)

optimization includes pipeline performance scores, internal CV scores and best pipelines. Two later are available in the program output stream. TPOT library code manipulations allowed extracting the best pipeline and evolution process details: genetic operator (mutation, crossover), pipeline, size and score of a selected individual (parent) and an individual after the operation (offspring). The rest of the metrics are calculated from extracted data: operation success rate, size increase, mutation type, mutated object, crossover type, crossover base, pipeline tree depth and operation level.

Scores

The analysis includes three types of scores: training, test and internal. The test score is the main performance indicator for a predictive model. Training score helps to detect overfitting. Internal CV score is a way to evaluate pipeline performance while training the pipeline.

First, a dataset is split into training and test subsets. The training set accounts for 90%

of all data and is used for TPOT optimization. A training score evaluates the performance of the resulting pipeline trained on training data to predict target values of the same data. A test score is based on the pipeline's ability to predict target values of new data—testing subset.

Internal CV tests the pipelines on different folds of training data and returns the average score. Internal CV score is included in the pipeline fitness value and plays a crucial role in the pipeline selection. The best internal score is the CV score of the fittest pipeline in its generation, and it can be accessed from the TPOT output stream if it is run with high verbosity.

Scores of individuals before and after mutation and crossover are also internal CV scores. They are used to calculate the improvement rate.

Test cases include both classification and regression problems. In both cases, default scores are used: accuracy for classification and negative MSE for regression. Negative MSE is converted into negative RMSE (Formula 3.1) for interpretability.

−RMSE = −√−(−MSE) (3.1)

Success rate

Success or improvement rate is a ratio of the number of successful or improved operations to the total number of operations (Equation 3.2). Successful operations are crossovers and mutations that improve the score of the pipeline. In other words, if the score of the individual selected for mutation or crossover is lower than the score after mutation or crossover, then the operation is considered successful (Equation 3.3).

𝑠𝑢𝑐𝑐𝑒𝑠𝑠_𝑟𝑎𝑡𝑒 =^∑^𝑁^𝑖=1^{𝑠𝑢𝑐𝑐𝑒𝑠𝑠}^𝑖

𝑁 (3.2)

(28)

𝑠𝑢𝑐𝑐𝑒𝑠𝑠_𝑖 = {1 , 𝑖𝑓 𝑠𝑐𝑜𝑟𝑒_𝑎𝑓𝑡𝑒𝑟_𝑖 > 𝑠𝑐𝑜𝑟𝑒_𝑏𝑒𝑓𝑜𝑟𝑒_𝑖

0 , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (3.3)

While mutation requires one input individual (pipeline), crossover requires two.

Crossover outputs two modified individuals; however, TPOT takes only the first one to the offspring set. In order to determine if a crossover is successful, the resulting individual (first offspring) is compared to the parent with the highest score.

Mutation metrics

Mutation operations differ by the type of mutation and the mutated object. During mutation, tree nodes or subtrees can be added to, removed, or replaced in the pipeline. Having both pipelines before and after the operation allows us to identify which part of the pipeline was changed: classifier, regressor, preprocessor, selector, combiner or hyperparameter.

In order to determine mutation type, pipelines before and after the operation are compared. Pipeline strings are converted into lists of pipeline elements, comparing which gives lists of added and removed elements. A mutation is additive if only a list of added elements has values. A mutation is subtractive if only a list of removed elements has values. If both lists have values, then mutation has replaced an object. Empty lists mean no changes. Matching resulting list elements with the pipeline operator dictionary identifies the mutated object.

Crossover metrics

As mutation has only one input pipeline, it was also decided to keep only one input pipeline for crossover: the fittest parent, which score is compared with the score of the offspring. Any of two parents can be the fittest. The first offspring typically has a root of the first parent and a subbranch from the second. Sometimes the root node of the offspring is inherited from the individual before crossover—the fittest parent or alpha. In other cases, the root note is inherited from the “unseen” parent—parent beta. If both trees are the same or none of the nodes match, it is impossible to distinguish the root origin. For an individual to be a root parent, two condi- tions should be met: (1) root classifier (or regressor) match and (2) at least all but one (n-1) root classifier (or regressor) hyperparameters match.

Crossovers can also be characterized by the type of switched pipeline tree parts: one node (leaf) or a subtree (branch). The crossover type is “leaf” if the difference between the pipelines before and after the operation is one hyperparameter. It is a “branch” type if the difference is multiple connected nodes or a subtree. If both trees are the same or none of the nodes match, the type is not available.

Operation level

(29)

The range of possible operation levels depends on the pipeline tree depth. The maximal operation level equals the tree's depth and implies changes in tree leaves (hyperparameters). The operation level is equal to zero if changes are performed in the root node. TPOT pipeline depth is often equal to the number of pipeline operators. The depth is smaller than the number of operators when a combiner is used in the pipeline because combiners can create forks in the tree that allow multiple operators to be on the same level.

3.4. HYPOTHESIS TESTING

The main performance indicators in this thesis are a score indicating how accurately the TPOT- generated pipeline can predict target values of unseen data and genetic operator success rate.

Pipeline performance scores can be normally distributed or skewed in any direction if the problem is complex or the model is far from optimal. Scores of easy problems tend to be close to optimal (accuracy = 1 or -RMSE = 0) and negatively skewed. The Shapiro-Wilk test is used in this work to check whether scenario score or improvement rate distributions are Gaussian.

Mann-Whitney U test, also known as the Wilcoxon-Mann-Whitney test, is a nonpara- metric statistical significance test for two independent samples. It is applied if at least one of the distributions is not Gaussian. If both distributions are normal, Student’s T test determines if their means are significantly different. Each experiment is repeated 30 times and thus outputs 30 values, which is considered enough for a central limit theorem to hold. That makes the mean and standard deviation of observed scores an accurate approximation of the actual ones.

P-value is the probability of obtaining test results at least as extreme as the result observed. The threshold for p-values used in this thesis is 0.05. Any value below indicates that compared distributions are significantly different. In cases where the significance level is not stated, the value of 5% should be assumed.

(30)

4. EXPERIMENTAL STUDY

The study consists of two experiments. Both experiments aim to evaluate the effect of evolution (mutation and crossover) on the quality of the model trained using TPOT.

The first experiment compares the scores of models selected as a result of either evolution or initialization according to the following four scenarios:

1. Initializing 250 individuals.

2. Evolving 50 individuals for 5 generations with default mutation and crossover rates (0.9 and 0.1, respectively).

3. Evolving 25 individuals for 10 generations with default mutation and crossover rates (0.9 and 0.1, respectively).

4. Evolving 50 individuals for 5 generations with custom mutation and crossover rates (0.5 and 0.5).

All the scenarios involve exactly 250 individuals in total. That is because preliminary experiments showed that it is enough to get accurate model scores (see Results section) for simple problems like ”Breast cancer” and “Concrete” and decent scores for more complex problems: like “Waveform” and “Bioavailability” see Chapter 4.2).

Each scenario was run 30 times on each of four test cases: two classification problems “Breast cancer” and “Waveform” and two regression problems “Bioavailability” and “Con- crete” . The conclusion about the significance of evolution-added value is based on two analyses. First is an analysis of the distribution of accuracy and RMSE scores for each test case separately. The second is the development of the best internal Cross-Validation (later CV) score over generations compared to the best internal CV scores of initializations—the first scenario.

The second experiment analyses the performance of genetic operators: crossover and mutation. A crossover takes two individuals as parents and returns two offspring, but in TPOT, only the first (by order) offspring is taken as an output. Mutation, however, inputs and outputs one individual. Every time an operation happens, the fitness values of input and output individuals are evaluated and compared. TPOT always aims to maximize fitness; therefore, operations that output individual fitness is higher than input individual fitness are called successful or improved. In the case of crossover, the highest parent fitness is compared with the first offspring’s fitness. TPOT uses a Pareto front that evaluates two fitnesses: model best internal CV score (accuracy or MSE) and the size of the individual. In operator effectiveness evaluation, only the CV score is taken into account. Operator effectiveness is based on a relative number of successful operations in each test case.

(31)

4.1. EXPERIMENTAL SETTINGS

As discussed above, the study consists of two experiments. The first experiment compares the performance of the models generated according to three scenarios: (1) initialization of 250 individuals, (2) evolution of 50 individuals for 5 generations, and (3) evolution of 25 individuals for 10 generations. Both evolution scenarios use default TPOT parameters for mutation and crossover rates: 0.9 and 0.1 (Table 4.1). In the second experiment, the performance of mutation and crossover operations is evaluated; therefore, mutation and crossover rates are adjusted to equal 0.5. This results in about 125 operations of each kind per run per test case and about 3750 operations in total (30 runs, one test case). The second experiment setting corresponds to scenario 4 (Table 4.1). Besides extra computation regarding operator effectiveness, the second experiment also produces the output valid for the analysis of the first experiment. Therefore, scenario 4 is also included in this analysis.

Table 4.1 Experimental scenario settings

Scenario Repetitions Generations Population size Mutation rate Crossover rate

1 30 1 250 0 0

2 30 5 50 0.9 0.1

3 30 10 25 0.9 0.1

4 30 5 50 0.5 0.5

The number of generations cannot be set to zero in TPOT. Therefore, to simulate the initialization of 250 individuals, TPOT is run for one generation with mutation and crossover rates equal to zero. Thus, only the selection was used in scenario 1. In this thesis, only scenarios with mutation or crossover operations are called evolution.

TPOT optimization is run with default configurations for both classification and regression problems. That implies that solution search space includes classifiers and regressors from sklearn and xgboost libraries, selectors, and preprocessors from sklearn. It excludes feature construction operators and neural network-inspired models.

The random seed is set to be equal to the microsecond of the current timestamp and is reset every time (1) before splitting the dataset into training and testing and (2) before running the TPOT algorithm. TPOT is trained on 90% of the dataset and tested on the rest 10%. Each experiment is run 30 times for each of the 4 test cases.

Experiments are written and run with Python 3.9 using TPOT 0.11.7 and DEAP 1.3.1 as the basis and with the help of NumPy 1.22.3 and scikit-learn 1.0.2.

(32)

4.2. TEST CASES

An experimental study is performed on four test cases—two classifications and two regression problems. None of the cases has missing values nor required pre-processing. The basic information about the test cases is summarized in Table 4.2.

Table 4.2 Test case descriptions

Name Type Attributes Instances Output range

Breast cancer Classification 30 569 1 = malignant, 0 = benign

Waveform Classification 40 5000 0 = Class 1, 1 = Class 2, 2 = Class 3 Bioavailability Regression 241 359 [0.4; 100]

Concrete Regression 8 1030 [2.33; 80.6]

4.2.1. Breast cancer

Breast Cancer Wisconsin (Diagnostic) Data Set (Wolberg et al., 1995) (later: breast cancer) is a classification dataset applicable for breast tumour diagnoses, which was first introduced in the research by Street et al. in 1993. It consists of 569 instances of described characteristics of the cell nuclei extracted from the digitized image of fine needle aspirate of a breast mass via the image analysis program Xcyt (Mangasarian et al., 1995). 357 (or 63%) of the instances are malignant, and the rest (212 or 37%) are benign. Each image consisted of multiple nuclei; therefore, each instance is described as a mean, standard error, and the worst value of each of 10 features: radius, texture, perimeter, area, smoothness, compactness, concavity, number of con- cave points, symmetry, and fractal dimension. There are 30 attributes in total.

Street et al. (1993) performed classification using MSM-T (Multi-surface method – Tree), and the resulting classifier with three features—mean texture, worst area, and worst smoothness—had an accuracy of 97.3%.

4.2.2. Waveform

Waveform Database Generator (Version 2) Data Set (Breiman et al., 1988) (later: waveform) is a synthetically generated dataset with 5000 instances equally split into three classes that are described with 40 features—consecutive “height” values, which plot has a waveform. Three basic waveforms gradually increase, reach their peaks in the 7th, 11th, and 15th attributes, and decrease to the initial value of zero (Figure 4.1 Base Waves). All base waveforms are six units high at their peaks and are zeros after the 21st attribute. Three dataset classes are linear combinations of two different base waveforms with added noise generated with a normal distribution centred around 0 with a standard deviation of 1. Class 1 is a combination of 1st and 3rd

Performance and Competitiveness of Tree-Based Pipeline Optimization Tool

MDSAA

Mestrado em Ciência de Dados e Métodos Analíticos

Performance and Competitiveness of Tree-Based Pipeline Optimization Tool

Marija Grjazniha

A thesis presented as a partial requirement for

obtaining a master’s degree in Data Science and

Advanced Analytics

MDSAA

Performance And Competitiveness of Tree-Based Pipeline Optimization Tool

by

Marija Grjazniha

STATEMENT OF INTEGRITY

ABSTRACT

KEYWORDS

TABLE OF CONTENTS

LIST OF FIGURES

LIST OF TABLES

LIST OF ABBREVIATIONS AND ACRONYMS

1. INTRODUCTION

2. THEORETICAL BACKGROUND

3. METHODOLOGY

4. EXPERIMENTAL STUDY