Assessing irace for automated machine learning

(1)

Federal University of Rio Grande do Norte Center for Exact and Earth Sciences

Bachelor in Computer Science

Assessing irace for automated machine learning

Carlos Eduardo Morais Vieira

Natal-RN November 28, 2019

(2)

Carlos Eduardo Morais Vieira

Assessing irace for automated machine learning

Undergraduate Report presented as a par-tial requirement for obtaining the degree of Bachelor in Computer Science.

Advisor

Leonardo C. T. Bezerra, PhD

Federal University of Rio Grande do Norte – UFRN

Natal-RN November 28, 2019

(3)

Vieira, Carlos Eduardo Morais.

Assesing irace for automated machine learning / Carlos Eduardo Morais Vieira. - 2019.

53f.: il.

Monografia (Bacharelado em Ciência da Computação)

-Universidade Federal do Rio Grande do Norte, Centro de Ciências Exatas e da Terra, Departamento de Informática e Matemática Aplicada. Natal, 2019.

Orientador: Leonardo C. T. Bezerra.

1. Computação Monografia. 2. Aprendizado de máquina -Monografia. 3. Configuração de algoritmos - -Monografia. 4. Visão computacional - Monografia. 5. Processamento de linguagem natural Monografia. 6. Análise de séries temporais -Monografia. I. Bezerra, Leonardo C. T. II. Título.

RN/UF/CCET CDU 004

Universidade Federal do Rio Grande do Norte - UFRN Sistema de Bibliotecas - SISBI

Catalogação de Publicação na Fonte. UFRN - Biblioteca Setorial Prof. Ronaldo Xavier de Arruda - CCET

(4)

Undergraduate thesis under the title Assessing irace for automated algorithm conﬁguration presented by Carlos Eduardo Morais Vieira and accepted by the Federal University of Rio Grande do Norte, being approved by all the members of the examining board speciﬁed below:

Leonardo C. T. Bezerra, Ph.D.

Advisor

Digital Metropolis Institute Federal University of Rio Grande do Norte

Manuel López-Ibáñez, Ph.D.

Alliance Manchester Business School University of Manchester

Márjory da Costa Abreu, Ph.D.

Department of Informatics and Applied Mathmatics Federal University of Rio Grande do Norte

Elizabeth Ferreira Gouvea Goldbarg, Ph.D.

Department of Informatics and Applied Mathmatics Federal University of Rio Grande do Norte

(5)

(6)

Acknowledgments

I would like to thank my advisor, Leonardo, for his invaluable guidance throughout this project, from its very conception. More than knowledge, he gave me perspective, and allowed me to see importance in things that seemed trivial, and the silver lining in dire times. I would also like to thank professors Manuel López-Ibáñez, Márjory Abreu, and Elizabeth Goldbarg for being part of the examining board, as well as professor Daniel Sabino de Araújo for his participation in the proposal step of this work.

I would also like to thank my friends, who were an immense help in procrastinating this work, and who did not help at all in its development. Still, they made it bearable, and if their company were not so effective in making me forget all about it for a moment or two, I would not have navigated it to the end, but drowned in it.

Finally, I want to thank my family. Near or far, they have always supported and encouraged me. This is especially true of my parents, Carlos and Betânia. Without them I would not have had the opportunity and the will to start this work, nor the means to finish it.

(7)

“When action grows unprofitable, gather information; when information grows unprofitable, sleep.” Ursula K. Le Guin, The Left Hand of Darkness

(8)

Assessing irace for automated machine learning

Autor: Carlos Eduardo Morais Vieira Orientador: Leonardo C. T. Bezerra, PhD

Resumo

Ferramentas automáticas de engenharia de algoritmos têm se tornado um recurso im-portante tanto para a academia quanto para a indústria. Em geral, essas ferramentas funcionam com o uso de certos configuradores de algoritmos demonstravelmente eficientes, dentre estes, oirace. Nessa prova de conceito, avaliamos a aplicação do iraceao campo de

aprendizado de máquina (AM). Para isso, propomos um template composto de compo-nentes do framework algorítmicoscikit-learn, que nomeamosiSklearn. Além disso, definimos

formalmente um espaço de configuração e um setup experimental que permitem que oirace

trate conjuntos de dados de AM como instâncias de um problema de otimização, fazendo doiSklearn um sistema funcional de aprendizado de máquina automatizado. Resultados

preliminares demonstram que irace é capaz de produzir modelos efetivos para três dos

maiores domínios de aplicação de AM: visão computacional, processamento de linguagens naturais, e análise de séries temporais.

Keywords: Aprendizado de máquina, configuração de algoritmos, visão computacional, processamento de linguagem natural, análise de séries temporais.

(9)

Assessing irace for automated machine learning

Author: Carlos Eduardo Morais Vieira Advisor: Leonardo C. T. Bezerra, PhD

Abstract

Automated algorithm engineering tools have become an important asset for both academia and industry. In general, these tools are powered by a few, provenly effective algorithm configurators, among which isirace. In this proof-of-concept investigation, we assess the

application of irace to the field of machine learning (ML). To do so, we propose a template

built on top of the scikit-learn algorithmic framework, dubbediSklearn, comprising many

preprocessing, feature engineering, and prediction algorithms. Furthermore, we formally define a configuration space and an experimental setup that allowirace to treat machine

learning datasets as instances of an optimization problem, makingiSklearna fully functional

automated machine learning (AutoML) system. Preliminary results demonstrate thatirace

is able to engineer effective algorithms for three of the major ML application domains, namely computer vision, natural language processing, and time series analysis.

Keywords: Machine learning, algorithm configuration, computer vision, natural language processing, time series analysis

(10)

List of figures

1 75 Fashion MNIST samples (5 rows, 15 columns). . . p. 34 2 30 SVHN samples (3 rows, 10 columns). Margins were added for

visual-ization. . . p. 34 3 30 CIFAR-10 / CIFAR-100 samples (3 rows, 10 columns). Margins were

added for visualization. . . p. 34 4 CIFAR-10 estimator tuning performance heatmaps . . . p. 45 5 CIFAR-100 estimator tuning performance heatmaps . . . p. 46 6 Fashion MNIST estimator tuning performance heatmaps . . . p. 47 7 SVHN estimator tuning performance heatmaps . . . p. 48 8 CIFAR-10 preprocessing tuning performance heatmaps . . . p. 50 9 CIFAR-100 preprocessing tuning performance heatmaps . . . p. 51 10 Fashion MNIST preprocessing tuning performance heatmaps . . . p. 52 11 SVHN preprocessing tuning performance heatmaps . . . p. 53

(13)

List of tables

1 Algorithms considered for each template component . . . p. 24 2 Configuration space of the hyperparameters associated with the algorithms

comprising iSklearn. . . p. 25

3 iSklearn-configured pipeline overview for each dataset . . . p. 31 4 Accuracy and R2 scores for each dataset. Values highlighted in boldface

indicate that the corresponding algorithm was the best performing for

the given dataset among all algorithms considered. . . p. 32 5 Summary of the six experimental setups considered in this section. Regular

stands for the configuration assessed in the previous section, used here as

baseline. . . p. 35 6 Accuracy for each dataset/experimental configuration pair. . . p. 35 7 Feature engineering (FE) and prediction (Pred) components selected for

each dataset/experimental configuration pair. . . p. 36 8 iSklearn-configured pipelines for the best experimental configuration in each

(14)

List of abbreviations

AB AdaBoost.

AutoML automated machine learning.

CASH combined algorithm selection and hyperparameter optimization. DT decision tree. FE feature engineering. kNN k-nearest neighbors. LR linear regression. ML machine learning. MLP multi-layer perceptron. RF random forests. RSM response-surface model. SF single-fold.

SMAC sequential model-based optimization for general algorithm configuration. SVM support vector machine.

(15)

14

1 Introduction

The success of machine learning (ML) algorithms in a wide range of application domains has resulted in a rising demand for off-the-shelf ML solutions that can be used by practitioners with little or no background at all on computational intelligence and/or application domains. This is best evidenced by algorithmic frameworks that have helped popularize ML and reduce the learning curve of newcomers to the field. Relevant examples are Weka (HOLMES; DONKIN; WITTEN, 1994), Keras (CHOLLET et al., 2015), and scikit-learn (PEDREGOSA et al., 2011), three of the most employed libraries in the data

science community.

Convenient as these packages may be for the application of ML by non-experts, they still leave most of the algorithm engineering process up to the user: (i) if and how to preprocess the data; (ii) how to extract, reduce, or select features; (iii) deciding which estimators to use, and; (iv) configuring all hyperparameters of the algorithms used. Automated machine learning (AutoML) initiatives attempt to automate part or the whole of this algorithm engineering process (THORNTON et al., 2013; KOMER; BERGSTRA; ELIASMITH, 2014; FEURER et al., 2015; OLSON et al., 2016; PUMPERLA et al., 2016; KOTTHOFF et al., 2017; REAL et al., 2017; ZOPH; LE, 2017). A successful example is the AutoSklearn package (FEURER et al., 2015), which builds a model usingscikit-learn

as underlying framework and SMAC (HUTTER; HOOS; LEYTON-BROWN, 2011) as algorithm configurator.

The effectiveness of AutoML approaches heavily depend on the configuration space and configurator provided. Although provenly effective, SMAC is based on global model-based optimization, i.e., modeling algorithm performance through response-surface mod-els (RSMs, (HUTTER et al., 2014)) that depend on instance-specific problem features. However, the heterogeneity of application domains one encounters in everyday ML tasks may require a broad feature set that must be continuously (and, ideallly, collectively) im-proved by the community in order to ensure robust RSMs. An alternative approach is local model-based optimization, where algorithm performance is not modeled by global models

(16)

15

but rather by parameter-specific ones. One such configurator isirace (LÓPEZ-IBÁÑEZ et

al., 2016), which promotes an iterative racing procedure where local candidate models get discarded if provenly outperformed by others. The applications of irace to several

auto-mated algorithm engineering tasks and/or application domains are numerous (BEZERRA, 2016; BEZERRA; LÓPEZ-IBÁÑEZ; STÜTZLE, 2017; BEZERRA; LÓPEZ-IBÁÑEZ; STÜTZLE, 2016; FRANCESCA et al., 2015; MASCIA et al., 2014; LÓPEZ-IBÁÑEZ; STÜTZLE, 2012). Yet, so far no AutoML initiative powered by iracecan be identified in

the literature or among ML communities.

Here, we conduct a proof-of-concept investigation in this direction. Concretely, we propose a machine learning template (iSklearn) comprised of a standard ML pipeline

architecture, with many algorithmic components available for (i) preprocessing, (ii) feature engineering, and; (iii) prediction. Our template is built on the popularscikit-learn package,

representing the common options a practitioner has at hand when first working with ML. A configuration of this template represents a pipeline where all components were jointly selected and configured.

We evaluate the effectiveness of the proposed approach on different application domains, namely computer vision, natural language processing, and time series analysis. For each domain, we select a problem and compare pipelines identified by irace as high-performing

against the prediction algorithms that comprise the template and ensembles automatically engineered by AutoSklearn. Preliminary results show that the configured pipelines are more effective than naively selecting an algorithm with its suggested default parameters. Remarkably, pipelines configured byirace seem to be competitive with the more elaborate

ensembles from AutoSklearn under the setups considered in this work. In addition, we conduct further experiments on alternative experimental setups specifically targeting computer vision. Patterns for some of the design choices selected byiracebecome more clear,

and results are once again competitive with the literature. Yet, the available components fromscikit-learn are mostly domain-independent, and state-of-the-art performance would

require more specialized components from domain-specific libraries. Besides that, further experiments on a more diverse set of datasets would be helpful towards confirming these preliminary results.

The remainder of this work is structured as follows. In Chapter 2 we give necessary background and definitions for the whole work. The following section describes our main contributions (Chapter 3), related to the application of irace to the context of machine

(17)

16

available components and associated configuration space, and the experimental setup we adopt to model machine learning datasets as instances of an optimization problem. We assess the proposed approach in Chapters 4 and 5, where we show that pipelines identified by irace outperform the off-the-shelf configured algorithms that ship with scikit-learn, and

are even competitive w.r.t. AutoSklearn and manually-designed algorithms from the literature. We conclude and discuss future work in Chapter 6.

(18)

17

2 Background

The now large, and still growing, field of machine learning concerns itself with the issue of how to construct a program that improves with “experience”(MITCHELL, 1997). A definition due to Thomas Mitchell (MITCHELL, 1997) states that a computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. We will use these terms as defined along this work, but also define some equivalent terms as needed.

The more specific area of machine learning this work concerns itself with is that of supervised learning. In this case, we have as a task to estimate a quantitative or categorical value (or output), based on a set of features (or inputs). For this purpose, we must have a training set of data, a set of tuples each of which containing a certain sequence of features, and the expected output (each of these tuples being what we previously defined as an experience) (HASTIE; TIBSHIRANI; FRIEDMAN, 2013). With this training data, we will build our learning program, which we will also call estimator for the case of supervised learning.

Cases in which we must assign to each input one of a finite number of discrete categories are called classification tasks. If, on the other hand, the output consists of one or more continuous variable, one has a regression task (BISHOP, 2016). Another important part of such a task is the metric used as performance measure for either classification or regression. For the latter, root mean squared error (RMSE) or mean absolute error (MAE) are common scores that one would wish to minimize (WILLMOTT; MATSUURA, 2005); or maximize, for the also commonly used coefficient of determination (denoted R2) (NAGELKERKE et al., 1991). Regarding classification, one could use the simpler accuracy score, i.e. the proportion of samples correctly classified; or, for example, the F1 score (or F-measure), which can be interpreted in the case of binary classification as a weighted mean of precision – the ability of the classifier to not label as positive a sample that is negative – and recall – the ability of the classifier to correctly label all the positive samples (HOSSIN; SULAIMAN,

(19)

18

2015).

In this section, we will talk in more details about these estimators, about how the data is prepared and modified before being handed to an estimator (which we will dub a pipeline), and about how we can view all possible shapes and configurations of these pipelines as a framework.

2.1 Estimators

In general, an estimator (also called predictor) is any random variable used to estimate some parameter of the underlying population from which the sample is drawn (MITCHELL, 1997). In this work, we use a few different estimators. For the case of regression tasks, there is the simple least squares linear regression model (LR). And for either regression or classification, any of the following are possible options.

K-nearest neighbors (kNN (BHATIA et al., 2010)) is a simple approach, in which an estimation of any input vector is given based on the (known) outputs of the k inputs in the traning set closest to it. This already implies two parameters that can be configured (an issue we will discuss later in Section 2.3): the number k of neighbors to be considered and the metric that will be used to determine distance, not to mention the parameter of whether or not these neighbors’ outputs will be weighed in an inversely proportional way to their distance to the input in question when giving an estimation for it. Specific discussion of parameters for this and other estimators, when relevant, is given in Section 3.2.

Another estimator is decision tree (DT, (BREIMAN, 2017)), which infers a set of simple decision rules from the training data, from which values in the testing data can be estimated. There are also ensemble methods, in which a prediction model is built by combining simpler base models. In this category fall random forests (RF, (BREIMAN, 2001)), in which a comittee of DTs is used to make estimations, and AdaBoost (AB, (FREUND; SCHAPIRE, 1997)) which has a similar comittee of “weak learners”, but which evolve over time and cast weighted votes.

A different family of approaches is that of analytical methods such as support vector machines (SVM (CORTES; VAPNIK, 1995)). SVMs establish boundaries for the input space according to the known output values. These boundaries can be nonlinear, since they are constructed by defining linear boundaries in a larger transformed version of the original input space.

(20)

19

Finally, a multi-layer perceptron (MLP, (HINTON, 1990)) is a neural network, having at least one hidden layer (layers between the input and the output layers). Each node in an MLP is a neuron which transforms the values from the previous layer using a weighted sum, followed by a non-linear activation function. Thanks to these multiple layers, and the non-linear activation functions, it can, like SVMs, produce models which are not linearly separable.

2.2 Pipelines and frameworks

For many combinations of tasks and estimators, it may be necessary to preprocess the original input into an input that the estimator can more easily use to solve the given task (KOTSIANTIS; KANELLOPOULOS; PINTELAS, 2006). This preprocessing is applied to all inputs, in training and testing data alike, and we will refer to the components that perform this preprocessing, along with the estimator itself, as a pipeline.

Such preprocessing can consist of (i) a simple scaling operation; (ii) feature extraction, where new features are produced from already existing ones; (iii) feature selection, where any subset of existing features are selected and the remainder discarded, or; (iv) any combination of these or other more complex transformations, in any order (KHALID; KHALIL; NASREEN, 2014).

An algorithmic framework, or simply framework, using the terms previously defined, is a set of algorithms which can be used to implement some specific set of pipelines. Such frameworks have become increasingly popular as a way to facilitate both the overall use of machine learning to real life tasks and reduce the learning curve for newcomers to the field. Examples of such frameworks are Weka (HOLMES; DONKIN; WITTEN, 1994), Keras(CHOLLET et al., 2015), and scikit-learn (PEDREGOSA et al., 2011). In particular, scikit-learnis a Python module containing a wide array of ML algorithms, with a high-level,

consistent API (BUITINCK et al., 2013). This allows for easy interchangeability and composition of its components, essential for our particular use case. For these reasons, we employ it in the construction of our AutoML system.

(21)

20

2.3 Combined Algorithm Selection and

Hyperparame-ter optimization (CASH)

Even with the rising popularity of algorithmic frameworks for ML, most of what we call the algorithm engineering is left up to the user: (i) if and how to preprocess the data; (ii) how to extract or select features; (iii) deciding which estimation algorithm to use, and; (iv) configuring all of these algorithms appropriately. In others words, efficiently building a pipeline as we have previously defined it can be still very challenging, given this space of possibilities. The choices involved in navigating this space can be divided in two, both discussed in this section, as well as the greater issue of the combination of these choices (THORNTON et al., 2013).

The first type of choice in building a pipeline is that of algorithm selection: which preprocessing algorithms (if any), and which estimation algorithm will be used. While many users may make this selection based on intuition, reputation, or fallacious generalization from previous experience, these approaches can significantly impair performance gains.

Besides algorithm selection, each algorithm may have any number of parameters: a term which is typically used in the algorithm configuration literature, equivalent to the term hyperparameter more commonly used in the ML community, which we will use interchangeably. The issue of algorithm configuration (or more generally, automated algorithm engineering) has long been contemplated as an optimization task, and as such several tools exist geared towards optimizing algorithm parameters. Among these are sequential model-based optimization for general algorithm configuration (SMAC) (HUTTER; HOOS; LEYTON-BROWN, 2011), tree-structured Parzen estimator (TPE) (BERGSTRA et al., 2011), and iterated racing (irace) (LÓPEZ-IBÁÑEZ et al., 2016).

The irace package, which we will use later in this work, implements the iterated racing procedure, which is an extension of the iterated F-race procedure (I/F-race (BIRATTARI et al., 2010a)). Though the main purpose of irace in its inception was to automatically configure optimization algorithms, the ‘racing’ aspect of this procedure was first proposed within the context of machine learning, for the purpose of model selection (MARON; MOORE, 1997). A race starts with a set of candidates (models or configurations). At each step of the race, each candidate is evaluated on a single instance, and those candidates that perform statistically worse for that instance are discarded. This racing procedure continues until reaching a certain number of surviving candidates, a maximum number of instances evaluated, or a pre-defined computational budget.

(22)

21

Now, there are several ways to select which candidates should be discarded at each iteration. The F-race algorithm relies on the non-parametric Friedman’s two-way analysis of variance by ranks, the Friedman test (CONOVER; CONOVER, 1980). Ideally, the initial set of candidates would be exhaustive, but that is often impractical for larger configuration spaces, as it is for our case, which will be made clear in Sections 3.1 and 3.2. Thus, I/F-race was proposed in order to more effectively sample the configuration space. It iterates over a sequence of F-races, and the candidates evaluated in each race depends on the results of previous races.

The CASH problem (as defined by Thornton et al. (THORNTON et al., 2013)), is the combined algorithm selection and hyperparameter optimization problem. This is precisely the problem, within the ML context, that AutoML seeks to address.

2.4 AutoML

Automated machine learning (AutoML) initiatives attempt to automate part or the whole of the ML algorithm engineering process, as discussed in Section 2.3. Any such initiative comprises, as we understand them today: (i) a set of possible pipelines (as previously defined), which we will call a template; (ii) an associated configuration space, listing parameters of the algorithms and their valid domains, and; (iii) an experimental setup involving an algorithm configurator able to evaluate candidates, enabling the AutoML approach to select/configure an algorithm that is high-performing for the input dataset.

Recent works in AutoML which present functioning systems are Auto-Weka

(THORN-TON et al., 2013), which uses theWekaframework as its template implementation, and the

SMAC configurator for candidate evaluation, a random-forest-based Bayesian optimization method; AutoSklearn (FEURER et al., 2015), which also uses SMAC, and builds an ensemble of algorithms from those evaluated during the optimization process (which are a sample of those available in thescikit-learnframework).

Two distinguishing features of the AutoSklearn system are the inclusion of a preliminary meta-learning step to warmstart the Bayesian optimization procedure, and of an automated ensemble construction step after the optimization procedure. Meta-learning is used to select possible pipelines that are likely to perform well on a new dataset. To do so, for a large number of datasets, both performance data and a set of meta-features (characteristics of the dataset) are collected. In that way, when a new dataset is considered,

(23)

22

used) to this one are evaluated before the optimization process begins (FEURER et al., 2015).

The ensemble construction seeks to produce an estimator that would be more robust and less prone to overfitting, then simply using the best model found. This ensemble is built using ensemble selection (CARUANA et al., 2004), meaning the ensemble starts off empty, and then, iteratively, each model found in the optimization procedure that maximizes the ensemble’s validation performance (with uniform weight) is added.

Other AutoML systems worth mentioning are Hyperopt-sklearn (KOMER; BERGSTRA;

ELIASMITH, 2014), which also uses algorithms from thescikit-learn framework, but with

configurators from the Hyperopt (BERGSTRA et al., 2015) library, such as TPE and

simulated annealing; and TPOT (Tree-based Pipeline Optimization Tool (OLSON et al., 2016)), anotherscikit-learn-based approach, but using genetic programming. In the following

chapter we will describe our own AutoML approach, dubbediSklearn, and each of its three

(24)

23

3 Setting up irace for machine

learning

As discussed in the previous chapter, any specific AutoML approach comprises (i) a template; (ii) an associated configuration space, and; (iii) an experimental setup to evaluate candidates, using an algorithm configurator. In this section, we will describe each of these elements in turn, starting with the template we adopt.

3.1 The isklearn template

The template we dubiSklearn in this work is given in Algorithm 3.1, and models a

stan-dard ML pipeline architecture. It comprises two major components, namelyPreprocessing

andPrediction, which respectively represent a preprocessing stage, performed over the

data prior to fitting the model, and a prediction stage, where model fitting is actually performed. Both components are given the option of Scaling through standardization, as

typically done in the literature.

The pre-scaling step may be selected if there are any feature engineering components being considered. Pre-scaling, as well as scaling, consists of removing the mean (except for sparse datasets) and scaling to unit variance. The scaling step, as opposed to the pre-scaling one, is always an option for the configurator.

Besides Scaling, the Preprocessing component comprises a feature engineering

com-S - > P r e p r o c e s s i n g P r e d i c t i o n P r e p r o c e s s i n g - > S c a l i n g FE | n o ne S c a l i n g - > T r u e | F a l s e FE - > S e l e c t i o n | S e l e c t i o n E x t r a c t i o n | E x t r a c t i o n S e l e c t i o n | E x t r a c t i o n P r e d i c t i o n - > S c a l i n g P r e d i c t o r

(25)

24

Table 1: Algorithms considered for each template component Component Algorithms Conditions

Selection model-free, model-based

model-based DT, RF, SVM classification_{and regression}

LR regression

Extraction SVD sparse

datasets PCA, ICA, DL otherwise Predictor DT, RF, SVM,

kNN, MLP, AB classificationand regression

LR regression

ponent (FE). Available options are feature Selection andExtraction, which can be used

simultaneously and, if so, in any order. In this work, feature extraction options are di-mensionality reduction algorithms, given in Table 1, and depend on the characteristics of the dataset provided: for sparse datasets, truncated singular value decomposition (SVD); otherwise, principal component analysis (PCA), independent component analysis (ICA), or dictionary learning (DL).

Component Selection comprises two options, model-free and model-based, also listed

in Table 1. Model-free feature selection retrieves a certain percentile of features based on a given scoring function, such as the ANOVA F-values (GIRDEN, 1992) or the Mutual Information (KRASKOV; STÖGBAUER; GRASSBERGER, 2004) between features and target. Conversely, model-based selection fits a weight importance model using a predictor, and retrieves only the most relevant. In our template, model-based selection provides a choice among several prediction algorithms: decision trees (DT, (BREIMAN, 2017)), random forests (RF, (BREIMAN, 2001)), and support vector machines (SVM), for either

classification or regression, and; linear regression (LR), for regression only.1 _Furthermore, model-based selection can be performed recursively, following the traditional recursive feature elimination (RFE) approach.

Concerning component Predictor, we consider a subset of the estimators available in scikit-learn. Besides the options already discussed for model-based selection (DT, RF, SVM,

and LR), we also consider k-nearest neighbors (kNN), multi-layer perceptron (MLP, (HIN-TON, 1990)), and AdaBoost (AB, (FREUND; SCHAPIRE, 1997)). Most algorithms present hyperparameters of their own which are exposed for configuration, as we detail next. All 1_{When used in the context of feature selection, we adopt these algorithms with their suggested default} hyperparameters, since the configuration of nested models is a complex aspect to be addressed in future work.

(26)

25

Table 2: Configuration space of the hyperparameters associated with the algorithms comprising iSklearn.

Algorithm Parameter Space

KNN k_weights {1, . . . , 100}_{_{uniform, weighted}}

AB Nestimators {2, . . . , 500}

learning rate {uniform, weighted} loss function {linear, square, exponential} DT & RF

max features [0.01, 1.0] min samples leaf [0.01, 0.5]

max depth none or {2, . . . , 50} classification criterion {gini, entropy} regression criterion {MSE, MAE}

RF Nestimators {2, . . . , 300}

SVM

C [1e−3, 1e5]

kernel {linear, polynomial, RBF,

sigmoid}

γ [1e−5, 1]

polynomial degree {1, . . . , 10}

MLP

hidden layers {1, 2, 3} nodes1, nodes2, nodes3 {3, . . . , 500}

activation function {identity, logistic, tanh, ReLU }

optimizer {lbfgs, sgd, adam}

L2 penalty [1e−5, 1e4]

initial learning rate [1e−6, 1]

learning rate {constant, invscaling, adap-tive}

of the algorithms comprising this template were chosen for two primary reasons: that they are already implemented in thescikit-learn framework, and that they are commonly used in

machine learning benchmarks.

3.2 Configuration space

The implementations provided by scikit-learn of the algorithmic options available in Table 1 present associated hyperparameters that need to be configured by irace. Table 2

details those hyperparameters as well as their domains, grouped by algorithm. Below, we further detail the hyperparameters associated to each algorithm considered, respecting the order adopted in Table 2. For further reference on these hyperparameters, we refer to the original proposals of the algorithms and to the documentation of scikit-learn.

KNN: The number k of neighbors is provided as a hyperparameter, as well as whether to use uniform or distance-based weights for each neighbor.

(27)

26

of base estimators and the learning rate. For regression tasks, three different loss functions are considered, used to update the weights after each iteration.

Decision trees (DT): iraceneeds to configure (i) the proportion of the features that can

be used to build the tree, and (ii) the minimum number of samples required for a leaf node (given as a fraction of the total number of samples). Optionally, irace may

configure a maximum depth value for the tree. The criterion used to measure the quality of a split is also provided as a hyperparameters, with different possible values for classification and regression tasks – gini or entropy for classification, and mean squared error (MSE) or mean absolute error (MAE) for regression.

Random forests (RF): The configuration space for random forests is a superset of the space for decision trees. Besides all of the hyperparameters from DT, irace must also

configure the number of estimators (trees) to be used.

Support vector machines (SVM): iracemust configure the penalty parameter of the

error term (C), the kernel to be used, and its associated hyperparameters γ (in the case of non-linear kernels). Also, for a polynomial kernel, the degree of the polynomial function is configured.

Multi-layer perceptron (MLP): iracemust configure the number of hidden layers and

the number of neurons in each layer. The activation function, solver (or optimizer), and L2 penalty must also be configured. For solvers SGD and Adam, the initial learning rate is also configured and, specifically for SGD, a learning rate schedule is chosen.

We make two additional remarks about the configuration space described above. First, we highlight the plasticity of the MLP architecture, given the freedom to configure the number of layers and nodes in each layer. Second, we adopt the logarithmic transformation, applying it to real-valued hyperparameters that present a very large range, following recent findings in the topic (FRANZIN; CÁCERES; STÜTZLE, 2017). For instance, the L2 penalty hyperparameter in MLPs is modeled as a surrogate parameter α ∈ [−5, 4], and is transformed back to L2 ∈ [10−5_{, 10}4_]_by _iSklearn_.

3.3 Experimental setup

In this work, we assess irace as a configurator that must identify high-performing

(28)

proce-27

dure (BIRATTARI et al., 2010b), and requires as input: (i) a configuration space, from which candidate configurations are sampled; (ii) a configuration budget, which determines the number of experiments it can perform and, consequently, the number of candidate configurations it will evaluate, and; (iii) a testing setup, comprising an instance list where configurations will be tested according to a given performance metric. Items (i) and (ii) are fairly correlated, as a larger configuration space will likely require a larger configuration budget for proper exploration.

In this work, the configuration space is the union of the components depicted in Table 1 and their associated hyperparameters, given in Table 2 and detailed in the previous section. The same configuration budget of 2 000 experiments is given to irace

when configuring pipelines for each problem. Each experiment is limited to a maximum runtime of 15 minutes, ensuring a feasible total execution time for any given irace run,

while also providing reasonable time for pipeline training and testing. If a configuration exceeds this limit, it is penalized so thatiracemay discard it at the end of the iteration.

Concerning (iii), the performance metrics we adopt vary as a function of task nature, namely accuracy for classification and R2 for regression, both for configuration and testing. Since instance is a concept native to optimization rather than machine learning, we propose to use a leave-p-out cross-validation approach (CELISSE; ROBIN, 2008). We apply this approach to datasets that represent both classification and regression tasks, traditionally adopted for relevant problems in different ML application domains, as follows:

Crime Incidence Data is a collection of real-world crime incidence time series provided by the Public Safety Secretariat of the state of Rio Grande do Norte in Brazil, in the context of the smart cities SmartMetropolis research project.2 _{We consider here} the time series representing the number of occurrences in 17 different police-defined districts for training, and a global time series comprising all districts for testing, each containing trend, seasonality and autoregressive features. For each time series, the latter 20% of the data is held out from model fitting, and used later as a testing set. To prevent overfitting in the configuration process, we regard as instance a tuple of three, randomly sampled districts. The performance of a configuration on a given instance is the average of its performance on each district of the tuple. Due to the nature of this problem, the evaluation of a candidate in a given district uses time series walk-forward cross-validation, i.e., splitting the series into five spans and assessing prediction on a given span from fitting on the previous ones.

(29)

28

MNIST (LeCun et al., 1998) is a popular computer vision classification problem dataset comprising handwritten digits. It contains a 60 sample training set and a 10 000-sample testing set, each 000-sample being a grey level, 28x28 pixels, centered image of a single digit. Each matrix representing an image is flattened as a feature vector, and for consistency, we follow a setup similar to that of the time series dataset. In this case, we split the training set into 20 stratified folds (each containing 3000 images), and created instances as tuples of three, uniformly sampled, distinct folds. The performance of a configuration on a given instance is the average of its performance on each fold of the tuple. Finally, to assess a configuration on a given fold from a given instance, we adopted 5-fold cross-validation.

Large Movie Review Dataset (LMRD) (MAAS et al., 2011) is a natural language processing dataset used for binary sentiment analysis (classification) of online movie reviews collected from IMDB.3 _{It consists of 50 000 highly polar reviews split evenly} into training and testing sets. We used a TF-IDF representation of the bag-of-words model provided for each review. The setup adopted is mostly similar to that of the MNIST dataset, but considers 10 stratified folds (instead of 20), given the smaller number of samples in the set.

As previously discussed, the same baseline configuration space is used to configure high-performing pipelines for each of these problems, varying only as a function of the dataset density and task nature. In addition, problem-specific hyperparameters may also be configured if so required. In this work, for example, the number of lags in the time series problem is provided as an additional hyperparameter for iraceto configure.

To benchmark the best pipelines found, we fit them using the whole training sets. This way we avoid using data previously used for training (during the configuration process) for validation. We then compare these models on the testing sets with thescikit-learnpredictors

present in our configuration space, using their suggested default hyperparameters from scikit-learn. Furthermore, to have an AutoML comparison baseline, we trained AutoSklearn

ensembles for each dataset using a time configuration budget equivalent to that used by

irace.4 Finally, we remark that algorithms that present stochastic components are run 10

times, and their mean performance over those runs is considered for comparison. In the 3_{<http://www.imdb.com>}

4_{One could argue that a comparison between pipelines and ensembles against stand-alone, unconfigured} algorithms favors the more elaborate approaches. Yet, from a non-expert perspective these composite approaches are unlikely to be adopted due to their complexity. Indeed, AutoML is an important tool for bridging the gap between such users and more developed algorithmic concepts.

(30)

29

following chapter we present the results of these experiments, discuss the pipelines selected by iSklearn, as well as compare the results obtained with AutoSklearn, and with current

(31)

30

4 Assessing pipelines configured

from irace

We start our discussion analyzing pipeline structures selected by irace for each dataset,

given in Table 3. For brevity, the Crime Incidence dataset is referred to as Crime. In addition, FE1 and FE2 depict the possibility of using Selection and Extraction simultaneously

– if only one of them is used, it is depicted as FE1 and FE2 is shown as blank. Finally, when Selection is adopted, its associated strategy is reported in parenthesis (model-free

or model-based).

Although many runs for many different datasets would be required to draw conclusive insights, we highlight a few observations concerning the pipelines engineered in this work. Feature engineering. Even if specialized preprocessing is not available for many datasets, some form of feature engineering is always selected. In particular,Selectionis always

used, configured as model-based for classification datasets, and model-free selection for the regression one. In this latter case, feature engineering also includesExtraction,

for which PCA is selected. Furthermore, in two out of three cases, scaling is adopted prior to feature engineering, and in the case of MNIST, also after it.

Model-based selection. It is rather interesting that feature selection is performed for a model different than the model chosen for prediction. This is the case for both classification datasets, where random forests are chosen for model-based selection, whereas support vector machines are used for estimation. Although commonly adopted in practice by experienced ML practitioners, this option is counter-intuitive for non-experts who would tend to select the same model for both components. Simplicity of the predictors. The predictors selected are relatively simple when

com-pared to some of the other possibilities, such as the neural network-based multi-layer perceptron or the ensemble methods AdaBoost and random forests. Once again, an experienced practitioner understands that simpler methods should be assessed firstly

(32)

31

Table 3: iSklearn-configured pipeline overview for each dataset

MNIST LMRD Crime

Preprocessing

Scaling True False True

FE1

Selection Selection Selection (model-based: (model-based: (model-free:

RF) RF) ANOVA F-values)

FE2 - - Extraction(PCA)

Prediction Scaling True False False

Predictor SVM SVM LR

when dealing with a novel dataset, but a non-expert tends to select more complex models assuming this complexity will reflect on better performance prediction. Some of these observations (particularly the latter) might suggest more complex pipelines are not selected because of the maximum runtime we fixed for each experiment, for feasibility. Nonetheless, the comparisons of final scores for each dataset given in Table 4 confirm that these pipelines are indeed high-performing w.r.t. the provided parameter space. In this table, accuracy values are given for MNIST and LMRD, whereas R2 is given for Crime. The best value obtained for each dataset is highlighted in boldface.

Below, we discuss the most important insights from this analysis, grouped by the algorithms with which we compare the pipelines produced by irace.

Algorithms that ship with scikit-learn For all datasets, the best model found byirace

from iSklearn outperforms all scikit-learndefault models. Although this analysis does

not completely cover all ML algorithms avalilable in scikit-learn, it is certainly positive

as to the effectiveness of our proposed approach. It is also rather important to remark that the models selected by iracefor classification (SVM) perform significantly poorly

when run with the suggested default parameter settings.

AutoSklearn When comparing AutoML approaches, the pipelines obtained from

iSklearn demonstrate competitive performance w.r.t. the ensembles created with

AutoSklearn, given the setup adopted in this work. Indeed, for the MNIST and Crime datasets, the iSklearn pipelines display the overall best score among all

algorithms considered. This is not the case on LMRD, where both AutoML approaches outperform the remaining ones, yet the ensemble from AutoSklearn achieves the best overall score. This is a very important result, given that our investigation proposes a very simple approach to AutoML, in contrast to the elaborate components considered in AutoSklearn, such as the possibility of creating ensembles.

(33)

32

Table 4: Accuracy and R2 scores for each dataset. Values highlighted in boldface indicate that the corresponding algorithm was the best performing for the given dataset among all algorithms considered.

KNN DT RF AB MLP SVM LR iSklearn _AutoSklearn

MNIST 96.88 87.76 94.16 72.99 95.82 11.35 - 98.6 97.67 LMRD 67.47 70.07 73.01 80.34 86.35 63.39 - 88.736 88.916

Crime 0.1126 0.1479 0.1457 0.0937 0.1708 -0.1756 0.2008 0.2130 0.2102

Overall analysis To draw a more general conclusion, we consider a rank sum analysis over all datasets, which ranks iSklearn first and AutoSklearn second. Interestingly,

SVM is ranked last, and indeed its results are very inferior to that of the remaining algorithms. This is an important evidence on the importance of AutoML, given that

iSklearn pipelines for classification are both SVM-based, and yet a non-expert would

have missed it.

State-of-the-art Put in perspective regarding the known state-of-the-art for the MNIST and LMRD datasets, the scores obtained still fall short. The explanation lies in the use of domain-specific feature processing algorithms and more advanced machine learning algorithms, such as convolutional neural networks for MNIST in (CIREGAN; MEIER; SCHMIDHUBER, 2012) and LSTM-based neural networks for LMRD in (MIYATO; DAI; GOODFELLOW, 2016), reaching accuracies of 99.77% and 94.1%, respectively. Rather than disappointing, these expected results serve as a strong motivation to further expand iSklearn with libraries that offer domain-specific components or more

elaborate ML algorithmic packages, such as TensorFlow orKeras.

In the next chapter we investigate other ways in which we should improve our work before tackling this issue.

(34)

33

5 Further experiments

As a way to better understand how our method compares to state-of-the-art approaches to common machine learning tasks, we undertook further experiments, turning now specifically to computer vision problems. In this section, we describe the (i) datasets selected, (ii) changes to the experimental setup, and; (iii) results.

5.1 Datasets

The datasets chosen for the experiments conducted in this section are listed below. These datasets were chosen for their popularity in machine learning benchmarks ((LIN; CHEN; YAN, 2013; ZHONG et al., 2017; KRIZHEVSKY; HINTON, 2010; HUANG et al., 2017)), and for their relatively large size when compared to datasets used in previous works on automated machine learning (FEURER et al., 2015; THORNTON et al., 2013; KOMER; BERGSTRA; ELIASMITH, 2014). All datasets contain small images and are generally used for classification.

Fashion MNIST (FMNIST) (XIAO; RASUL; VOLLGRAF, 2017) is a classification problem dataset with the same number of training and testing samples as MNIST, each sample sharing also the same format (grayscale) and dimensions. In contrast to digits, the ten possible classes in Fashion MINST are different types of articles of clothing: t-shirt/top, trouser, pullover, dress, coat, sandal, shirt, sneaker, bag, and ankle boot. A sample of the dataset can be seen in Figure 1.

Street View House Numbers (SVHN) (NETZER et al., 2011) is a real-world dataset of house numbers obtained from Google Street View images. Similarly to MNIST, it contains images of centered digits (see Figure 2). However, SVHN images come from a significantly harder real-world problem of digit recognition in natural scene images. It is also a larger dataset than MNIST, with the default split comprising 73,257 samples for training and 26,032 for testing.

(35)

34

Figure 1: 75 Fashion MNIST samples (5 rows, 15 columns).

Figure 2: 30 SVHN samples (3 rows, 10 columns). Margins were added for visualization.

Figure 3: 30 CIFAR-10 / CIFAR-100 samples (3 rows, 10 columns). Margins were added for visualization.

CIFAR-10 and CIFAR-100 (KRIZHEVSKY; HINTON et al., 2009) are datasets of 32x32-pixel natural scene images, like SVHN. Subjects range from living beings (people, animals, etc.) to objects (ships, trucks, cars, etc.), and the default split comprises 50,000 training images and 10,000 test images. In CIFAR-10, images depict 10 different classes, each class having 6,000 samples. By contrast, CIFAR-100 images depict 100 different classes, each class having 600 samples.

5.2 Experimental setup

In this section, we investigate three research directions to further refine our experimental setup design. A summary of all the different setups considered in these experiments are given in Table 5.

(36)

35

Table 5: Summary of the six experimental setups considered in this section. Regular stands for the configuration assessed in the previous section, used here as baseline.

regular 5k bugdet 30m cutoff SF SF 5k SF 20m Instance three-fold three-fold three-fold single-fold single-fold single-fold

Budget 2000 5000 2000 2000 5000 2000

Cutoff 15m 15m 30m 10m 10m 20m

Table 6: Accuracy for each dataset/experimental configuration pair.

iSklearn baseline

regular 5k bugdet 30m cutoff SF SF 5k SF 20m SVM RF CIFAR-10 35.73 43.61 40.15 50.85 43.24 32.13 39.98 49.98 CIFAR-100 20.50 20.54 17.88 19.58 16.80 21.14 13.84 19.45

FMNIST 88.63 87.64 85.23 88.26 88.99 89.18 89.7 84.75

SVHN 65.89 72.82 66.7 65.33 80.21 55.10 37.39 33.89

Increased cutoff time: in this research direction, we increase the cutoff time for the evaluation of a given candidate on a given instance, from 15 to 30 minutes.

Increased tuning budget: in this research direction, we increase the maximum number of experiments iraceis allowed to perform, from 2,000 to 5,000.

Single-fold (SF) approaches: in this research direction, we model an instance as a single randomly sampled fold, rather than as tuple of three folds. Specifically, we investigate three setups, differing as to cutoff time and tuning budget:

• SF: 2,000 experiments, 10-minute cutoff time. • SF 20m: 2,000 experiments, 20-minute cutoff time. • SF 5k: 5,000 experiments, 10-minute cutoff time.

5.3 Results

Accuracy results are given in Table 6, where rows depict datasets and columns the different experimental setups previously explained. The best value obtained per benchmark is highlighted in boldface. In addition, results from the literature that do not use specialized components nor are based on deep learning are given as baseline (LU; ZHANG; VETTIVELU, 2017; XIAO; RASUL; VOLLGRAF, 2017).

In general, results obtained from iSklearnare competitive with the manually-designed

approaches from the literature. In particular, single-fold approaches are preferable to our previous approach of three-fold instances. Do note that, while our exploration of different

(37)

36

Table 7: Feature engineering (FE) and prediction (Pred) components selected for each dataset/experimental configuration pair.

regular 5k bugdet 30m cutoff SF SF 5k SF 20m

CIFAR-10 FE - E S & E S E & S

-Pred kNN kNN kNN SVM kNN AB CIFAR-100 FE E E & S - - S S Pred kNN kNN DT MLP kNN MLP FMNIST FE S - S E & S - -Pred MLP MLP SVM MLP MLP MLP SVHN FE S S & E S S S S Pred kNN kNN kNN kNN SVM kNN

experimental configurations seems to indicate that more favorable results are expected with the single-fold approach, most other points are inconclusive. That is, increasing the number of experiments and/or the cutoff time for each experiment does not necessarily lead to better results, and in some cases not even equivalent results are reached. On the other hand, this could be related to variance in runs, and repeatingirace campaigns would

likely help mature these conclusions.

In Appendix A, we can see for each experimental configuration and dataset the best performing candidate with a particular estimator at each iteration. Such performance is measured as the mean of scores obtained by the candidate on the experiments run at each iteration. Note that before concluding, most of these runs are already considering only one or two different estimators, while other parameters are still being configured. Note also the increase in number of iterations as the experimental budget is increased, possibly allowing for a finer tuning of each pipeline.

We proceed our analysis with the pipeline composition assessment given in Table 7. Similarly to Table 6, rows depict benchmarks, whereas columns depict the different experimental setups considered. Each cell lists the feature engineering (FE) components when used (and, if so, in which order). In addition, cells indicate the predictors (Pred) selected by irace.

Patterns in this table are much more clear. Regarding predictors, k-nearest neigh-bors (kNN) is selected half of the times, although none of the best-performing pipelines are based on kNN. Yet, while this is a relatively simple prediction algorithm, most of the kNN-based pipelines use some of form of FE. Together, these components lead to competitive performance on CIFAR-100 and Fashion MNIST w.r.t. the other pipelines. Out of the prediction algorithms that were in fact chosen in the best-performing pipelines, SVM and MLP are equally occuring. The success of the latter might be correlated with

(38)

37

Table 8: iSklearn-configured pipelines for the best experimental configuration in each dataset.

CIFAR-10 CIFAR-100 FMNIST SVHN

Preprocessing

Scaling False True - False

FE

Selection Selection - Selection

(model-based: (model-free: (model-based:

RF) ANOVA F-values) RF)

Prediction Scaling False False True False

Predictor SVM MLP MLP SVM

the overrall success of other more complex neural networks in computer vision problems. As for the former, SVMs had already been select on two out of three datasets of different domains, in our previous experiments. Indeed, a review of the best-performing algorithms that do not use specialized components nor are based on deep learning ((LU; ZHANG; VETTIVELU, 2017; XIAO; RASUL; VOLLGRAF, 2017)) confirm that these algorithms are the best options for small image computer vision datasets.

Concerning pre-processing, over two thirds of the cells include some FE component. Surprisingly, even if extraction is expected to be effective for computer vision problems, only half of the pipelines that include FE components comprise extraction. More importantly, none of the pipelines that are best-performing for each benchmark use feature extraction. Among the best-performing pipelines, three out of four adopt feature selection, but differ as to how. This is further detailed in Table 8, where we can see that the feature selection components are the same selected in the experiments of the previous section. Heatmaps representing the best performing candidates with particular preprocessing configurations at each iteration can be seen in Appendix B, which further illustrates the dominance of feature selection in respect to extraction.

(39)

38

6 Concluding remarks

Automated machine learning seeks to bridge the gap between non-expert practitioners of ML and the specialized knowledge underlying successful computational intelligence applications. Both industry and academia have a growing interest in this field and seek to popularize it. Although there are some working AutoML tools available to the general public, the efficacy of these tools is continously improved by the research comunity.

Still, a number of efficient tools devised over the past few years by the algorithm configuration community have rarely (if at all) been explored in the context of automated machine learning.irace is one such tool, having become one of the most used and efficient

algorithm configurators currently available. Indeed, it is unfortunate that no AutoML tool powered byirace had been proposed so far, particularly due to its setup flexibility, ability

to deal with different types of parameters, and distinct optimization algorithm. In this work, we have conducted a first effort in this direction, covering all aspects required to use

iracein the context of AutoML.

Although our approach is fairly simple, we have demonstrated how iracecan be used

to configure pipelines that outperform the ML predictors that comprise it, considering several relevant application domains, namely computer vision, natural language processing, and time series analysis. The pipelines engineered in this work even displayed competitive performance w.r.t. more elaborate ensembles produced by the well-known AutoSklearn. Furthermore, inquiries into alternate experimental setups for this tool showed that there is potential for improvement, as well as valuable research questions on the benefits and detriments to the use of irace in this context.

And yet, the ultimate goal of AutoML is to help devise state-of-the-art algorithms in an automated way, a goal that has been proven feasible in other fields as long as templates and frameworks are enriched with domain-specific components. Currently, our proposal is application-independent, but in future work intend to pursue this goal. More specifically, a pressing point in our work will be investigating the inclusion of deep learning algorithms

(40)

39

into our system by way of transfer learning. Although challenging due to the computational overhead incurred by deep learning, this research direction has the potential to help us approach state-of-the-art results. Besides, there are a number of framework offering such algorithms, which can be integrated withiSklearn to increase the efficacy of the pipelines

one may instantiate from it.

Finally, it is also fairly important to expand the range of application domains considered. To this end, our goal is to provide iSklearn as an open source project, and to endeavor to

(41)

40

References

BERGSTRA, J. et al. Hyperopt: a python library for model selection and hyperparameter optimization. Computational Science & Discovery, IOP Publishing, v. 8, n. 1, p. 014008, 2015.

BERGSTRA, J. S. et al. Algorithms for hyper-parameter optimization. In: Advances in neural information processing systems. [S.l.: s.n.], 2011. p. 2546–2554.

BEZERRA, L. C. T. A component-wise approach to multi-objective evolutionary algorithms: from flexible frameworks to automatic design. Tese (Doutorado) — IRIDIA,

École polytechnique, Université Libre de Bruxelles, Belgium, 2016.

BEZERRA, L. C. T.; LÓPEZ-IBÁÑEZ, M.; STÜTZLE, T. Automatic component-wise design of multi-objective evolutionary algorithms. IEEE Transactions on Evolutionary Computation, IEEE, v. 20, p. 403–417, 2016.

BEZERRA, L. C. T.; LÓPEZ-IBÁÑEZ, M.; STÜTZLE, T. Automatic configuration of multi-objective optimizers and multi-objective configuration. In: KOROŠEC, P. et al. (Ed.). High-Performance Simulation Based Optimization. [S.l.]: Springer International

Publishing, 2017, (Studies in Computational Intelligence). Accepted.

BHATIA, N. et al. Survey of nearest neighbor techniques. arXiv preprint arXiv:1007.0085, 2010.

BIRATTARI, M. et al. F-race and iterated f-race: An overview. In: Experimental methods for the analysis of optimization algorithms. [S.l.]: Springer, 2010. p. 311–336.

BISHOP, C. Pattern Recognition and Machine Learning. Springer New York, 2016. (Information Science and Statistics). ISBN 9781493938438. Disponível em: <https://books.google.com.br/books?id=kOXDtAEACAAJ>.

BREIMAN, L. Random forests. Machine learning, Springer, v. 45, n. 1, p. 5–32, 2001. BREIMAN, L. Classification and regression trees. [S.l.]: Routledge, 2017.

BUITINCK, L. et al. Api design for machine learning software: experiences from the scikit-learn project. arXiv preprint arXiv:1309.0238, 2013.

CARUANA, R. et al. Ensemble selection from libraries of models. In: ACM. Proceedings of the twenty-first international conference on Machine learning. [S.l.], 2004. p. 18.

CELISSE, A.; ROBIN, S. Nonparametric density estimation by exact leave-p-out cross-validation. Computational Statistics & Data Analysis, Elsevier, v. 52, n. 5, p. 2350–2368, 2008.

(42)

41

CHOLLET, F. et al. Keras. 2015. <https://keras.io>.

CIREGAN, D.; MEIER, U.; SCHMIDHUBER, J. Multi-column deep neural networks for image classification. In: IEEE. Computer vision and pattern recognition (CVPR), 2012 IEEE conference on. [S.l.], 2012. p. 3642–3649.

CONOVER, W. J.; CONOVER, W. J. Practical nonparametric statistics. Wiley New York, 1980.

CORTES, C.; VAPNIK, V. Support-vector networks. Machine Learning, v. 20, n. 3, p. 273– 297, Sep 1995. ISSN 1573-0565. Disponível em: <https://doi.org/10.1007/BF00994018>. FEURER, M. et al. Efficient and robust automated machine learning. In: CORTES, C. et al. (Ed.). Advances in Neural Information Processing Systems 28. Curran Associates, Inc., 2015. p. 2962–2970. Disponível em: <http://papers.nips.cc/paper/ 5872-efficient-and-robust-automated-machine-learning.pdf>.

FRANCESCA, G. et al. Automode-chocolate: automatic design of control software for robot swarms. Swarm Intelligence, Springer, v. 9, n. 2-3, p. 125–152, 2015.

FRANZIN, A.; CÁCERES, L. P.; STÜTZLE, T. Effect of transformations of numerical parameters in automatic algorithm configuration. Optimization Letters, Springer, p. 1–13, 2017.

FREUND, Y.; SCHAPIRE, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, Elsevier, v. 55, n. 1, p. 119–139, 1997.

GIRDEN, E. R. ANOVA: Repeated measures. [S.l.]: Sage, 1992.

HASTIE, T.; TIBSHIRANI, R.; FRIEDMAN, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer New York, 2013. (Springer Series in Statistics). ISBN 9780387216065. Disponível em: <https: //books.google.com.br/books?id=yPfZBwAAQBAJ>.

HINTON, G. E. Connectionist learning procedures. In: Machine Learning, Volume III. [S.l.]: Elsevier, 1990. p. 555–610.

HOLMES, G.; DONKIN, A.; WITTEN, I. H. Weka: A machine learning workbench. 1994. HOSSIN, M.; SULAIMAN, M. A review on evaluation metrics for data classification evaluations. International Journal of Data Mining & Knowledge Management Process, Academy & Industry Research Collaboration Center (AIRCC), v. 5, n. 2, p. 1, 2015. HUANG, G. et al. Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. [S.l.: s.n.], 2017. p. 4700–4708.

HUTTER, F.; HOOS, H. H.; LEYTON-BROWN, K. Sequential model-based optimization for general algorithm configuration. In: SPRINGER. International Conference on Learning and Intelligent Optimization. [S.l.], 2011. p. 507–523.

HUTTER, F. et al. Algorithm runtime prediction: Methods & evaluation. Artificial Intelligence, v. 206, p. 79 – 111, 2014.

(43)

42

KHALID, S.; KHALIL, T.; NASREEN, S. A survey of feature selection and feature extraction techniques in machine learning. In: IEEE. 2014 Science and Information Conference. [S.l.], 2014. p. 372–378.

KOMER, B.; BERGSTRA, J.; ELIASMITH, C. Hyperopt-sklearn: automatic

hyperparameter configuration for scikit-learn. In: ICML workshop on AutoML. [S.l.: s.n.], 2014. p. 2825–2830.

KOTSIANTIS, S.; KANELLOPOULOS, D.; PINTELAS, P. Data preprocessing for supervised leaning. International Journal of Computer Science, Citeseer, v. 1, n. 2, p. 111–117, 2006.

KOTTHOFF, L. et al. Auto-weka 2.0: Automatic model selection and hyperparameter optimization in weka. The Journal of Machine Learning Research, JMLR. org, v. 18, n. 1, p. 826–830, 2017.

KRASKOV, A.; STÖGBAUER, H.; GRASSBERGER, P. Estimating mutual information. Physical review E, APS, v. 69, n. 6, p. 066138, 2004.

KRIZHEVSKY, A.; HINTON, G. Convolutional deep belief networks on cifar-10. Unpublished manuscript, v. 40, n. 7, p. 1–9, 2010.

KRIZHEVSKY, A.; HINTON, G. et al. Learning multiple layers of features from tiny images. [S.l.], 2009.

LeCun, Y. et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, IEEE, v. 86, n. 11, p. 2278–2324, 1998.

LIN, M.; CHEN, Q.; YAN, S. Network in network. arXiv preprint arXiv:1312.4400, 2013. LÓPEZ-IBÁÑEZ, M. et al. The irace package: Iterated racing for automatic algorithm configuration. Operations Research Perspectives, v. 3, p. 43 – 58, 2016.

LÓPEZ-IBÁÑEZ, M.; STÜTZLE, T. The automatic design of multiobjective ant colony optimization algorithms. IEEE Transactions on Evolutionary Computation, IEEE, v. 16, n. 6, p. 861–875, 2012.

LU, D.; ZHANG, Y.; VETTIVELU, N. A Comparison of Classifiers on the CIFAR-10 Dataset. [S.l.], 06 2017.

MAAS, A. L. et al. Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland, Oregon, USA: Association for Computational Linguistics, 2011. p.

142–150. Disponível em: <http://www.aclweb.org/anthology/P11-1015>.

MARON, O.; MOORE, A. W. The racing algorithm: Model selection for lazy learners. Artificial Intelligence Review, Springer, v. 11, n. 1-5, p. 193–225, 1997.

MASCIA, F. et al. Grammar-based generation of stochastic local search heuristics through automatic algorithm configuration tools. Computers & operations research, Elsevier, v. 51, p. 190–199, 2014.

(44)

43

MITCHELL, T. Machine Learning. McGraw-Hill, 1997. (McGraw-Hill International Editions). ISBN 9780071154673. Disponível em: <https://books.google.com.br/books?id= EoYBngEACAAJ>.

MIYATO, T.; DAI, A. M.; GOODFELLOW, I. Adversarial training methods for semi-supervised text classification. arXiv preprint arXiv:1605.07725, 2016.

NAGELKERKE, N. J. et al. A note on a general definition of the coefficient of determination. Biometrika, Oxford University Press, v. 78, n. 3, p. 691–692, 1991.

NETZER, Y. et al. Reading digits in natural images with unsupervised feature learning. 2011.

OLSON, R. S. et al. Evaluation of a tree-based pipeline optimization tool for

automating data science. In: Proceedings of the Genetic and Evolutionary Computation Conference 2016. New York, NY, USA: ACM, 2016. (GECCO ’16), p. 485–492. ISBN

978-1-4503-4206-3. Disponível em: <http://doi.acm.org/10.1145/2908812.2908918>. PEDREGOSA, F. et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, v. 12, p. 2825–2830, 2011.

PUMPERLA, M. et al. Hyperas. 2016. <https://github.com/maxpumperla/hyperas>. REAL, E. et al. Large-scale evolution of image classifiers. CoRR, abs/1703.01041, 2017. Disponível em: <http://arxiv.org/abs/1703.01041>.

THORNTON, C. et al. Auto-weka: Combined selection and hyperparameter optimization of classification algorithms. In: ACM. Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. [S.l.], 2013. p. 847–855.

WILLMOTT, C. J.; MATSUURA, K. Advantages of the mean absolute error (mae) over the root mean square error (rmse) in assessing average model performance. Climate research, v. 30, n. 1, p. 79–82, 2005.

XIAO, H.; RASUL, K.; VOLLGRAF, R. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. 2017.

ZHONG, Z. et al. Random erasing data augmentation. arXiv preprint arXiv:1708.04896, 2017.

ZOPH, B.; LE, Q. V. Neural architecture search with reinforcement learning. In: . [s.n.], 2017. Disponível em: <https://arxiv.org/abs/1611.01578>.

(45)

44

APPENDIX A -- Estimator tuning

(46)

45 (a) 30m cutoff (b) regular (c) 5k budget (d) SF 20m (e) SF (f ) SF 5k Figure 4: CIF AR-10 estimator tun ing performance heatmaps

(47)

46 (a) 30m cutoff (b) regular (c) 5k budget (d) SF 20m (e) SF (f ) SF 5k Figure 5: CIF AR-100 estimator tuning performance heatmaps

(48)

47 (a) 30m cutoff (b) regular (c) 5k budget (d) SF 20m (e) SF (f ) SF 5k Figure 6: Fashion MN IS T estimator tuning performance heatmaps

(49)

48 (a) 30m cutoff (b) regular (c) 5k budget (d) SF 20m (e) SF (f ) SF 5k Figure 7: SVHN estimator tuni ng performance heatmaps

(50)

49

APPENDIX B -- Preprocessing tuning

performance heatmaps

(51)

50 (a) 30m cutoff (b) regular (c) 5k budget (d) SF 20m (e) SF (f ) SF 5k Figure 8: CIF AR-10 prepro cessin g tuning performance heatmaps

(52)

51 (a) 30m cutoff (b) regular (c) 5k budget (d) SF 20m (e) SF (f ) SF 5k Figure 9: CIF AR-100 prepro cessing tuning performance heatma ps

(53)

52 (a) 30m cutoff (b) regular (c) 5k budget (d) SF 20m (e) SF (f ) SF 5k Figure 10: Fashion MN IS T prepro cessing tun ing performance heatmaps

(54)

53 (a) 30m cutoff (b) regular (c) 5k budget (d) SF 20m (e) SF (f ) SF 5k Figure 11: SVHN prepro cessing tuning performance heatm aps

Assessing irace for automated machine learning

Assessing irace for automated machine learning

Carlos Eduardo Morais Vieira

Carlos Eduardo Morais Vieira

Assessing irace for automated machine learning

Leonardo C. T. Bezerra, PhD

Acknowledgments

Assessing irace for automated machine learning

Resumo

Assessing irace for automated machine learning

Abstract

Contents

List of figures

List of tables

List of abbreviations

1

Introduction

2

Background

2.1

Estimators

2.2

Pipelines and frameworks

2.3

Combined Algorithm Selection and

Hyperparame-ter optimization (CASH)

2.4

AutoML

3

Setting up irace for machine

learning

3.1

The isklearn template

3.2

Configuration space

3.3

Experimental setup

4

Assessing pipelines configured

from irace

5

Further experiments

5.1

Datasets

5.2

Experimental setup

5.3

Results

6

Concluding remarks

References

APPENDIX A -- Estimator tuning

APPENDIX B -- Preprocessing tuning

performance heatmaps