Forecasts of multivariate time series sampled from industrial machinery sensors

(1)

Forecasts of Multivariate Time Series

Sampled From Industrial Machinery Sensors

Rio das Ostras

7 de maio de 2020

(2)

(3)

Forecasts of Multivariate Time Series Sampled From

Industrial Machinery Sensors

Dissertação apresentada ao Mestrado Pros-sional em Engenharia de Produção e Sis-temas Computacionais da Universidade Fed-eral Fluminense como requisito parcial para a obtenção do Grau de Mestre.

Linha de Pesquisa: Engenharia de Sistemas de Informação.

Universidade Federal Fluminense

Mestrado Prossional em Engenharia de Produção e Sistemas Computacionais

Orientador: Leila Weitzel Coelho da Silva

Orientador: Ana Paula Barbosa Sobral

Rio das Ostras

7 de maio de 2020

(4)

(5)

Bibliotecária responsável: Monnique São Paio de Azeredo Esteves Veiga - CRB7/6921

Forecasts of Multivariate Time Series Sampled From

Industrial Machinery Sensors / Heron Felipe Rosas dos Santos ; Leila Weitzel Coelho da Silva, orientadora ; Ana Paula Barbosa Sobral, coorientadora. Niterói, 2019.

107 p. : il.

Dissertação (mestrado profissional)-Universidade Federal Fluminense, Rio das Ostras, 2019.

DOI: http://dx.doi.org/10.22409/PPG-MESC.2019.m.04053481503 1. Manutenção industrial. 2. Previsão. 3. Serie temporal. 4. Rede neural. 5. Produção intelectual. I. Silva, Leila Weitzel Coelho da, orientadora. II. Sobral, Ana Paula Barbosa, coorientadora. III. Universidade Federal Fluminense. Instituto de Ciência e Tecnologia. IV. Título.

(6)

-Forecasts of Multivariate Time Series Sampled From

Industrial Machinery Sensors

Dissertação apresentada ao Mestrado Pros-sional em Engenharia de Produção e Sis-temas Computacionais da Universidade Fed-eral Fluminense como requisito parcial para a obtenção do Grau de Mestre.

Linha de Pesquisa: Engenharia de Sistemas de Informação.

Aprovada em setembro de 2019

Prof. Dra. Leila Weitzel Coelho da Silva Orientadora

Instituto de Ciência e Tecnologia - UFF

Prof. Dra. Ana Paula Barbosa Sobral Orientadora

Instituto de Ciência e Tecnologia - UFF

Prof. Dr. Leonard Barreto Moreira Instituto de Ciência e Tecnologia - UFF

Prof. Dra. Zenaide Carvalho Silva Instituto de Geociências e Engenharias

-UNIFESSPA

Rio das Ostras

7 de maio de 2020

(7)

complain of, if I'm going to do what I was born for the things I was brought into the world to do? Or is this what I was created for? To huddle under the blankets and stay warm? (Marcus Aurelius)

(8)

(9)

Prognostics assesses and predicts future machine health, which includes detecting incipi-ent failures and predicting remaining useful life. Several studies have treated prognostics from a time series forecasting perspective. The main goal of this study is to evaluate the performance of a set of methods in the prediction of future values on a dataset of time series collected from sensors installed in an industrial gas turbine. Forecasting methods tested include the use of multivariate and univariate neural networks (FNN and LSTM), exponential smoothing and ARIMA models. Results show that the use of ARIMA models to forecast on the studied dataset is the best default method to apply, and is the only forecasting method that consistently beats a simple naïve no-change model.

(10)

(11)

Prognósticos avaliam e preveem a condição futura de máquinas, o que inclui detectar falhas incipientes e prever a vida útil remanescente. Vários estudos trataram prognós-ticos de um ponto de vista de previsão de séries temporais. O objetivo principal desse estudo é avaliar o desempenho de um conjunto de métodos na previsão de valores futuros em um conjunto de séries temporais coletadas de sensores instalados em uma turbina a gás industrial. Métodos de previsão avaliados incluem o uso de redes neurais (FNN e LSTM) univariadas e multivariadas, amortecimento exponencial e modelos ARIMA. Os resultados mostram que o uso de modelos ARIMA para previsão no conjunto de séries temporais estudado é o melhor método para aplicar por padrão, e que é o único método de previsão que consistentemente supera um modelo ingênuo simples que assume ausência de mudança no tempo.

(12)

(13)

Figure 1 The V-Architecture of CBM. . . 19

Figure 2 The supervised learning process. . . 24

Figure 3 Nonlinear model of a neuron k. . . 25

Figure 4 The threshold function. . . 26

Figure 5 The sigmoid function for varying slope parameter a. . . 27

Figure 6 A multilayer feedforward network. . . 28

Figure 7 Representation of a recurrent neural network. . . 31

Figure 8 A LSTM cell. . . 33

Figure 9 Johnson & Johnson quarterly earnings per share, 84 quarters, 1960-I to 1980-IV. . . 34

Figure 10 Example correlogram. Quarterly percentage change in US consumption. 37 Figure 11 Example PACF. Quarterly percentage change in US consumption. . . . 37

Figure 12 Dierence in perception between diagnostics and prognostics. . . 45

Figure 13 Position of embedded sensors on the turbine under consideration. . . . 50

Figure 14 Data preprocessing. . . 51

Figure 15 Steps in the proposed method. . . 53

Figure 16 The study's dataset. Daily Values. . . 57

Figure 17 The study's dataset. Daily Values. Without extreme outliers. . . 57

Figure 18 ACF and PACF for series P1. . . 97

Figure 22 ACF and PACF for series R1. . . 98

Figure 23 ACF and PACF for series R2. . . 99

Figure 24 ACF and PACF for series T1. . . 99

(14)

Figure 38 ACF and PACF for series V1. . . 104

(15)

Table 1 Taxonomy of exponential smoothing methods. . . 39 Table 2 Number of points in the raw dataset, per sensor. . . 55 Table 3 Number of points in the dataset, per sensor, after the removal of

stop-page points. . . 56 Table 4 p-values for the Augmented Dickey-Fuller test. All time series in the

dataset. . . 58 Table 5 p-values for Levene's test. All time series in the dataset. . . 59 Table 6 p-values for the Anderson-Darling test. All time series in the dataset. . 59 Table 7 ETS models selected for the time series. Both the original training set

and the training set without extreme outliers. . . 62 Table 8 MSE obtained on the test set by the application of ETS models trained

on the normal training set. . . 63 Table 9 MSE obtained on the test set by the application of ETS models trained

on the training set without extreme outliers. . . 64 Table 10 ARIMA models selected for the time series. Both the original training

set and the training set without extreme outliers. . . 65 Table 11 For how many series an ARIMA selection procedure delivers the best

results. . . 66 Table 12 MSE obtained on the test set by the application of ARIMA models

trained on the normal training set. . . 67 Table 13 MSE obtained on the test set by the application of ARIMA models

trained on the training set without extreme outliers. . . 68 Table 14 MSE obtained on the test set by the application of non seasonal ARIMA

models without Box-Cox transform. Fitted on the normal training set. . 69 Table 15 Hyperparameters considered for the FNN networks. . . 71 Table 16 Average validation loss (MSE) for the one hidden layer multivariate FNN

after 10 training runs. . . 71 Table 17 Top 10 two hidden layers multivariate FNN architectures. Ranked by

average validation loss (MSE) after 10 training runs. . . 72 Table 18 Number of units selected for the hidden layers of the univariate FNNs. . 72 Table 19 Average validation loss (MSE) for the multivariate LSTM networks after

10 training runs. . . 73 Table 20 Number of units selected for the hidden layers of the univariate LSTM

networks. . . 73 Table 21 MSE obtained on the test set by the application of univariate FNNs.

(16)

Trained on the normal training set. . . 75 Table 23 MSE obtained on the test set by the application of univariate FNNs.

Fitted on the training set without extreme outliers. . . 76 Table 24 MSE obtained on the test set by the application of univariate LSTMs.

Fitted on the training set without extreme outliers. . . 77 Table 25 MSE obtained on the test set by the application of the best multivariate

one hidden layer FNN. Trained on the normal training set. . . 78 Table 26 MSE obtained on the test set by the application of the best multivariate

two hidden layers FNN. Trained on the normal training set. . . 79 Table 27 MSE obtained on the test set by the application of the best multivariate

LSTM. Trained on the normal training set. . . 80 Table 28 MSE obtained on the test set by the application of the best multivariate

one hidden layer FNN. Fitted on the training set without extreme outliers. 81 Table 29 MSE obtained on the test set by the application of the best multivariate

two hidden layers FNN. Fitted on the training set without extreme outliers. 82 Table 30 MSE obtained on the test set by the application of the best multivariate

LSTM. Fitted on the training set without extreme outliers. . . 83 Table 31 Average MSE obtained on the test set by the evaluated forecasting

meth-ods. Fitted on the normal training set. . . 85 Table 32 Average MSE obtained on the test set by the evaluated forecasting

meth-ods. Fitted on the training set without extreme outliers. . . 85 Table 33 Number of series for which a forecasting method produces the best

re-sults. Based on MSE. Fitted on the normal training set. . . 86 Table 34 Number of series for which a forecasting method produces the best

(17)

1 INTRODUCTION . . . 17

1.1 Motivation . . . 17

1.2 Main Goal . . . 20

1.3 Detailed Goals . . . 20

2 LITERATURE REVIEW . . . 23

2.1 Articial intelligence and Machine Learning . . . 23

2.2 Neural Networks and Deep Learning . . . 24

2.2.1 Introduction . . . 24

2.2.2 Modern Neural Networks . . . 25

2.2.3 Recurrent Neural Networks and Long Short Term Memory Units . . . 30

2.2.4 Deep Learning . . . 32

2.2.5 Recent Work on Deep Learning . . . 33

2.3 Time Series . . . 34

2.3.1 Stationarity and Normality . . . 35

2.3.2 Autocorrelation . . . 36

2.3.3 Forecasting . . . 36

2.3.3.1 Exponential Smoothing . . . 37

2.3.3.2 Automatic Forecasting With Exponential Smoothing . . . 38

2.3.3.3 ARIMA Models . . . 40

2.3.3.4 Automatic Forecasting With ARIMA Models . . . 41

2.3.3.5 Model Selection and Evaluation. . . 42

2.3.4 Time Series Forecasting With Neural Networks . . . 44

2.4 Prognostics . . . 44

2.4.1 Recent Work on Prognostics . . . 46

2.4.2 Prognostics and Time Series . . . 47

3 METHODOLOGY . . . 49 3.1 Research Classication . . . 49 3.2 Data . . . 49 3.3 Method . . . 51 4 DATA . . . 55 4.1 Preprocessing . . . 55 4.2 Preliminary Analysis . . . 56

(18)

5.1 Exponential Smoothing . . . 61

5.1.1 Method Selection . . . 61

5.1.2 Results on the Test Set . . . 62

5.2 ARIMA Models . . . 64

5.2.1 Model Selection . . . 64

5.2.2 Results on the Test Set . . . 66

6 NEURAL NETWORKS . . . 71

6.1 Model Selection . . . 71

6.2 Results on the Test Set . . . 73

7 DISCUSSION . . . 85

8 CONCLUSION . . . 89

REFERENCES . . . 91

(19)

1 Introduction

1.1 Motivation

The International Organization for Standardization - ISO (2013, p. 25) denes mainte-nance as the combination of all technical and administrative actions, including supervi-sory actions, intended to retain an item in, or restore it to, a state in which it can perform a required function". Maintenance is an important part of managing assets. When prop-erly done, the maintenance function allows a plant to perform up to its design standards, with its maintenance costs tracking on budget, and with reasonable capital investments (CAMPBELL; JARDINE; MCGLYNN, 2016).

According to Pintelon and Parodi-Herz (2008, p. 21) at rst maintenance was nothing more than a mere inevitable part of production. It is Pintelon and Parodi-Herz view that industrial maintenance was perceived both as an accessory function that could not be managed and as an unavoidable cost. However, there has been profound technological evolution between nowadays installations, which are more automated, complex and right-sized, and the installations from the start of the twentieth century. These technological evolutions, combined with the erce and worldwide competition, means that an industry's physical assets take a central role, and so does the maintenance function, elevated to a strategic element to accomplish business goals.

In the 1950s, almost all maintenance actions were corrective. Corrective maintenance ac-tions can be dened as repair or restore acac-tions following a breakdown or loss of function (PINTELON; PARODI-HERZ, 2008). Another denition for corrective maintenance is the maintenance that occurs after a system fails (WANG, 2002). Allowing a machine or equipment to run until its failure is a reactive maintenance management technique, that requires a plant to be able to react to all possible failures within itself. This implies in high spare parts inventory costs, high overtime labor costs, high machine downtime, and low production availability. Analysis of maintenance costs show that repairs made in a reactive mode are normally three times more expensive than the same repair made on a scheduled basis (MOBLEY, 2002).

Precautionary maintenance actions have the fundamental aim of diminishing the failure probability of the physical asset, and to anticipate or avoid the consequences if a failure occurs. Preventive maintenance is a precautionary maintenance policy popularized in the 1960s (PINTELON; PARODI-HERZ, 2008). Preventive maintenance is a time-driven maintenance management program that assumes all machines will degrade within a time frame typical for their particular classication. Maintenance is then scheduled based on the MTTF (Mean Time To Failure) statistic. The normal result of this approach is ei-ther unnecessary repair or catastrophic failure. The rst option means the materials and labor used on the repair were wasted. The second option is even more costly, equating to run-to-failure maintenance (MOBLEY, 2002).

Shin and Jun (2015) dene CBM (Condition Based Monitoring), initially called predictive maintenance, as:

a maintenance policy wich do maintenance action before product failure happens, by assessing product condition including operating

(20)

environ-ments, and predicting the risk of product failure in a real-time way, based on gathered product data (SHIN; JUN, 2015, p. 120).

Mobley (2002) says that:

the common premise of predictive maintenance is that regular monitor-ing of the actual mechanical condition, operatmonitor-ing eciency, and other indicators of the operating condition of machine-trains and process sys-tems will provide the data required to ensure the maximum interval between repairs and minimize the number and cost of unscheduled out-ages created by machine-train failures(MOBLEY, 2002, p. 4).

While the technologies and technical methods for CBM are still in their infancy, advance-ments in information technology have accelerated growth in CBM technology by enabling network bandwidth, data collection and retrieval, data analysis, and decision support ca-pabilities for large data sets of time series data. Recently, the petroleum, petrochemical and natural gas industries started to have more interest in the CBM policy (SHIN; JUN, 2015; PRAJAPATI; BECHTEL; GANESAN, 2012).

Prajapati, Bechtel and Ganesan (2012) represent the CBM process in a V architecture, shown in Figure 1. The diagram shows the several methods and activities to the CBM process, including acquisition of real time input, data conversion and prognostication. Diagnostics and prognostics are two parts of CBM. Diagnostics is a reactive process. It takes place after a fault has already occurred and aims to determine the root cause of the failure. It cannot prevent machine downtime and the corresponding expenses. Prognos-tics is a proactive process. It assesses and predicts future machine health, which includes detecting incipient failures and predicting remaining useful life (LEE et al., 2014). Pra-japati, Bechtel and Ganesan (2012) dene prognostics as:

the process of predicting the future failure of any system by analyzing the current and previous history of the operating conditions of the sys-tem or monitoring the deviation rate of the operation from the normal conditions (PRAJAPATI; BECHTEL; GANESAN, 2012, p. 388).

Real prognostic systems are still scarce in industry (DRAGOMIR et al., 2009). In the Oil and Gas industry context, Cho et al. (2016, p. 3) state that estimating the next failure time with sensor data is still the undeveloped area in an oshore plant equipment. There are three classes to the current approaches to prognostics: model based, data driven and hybrid. Model based approaches presume that it is possible to build a mathematical model from the understanding of the physical mechanisms involved in the failure modes of the machine for which the model is built. While these approaches have the advantage of providing the ability to incorporate physical understanding of the system, if the un-derstanding of the system degradation is poor, it may be dicult to model the system behavior. Data driven approaches use data gathered from sensors or by the machine operators to track features that indicate the degradation of the system. Data driven ap-proaches can leverage computer intelligence techniques like neural networks and decision trees, or statistical techniques like auto-regressive models (DRAGOMIR et al., 2009). There is no agreement on what are the best methods and techniques for each prognostics application. One commonly used technique is the analysis of time series. Several studies

(21)

Figure 1 The V-Architecture of CBM.

Source: Prajapati, Bechtel and Ganesan (2012).

into prognostics have treated it from a time series forecasting perspective (PHAM et al., 2012; HENG et al., 2009; DATONG; YU; XIYUAN, 2011; NIU; YANG, 2010; CHO et al., 2016).

In recent years, there have been many studies which showed good performance from arti-cial neural networks (ANN) for time series forecasting in diverse domains, like predicting trac speed and prediction of telephone calls load. These studies explored dierent net-work architectures in the forecasting task (EGRIOGLU et al., 2015; BIANCHI et al., 2015; KHASHEI; BIJARI, 2010; MA et al., 2015).

Historically, it was believed that statistically sophisticated or complex time series fore-casting methods do not necessarily produce more accurate forecasts than simpler methods developed by practicing forecasters (MAKRIDAKIS; HIBON, 2000). However, more re-cent evaluations have concluded that more complex methods based on computational intelligence and neural networks have caught up, and that simpler methods can no longer claim to outperform computer intelligence methods without a proper empirical evaluation (CRONE; HIBON; NIKOLOPOULOS, 2011).

It is often the case that process data, collected in the form of time series, is compressed and archived for record keeping and only retrieved for use in emergency analysis after a

(22)

fault has occurred. This data could be of tremendous advantage when combined with eective analytics and superior computing power capable of generating knowledge from the data (QIN, 2014). It is possible that the application of time series forecasting meth-ods based on neural networks, combined with the huge available amount of machinery historical data, may lead to more precise and accurate prognostics of industrial machines. Ultimately, better prognostics lead to reduced maintenance costs and increased produc-tion availability.

The structure of the remaining of this work is as follows: Chapter 2 presents a litera-ture review in the subjects relevant to this work. Chapter 3 covers the nalitera-ture of the research, and the methods applied to data preprocessing, forecasting and forecast evalu-ation. Chapter 4 describes the dataset used in the study and the preprocessing applied to the dataset prior to any model building. Chapter 5 covers the selection of the ETS and ARIMA models, as well as the forecasts generated by these models on the dataset. Chapter 6 covers the design of the FNN and LSTM neural networks, and the forecasts generated by these networks on the dataset. Chapter 7 is a discussion on the results ob-tained by applying the proposed methods to the dataset. Chapter 8 provides a conclusion for the study.

1.2 Main Goal

The main goal of this research is to evaluate a set of established forecasting methods and a set of neural networks based methods in time series forecasting tasks for industrial prognostics. The research tests the hypothesis that articial neural networks based meth-ods may lead to superior performance than stablished statistical methmeth-ods. This will be done by using these forecasting methods to predict future values for a set of time series of process and mechanical condition data collected from a gas turbine installed in an oil platform.

It is expected that the predictions should increase the information available for the main-tenance decision-making process. Ultimately, the study should generate knowledge that could lead to improved asset management and reduced maintenance costs.

1.3 Detailed Goals

In order to achieve the main objective the following steps are necessary:

• Systematic literature review in the area of the prognostics process of condition based maintenance.

(23)

• Identify the methods based on articial neural networks that have been delivering the best performance in the task of timeseries prediction.

• Dene and implement a set of methods for the prediction of future values for an industrial machine's monitored parameters, based on the usage of articial neural networks.

(24)

(25)

2 Literature Review

2.1 Articial intelligence and Machine Learning

Articial intelligence, also called computational or machine intelligence, is the study of intelligent agents. An agent is anything that perceives and interacts with its environment. Other denitions of articial intelligent systems vary on the denition of intelligence as human or rational behavior, and on concern with intelligent thinking (the inner works of the system) or with intelligent behavior (the system's output). It is possible to follow many dierent approaches in order to build an intelligent agent. There are many subelds to AI. Some are general, like perception and logical reasoning. Others are more specic, like playing chess (RUSSELL; NORVIG, 2016).

Ever since the inception of computers humans have wondered if they could be made to learn and to improve automatically with experience. Successfully understanding how to make computers learn would lead to many new uses of computers and new levels of competence and customization (MITCHELL et al., 1997). According to Segaran (2007), machine learning is a subeld of articial intelligence concerned with algorithms that al-low a machine to learn. The algorithm takes a data set, infers information about the properties of the data, and that inferred information is used to make predictions about other data in the future. The algorithm looks for patterns in the data and then use those to generalize. There are many dierent machine learning methods, all with dier-ent strengths and weaknesses.

Machine Learning tasks can be divided in two categories: Unsupervised learning and supervised learning. Supervised learning consists in building a model to predict, or esti-mate, an output based in one or more inputs. In unsupervised learning we have a data set of measurements, but no associated response variable. However, unsupervised learning makes it possible to learn relationships and structures of the data set, even in the absence of a response variable (JAMES et al., 2013).

Supervised learning can be divided in classication problems and regression problems. In classication problems, as in identifying email spam and in bank fraud detection, the dependent variable is a discrete variable with limited possibilities (the classes). The goal is to based on the input values of a sample (independent variables), predict to which class (dependent variable) this sample belongs. In regression problems, as in weather forecast-ing and stock market prediction, the dependent variable is continuous and the goal is to dene a value for this variable, based on a sample's input values (BRINK; RICHARDS; FETHEROLF, 2016). Figure 2 shows the supervised learning process.

In the last two decades, machine learning has experienced signicant progress:

Machine learning has progressed dramatically over the past two decades, from laboratory curiosity to a practical technology in widespread com-mercial use. [. . . ] Many developers of AI systems now recognize that, for many applications, it can be far easier to train a system by showing it examples of desired input-output behavior than to program it manually by anticipating the desired response for all possible inputs (JORDAN; MITCHELL, 2015, p. 255).

(26)

Figure 2 The supervised learning process.

Source: Raschka (2015).

capabilities to gather and transmit vast amounts of data. Machine learning succeeds in solving problems in diverse elds like robotics and self-driving vehicles, speech and natural language processing, computer vision, neuroscience research and proteomics (JORDAN; MITCHELL, 2015; TYANOVA et al., 2016).

2.2 Neural Networks and Deep Learning

2.2.1 Introduction

Neural Network is a term that encompass a large class of models and learning methods. In spite of the hype surrounding neural networks, they are just nonlinear statistical mod-els. They extract linear combinations of the input variables, and then model the target variable as a nonlinear function of these linear combinations (FRIEDMAN; HASTIE; TIBSHIRANI, 2001).

According to Haykin (2009), articial neural networks are models built through the mas-sive interconnection of simple computing cells, called neurons or processing units. In that way, a neural network is a massively parallel distributed processor. Neural networks acquire knowledge through a process called learning, and the knowledge is stored in the weights (the strengths) of the interneuron connections.

In 1943 McCulloch and Pitts proposed a theory for the calculus of events in the nervous system using propositional logic. The model was based on the assumptions of theoretical neurophysiology of the time, like that the nervous system is a network of neurons, and that at any instant a neuron has some threshold that an excitation must exceed in order to initiate an impulse.

In 1957 Rosenblatt introduced the idea of the perceptron. The perceptron is a system that emulates the perceptual processes of the biological brain, depending on probabilis-tic rather than determinisprobabilis-tic principles for its operation. Rosenblatt's perceptron had a training rule that updated the weights proportionally to the values of the inputs.

(27)

After the introduction of the perceptron, research on the eld of neural networks pro-gressed with innovations like ADALINE (WIDROW; HOFF, 1960), which used an itera-tive gradient based method for training, until it was proved in 1969 that the perceptron model was not capable of representing many important problems. This caused a decrease in research funds for the area. The situation was only reversed in the 1980's, with the popularization of the backpropagation algorithm for network training and the discovery that non-linearly separable problems could be solved by applying multilayer perceptrons. From that time on, the development in the eld of neural networks has been explosive (KRIESEL, 2007).

2.2.2 Modern Neural Networks

Neurons are the fundamental information-processing units to the operation of a neural network. A model of a neuron is shown in Figure 3. There are three basic elements to the neuron's model (HAYKIN, 2009):

• A set of connecting links, each characterized by a weight of its own. A signal xj at

the output of neuron j to neuron k is multiplied by the weight wkj.

• An adder for summing the input signals and the externally applied bias. This is also called the propagation function, often the weighted sum of inputs. It generates a net input (KRIESEL, 2007).

• An activation function for limiting the output of a neuron. Usually the output is limited to the closed unit interval [0, 1], or, alternatively, [−1, 1].

Figure 3 Nonlinear model of a neuron k.

(28)

The adder function results in an activation potential vk given by Equation 2.1. Then,

the neuron takes the activation potential and inputs it to the activation function ϕ(vk).

McCulloch and Pitts' perceptron uses a threshold activation function. The output of a McCulloch-Pitts' neuron is always either 0 or 1. The threshold function is given by Equation 2.2. The behavior of the threshold function can be seen in Figure 4.

vk = m X j=1 wkjxj + bk (2.1) ϕ(v) =    1 ifv ≥ 0 0 ifv < 0 (2.2)

Perceptrons are unstable to small variations in its weights and bias. A small change may cause the neuron's output to ip between 0 and 1. That ip may cause the behavior of the rest of the network to change in some very complicated way and is a property undesirable for network training. This problem is overcome with the introduction of a dierent type of neuron. The sigmoid neuron uses a sigmoid activation function. The sigmoid function, given in Equation 2.3, is continuous in all values of v, as shown in Figure 5. Small changes in the sigmoid neuron's weights and bias cause only a small change in its output. That is the crucial fact that allows a network of sigmoid neurons to learn (NIELSEN, 2015).

ϕ(v) = 1

a + exp−av (2.3)

A third type of activation function, which is the most popular type at present for ma-chine visions tasks, is the rectied linear unit (ReLU). As in Equation 2.4, ReLU simply returns the maximum value between 0 and the activation potential v (LECUN; BENGIO; HINTON, 2015). At present, the understanding of why ReLU delivers better results than

Figure 4 The threshold function.

(29)

Figure 5 The sigmoid function for varying slope parameter a.

Source: Haykin (2009).

sigmoid neurons is poor. Regardless, ReLU has shown good results in benchmark datasets and the practice of their use has spread (NIELSEN, 2015).

ϕ(v) =    v ifv ≥ 0 0 ifv < 0 (2.4)

The manner in which a network's neurons are structured is called its architecture. Figure 6 shows a feedforward network. The gure shows the input layer, two hidden layers and the output layer. The input layer has one neuron for each independent variable in the input. The denomination hidden layer simply means that a layer is neither an input nor an output layer. As stated, Figure 6 shows a feedforward network, meaning there are no loops in the network. In a feedforward network, a layer only uses as input the output from the previous layer. There are neural network models were feedback loops are allowed. These models are called Recurrent Neural Networks (RNNs) (NIELSEN, 2015). Haykin (2009) speaks in three main classes of network architectures: Single-layer feedforward networks, multilayer feedforward networks and recurrent networks.

Multilayer feedforward networks, which may be called multi-layer perceptrons, in spite of using sigmoid neurons instead of McCulloch-Pitts' perceptrons (NIELSEN, 2015), are the quintessential deep learning models, being of extreme importance to machine learning practitioners and forming the basis of many deep learning applications (GOODFELLOW; BENGIO; COURVILLE, 2016).

The layers of a multilayer feedforward network are fully connected. This means that the outputs of every neuron in a layer are used as inputs for all the neurons in the following layer. The activation o a neuron j in a layer l is given by Equation 2.5. The neuron does a sum of all the activations al−1

k from all neurons in layer l − 1 weighted by the weights of

the conections wl

jk between neuron k in layer l − 1 and neuron j in layer l. The neuron

adds the bias bk

(30)

activation function. All the activations in a layer can be calculated at once using vectors and matrixes, as in Equation 2.6. A layer's activations are represented as a vector al, its

biases as a vector bl_{, and the weights connecting to the layer as a matrix w}l _(NIELSEN,

2015). al_j = ϕ X k w_jkl al−1_k + bl_j ! (2.5) al = ϕ(wlal−1+ bl) (2.6)

The entire neural network can be thought of as a chain of functions that represent each of the network's layers. As an example, a network with one hidden layer could be represented as f(x) = f(2)_(f(1)_(x)) _{where x is the input vector, f}(1) _{is the hidden layer and f}(2) _is

the output layer (GOODFELLOW; BENGIO; COURVILLE, 2016). Each layer's output is a function of its weights and biases, and of the previous layer's activations. Hence, the output of the neural network is a function of the input vector, which is the input to the rst hidden layer, and the network's weights and biases. Training the network corresponds to nding values for the parameters (the weights and biases) that minimize a network performance metric (the cost function) (JORDAN; MITCHELL, 2015). The performance metric is a quantitative measure of the performance of a machine learn-ing algorithm. Usually, after trainlearn-ing, the performance is evaluated on a data set that wasn't used during the training of the model. This data set is called a test set. The choice of cost function is an important part of neural network design. Neural networks often use one of the cost functions commonly used by other parametric machine learning models, combined with a regularization term (GOODFELLOW; BENGIO; COURVILLE, 2016). The regularization term aims to avoid overtting, a situation in which during training the performance of the model improves in the training set, but that improvement doesn't translate to increased performance on the test data (NIELSEN, 2015).

Supervised learning on a multilayer perceptron can be seen as a numerical optimization Figure 6 A multilayer feedforward network.

(31)

problem (HAYKIN, 2009). Neural network training is the most dicult optimization problem involved in deep learning, and one can spend considerable computational re-sources in a single instance of the neural network training problem. The training problem consists in nding a set of weights w that minimizes a cost function C(w) which includes a performance measure on the training set and may include additional regularization terms. Learning in neural networks diers from traditional optimization in that while we care about a performance measure P dened with respect to test set, we only optimize P indirectly. We reduce a cost C(w) measured on the test set, in the hope that this will result in improvements to P . Stochastic gradient descent is a commonly used algorithm for neural network training and, while probably the most used optimization algorithm for deep learning, it is but one method to minimize the cost function (GOODFELLOW; BENGIO; COURVILLE, 2016).

Gradient descent is a method to minimize a function f(x), where f : Rn _{→ R}_{. The}

method starts from a point x(0) _{and searches for the minimum of the function f(x) by}

moving in a direction that is opposite to the gradient (∆x = −∇f(x)) (BOYD; VAN-DENBERGHE, 2004). The gradient of a function is a vector whose components are the partial derivatives of the function relative to each of the function's variables. A partial derivative tells the rate of change of a function when only one of its variables is changed. The gradient tells how to move up a function. The gradient vector's direction tells in wich way to move, and the vector's amplitude tells how much to move (STRANG, 1991). The backpropagation algorithm is a way to calculate the partial derivatives of a neural network's cost function:

The back-propagation algorithm is a computationally ecient technique for computing the gradients (i.e., rst-order derivatives) of the cost function e(w), expressed as a function of the adjustable parameters (synaptic weights and bias terms) that characterize the multilayer per-ceptron.(HAYKIN, 2009, p. 180)

The backpropagation algorithm is ecient because the complexity of the algorithm is linear in the number of weights in the network (HAYKIN, 2009). The algorithm is an application of the chain rule for dierentiation:

The key insight is that the derivative (or gradient) of the objective with respect to the input of a module can be computed by working backwards from the gradient with respect to the output of that module (or the input of the subsequent module) [...] The backpropagation equation can be applied repeatedly to propagate gradients through all modules, starting from the output at the top (where the network produces its prediction) all the way to the bottom (where the external input is fed). Once these gradients have been computed, it is straightforward to compute the gra-dients with respect to the weights of each module. (LECUN; BENGIO; HINTON, 2015)

Nielsen (2015) summarizes the application of backpropagation into four equations inde-pendent of the format of the activation function and the format of the cost function, as long as the cost function C can be written as an average of the cost function over indi-vidual training examples Cx, and the cost function can be written as a function of the

outputs of a neural network. First, the error in the jth _{neuron of the l}th_{layer is introduced}

(32)

neuron's activation potential vl

j. The rst backpropagation equation is Equation 2.8. It

results in a vector with the errors in the output layer. ∇aC is a vector with the

par-tial derivatives of the cost relative to the output layer activations. The second equation, Equation 2.9, results in a vector with the errors in layer l in terms of the errors in the next layer l + 1. The third equation, Equation 2.10, gives the partial derivative of the cost relative to any bias. The fourth equation, Equation 2.11, gives the partial derivative of the cost relative to any weight. These equations allow the calculation of the gradient of the cost function.

δ_jl = ∂C ∂vl j (2.7) δL= ∇aC σ0(vL) (2.8) δl = ((wl+1)Tδl+1) σ0(vl) (2.9) ∂C ∂bl j = δl_j (2.10) ∂C ∂wl jk = al−1_k δ_jl (2.11)

2.2.3 Recurrent Neural Networks and Long Short Term Memory Units

Recurrent Neural Networks (RNNs) are a family of neural networks specialized in pro-cessing sequential data. RNNs can be built in dierent ways, with recurrent connections between hidden units, or between output units and hidden units, and they can approxi-mate any function involving recursion. In a RNN, the state h(t) (the activations of the hidden units) depends on the past values of the state h(t − 1), as in Equation 2.12. Ap-pling Equation 2.12 recursively reveals that the state contains information on the entire past sequence. Figure 7 shows a recurrent network with recursive connections on the hidden layer.

h(t) = f (h(t − 1), x(t); θ) (2.12)

Standard recurrent neural networks suer from the problem of gradient instability. In typical neural networks, with standard activation functions, the backpropagated errors either shrink or grow exponentially with the number of layers in the network. The problem is worse in RNNs because the gradients are propagated back through time, and each time

(33)

period equates to an additional layer. The longer the network runs, the more unstable are the gradients on inputs further back in time. This problem is also known as the long time lag problem and it makes deep networks hard to train by backpropagation. Standard RNNs fail to learn when the lag between relevant input and target signal is greater than 5 to 10 time steps (NIELSEN, 2015; GERS; SCHMIDHUBER; CUMMINS, 1999; SCHMIDHUBER, 2015). Gated RNNs, like Long Short Term Memory (LSTM) networks, address the gradient instability problem by creating paths through time that have derivatives that will not vanish or explode. Instead of simple nonlinear units in the hidden layers, a LSTM network has LSTM cells. The LSTM cells have an internal recurrence in addition to the outer recurrence of the RNN. LSTM networks are some of the most eective models for sequential data (GOODFELLOW; BENGIO; COURVILLE, 2016).

Gers, Schmidhuber and Cummins (1999)introduced a LSTM network with the addition of adaptive forget gates. The forget gates are designed to reset the memory cells when their contents are no longer relevant. Figure 8 shows a LSTM cell.

In a LSTM cell, all gating units and the input unit take as input the current network input x(t) _{and previous LSTM layer output h}(t−1)_{. The state unit s}(t)

i of LSTM cell i has

a self-loop and is the most important component of the cell. The forget gate f(t)

i controls

the weight of the state unit self-loop, which it sets to a value between 0 and 1. In that way, the forget gate controls how much of the information in the state is preserved or discarded between time steps. The forget gate is a sigmoid unit and the value of f(t)

i is given by

Equation 2.13, where bf_{, U}f _{and W}f _{are the biases, input weights, and recurrent weights}

for the forget gates. f_i(t) = σ bf_i +X j U_i,jf x(t)_j +X j W_i,jf h(t−1)_j ! (2.13)

The input gate g(t)

i , similar to the forget gate, is a sigmoid unit that yields a gating value

between 0 and 1, and in that way controls how much of the network input is allowed to accumulate in the state. g(t)

i is given by Equation 2.14, where bg, Ug and Wg are the

Figure 7 Representation of a recurrent neural network.

(34)

biases, input weights, and recurrent weights for the input gates. g_i(t) = σ bg_i +X j U_i,jg x(t)_j +X j W_i,jg h(t−1)_j ! (2.14) The cell state s(t)

i is then updated as by equation 2.15, where b, U and W are the biases,

input weights, and recurrent weights into the LSTM cells. s(t)_i = f_i(t)s(t−1)_i + g_i(t)σ bi+ X j Ui,jx (t) j + X j Wi,jh (t−1) j ! (2.15) The LSTM cell also has an output gate q(t)

i . The output gate is also a sigmoid unit

and is capable of shutting o the LSTM cell output h(t) i . q

(t)

i is given by Equation 2.16,

where bo_{, U}o _{and W}o _{are the biases, input weights, and recurrent weights for the}

out-put gates. Equation 2.17 yields the LSTM cell outout-put h(t)

i (GOODFELLOW; BENGIO; COURVILLE, 2016). q_i(t) = σ bo_i +X j U_i,jo x(t)_j +X j W_i,joh(t−1)_j ! (2.16) h(t)_i = tanh(s(t)_i )q_i(t) (2.17)

2.2.4 Deep Learning

A deep neural network is a network with many hidden layers. Deep learning in a neural network is about assigning weights across the many computational layers that result in the network exhibiting the desired behavior (SCHMIDHUBER, 2015). Most machine learning algorithms require the manual extraction of features from the raw data, using domain expertise to design a useful representation of the raw data that can be used by a classication or regression algorithm. The addition of computational stages allows a neural network to automatically learn a useful representation from the raw data. Each layer builds on the representation of the previous layer, resulting in a representation at a slightly more abstract level(LECUN; BENGIO; HINTON, 2015).

In theory, the backpropagation algorithm should allow for deep learning. However, in the 1980s, it seemed as if the addition of hidden layers did not oer empirical benets. The major problem to the application of backpropagation in deep networks is the problem of vanishing or exploding gradient (SCHMIDHUBER, 2015). On the problem of vanishing or exploding gradient:

in at least some deep neural networks, the gradient tends to get smaller as we move backward through the hidden layers. This means that neu-rons in the earlier layers learn much more slowly than neuneu-rons in later layers. [. . . ] there are fundamental reasons why this happens in many neural networks. The phenomenon is known as the vanishing gradient problem. (NIELSEN, 2015)

(35)

Figure 8 A LSTM cell.

Source: Goodfellow, Bengio and Courville (2016).

Graphical Processing Units (GPUs) are cheap hardware that allow for increased speeds in neural network training. Originally intended for video games, GPUs excel in the vector and matrix multiplications required for NN training. The speed of GPUs allows for the training of deep neural networks using standard backpropagation in reasonable time, even while not eliminating the problem of vanishing or exploding gradient. This shows that advances in hardware can be more important than advances in algorithms for deep learning (SCHMIDHUBER, 2015).

2.2.5 Recent Work on Deep Learning

A method based on a combination of deep neural networks trained by supervised and reinforcement learning and search trees was able to beat the European champion of go, viewed as the most challenging classical game for articial intelligence, in 5 out of 5 games. The neural network is used to prune the number of possible plays to be evaluated (SIL-VER et al., 2016).

Glasser et al. (2016) attacked the problem of classifying distinct areas in the human brain using a multi-layer perceptron. An algorithm generates area boundaries in magnetic res-onance images. The boundaries are reviewed by specialists. The areas are then classied

(36)

Figure 9 Johnson & Johnson quarterly earnings per share, 84 quarters, 1960-I to 1980-IV.

Source: Shumway and Stoer (2017). using neural networks.

Xu et al. (2015) used the features extracted from images by Convolutional Neural Net-works (CNNs), as inputs for a recurrent neural network in order to automatically generate captions describing the contents of images. CNNs are neural networks that use convolu-tion, a specialized kind of linear operaconvolu-tion, instead of regular matrix multiplication in at least one of its layers (GOODFELLOW; BENGIO; COURVILLE, 2016).

Shin et al. (2016) explored the application of CNNs to two problems in medical imag-ing: thoraco-abdominal Lymph Node (LN) detection and Interstitial Lung Disease (ILD) classication. The focus of the work was on the evaluation of three factors in CNNs: Architecture, dataset characteristics and transfer learning.

2.3 Time Series

Chateld (2003, p. 1) denes a time series as a collection of observations made se-quentially through time. There is an expected correlation between data points sampled adjacently in time. This correlation should be accounted for and it invalidates the use of many statistical tools that rely on the dierent points on a data set being independent from each other. Time series are important to many elds like economics, social sciences, epidemiology, medicine and industrial process control (SHUMWAY; STOFFER, 2017; CROARKIN et al., 2002). Figure 9 shows an example time series: quarterly earnings from Johnson & Johnson for the 1960-1980 period.

A mathematical representation of a discrete time series, a time series with data points measured at a xed sampling interval, is given by Equation 2.18 (METCALFE;

(37)

COW-PERTWAIT, 2009).

{xt} = {x1, x2, x3, . . . } (2.18)

2.3.1 Stationarity and Normality

Many important results in statistical analysis follow from relevant theoretical assumptions relating to a selected method of analysis being approximately satised. Stationarity and normality are two commonly assumed theoretical assumptions (SAKIA, 1992).

A sort of regularity may exist over time in the behavior of a time series. The concept of stationarity captures this notion of regularity. A strictly stationary time series is one for which the probabilistic behavior of every collection of values, as in Equation 2.19, is identical to that of a time shifted set, as in Equation 2.20, for all time shifts h. The denition of strict stationarity is too strong for most applications. Moreover, it is dicult to assess strict stationarity from a single data set. Rather than imposing conditions on all possible distributions of a time series, it is possible to use a milder version that imposes conditions only on the rst two moments of the series (SHUMWAY; STOFFER, 2017).

{xt1, xt2, . . . , xtk} (2.19)

{xt1+h, xt2+h, . . . , xtk+h} (2.20)

A milder denition of stationarity is given by Makridakis, Wheelwright and Hyndman (2008). The conditions for time series stationarity are that the process generating the data is in equilibrium around a constant value (the underlying mean) and that the variance around the mean remains constant over time.

There are several statistical tests developed to determine if a time series is stationary on trend. These tests are also known as unit root tests. The Dickey-Fuller test is the most widely used stationarity test. The Dickey-Fuller test on time series {Yt}begins with the

estimation of a regression model on the dierenced time series {Y0

t}, as in Equation 2.21.

The p number of lagged terms in usually set to 3. If Yt is stationary, φ will be negative

(MAKRIDAKIS; WHEELWRIGHT; HYNDMAN, 2008).

Y_t0 = φYt−1+ b1Yt−10 + b2Yt−20 + . . . + bpYt−p0 (2.21)

Stationarity on variance implies that the variances of k dierent samples are equal. The denition of homoscedasticity is equal and nite variances across samples. There are sev-eral tests for this problem, but many of them rely on the assumption of normality and are not robust to its violation. Levene's test is robust to non-normality and is a popular tool for checking homogeneity of variances. The null hypothesis of Levene's test is that the sample variances are equal. Further details on Levene's test are given by Gastwirth,

(38)

Gel and Miao (2009).

Normality implies that the distribution of a sample is approximately normal. The Anderson-Darling test is a test of goodness of t of a sample to a given distribution. It can be used to test normality of a sample. The null hypothesis is that the distribution of the sample ts the specied distribution. Futher details on the Anderson-Darling test are given by Lovric (2011).

2.3.2 Autocorrelation

The correlation coecient is a statistic that measures the linear relationship between two variables. Autocorrelation is a similar measure that serves the same purpose for a single time series. Autocorrelation measures the correlation between observations at time t with observations at time t − 1, or with observations lagged by k numbers of periods at time t − k. The autocorrelation between observations at times t and t − k is rk, as given in

Equation 2.22 (MAKRIDAKIS; WHEELWRIGHT; HYNDMAN, 2008). rk = Pn t=k+1(Yt− ¯Y )(Yt−k− ¯Y ) Pn t=1(Yt− ¯Y )2 (2.22) The autocorrelations at lags 1, 2, 3, and so on, dene the autocorrelation function (ACF). The autocorrelations can be plotted against the lags in order to form a correlogram, as in Figure 10. The ACF is a standard tool in time series exploration (MAKRIDAKIS; WHEELWRIGHT; HYNDMAN, 2008). An autocorrelation coecient is irrelevant if it lies within ±2/√T, where T is the length of the time series. An issue with the autocorre-lation is that an autocorreautocorre-lation coecient rk is inuenced by the values between Yt and

Yt−k. The use of partial autocorrelations overcomes this issue. A partial autocorrelation

coecient αk measures the relationship between Yt and Yt−k after removing the eects of

time lags 1, 2, 3, . . . , k − 1. An autoregressive model as in Equation 2.31 estimates αk. αk

is equal to φk. The partial autocorrelations at lags 1, 2, 3, and so on, dene the partial

autocorrelation function (PACF), as in Figure 11 (HYNDMAN; ATHANASOPOULOS, 2014).

2.3.3 Forecasting

According to Chateld (2003, p. 73), forecasting the future values of a time series is an important problem in many areas, including economics, production planning, sales forecasting and stock control. The main objective in the forecasting problem consists in, given a historical time series with data between x1 and xN, estimating future values

up to xN +h for the time series, where h is the forecasting horizon (CHATFIELD, 2003;

METCALFE; COWPERTWAIT, 2009).

There are many forecasting methods. Forecasts may be based simply on the judgment of experts, they may be based on a statistical source, or they may combine expert knowledge

(39)

Figure 10 Example correlogram. Quarterly percentage change in US consumption.

source: Hyndman and Athanasopoulos (2014)

Figure 11 Example PACF. Quarterly percentage change in US consumption.

source: Hyndman and Athanasopoulos (2014)

with statistics. Statistical methods may be univariate, using values of only one series to predict future values for that same series, or multivariate, using data from more than one time series (ARMSTRONG, 2001).

A naïve model is a model that presumes things will remain as they have in the past. For time series data, the naïve (no change) model simply forecasts the next observation as equal to the latest observation. The naïve model serves as a benchmark model for other models. If a model cannot forecast better than a simple alternative like the naïve no-change model, it is of no use (ARMSTRONG, 2001).

2.3.3.1 Exponential Smoothing

Exponential smoothing is an approach that uses all historical values as predictors, giving more weight to more recent values, as in Equation 2.23 for Simple Exponential Smoothing (SES). The equation shows that the forecast for time t + 1 is a weighted average between the most recent observation xtand the forecasted value for time t. Recursively substituting

(40)

ˆ xt yields Equation 2.24. ˆ xt+1 = αxt+ (1 − α)ˆxt (2.23) ˆ xt+1 = αxt+ α(1 − α)xt−1+ α(1 − α)2xt−2+ ... (2.24)

As long as the smoothing parameters α is between 0 and 1, the weight given to each observation decreases exponentially as each observation comes from further in the past, hence the name exponential smoothing. The Holt-Winters procedure, given in the additive form by Equations 2.25 thru 2.28, generalizes simple exponential smoothing, allowing for a trend (b) and a seasonal (s) term. In the equations, m is the period of seasonality, α, β and γ are smoothing parameters. Selecting the best values for the parameters in a Holt-Winters model is a non-linear minimization problem, and the task requires an optimization tool (METCALFE; COWPERTWAIT, 2009; HYNDMAN; ATHANASOPOULOS, 2014).

ˆ

xt+h = lt+ hbt+ st−m+h+m (2.25)

lt= α(xt− st−m) + (1 − α)(lt−1+ bt−1) (2.26)

bt= β(lt− lt−1) + (1 − β)bt−1 (2.27)

st= γ(xt− lt−1− bt−1) + (1 − γ)st−m (2.28)

There are several exponential smoothing methods other than Simple Exponential Smooth-ing and the Holt-Winters procedure. Hyndman and Khandakar (2007) identify a total of fteen exponential smoothing methods, with ve possibilities for the trend component (None, Additive, Additive damped, Multiplicative and Multiplicative damped) and three possibilities for the seasonal component (None, Additive and Multiplicative) as shown in Table 1. Hyndman and Khandakar (2007) also provide the formulae for each of the methods.

2.3.3.2 Automatic Forecasting With Exponential Smoothing

Automatic forecasting of large sets of univariate time series is a common need in a va-riety of business settings. Hyndman et al. (2002) describe a methodology to automatic forecasting using ES methods based on a state space framework.

(41)

Table 1 Taxonomy of exponential smoothing methods.

Source: Hyndman and Khandakar (2007)

An ES method is an algorithm capable of generating point forecasts only. For each ES method, it is possible to obtain two statistical state space models (ETS models, for Er-ror, Trend and Seasonality) that underlie the ES method. One model has additive errors and the other has multiplicative errors. The state space model will yield the same point forecasts as the ES method, but also provides a framework for computing prediction inter-vals and other properties. In what concerns the point forecasts, the distinction between additive and multiplicative errors is irrelevant. Each model consists of a measurement equation that describes the observed data, and some state equations that describe how the unobserved components (Level, trend and seasonal components. These are the states) change over time (HYNDMAN et al., 2002; HYNDMAN; KHANDAKAR, 2007; HYND-MAN; ATHANASOPOULOS, 2014).

There are 30 possible ETS models, two for each of the fteen ES variations. The general model uses a state vector xt = (lt, bt, st, st−1, st−m+1), and state equations as in Equation

2.29 and Equation 2.30.

yt= w(xt−1) + r(xt−1)t (2.29)

xt= f (xt−1) + g(xt−1)t (2.30)

These models are called `innovations', or `single source of error', because only one source of error appears in all of the model's equations. This makes it easy to compute maxi-mum likelihood estimates for the model parameters and the initial states x0 even if the

model is nonlinear. The ETS models may be nonlinear. This means that time series that exhibit nonlinear characteristics may be better modelled using exponential smoothing ETS models (HYNDMAN et al., 2002; HYNDMAN; KHANDAKAR, 2007; HYNDMAN; ATHANASOPOULOS, 2014).

(42)

follows: For each series, apply all models that are appropriate. Obtain maximum likeli-hood estimates of the model parameters (both smoothing parameters and initial states). Select the best of the models according to the AIC (HYNDMAN; KHANDAKAR, 2007). 2.3.3.3 ARIMA Models

An Auto Regressive (AR) model considers that the current value of a series, xt, can be

explained as a linear function of a p number of past values of itself xt−1, xt−2, ..., xt−p,

hence auto regressive, as in Equation 2.31. p is the order of the autoregressive model. xt = φ1xt−1+ φ2xt−2+ . . . + φpxt−p+ wt (2.31)

The Moving Average (MA) model considers that the observed value of a time series can be modeled as a regression in past forecast errors. The lagged error terms are not actually observable. Maximum likelihood is commonly used to recursively estimate the parameters of an MA model (HYNDMAN; ATHANASOPOULOS, 2014; TSAY, 2005). The moving average model of order q is given by Equation 2.32.

xt = wt+ θ1wt−1+ θ2wt−2+ . . . + θpwt−p (2.32)

A more general model is the Auto Regressive Moving Average (ARMA) as in Equation 2.33. ARMA is a general model that is reducible to purely auto regressive or purely moving average. An ARMA(p,q) model has auto regressive order p and moving average order q. AR and MA models are usually restricted to stationary data. The properties of a stationary time series, mean and variance, do not depend on the time at which the series is observed (SHUMWAY; STOFFER, 2017; HYNDMAN; ATHANASOPOULOS, 2014).

xt = φ1xt−1+ φ2xt−2+ . . . + φpxt−p+ wt+ θ1wt−1+ θ2wt−2+ . . . + θpwt−p (2.33)

Auto Regressive Integrated Moving Average (ARIMA) models add dierencing to ARMA models. Dierencing is a way to make time series stationary on trend by computing the dierences between consecutive observations. A rst order dierenced time series is given by Equation 2.34. The addition of dierencing to ARMA allows for non-stationary data. In an ARIMA(p, d, q), d order dierencing is applied to the data to make it at least approximately stationary. Using backshift notation (the backshift operator B shifts the data back one period as in Equation 2.35), an ARIMA model can be written as shown in Equation 2.36(SHUMWAY; STOFFER, 2017; HYNDMAN; ATHANASOPOULOS, 2014).

x0_t = xt− xt−1 (2.34)

(43)

(1 − φ1B − ... − φpBp)(1 − B)dxt= c + (1 + θ1B + ... + θqBq)wt (2.36)

The Seasonal ARIMA(p, d, q)(P, D, Q)m model is given by Equation 2.37, where Φ(z) and

Θ(z) are polynomials of order P and Q respectively. The (1 − Bm)D term in the seasonal ARIMA equation accounts for the seasonal dierencing of the time series. D is the order of the seasonal dierencing and d is the order of the regular dierencing (HYNDMAN; KHANDAKAR, 2007).

Φ(Bm)φ(B)(1 − Bm)D(1 − B)dyt= c + Θ(Bm)θ(B)wt (2.37)

Unlike exponential smoothing methods, the ARIMA class of models assumes homoscedas-ticity (stationarity on variance). Consequently, transformations to the original data are sometimes necessary (HYNDMAN; KHANDAKAR, 2007). Many statistical techniques rely on assumptions about the sampled population that are not always satised. In sit-uations where the assumptions are violated, one option is to ignore the violation of the assumptions and proceed with the analysis as if all assumptions were satised. Another is to apply a transformation to the data. The Box-Cox transform is a parametric power transformation that aims to reduce anomalies such as non-additivity, non-normality and heteroscedasticity. The transformed data wt is given by Equation 2.38, where λ is the

transformation parameter (SAKIA, 1992). wt=    log(yt), if λ = 0 (yλ t − 1)/λ, if λ 6= 0 (2.38) 2.3.3.4 Automatic Forecasting With ARIMA Models

A diculty to the application of ARIMA models is the process of order selection. There have been several attempts at automating ARIMA modelling. This study applies the automatic procedure suggested by Hyndman and Khandakar (2007).

The task of automatic order selection consists of selecting values for p, q, P, Q, D and d. With d and D known, the other orders p, q, P and Q are selected via an information criterion. The Canova-Hansen test tests for deterministic seasonality. Canova-Hansen tests were originally used to select D. There has been a switch in favor of OCSB tests for D selection resulting in better forecasts (HYNDMAN, 2011). After D is known, the selection of d occurs thru successive KPSS unit-root tests. If the test is signicant, the data is dierenced and the new dierenced data goes thru a new test. The procedure stops when the tests yield the rst insignicant result.

Even after the selection of d and D, the number of potential models is large. It is not possible to t every model and then choose the one with the smallest AIC. Hyndman and Khandakar (2007) suggest a way to traverse the space of models eciently in order to

(44)

arrive at a model with low AIC.

The procedure starts with four possible models. If d + D ≤ 1, the models are tted with a constant (c 6= 0). Otherwise, the constant is set to 0 (c = 0):

• ARIMA(2; d; 2) if m = 1 and ARIMA(2; d; 2)(1; D; 1) if m > 1. • ARIMA(0; d; 0) if m = 1 and ARIMA(0; d; 0)(0; D; 0) if m > 1. • ARIMA(1; d; 0) if m = 1 and ARIMA(1; d; 0)(1; D; 0) if m > 1. • ARIMA(0; d; 1) if m = 1 and ARIMA(0; d; 1)(0; D; 1) if m > 1.

Between the four starting models, the one with the smallest AIC becomes the current model. The procedure tests up to thirteen variations on the current model:

• where one of p, q, P and Q is allowed to vary by ±1 from the current model; • where p and q both vary by ±1 from the current model;

• where P and Q both vary by ±1 from the current model;

• where the constant c is included if the current model has c = 0 or excluded if the current model has c 6= 0.

Whenever a model with lower AIC is found, it becomes the new current model and the procedure is repeated. The procedure stops when it is not possible to nd a model with lower AIC than the current model.

2.3.3.5 Model Selection and Evaluation

The model tting process estimates parameters that maximize a measure of goodness of t. The goodness of t measures how well a model ts to a data set. That includes both the regularity and the noise in the data. During the tting of a model to a data set, a perfect t can always be achieved by using a suciently complex model. This complex model overts the data, meaning it captures both the regularity and the noise in the data. A good t after model training does not imply in a model with good forecasting capabil-ities (MAKRIDAKIS; WHEELWRIGHT; HYNDMAN, 2008; MYUNG; PITT, 2004). Measuring the forecast accuracy of the model out-of-training-sample avoids the possibility of being deceived by overtting. Forecasts are made on a test set (or holdout set), that was not used during the model tting phase. It is the rst time the model is exposed to the test set, making it not possible for the model to be overtted to the test data. The model accuracy is measured on the test data only (MAKRIDAKIS; WHEELWRIGHT; HYNDMAN, 2008).

Generalization refers to a model's ability to t the regularity in the data and its prediction capability on independent test data. Estimating a model's prediction error on a test set

(45)

estimates its generalization error. A good approach for model assessment is dividing the available data in three parts: training set, validation set and test set. The uses of these sets are, according to Friedman, Hastie and Tibshirani (2001): The training set is used to t the models; the validation set is used to estimate prediction error for model selection; the test set is used for assessment of the generalization error of the nal chosen model.. The three sets approach reduces the amount of data available for model tting.

There are methods that allow ignoring the validation step, dividing the available data in training set and test set and using an analytical measure of generalizability on the training set for model selection. The measure of generalizability balances model goodness of t and model complexity. The Akaike information criterion (AIC) is a measure of generalizability that uses log-likelihood as a measure of goodness of t and the number of model param-eters as the only relevant measure of model complexity. AIC is calculated as in Equation 2.39, where ln L(w∗₎_{is the natural logarithm of the model's maximized likelihood and k is}

the number of model parameters. A useful approximation for the AIC based on the vari-ance of the residuals is given by equation 2.40 where σ2_{is the variance of the residuals and}

n is the number of observations in the time series (FRIEDMAN; HASTIE; TIBSHIRANI, 2001; MYUNG; PITT, 2004; MAKRIDAKIS; WHEELWRIGHT; HYNDMAN, 2008).

AIC = −2 ln L(w∗) + 2k (2.39)

AIC ≈ n(1 + log(2π)) + n log σ2+ 2k (2.40) It is important to evaluate forecasts properly:

In evaluating forecasting methods, researchers should follow accepted scientic procedures. Interestingly, much published research on fore-casting ignores formal evaluation procedures and simply presents possi-ble but untested approaches. In many cases, evaluation is done, but it tends to be narrow, with the intent of advocating a particular method. (ARMSTRONG, 2001, p. 441)

Empirical evaluations of forecasts produced by models based on computer intelligence have not followed the best practices derived from empirical evaluations of forecasts made by statistical methods, like comparison with established statistical benchmarks, forecasts made on real out-of-sample data and the use of robust error metrics (CRONE; HIBON; NIKOLOPOULOS, 2011).

The forecasts from a given method must be compared with the forecasts from a naive method. If the forecasting method under evaluation cannot show better results than a simple naive method it is of no use.

The main error measure used in this study is the Mean Squared Error (MSE) (MA et al., 2015; JIANG; SONG, 2011; KHASHEI; BIJARI, 2010). MSE is a stable error measure for the dataset used in this research. Other measures like MAPE and Theil's U-statistic

(46)

would be unstable because the time series have mean zero. When the time series values are very close to zero, the computations involving percentage errors can be meaningless. This happens because these error measures involve divisions by values of the time series (MAKRIDAKIS; WHEELWRIGHT; HYNDMAN, 2008). MSE is given by Equation 2.41, where k is the number of points in the forecast.

M SE = Pk

t=1(ˆyt˘yt) 2

k (2.41)

2.3.4 Time Series Forecasting With Neural Networks

Khashei and Bijari (2010) introduced an approach based on using an ARIMA model to extract features from a time series. The features are then used to train a single hidden layer feedforward network.

Kourentzes, Barrow and Crone (2014) explored dierent ways to combine the predictions of each neural network in an ensemble for time series forecasting. A novel mode operator is compared to the more established mean and median operators.

Ma et al. (2015) used LSTM neural networks to predict trac speed. The LSTM net-work's performance was compared to other architectures of recurrent neural networks, support vector machines, ARIMA and Kalman lter models. In most situations, the LSTM design showed superior performance.

Wang, Zeng and Chen (2015) utilized Adaptive Dierential Evolution (ADE) to initialize the weights of a neural network used for time series forecasting to a point believed to be close to the global optimum. The network is then further trained with backpropagation. Claveria and Torra (2014) compared a single hidden layer neural network, with one input and three nodes in the hidden layer, to ARIMA and self-exciting threshold auto regression (SETAR) for tourism demand forecasting. The work showed that ARIMA models have superior performance in out-of-sample forecast in most scenarios.

Khatibi et al. (2011) compared single hidden layer articial neural networks to ANFIS and genetic programming in the task of forecasting river ow. The research showed better performance by genetic programming.

Jiang and Song (2011) used a nonlinear autoregressive exogenous neural network, a model that includes feedback of the network output, to predict a timeseries of sunspots. The model showed better precision then ARIMA and regular feedforward network.

2.4 Prognostics

Modern systems generate great amounts of data on system health and on operational and environmental conditions. It is common to use this data to monitor the system health. That helps in preventing costly in-service system failures, which could lead to loss of property or lives (MEEKER; HONG, 2014). CBM can be dened as:

Forecasts of multivariate time series sampled from industrial machinery sensors

Forecasts of Multivariate Time Series

Sampled From Industrial Machinery Sensors

Rio das Ostras

7 de maio de 2020

Forecasts of Multivariate Time Series Sampled From

Industrial Machinery Sensors

Orientador: Leila Weitzel Coelho da Silva

Orientador: Ana Paula Barbosa Sobral

Rio das Ostras

7 de maio de 2020

-Forecasts of Multivariate Time Series Sampled From

Industrial Machinery Sensors

Rio das Ostras

7 de maio de 2020

1 Introduction

1.1 Motivation

1.2 Main Goal

1.3 Detailed Goals

2 Literature Review

2.1 Articial intelligence and Machine Learning

2.2 Neural Networks and Deep Learning

2.2.1 Introduction

2.2.2 Modern Neural Networks

2.2.3 Recurrent Neural Networks and Long Short Term Memory Units

2.2.4 Deep Learning

2.2.5 Recent Work on Deep Learning

2.3 Time Series

2.3.1 Stationarity and Normality

2.3.2 Autocorrelation

2.3.3 Forecasting

2.3.4 Time Series Forecasting With Neural Networks

2.4 Prognostics

2.1 Articial intelligence and Machine Learning