Forecasting inflation in Portugal by using machine learning techniques

(1)

(2)

Dedication

To my grandfather, the greatest human being that ever lived, who passed away while I was writing this thesis and who I miss every day.

(3)

Acknowledgments

I would like to express my deep gratitude to Professor Diana Mendes for always believing in me and guiding me throughout all this process. For all her patience, advice and support, my most sincere thank you.

I wish to thank my parents, my brothers and my sisters for their support and encouragement in both my academic and personal life. I would like to extend these thanks to the rest of my family, who always gave me motivation and assistance when necessary. I am grateful to my cousin who helped me chose the subject of this project.

My friend Carlos Máximo helped me solve problems in times of despair and gave me encouragement when I needed it. For this, I am extremely grateful.

I would like to emphasize the help I received from my friend Ana Beatriz Lopes and thank her for her patience.

I would like to thank my friends, Raquel Sousa, Jorge Becho, Mariana Cancelinha and Inês Marques, who have always been by my side and helped me along the way.

Finally, I would like to express my gratitude to everyone else that I did not specify in these acknowledgments but that somehow helped me in the last few years and gave me motivation.

(4)

Abstract

The main objective of this thesis is a comparison between the performance of classic econometric models, such as ARMA/ARIMA and ARCH/GARCH models, and Machine Learning algorithms, namely Artificial Neural Networks, in the forecasting of Portugal’s inflation rate. Nowadays, Machine Learning algorithms have been more and more applied to financial indexes’ econometric forecast and, therefore, they were considered relevant to be included in this project.

At an early stage, basic statistic concepts will be explained as well as the methods and models chosen. ARMA/ARIMA models produced some satisfactory predictions. However, all models presented high levels of kurtosis, an indicator of heteroskedasticity, which lead us to conclude these models are not the most appropriated since variance is not constant over time.

ARCH/GARCH models’ performance was not effective in the forecast of Portuguese inflation. Even though we solved the heteroskedasticity problem and we obtained very convincing results when it comes to the residuals series, the predicted values were less accurate when compared to previous models.

Finally, hundreds of different Machine Learning algorithms were tested. As referred before, the main focus of this dissertation was the use of Artificial Neural Networks. There are several adjustable parameters and infinite possible combinations. However, we found two models that seem to fit our time series very well. This thesis concludes that out of all models tested, Artificial Neural Networks produce the most accurate forecast of the Portuguese inflation rate.

Keywords: Inflation Rate, Time Series, ARMA/ARIMA, ARCH/GARCH, Machine Learning, Artificial Neuron Networks

(5)

Resumo

Esta tese tem como principal objetivo a comparação entre o desempenho de modelos econométricos clássicos, tais como os modelos ARMA/ARIMA (AutoRegressive Moving Average/AutoRegressive Integrated Moving Average) e os modelos ARCH/GARCH (AutoRegressive Conditional Heteroskedasticity/ Generalized AutoRegressive Conditional Hetereroskedasticity), e algoritmos de Machine Learning, nomeadamente Redes Neuronais Artificiais, na previsão da taxa de inflação em Portugal. Atualmente, os algoritmos de Machine Learning têm sido cada vez mais aplicados à previsão econométrica de índices financeiros e, portanto, consideramos relevante a sua utilização neste projeto. Numa fase inicial, serão explicados todos os conceitos estatísticos básicos abordados neste trabalho, tais como a definição de processo estocástico, processo de Ruído Branco, série temporal, estacionariedade, assimetria e curtose, testes de raiz unitária/estacionariedade, correlogramas e heterocedasticidade. Foram, também, explicados resumidamente os modelos utilizados nesta dissertação, que incluem os modelos AR(p), MA(q), ARMA(p,q) e ARCH/GARCH. Foram explicitadas as funções de autocorrelação (ACF) e de autocorrelação parcial (PACF). Considerámos relevante também definir as fórmulas dos critérios de avaliação de qualidade de previsão utilizados, como, por exemplo, o Erro Quadrático Médio e o Índice de Theil e as suas respetivas proporções. Numa segunda fase, introduzimos o tópico da Inteligência Artificial, mais precisamente dos algoritmos de Machine Learning. Depois de uma breve abordagem histórica, fizémos a comparação entre as Redes Neuronais Artificiais e as Redes Neuronais Biológicas. Incluímos, também, várias funções de ativação frequentemente utilizadas nestes algoritmos.

Depois desta contextualização teórica, passámos à análise dos resultados obtidos. Começámos por analisar as estatísticas descritivas da série temporal em estudo e verificámos que a série era assimétrica e leptocúrtica. Notámos que a série temporal correspondente à taxa de inflação em Portugal não era estacionária, após termos efetuado os testes de raiz unitária Augmented Dickey-Fuller (ADF) e Phillips Perron (PP) e o teste de estacionariedade Kwiatkowski-Phillips-Schmidt-Shin (KPSS). Para aplicar os primeiros modelos, tivemos, então, de integrar a série uma vez para obter a estacionariedade desejada. Os primeiros modelos a ser aplicados foram os modelos ARMA/ARIMA. Em primeiro lugar, recorremos ao correlograma para determinar quais as ordens dos modelos que deveríamos testar. De seguida, excluímos todos os modelos que apresentaram algum coeficiente estatisticamente nulo. Procedeu-se à comparação dos modelos obtidos, utilizando algums critérios conhecidos, sendo estes o Akaike Information Criterion (AIC) e o Schwarz Information Criterion (SIC). No entanto, como os valores eram todos semelhantes, não foi possível retirar qualquer tipo de conclusão. Posteriormente, analisámos a série dos resíduos através das estatísticas descritivas, histogramas e testes de correlação serial. Em seguida, utilizámos os modelos validados para prever a taxa de inflação em Portugal no mês de maio de 2019, comparando este valor com a taxa de inflação registada neste mês. Por último, analisámos a qualidade da previsão, utilizando critérios pré-estabelecidos, tais como o Erro Quadrático Médio e o Índice de Theil. Os modelos ARMA/ARIMA obtiveram algumas previsões satisfatórias. No entanto, todos os modelos apresentaram elevados níveis de curtose, um indicador claro de heterocedasticidade, o que leva a concluir que não são os mais apropriados para utilizar, uma vez que a variância não é constante ao longo do tempo.

Consequemente, aplicámos os modelos ARCH/GARCH. De modo análogo aos modelos testados anteriormente, começámos por excluir todos os modelos que apresentavam algum coeficiente estatisticamente irrelevante. Analisámos os gráficos e concluímos que ainda existia alguma volatilidade. De seguida, examinámos a série dos resíduos dos modelos validados. Depois de verificar

(6)

a heterocedasticidade dos resíduos, utilizando os testes ARCH e de Breusch-Pagan-Godfrey, concluímos que estes modelos apresentavam variância constante ao longo do tempo. De novo, utilizámos os modelos para prever a taxa de inflação em Portugal no mês de maio de 2019. Finalmente, explorámos a qualidade da previsão, utilizando alguns dos mesmos critérios referidos anteriormente. Concluímos que o desempenho dos modelos ARCH/GARCH na previsão da inflação não foi eficaz. Apesar de termos resolvido o problema da heterocedasticidade e termos obtido valores bastante convicentes no que diz respeito à série dos resíduos, a previsão obtida foi menos precisa quando comparada com os modelos anteriores.

Por último, foram testados centenas de algoritmos de Machine Learning. Como referido anteriormente, o foco desta dissertação incidiu na utilização de Redes Neuronais Artificais. Existem vários parâmetros ajustáveis e as combinações são infinitas e, por consequente, de todos os modelos testados, apenas alguns configuram nesta tese . Neste projeto, os parâmetros modificados foram o número de camadas ocultas, o número de nodos em cada camada oculta, a função de ativação escolhida, o número máximo de épocas e a partição em Training/Test Sets. A comparação entre os modelos foi efetuada utilizando, de novo, o Erro Quadrático Médio, sendo que os melhores modelos são os que apresentam um valor mais próximo de zero. Foram encontrados dois modelos que parecem ajustar-se muito satisfatoriamente à serie temporal em causa. Os dois modelos que apresentam os melhores resultados têm vários parâmetros em comum: a função de ativação escolhida foi a tangente hiperbólica, foi utilizada apenas uma camada oculta e três nodos nessa mesma camada oculta. O único parâmetro diferente foi o número de épocas escolhido, que variou entre duzentos e trezentos. Também foi verificado que o algoritmo atua de forma eficaz, aprendendo toda a informação bastante rapidamente, uma vez que nos dois casos houve uma paragem antecipada. Em seguida, observámos os gráficos dos modelos computados e que os gráficos correspondentes aos dois modelos referidos ajustam-se quase perfeitamente à série original da taxa de inflação em Portugal.

Esta tese conclui que de todos os modelos testados, as Redes Neuronais Artificiais produzem a previsão mais fidedigna da taxa de inflação em Portugal e que apresentam vantagens em relação aos modelos econométricos clássicos utlizados. Considera-se, portanto, pertinente a continuação do estudo da aplicação de algoritmos de Machine Learning em índices financeiros.

Palavras-chave: Taxa de inflação, Série Temporal, ARMA/ARIMA, ARCH/GARCH, Machine Learning, Redes Neuronais Artificiais

(7)

Index

Introduction………...……….. 1

Chapter 1. Statistical Concepts and Classic Econometric Models………. 3

1.1. Random Variable………...3

1.2. Covariance………..3

1.3. Stochastic Process………...3

1.4. White Noise Process………...4

1.5. Time Series………..4

1.6. Stationarity……….….4

1.7. Skewness and Kurtosis………...………5

1.8. Unit Root/Stationary Tests………6

1.8.1. Dickey-Fuller Unit Root Test……….….6

1.8.2. Augmented Dickey-Fuller Unit Root Test……….6

1.8.3. Phillips-Perron Unit Root Test………...7

1.8.4. Kwiatkowski-Phillips-Schmidt-Shin Stationarity Test……….…7

1.9. ARMA Models ………...………7

1.9.1. AutoRegressive Models………7

1.9.2. Moving Average Models……….……….8

1.9.3. AutoRegressive Moving Average Models……….……….8

1.10. Correlogram………..9

1.11. Homoskedasticity/Heteroskedasticity……….9

1.12. ARCH/GARCH Models………...……9

1.13. Residuals Tests ………...………...……… 10

1.14. Forecasting Evaluation Methods………...…10

1.14.1. Forecast Error………....10

1.14.2. Mean Forecast Error or Bias………..…..11

1.14.3. Mean Absolute Deviation………..11

1.14.4. Mean Square Error………11

1.14.5. Mean Absolute Percentage Error……….11

1.14.6. Theil Inequality Coefficient………...11

Chapter 2. Machine Learning………..12

2.1. A Brief History of ANNs……….…..13

2.2. The Artificial Neuron………...13

2.3. Artificial Neural Networks………...15

2.4. The Multilayer Perceptron Neural Network………..16

2.5. Activation Functions……….17

2.6. Forecasting with ANNs……….18

Chapter 3. Results & Analysis……….19

3.1. Descriptive Statistics and Unit Root/Stationary Tests………..…….………19

3.2. ARMA/ARIMA Models’ Results……….………21

(8)

3.4. ANNs Models’ Results……….………….30 Chapter 4. Conclusions/Discussion………...………..34 Bibliography/References ……….35 Appendix. ………..………36 A. ARMA/ARIMA Models……….…………36 A1. Outputs……….………36

A2. Residuals Graph of Validated ARMA/ARIMA Models……….….…43

A3. Residuals Histogram of Validated ARMA/ARIMA Models………..….45

B. ARCH/GARCH Models……….47

B1. Outputs……….47

B2. Conditional Standard Deviation Graphs for Validated ARCH/GARCH Models……….51

B3. Residuals’ Histogram for Validated ARCH/GARCH Models………....52

C. ANNs Models………..54

C1. Actual Series versus Forecast Series graphs ([-1,1] scale) and Forecast Series graph (real values) ………...54

(9)

Introduction

The main objective of this thesis is to use an Artificial Neuron Network (ANN) to forecast accurately the behavior of the inflation rate in Portugal, using monthly data from January of 1949 until May of 2019, and compare the results to the ones obtained by other classical econometric models.

Inflation is the long term rise in the general level of prices of goods and services in a certain economy over an extended period of time. This translates into a reduction of the purchasing power of the population. As the prices increase, one unit of currency can acquire less percentage of goods or services.

Inflation is measured as an annual percentage change and there are several ways to evaluate it. The most common is the consumer price index (CPI), where inflation is measured by the change in the prices of a basket of goods and services heavily purchased by specific subgroups of the population. There are many causes for inflation, which can be divided in two subgroups: Demand-Pull Inflation and Cost-Pull Inflation. The first is when inflation is caused by an increase of the aggregate demand, outweighing the aggregate supply, while the other happens when there is an increase in the cost of wages and raw materials which results in a decrease in the aggregate supply of goods and services. The most common causes for Demand-Pull Inflation are rapid economic growth, government spending, cut in interest rates, increased money supply, inflation expectations and technological innovation. Cost-Pull Inflation can be the result of higher wages, the existence of a monopoly, natural disasters, regulation and taxations applied by governments, changes in Exchange rates and inflation expectations.

Previous studies have showed that prior to the adoption of the euro, Portuguese Inflation was explained essentially by foreign inflation. First, it is important to give some background on historical evolution of the inflation rate in Portugal. As of 1950, it was observed that prices were following a slight upward trend that continued throughout the 60.

Between 1970 and 1973, inflation rate double in Portugal, due to several internal and external motives such as the first oil crisis in 1973 and the political regime of a dictatorship that ended with the coup d'état of 1974. After this, as a result of the extreme climate of political and social instability, the inflation rate almost triplicated. Some factors that contributed to this fact were the return of many Portuguese citizens from ex-colonies in Africa and the nationalization of a substantial part of the productive sector. This pattern did not verify itself in the following years, which can be explained by the external helps received in this period and the end of military government spending with war. Still, inflation rate was very high as a result of the instability and misadjusted macroeconomic politics. In 1985, the prices started going down due to new external helps and the following year, Portugal joined the European Economic Community (EEC) with the second highest inflation rate of the twelve state-members, only surpassed by Greece. The main goal of the monetary and currency politics of the EEC was to decrease inflation rates. For that purpose, a series of reform programs was implemented. This situation provided the opportunity to attract foreign direct investment.

Around 1990, there was a change in monetary politics, based on exchange rate stability. All these events lead to a disinflation phenomenon between mid-80's and the late 90's. Then, in 1999, Portugal joined the Euro, and even though in the first years after adopting the single currency the Portuguese inflation rate accelerated with values exceeding 5%, historical tendency shows Euro created some stability in the rise of prices in Portugal. The Eurosystem, which is composed of the European Central Bank (ECB) and national central banks (NCBs) in euro area, is the responsible entity for implementing the monetary policy in all euro area. Its primary goal is to protect the purchasing power of the

(10)

common currency by maintaining price stability. This methodology has been successful in terms of preventing big changes in Portuguese inflation values in short periods of time.

There are four chapters in this thesis. The first one contains some basic statistical concepts and classic econometric models and displays the methods and tests chosen to verify each model, the second introduces and explains ANNs and the methods used. Third chapter showcases and explains the results obtained and fourth the conclusions of this work. After, an appendix section was included, containing several models’ outputs and graphs. Throughout this thesis, the Software used was E-views and the Artificial Neural Networks were programmed using Pyhton.

(11)

Chapter 1. Statistical Concepts and Classic Econometric Models 1.1. Random Variable

A random variable is a variable whose values are unknown or a measurable function that assigns values to the outcomes of a random phenomenon. Random variables are usually designated by a capital letter and can be classified as discrete or continuous. Discrete random variables have specific countable values while continuous random variables can take any value within a continuous range or interval. Random variables are often used in econometric analysis.

1.2. Covariance

Covariance of two random variables, X and Y, measures the total joint variation of the variables, it can assume any real value and depends on the unit of measure chosen. Positive covariance indicates the two variables tend to move in the same direction while negative covariance indicates X and Y tend to move in inverse directions. Mathematically, it translates to

ܥ݋ݒሺܺǡ ܻሻ ൌ ܧሾሺܺ െ ߤ௑ሻሺܻ െ ߤ௒ሻሿ ሺǤ ͳǤʹǤͳሻ Stability of covariance can be quantified by the auto covariance function, defined as

ߛሺݐǡ ݏሻ ൌ ܥ݋ݒሺݕ௧ǡ ݕ௧ି௦ሻ ൌ ܧሾሺݕ௧െ ߤ௧ሻሺݕ௧ି௦െ ߤ௧ି௦ሻሿ ሺǤ ͳǤʹǤʹሻ Standard covariance is called correlation. Correlation measures the strength of the relationship between the two variables and takes values in rangeሾെͳǡͳሿ. Unlike covariance, correlation is not affected by the unit of measure.

Pearson's correlation coefficient of two random variables, X and Y, is defined as ߩ௑ǡ௒ ൌ ܥ݋ݒሺܺǡ ܻሻ

ඥܸܽݎሺܺሻܸܽݎሺܻሻሺǤ ͳǤʹǤ͵Ǥ ሻ where ܸܽݎሺܺሻ and ܸܽݎሺܻሻ are the variances of variables ܺ and ܻ, respectively.

This way, autocorrelation measures the extent of a linear relationship between lagged values of a variable.

The Autocorrelation Function (ACF) for a time series ݕ௧ is defined as ߩሺݐǡ ݏሻ ൌ ܥ݋ݒሺݕ௧ǡ ݕ௧ି௦ሻ ඥܸܽݎሺݕ௧ሻܸܽݎሺݕ௧ି௦ሻൌ ߛሺݐǡ ݏሻ ඥߛሺͲሻߛሺͲሻൌ ߛሺݐǡ ݏሻ ߛሺͲሻ ሺǤ ͳǤʹǤͶǤ ሻ where ߛሺݐǡ ݏሻ is the covariance in instants of time ݐ and ݏ and ߛሺͲሻ is the variance.

1.3. Stochastic Process

Let ሺȳǡ ࣠ǡ Ȭሻ be a probability space and ሺȠǡ ࣟሻ a measurable space. Let Ȯ ് ׎ be a set where for each ݐ א Ȯ we have a random variableܻሺݐሻ, measurable function of ȳ ՜ ȠǤ The family of real random variables ሼܻሺݐሻǡ ݐ א Ȯሽ is called a stochastic process defined in space ሺȳǡ ࣠ǡ Ȭሻ and it assumes values in ሺȠǡ ࣟሻǤ

The linear representation of a stochastic process can be defined by

ݕ௧ൌ ܽ଴൅ ܽଵݕ௧ିଵ൅ ܽଶݕ௧ିଶ൅ ڮ ൅ ܽ௡ݕ௧ି௡൅ ߝ௧ ሺǤ ͳǤ͵ǤͳǤ ሻ where ݕ௧ି௣ are lagged values of ݕ௧ and ܽ௣ are real parameters. The error term ߝ௧ is given by a white noise process, which will be explain in further detail next.

(12)

1.4. White Noise Process

A process ሼࣟ௧ǡ ݐ א ܶሽ is said to be White Noise with mean Ͳ and varianceߪଶ, written as ሼࣟ௧ሽ̱ܹܰሺͲǡ ߪଶ_{ሻ if it meets three requirements:}

(1) ܧሺࣟ௧ሻ ൌ Ͳ ሺǤ ͳǤͶǤͳǤ ሻ (2) ܸܽݎሺܺ௧ሻ ൌ ߪଶ൏ λ ሺǤ ͳǤͶǤʹǤ ሻ

(3) ܧሺࣟ௧ǡ ࣟ௧ି௦ሻ ൌ ܥ݋ݒሺࣟ௧ǡ ࣟ௦ሻ ൌ Ͳfor all ݏ ് ݐ ሺǤ ͳǤͶǤ͵Ǥ ሻ that is, null mean, constant variance and independence.

1.5. Time Series

A time series is a set of observations taken in determined instants in time at regular intervals. Time series' characteristic movements, frequently denominated as components, can be divided in four major types:

· trend movements refer to the direction the series graph appears to take in a long time-lag (trend lines can be obtained through linear regression);

· cyclical movements report the long term oscillations or deviations from trend lines. These cycles can be periodic or not;

· seasonal movements refer to identical patterns during successive time intervals; · random or irregular movements allude to sporadic chronological shifts.

Most common methods of time series analysis assume independence of events. A time series can be characterized for:

· independence (purely random series or white noise, meaning that knowing what happened at instantݐ will not help us to forecast instant ሺݐ ൅ ͳ));

· long memory (dependence effects fade slowly, meaning past values influence future values); · short memory (dependence effects disappear quickly).

Let ሼݕ௧ǡ ݐ א ܶሽbe a time series with ܧሾݕ_௧ଶሿ ൏ λ: · mean function at instant t is defined by

ߤ௧ൌ ܧሾݕ௧ሿ ሺǤ ͳǤͷǤͳǤ ሻ · variance of the times series at instant t is defined as

ܸܽݎሺݕ௧ሻ ൌ ߪ௬ଶ೟ൌ ܧሾሺݕ௧െ ߤ௧ሻଶሿ ሺǤ ͳǤͷǤʹǤ ሻ

1.6. Stationarity

Stationarity is one of the most desired properties in a time series. Stationary processes are much easier to analyze and to predict. Statistical properties such as mean, variance and autocovariance do not change over time in a stationary process. Estimator’s properties and quality also depend on whether series are stationary or non-stationary. In the first case, irregularities can be softened and forecasting for time instant ሺݐ ൅ ͳሻ is as accurate as for time instant ሺݐ ൅ ݇ሻ. In stationary time series, shocks fade away as time passes by. In a non-stationary time series, those effects are propagated throughout time. Usual tests can't be applied to non-stationary series, because linear regression classical model's suppositions are violated.

A series is said to have ݇-th order stationarity, with ݇being a positive integer, if for any set with ݇ points ݐ_ଵǡ ǥ ǡ ݐ_௞א Ȯ and for all ݄ǣݐ_௜൅ ݄ א Ȯǡ ݅ ൌ ͳǡ ǥ ǡ ݇, random vectors of length ݇,

(13)

൫ܺሺݐଵሻǡ ǥ ǡ ܺሺݐ௞ሻ൯ and ൫ܺሺݐଵ൅ ݄ሻǡ ǥ ǡ ܺሺݐ௞൅ ݄ሻ൯ are identically distributed. The process will be strongly stationary if it has ݇-th order stationarity for all ݇ א Ժ.

A time series is called weak stationary if for all ݐ and ሺݐ െ ݏሻ א ܶ we have: 1) constant mean (same mean at all time points):

ܧሺݕ௧ሻ ൌ ܧሺݕ௧ି௦ሻ ൌ ߤ ሺǤ ͳǤ͸ǤͳǤ ሻ 2) constant finite variance:

ܧሾሺݕ௧െ ߤሻଶሿ ൌ ܧሾሺݕ௧ି௦െ ߤሻଶሿ ൌ ߪ௬ଶ൏ λ ሺǤ ͳǤ͸ǤʹǤ ሻ 3) constant covariance:

ܧሾሺݕ௧െ ߤሻሺݕ௧ି௦െ ߤሻሿ ൌ ܧൣሺݕ௧ି௦െ ߤሻ൫ݕ௧ି௝ି௦െ ߤ൯൧ ൌ ߛሺݏሻ ሺǤ ͳǤ͸Ǥ͵Ǥ ሻ

All weakly stationary time series/processes have Autocorrelation and Partial Autocorrelation functions (ACF and PACF, respectively) converging to zero, as time lags grow. These functions will be defined and explained in sections1.9.1. and 1.9.2.. A series can be stabled doing successive differences or log differences. Original data can be transformed into simple returns or log returns, respectively defined as

ܴ௧ൌ௬೟ି௬೟షభ

௬೟షభ ൈ ͳͲͲΨ ሺǤ ͳǤ͸ǤͶǤ ሻ

ܴ௧ൌ ቀ ௬೟

௬೟షభቁ ൈ ͳͲͲΨ ሺǤ ͳǤ͸ǤͷǤ ሻ

1.7. Skewness and Kurtosis

Skewness and kurtosis measure the shape of a probability distribution. Both are used to make inferences about the normality of such distribution. Skewness measures the degree of asymmetry, with zero skewness indicating symmetry, positive skewness a relatively long right tail in comparison to the left and negative skewness illustrating the opposite case. Positive skewness is called right skewness and negative skewness is called left skewness.

The skewness of a variable ܺ is:

ܵ௞ൌ ܧ ൜ܺ െ ܧሾܺሿ_ߪ ൠ ଷ

ൌܧሼܺ െ ܧሾܺሿሽ_ߪ_ଷ ଷሺǤ ͳǤ͹ǤͳǤ ሻ Kurtosis indicates the extent to which probability is concentrated in the center and especially the tails of the distribution rather than in the “shoulders” which are the regions between the center and the tails. If the coefficient of kurtosis is 3, we are in the presence of a mesokurtic distribution. If the coefficient is bigger than 3, it indicates a leptokurtic distribution, meaning data has heavy tails or outliers. Otherwise, we have a platykurtic distribution, indicating data has light tails or lack of outliers.

The kurtosis of a variable ܺ is given by:

(14)

1.8. Unit Root/Stationarity Tests

Prior to time-series econometric analysis, one of the priorities is to study the stationarity of the variables. Though stationarity is a desirable property, most time series do not possess it. A non-stationary series suffers permanent effects from random shocks and the variances of the series are time-dependent. There are several available tests to investigate the stationarity properties of a time series. Some of the most common are the Augmented Dickey-Fuller (ADF) and Phillips-Perron (PP) unit root tests. The most challenging feature of these tests is the lack of accuracy when a process is stationary but has a root close to the non-stationary boundary. Sometimes these tests’ results can be contradictory. When that happens, we can use another tests such as the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test to take conclusions about the stationarity of the series.

1.8.1. Dickey-Fuller (DF) Unit Root Test The initial regression model

ݖ௧ൌ ߚ଴൅ ߚଵݐ ൅ ߩݖ௧ିଵ൅ ߤ௧ ሺǤ ͳǤͺǤͳǤͳǤ ሻ is slightly altered to

ݖ௧െ ݖ௧ିଵൌ ሺߩ െ ͳሻݖ௧ିଵ൅ ߚ଴൅ ߚଵݐ ൅ ൅ߤ௧ ሺǤ ͳǤͺǤͳǤʹǤ ሻ where ߩ െ ͳ ൌ ߙ

The two hypothesis of the DF test are as follows:

ܪ଴ǣߙ ൌ ߩ െ ͳ ൌ Ͳ ܪଵǣߙ ൏ Ͳ

ܪ଴ is not rejected, which means the series is non-stationary or has a unit root, if the test value is higher than the critical values determined for the confidence levels of ͳΨǡ ͷΨ and ͳͲΨ.

1.8.2. Augmented Dickey-Fuller (ADF) Unit Root Test This test adds lag values of the independent variables, i.e.,

οݖ௧ൌ ݖ௧െ ݖ௧ିଵൌ ሺߩ െ ͳሻݖ௧ିଵ൅ ߚ଴൅ ߚଵݐ ൅ ߚଶοݖ௧ିଵ൅ ߚଷοݖ௧ିଶ൅ ڮ ൅ ݑ௧. ሺǤ ͳǤͺǤʹǤͳǤ ሻ ܪ଴ is not rejected, implying non-stationarity or existence of a unit root, if the value is higher than the critical values for the confidence levels of ͳΨǡ ͷΨ and ͳͲΨ. Alternatively, we do not reject the null hypothesis if the p-value of the test displays a high value. In this thesis, it was considered ͲǤͲͷሺͷΨሻ as the significance value.

ADF and DF tests have different critical values. If there is serial correlation in the error (residuals) series, the appropriated test is the ADF.

1.8.3. Phillips-Perron (PP) Unit Root Test

Similar to the ADF test, the null and alternative hypothesis in this test are respectively given by ܪ଴ǣ ׌ݑ݊݅ݐݎ݋݋ݐ

(15)

Even though PP test uses the same estimation scheme as DF and ADF tests, there are some advantages in choosing this test. Phillips-Perron test uses a non-parametric approach and ignores serial correlation. We do not have to specify lag length prior to running the test and it corrects heteroskedasticity in the error terms. PP tests are more powerful when we use large samples.

1.8.4. Kwiatkowski-Phillips-Schmidt-Shin (KPSS) Stationarity Test

Contrarily to the two tests presented before, the KPSS has a different null hypothesis, namely: the series is stationary. The alternate hypothesis is now the series is non-stationary. That is:

ܪ଴ǣݕ௧̱ܫሺͲሻ െ level stationary ܪଵǣݕ௧̱ܫሺͳሻ െdifference stationary

One big disadvantage of the KPSS test is that it has a high rate of Type I errors, i.e., it tends to reject the null hypothesis often. As we are using three different tests, we tend to minimize any kind of errors.

1.9. ARMA Models

1.9.1. AutoRegressive (AR) Models

An AutoRegressive model assumes that past behavior of a time series is useful to forecast future values. Hence, these models predict the next time steps based on a specific number of immediately preceding observations of the series.

A process defined by

ݕ௧ ൌ ܽ଴൅ ܽଵݕ௧ିଵ൅ ܽଶݕ௧ିଶ൅ ڮ ൅ ܽ௣ݕ௧ି௣൅ ߝ௧ǡ ሺǤ ͳǤͻǤͳǤͳǤ ሻ with ߝ௧̱ܹܰሺͲǡ ߪଶሻǡ is called an autoregressive model ARሺ݌ሻ, where ݌ is the lag order and ߝ௧ is white noise.

The order of an AR model is the number of past observations in the series used to predict the value at the present time. For example, an AR(1) is a first order AR process meaning the output variable is related exclusively to time periods that are one period apart, i.e., the value at the variable at ሺݐ െ ͳሻ.

The autocorrelation function (ACF) of a AR(p) can be represented by

ߩሺ݄ሻ ൌ ൜ܽ_{Ͳ݂݄݅ ൐ ݌ ሺǤ ͳǤͻǤͳǤʹǤ ሻ}௛݂݅Ͳ ൑ ݄ ൑ ݌ The partial autocorrelation function (PACF) is defined by

ݎሺ݌ሻ ൌ ൜݄ܽ ് Ͳǡ݄ ൏ ݌

Ͳǡ݄ ൐ ݌ ሺǤ ͳǤͻǤͳǤ͵Ǥ ሻ The PACF suggests we should use ݌ lags for the AR model order. This function abruptly converges to zero after ݌ steps. On the other hand, the ACF function has a geometric decline after ݌.

(16)

1.9.2. Moving Average (MA) Models

Instead of using past values of the variable to forecast future values, a Moving Average process uses past forecast errors.

A process defined by

ݕ௧ൌ ߝ௧൅ ߠଵߝ௧ିଵ൅ ߠଶߝ௧ିଶ൅ ڮ൅ ߠ௤ߝ௧ି௤, ሺǤ ͳǤͻǤʹǤͳǤ ሻ with ߝ௧̱ܹܰሺͲǡ ߪଶሻǡ is said to be a Moving Average model MAሺݍሻ, where ݍ is the order of the process and ߝ௧ is white noise.

The Autocorrelation Function (ACF) of MA(q) models is defined as

ߩሺ݄ሻ ൌ ൞

ߠ௛൅ σ௤ି௛௝ୀଵߠ௝ߠ௝ା௛

ͳ ൅ σ௤௝ୀଵߠ௝ଶ ݂݅Ͳ ൏ ݄ ൑ ݍ Ͳ݂݄݅ ൐ ݍ

ሺǤ ͳǤͻǤʹǤʹǤ ሻ

The Partial Autocorrelation Function (PACF) can be represented as

ݎሺ݄ሻ ൌ ە ۔ ۓߪଶ_{෍ ߠ}_௝ߠ ௝ା௛ ௤ି௛ ௝ୀଵ ݂݅Ͳ ൏ ݄ ൑ ݍ Ͳ݂݄݅ ൐ ݍ ሺǤ ͳǤͻǤʹǤ͵Ǥ ሻ

The ACF suggests we should use ݍ lags for the AR model order. This function abruptly converges to zero after ݍ steps. On the other hand, the ACF function has a slow convergence to zero.

1.9.3. AutoRegressive Moving Average (ARMA) Models

Combining the two latest models we obtain an AutoRegressive Moving Average model ARMAሺ݌ǡ ݍሻ defined by

ݕ௧ ൌ ܽ଴൅ ܽଵݕ௧ିଵ൅ ڮ ൅ܽ௣ݕ௧ି௣൅ ߠଵߝ௧ିଵ൅ ڮ ൅ ߠ௤ߝ௧ି௤൅ ߝ௧ ሺǤ ͳǤͻǤ͵ǤͳǤ ሻ where

ܧሺߝ௧ሻ ൌ Ͳǡ ܧሺߝ௧ଶ_{ሻ ൌ ߪ}ଶ_{ǡ ܧሺߝ௧ߝ௦}_{ሻ ൌ Ͳǡ ׊ݐ ് ݏ} _{ሺǤ ͳǤͻǤ͵ǤʹǤ ሻ} The mean of an ARMA model is given by

ܧሺ୲ሻ ൌ_{ͳ െ ଵ}_{െ ଶ}଴_{െ ڮ െ ୮}ሺǤ ͳǤͻǤ͵Ǥ͵Ǥ ሻ

where ܽ଴ǡ ǥ ǡ ܽ௣ are real parameters.

Stationary/stability conditions regard the autoregressive part of the model while invertibility conditions regard the moving average part.

ARMA models can only be applied to stationary time series. If the series in question is non-stationary, we work with returns, i.e., first order differences defined as ݕ_௧െ ݕ_௧ିଵ and AutoRegressive Integrated Moving Average (ARIMA) models should be used.

(17)

An ARIMA ሺ݌ǡ ݀ǡ ݍሻ model is an ARMA ሺ݌ǡ ݍሻ model applied to a series that had to be integrated ݀ times to achieve stationarity. Hence, ݌ is the order of the autoregressive part, ݀ is the degree of differencing involved and ݍ is the order of the moving average part.

To construct an ARMA/ARIMA model we can use the Box-Jenkins methodology to identify, estimate and diagnose. To identify the model that would get the best results we can use graphic methods (time series, ACF and PACF to determine orders ݌ and ݍ of the model, as will be explained in section 1.10.). To choose the best model, it is necessary to try different orders ARMA/ARIMA models and compare the values of either the Akaike Information Criterion (AIC) or the Schwarz Information Criterion (SIC). If they generate different orders, we have liberty to choose as none is proven to be better than the other.

1.10. Correlogram

A correlogram is a graph that contains both the ACF and the PACF values for specific lags. An ACF plot contains the autocorrelations that help measure the relationship between ݕ௧and ݕ௧ି௞ for different values of ݇. PACF plot will show the partial correlations that we can use to measure the relationship between ݕ_௧and ݕ_௧ି௞, after removing lags ͳǡ ʹǡ ͵ǡ ǥ ǡ ݇ െ ͳ. The number of ACF outliers is related to the moving average order of the model while the number of PACF outliers relates to the autoregressive part. Dashed lines in correlograms correspond to േʹȀξܰ, where ܰ is the number of observations. If the autocorrelation values are inside this range, we conclude they are not significantly different from zero with a level of confidence equal to 5%. The last two columns represent Ljung-Box Q-Statistic and the respective p-values. In this case, the null hypothesis is given by

ܪ଴: no autocorrelation until the ݇–th order.

1.11. Homoskedasticity/Heteroskedasticity

Homoskedasticity refers to constant/uniform variance of the residuals in a model. When the variance of the error terms changes a lot over time, this might indicate the model is poorly defined.

Using correlograms of squared residuals, we can see if there is residual autocorrelation, which implies variance is non-constant over time. If the variance is not uniform, we are in the presence of a model with heteroskedasticity. When future periods of volatility cannot be identified, we have a model with conditional heteroskedasticity.

1.12. ARCH/GARCH models

AutoRegressive Conditional Heteroskedasticity (ARCH) models were introduced by Engle in 1982 and, later in 1986, Bollerslev and Taylor introduced Generalized AutoRegressive Conditional Heteroskedasticity (GARCH) models. As explained before, conditional heteroskedasticity exists when returns are volatile. If a time series has a high level of kurtosis, that is an indicator of heteroskedasticity.

ARCH models have two specifications – one for the conditional mean equation (1), one for the conditional variance (2):

(1) ݕ௧ൌ ܽ଴൅ ܽଵݕ௧ିଵ൅ ڮ ൅ ܽ௣ݕ௧ି௣൅ ߝ௧ ሺǤ ͳǤͳʹǤͳǤ ሻ (2) ߪ_௧ଶൌ ߱ ൅ σ௣_௝ୀଵߚ_௝ߪ_௧ି௝ଶ ሺǤ ͳǤͳʹǤʹǤ ሻ

(18)

The representation of the GARCHሺ݌ǡ ݍሻ variance is ߪ௧ଶൌ ߱ ൅ ෍ ߚ௝ߪ௧ି௝ଶ ௣ ௝ୀଵ ൅ ෍ ߙ௜ߝ௧ି௜ଶ ௤ ௜ୀଵ ሺǤ ͳǤͳʹǤ͵Ǥ ሻ

Where ߱ is a constant term, ߝ௧ି௜ଶ represents the news about volatility from the previous periods and ߪ௧ି௝ଶ the forecast variance from last periods. We also know ݌ identifies the order of the autoregressive GARCH terms and ݍ is the moving average ARCH term.

There are many variations of these models within the GARCH family. Some of the most commonly used are Integrated GARCH (IGARCH), Exponential GARCH (EGARCH), Threshold GARCH (TGARCH), Power GARCH (PGARCH) and Component GARCH (CGARCH), among others.

1.13. Residuals Tests

After choosing the model in order to check how well the data fits the model, we need to study the residuals. In order to do so, we can run several tests and look for low variance, zero mean, linear independence and a normal distribution. We can also obtain a correlogram of the residuals to look for some kind of pattern or peaks that may indicate there is still some linear or non-linear information in the residuals, meaning we are not in the presence of pure white noise.

It’s important to check if there is correlation on the peaks we found before by running a serial correlation LM test. If there is, we can improve our model by inserting dummy variables and then run new tests to see if the values of the AIC or the SIC are better and if the residuals now meet all the desired requirements.

Once the model is validated, we can start forecasting, which is the main goal of this thesis.

1.14. Forecasting Evaluation Formulas

It is very important to evaluate the forecast accuracy of our model. Residuals alone do not provide reliable indications of how what true forecast errors are likely to be. Depending on the model we are working with, we have different measures we can use to determine the accuracy of our prediction. Those measures are listed in sections 1.14.2 – 1.14.6.

1.14.1. Forecast error

Forecast error is the difference between the actual/real value of a time series and the predicted/forecast one. Hence, a positive forecast error indicates the forecast model underestimated the actual value while a negative forecast error means the forecast method overestimated the actual value.

ܨ௘ൌ ܣ௧െ ܨ௧

Where ܣ௧ represents the actual value at instant ݐ and ܨ௧ the forecast value at the same instant. Let ܶ be the total number of forecast values.

(19)

1.14.2. Mean Forecast Error or Bias

The Mean Forecast Error (MFE), also called Bias, is the mean of the differences between the forecast error and the actual value per period. We want this value to be as close to zero as possible. A large positive (negative) MFE value means we are underestimating (overestimating) the data.

ܯܨܧ ൌσ்௧ୀଵ_{ܶ ሺǤ ͳǤͳͶǤʹǤͳǤ ሻ}ܨ௘ 1.14.3. Mean Absolute Deviation Forecast Error

The Mean Absolute Deviation (MAD) for forecast errors is the average distance between each actual value and the corresponding estimated value.

ܯܣܦ ൌσ ȁܣ௧்௧ୀଵ _ܶ െ ܨ௧ȁሺǤ ͳǤͳͶǤ͵ǤͳǤ ሻ 1.14.4. Mean Square Error

The Mean Square Error (MSE) is the average squared difference between the predicted values and the actual values. Hence, this value is always non negative and we wish to find the minimum value possible.

ܯܵܧ ൌσ ሺܣ௧்௧ୀଵ _ܶെ ܨ௧ሻଶሺǤ ͳǤͳͶǤͶǤͳǤ ሻ 1.14.5. Mean Absolute Percentage Error

As the name suggests, the Mean Absolute Percentage Error (MAPE) gives information about the accuracy of a prediction in the form of percentage. The MAPE works better when we have no extreme values and no zeros.

ܯܣܲܧ ൌ ͳͲͲσ ቚܣ ௧െ ܨ௧ ܣ் ቚ ் ௧ୀଵ ܶ ሺǤ ͳǤͳͶǤͷǤͳǤ ሻ 1.14.6. Theil Inequality Coefficient

ܷ ൌ ටͳܶσ ሺܣ௧െ ܨ௧ሻ ଶ ் ௧ୀଵ ටͳܶσ ܨ்௧ୀଵ ௧ଶ൅ ටͳܶ σ ܣ்௧ୀଵ ௧ଶ ሺǤ ͳǤͳͶǤ͸ǤͳǤ ሻ

This coefficient is scaled to be a value between Ͳ and ͳ, with Ͳ representing a perfect prediction and ͳ the opposite case. Theil’s ܷstatistic can be rescaled and decomposed into 3 proportions:

· ܾ݅ܽݏ: this value indicates the systematic error. No matter the value of ܷ, we always prefer if the bias value is close to zero. A large bias suggests a systematic over or under prediction. · ݒܽݎ݅ܽ݊ܿ݁ǣ this value indicates the ability of the forecasts to replicate the degree of variability

in the variable to be forecast. When the variance proportion value is large then the actual series has fluctuated considerably whereas the forecast has not.

· ܿ݋ݒܽݎ݅ܽ݊ܿ݁ǣ this value measures unsystematic error. It should have the highest proportion of inequality.

(20)

Chapter 2. Machine Learning

Artificial Intelligence (AI) is a vast concept, which traditionally refers to artificial creations that can mimic the functioning of human intelligence to solve problems. Machine Learning (ML) is a field of Artificial Intelligence, based on the idea that systems can learn with data, identify patterns and make decisions with minimal human intervention. Machine learning is often used to make predictions using computers. Even though it is not new, it has been gaining importance in the last several years and is now used in a wide variety of applications.

The basics of machine learning consist in giving training data to a learning algorithm. The learning algorithm then generates a new set of rules, produces reliable decisions and results, including predictions, based on inferences from the data and generalizing from examples. This process is revolutionary because instructions do not have to be programmed step-by-step. Instead, the algorithm learns and adjusts, in an iterative process, its representative knowledge model in order to improve its performance. Computers can learn and build mathematical models without relying on being explicitly programmed to do specific tasks and can adapt when exposed to new data. The more data available to train the algorithm, the more it will learn and the results will be more accurate.

Machine learning models can apply a mix of different techniques, but the methods are typically categorized in three general types:

· Supervised learning: labeled input data and the desired output are provided previously to the learning algorithm. We use this type when we have data that we want to predict or explain. · Unsupervised learning: unlabeled data is given to the learning algorithm. The algorithm is

asked to recognized patterns in the data given. This type is used when we want to relate and group data without having a target output.

· Reinforcement learning: A dynamic environment is set in order to interact with the learning algorithm with the goal to provide a valuation about the response of the system.

The two biggest limitations of machine learning involve overfitting and dimensionality. The first case happens when the data is biased and the model does not generalize to new data while the second happens when we have trouble understanding the data as a result of multiple dimensions. Another frequent problem is having access to a data set large enough.

Financial time series prediction has gained a considerable amount of importance in the last few decades. However, this is not a simple task due to the highly nonlinear nature of the financial time series, their unstableness and/or uncertainty and the complex relation between the variables. Usually, financial data contains a tremendous amount of noise, non-stationary characteristics, complex dimensionality and can change rapidly in very short periods of time.

ML models such as Artificial Neural Networks (ANN) have been applied to model a wide range of challenging problems in different areas such as finance prediction. During the last few years, a number of neural networks models and hybrid models have been proposed for obtaining accurate prediction results in an attempt to outperform the tradition linear and nonlinear approaches such as ARIMA as GARCH models. Machine Learning models have displayed a better performance than other conventional models. Many studies have praised the benefits of this methodology in forecasting financial time series. While using classic Econometric Models such as linear regression, ARMA and ARIMA models, one has to make a lot of assumptions such as the linearity and stationarity of the time series. These classic estimation methods assume mean and variances of the stochastic series are finite, constant and time invariant. Therefore, using the traditional models degrade the quality of the prediction results. Hence, ANNs can overcome many of the drawbacks associated with the traditional techniques.

(21)

A neural network is a bio-inspired system with several single processing elements called neurons or nodes. The neurons are connected to each by joint mechanisms that consists of a set of assigned weights. Usually, in the NN models, neurons are organized in layers. The most common architecture consists of an input layer, an output layer and one or more hidden layers. Each layer consists of multiple neurons that are connected to neurons in adjacent layers. The input layer corresponds to the input variables with one node for each input variable. The hidden layers, where information processing takes place, are used to capture the non-linear relationship among variables via the connection weights. An appropriate number of neurons in the hidden layers needs to be determined by repeated training. The output consists of only one neuron that represents the predicted value of the output variable. A neural network can be trained by the historical data of a time series in order to capture the non-linear characteristics of the specific time series. The model, such as connection weights and nodes biases, will be adjusted by an iterative process with the goal of minimizing the forecasting errors.

2.1. A Brief History of ANNs

In 1943, Warren McCulloch and Walter Pitters, a neurophysiologist and a logician, respectively, wrote “A Logical Calculus of Ideas Immanent in Nervous Activity”, one of the first modern computational theories of mind and brain. Their study focused on mechanisms that resembled neural ones and used logic and computation to understand neural activity. Even though this was an embryonic stage of neural network research, their model still serves as inspiration to most modern neural systems.

In 1969, Marvin Minsky and Seymour Paper, a cognitive scientist and a mathematician, wrote a book called “Perceptrons: an Introduction to Computational Geometry”. Using mathematical techniques, they studied the real possibilities of perceptrons, a type of Artificial Neural Networks, focusing on their theoretical capabilities and limitations.

In 1982, John Hopfield, a scientist, presented his study “Neural Networks and Physical Systems with Emergent Collective Computational Abilities”, a groundbreaking innovation of an associative neural network. Hopfield defended the possibility of equipping electronic computers with associative memories and made advances in the pattern recognition models.

In 1986, D. E. Rumelhart, G. E. Hinton and R. J. Williams, a psychologist, a cognitive psychologist and a professor of computer science, published “Learning Representations by Back-Propagation Errors”, where they studied the nature of the connections in a neural network. They introduced a new learning procedure, called back-propagation for neural networks. This procedure consists in repeatedly adjusting the weights of the connections in the neural network with the goal of minimizing the errors, i.e., the difference between the actual output vector obtained and the targeted output vector.

2.2. Artificial Neuron

As referred before, the inspiration behind the Artificial Neural Networks came from the Biological human neuron.

So let’s take a look at biological neurons first and how they work. Inputs are received by the dendrites and are passed to the soma, the cell body of a neuron, which processes the information they received. The input signals are passed on towards axon terminals, also called synapses, using a transmission line

(22)

called the axon. Axon terminals are connected to the dendrites of another neuron. The next figure is a scheme of a human neuron.

Figure 2.2.1. The biological neuron and its components

Basically, an artificial neuron is composed of a set of weighted inputs, which are similar to the inputs received by the dendrites at a human neuron and an activation function, which corresponds to the axon of a biological neuron.

The capacity of the memory of an artificial neuron is proportional to the number of connections between the neurons and not to the number of neurons itself and the system’s ability to learn is directly related to the strength of the connections.

Let ܫ ൌ ሺݔଵǡ ݔଶǡ ǥ ݔ௡ሻ be the vector of ݊ input signals one AN receives. To each input signal ݔ௜ is attributed one weight ݓ௜. Each weight can be seen as a measure of the strength of the connections between the neurons. As the learning process develops, the initial weight for a neuron can change. The following expression

෍ ݔ௜ݓ௜ ௡

௜ୀଵ

ሺܧݍǤ ʹǤʹǤͳሻ

is the net output signal computed by the AN. It is obtained by multiplying the input number and the weight of each node. After this, an output signal, ݋_௝ is produced using the activation function.

(23)

2.3. Artificial Neural Networks (ANNs)

The next table identifies each term from a biological neuron network with its counterpart in an artificial neuron network.

Table 2.3.1. A comparison between biological and artificial neural networks

Biological Neural Network Artificial Neural Network

Soma Node

Dendrites Input

Synapse Weights

Axon Output

Training a network means providing sets of inputs and outputs, without explaining how the output is obtained, and it involves adjusting the parameters, weights and/or biases of the model. At first, connections between neurons are rated randomly and we insert the desired output. After this, the network will compare the actual output to the desired one and will gradually change the connections in order to minimize the error.

ANNs have the ability to work with incomplete, uncertain or slightly incorrect data and are tolerant to noise. They also have the ability to keep functioning in case of loss of a significant proportion of neurons.

Figure 2.3.1. An illustration of how ANNs work

ݓ௡ ݓଷ ݓଶ ݓଵ ݔଵ ݔଶ ݔ௡ ݔଷ ෍ ᇹ Inputs Activation Function

Output

(24)

2.4. Multilayer perceptron (MLP) neural network

A Multilayer Perceptron neural network is composed of several layers that contain nodes: the input layer, where external data is received, the hidden layers, and the output layer, where we obtain our results. A multilayer perceptron (MLP) is a feedforward artificial neural network that generates a set of outputs from a set of inputs and uses backpropogation for training the network to adjust the weights and biases relative to the error. MLP neural networks are often applied to supervised learning problems.

In figure 2.4.1., we have an example of a simple neural network with three inputs, a single hidden layer with five nodes and one output layer. Figure 2.4.2. is a scheme of a MLP neuron network with three inputs as well, but several hidden layers and an output layer.

Figure 2.4.1. Single hidden layer ANN example

(25)

2.5. Activation functions

An artificial neuron implements a non-linear mapping from Թ௡ to ሾͲǡͳሿ or ሾെͳǡͳሿ, in most frequent cases. This depends on the activation function used and ݊ represents the number of input signals to the neuron.

An activation function, frequently called transfer function as well, is a mathematical equation that receives the net input signal and the bias and determine the output of the neuron. This function is linked to every artificial neuron in the network and it determines whether or not one neuron should be activated, based on its importance to the model.

Activation functions can be divided into the three types listed below.

· Binary Step Functions – these are threshold-based activation functions. If the input value is below or above a certain threshold, the neuron is activated and sends the same signal to next neuron.

The threshold function, where ߠ is the threshold defined, is described as

݂஺ேሺݔሻ ൌ ൜ͳ݂݅ݔ ൒ ߠ_{Ͳ݂݅ݔ ൏ ߠሺǤ ʹǤͷǤͳǤ ሻ} A variant of this function, called the Signum function is also chosen sometimes

݂஺ேሺݔሻ ൌ ൜ ͳ݂݅ݔ ൒ ߠ_{െͳ݂݅ݔ ൏ ߠ ሺǤ ʹǤͷǤʹǤ ሻ}

· Linear functions – the derivative of this type of functions is a constant and, therefore, it is not possible to use backpropagation to train the model. All layers of the neural network turn into just one layer.

The linear function, where ܽ is the slope, is defined as

݂஺ேሺݔሻ ൌ ܽݔሺǤ ʹǤͷǤ͵Ǥ ሻ When we combine the linear function and the threshold function, we obtain the ramp function, which is characterized as

݂஺ேሺݔሻ ൌ ቐݔ݂݅Ͳ ൏ ݔ ൏ ߠͳ݂݅ݔ ൒ ߠ

െͳ݂݅ݔ ൏ ͲሺǤ ʹǤͷǤͶǤ ሻ · Non-linear functions – these functions allow us to create complex mappings, essential for

learning and modeling complex data. The derivate functions exist and it is not constant. Hence, non-linear activation functions allow backpropagation and they need multiple hidden layers.

One of the most frequently used non-linear functions is the sigmoid or logistic function. Its equation is as follows

݂஺ேሺݔሻ ൌ_{ͳ ൅ ݁}ͳ _ି௫ሺǤ ʹǤͷǤͷǤ ሻ The graph of the sigmoid function is a S-shaped curve and its range is ሾͲǡͳሿ. The outputs are not zero centered, but it has smooth gradient and produces clear predictions.

(26)

We have the tan-sigmoid or hyperbolic tangent (tanh) function, which can be defined by equation ݂஺ேሺݔሻ ൌͳ െ ݁_{ͳ ൅ ݁}ିଶ௫_ିଶ௫ሺǤ ʹǤͷǤ͸Ǥ ሻ Unlike the sigmoid function, the hyperbolic tangent function has outputs that are zero centered, because its range is ሾെͳǡͳሿ and optimization is easier. It also has smooth gradient and makes good predictions.

The last two functions share the same disadvantages. Both the logistic and hyperbolic tangent activation functions have slow convergence and are heavy to compute. They also share the vanishing gradient problem: when values are either very high or very low, there is no change in the prediction. This can result in accuracy problems or slowness in finding an output.

Finally, it is also importance to mention the Rectified Linear Units (ReLU) function defined as ݂஺ேሺݔሻ ൌ ሺͲǡ ݔሻ ൌ ൜Ͳ݂݅ݔ ൏ Ͳ_{ݔ݂݅ݔ ൒ ͲሺǤ ʹǤͷǤ͹Ǥ ሻ} ReLU functions are computationally efficient and have quick convergence. They look linear, but have a derivative function. These functions should only be used within the hidden layers of the network. The major disadvantage is the Dying ReLU problem – when inputs approach zero and/or are negative, the gradient of the function becomes zero. This can result in dead neurons, which might cause the network’s ability to learn and to perform backpropagation.

A variant of these functions exists and it is called the Leaky Rectified Linear Units (Leaky ReLU) functions:

݂஺ேሺݔሻ ൌ ൜ͲǤͲͳݔ݂݅ݔ ൏ Ͳ_{ݔ݂݅ݔ ൒ ͲሺܧݍǤ ʹǤͷǤͺǤ ሻ} Leaky ReLU functions prevent the dying ReLU problem, because they have a small positive slope in the negative area, which makes backpropagation possible. It has the same advantages as ReLU functions.

There have been some studies and researches in order to determine automatically which activation function is the optimal one, but as of today this parameter has to be manually configured. However, some desirable properties of the activation function are non-linearity, continuously differentiability and monotonicity.

2.6. Forecasting with ANNs

The accuracy of forecasts in ANN models can only be truly determined when we consider data that was not included when fitting the model. Consequently, it is common practice to separate the available data into training and test sets. Training data is used to estimate the parameters of the model and test data is used to evaluate the accuracy of the forecast method chosen. Once again, we will determine the accuracy of our predicted values using the formulas of sections 1.14.2. – 1.14.6.

(27)

Chapter 3. Results & Data Analysis

3.1. Descriptive Statistics & Unit Root/Stationarity Tests

Our data base has 845 observations, corresponding to the monthly values of Portuguese inflation between January of 1949 and May of 2019. The last value, referent to May of 2019 was only used to compare the forecast values. Graphic representation of this time series and its respective descriptive statistics are shown below.

Table 3.1.1. Descriptive Statistics (Portuguese Inflation Rate Series)

-10 0 10 20 30 40 50 55 60 65 70 75 80 85 90 95 00 05 10 15 INF

Figure 3.1.1. Portuguese Inflation Rate Series graph

A first look at the graph illustrated in Figure 1 indicates that our time series is not stationary, with alternated trends, both upward and downward.

Exploratory data analysis show that average monthly inflation rate in Portugal is 7.07%, while the all-time high of 36.66% was reached in May of 1977 and the all-all-time low of -3.74% in June of 1954. Also, the skewness of this distribution is 1.33 while the kurtosis reaches 3.92, meaning the series is both asymmetric and leptokurtic.

Mean 7.080865 Median 3.585000 Maximum 36.66000 Minimum -3.74000 Standard Deviation 7.925615 Skewness 1.329424 Kurtosis 3.139424 Jarque-Bera 277.8362 Probability 0.000000 Sum 5976.250

Sum Square Deviation 59253.36

(28)

We will now test if the series is stationary, using the unit root tests stated before. Table 3.1.2. Unit Root/Stationarity Tests (Original Series)

Test Statistic Critical Value Probability

1% 5% 10%

Augment Dickey-Fuller -1.3953 -3.4380 -3.4379 -2.5685 0.5858 Phillips-Perron -2.2401 -2.8648 -2.8648 -2.5685 0.1923 Kwiatkowski-Phillips-

Schmidt-Shin 0.7244 0.739 0.463 0.347 -

Both ADF and PP tests show high p-values, or alternatively, test statistic values are higher than the respective critical values, meaning we can’t reject ܪ଴ and, therefore, we are in the presence of a unit root, i.e., the series is non-stationary. To confirm these results, we can use KPSS test that indicates we should reject ܪ଴ (“the series is stationary”) for confidence level 1%.

We can now conclude the series is non-stationary. After integrating the series once, i.e., applying first order difference operator, we run new stationarity tests for the returns and the results are as follow below.

Table 3.1.3. Unit Root/Stationarity Tests (Series of Returns)

Test Statistic Critical Value Probability

1% 5% 10%

Augment Dickey-Fuller -13.2553 -3.4380 -3.4379 -2.5685 0.0000 Phillips-Perron -23.4748 -2.8648 -2.8648 -2.5685 0.0000 Kwiatkowski-Phillips-

Schmidt-Shin 0.0610 0.739 0.463 0.347 -

Considering the new results obtained from ADF and PP tests, we can see test statistics now have much smaller values than critical values and the p-values associated are statistically close to zero. Hence, we reject ܪ_଴, there is no unit root and we can conclude the series of returns is stationary.

Table 3.1.4. Descriptive Statistics (Series of Returns)

Mean -0.004259 Median -0.020000 Maximum 6.400000 Minimum -7.640000 Standard Deviation 1.120456 Skewness -0.223814 Kurtosis 11.87318 Jarque-Bera 2772.546 Probability 0.000000 Sum -3.590000

Sum Square Deviation 1057.066

(29)

The new series is slightly asymmetric, with a negative skewness coefficient of -0.22, but extremely leptokurtic, having a very high value of kurtosis of 11.99. This might be an indicator of heteroskedasticity, which we will analyze later by running some specific tests.

3.2. ARMA/ARIMA models’ results

Since our time series is now stationary, we will take a look at the correlogram and analyze it to determine which models we should try in the context of ARMA/ARIMA models (for the series of returns or the original series, respectively).

Figure 3.2.1. Residuals Correlogram

Significant peaks determine the orders we use in our model. By observation of the graph, we might be inclined to test models AR(1), AR(2), AR(3), MA(1), MA(2), MA(3), ARMA(1,1), ARMA(1,2), ARMA(2,1), ARMA(2,2), ARMA(1,3), ARMA(3,1), ARMA(2,3), ARMA(3,2) and ARMA(3,3) – for the new series – or equivalently ARIMA (1,1,0), ARIMA (2,1,0), ARIMA (0,1,1), ARIMA(0,1,2), ARIMA(1,1,1), ARIMA(1,1,2), ARIMA(2,1,1), ARIMA (2,1,2), ARIMA(1,1,3), ARIMA(3,1,1), ARIMA(2,1,3), ARIMA(3,1,2) and ARIMA(3,1,3) for the original series. The fitting process used in this thesis was the Generalized Least Squares - Gauss-Newton.

The superior panel of the outputs informs about the inputs, the middle panel about the coefficients and the lower one shows statistics about the equation used. Some of the outputs will not be shown here and the results will be briefly explained as they are similar to others.Only the middle and lower panel of some outputs will be shown in results chapter. Full outputs of all models can be consulted in Appendix section.

The following list is a resume of how to interpret the outputs generated: · Std. Error is the standard error associated with each coefficient tested.

· t-Statistic represents the values of the test statistic associated with the null hypothesis that each line’s coefficient is equal to zero. If any of the coefficients of the model is zero, we can discard the model. If the absolute value of the t-statistic is higher than the chosen critical value, that specific coefficient is significant to the model in question.

· Prob represents the p-value. If the p-value is bigger than 0.05, we reject the null hypothesis that the coefficients are non-null and we conclude the model does not fit our data.

(30)

· R-squared and Adjusted R-squared measure the proximity degree between the variables. · S.E of regression corresponds to the estimated standard deviation of the errors.

· Mean dependent var and S.D. dependent var are, respectively, the mean and the sample standard deviation of the dependent variable.

· “F-statistic” and Prob(F-statistic) test the null hypothesis where all coefficients except the intercept are null. The closest this value is to zero the better for our model, it means the coefficients are statistically significant.

· Akaike info criterion and Schwarz criterion are the AIC and SIC, as explained before. We will use these criterions to compare different models. Sum squared residuals and log likelihood are also used to compare different regression models.

· Durbin-Watson stat is a classic statistical test for serial correlation. A value close to 2 means there is no serial correlation, which is the optimal case, while a value close to 0 means there is probably serial correlation.

· Inverted AR/AM roots should be values inside the unit circle.

The first ruled out models were AR(2), AR(3), MA(2), MA(3), ARMA(1,1), ARMA(1,2), ARMA(1,3), ARMA(2,3), ARMA(3,1), ARMA(3,2) and ARMA(3,3) . Below we can see the output obtained when computing ARMA(1,1), which will serve as an example to explain why these models were excluded.

Table 3.2.1.ARMA(1,1) Model Output

Both AR(1) and MA(1) coefficients haveܲݎ݋ܾ ൐ ͲǤͲͷ, which cause us not to reject ܪ଴ǡ meaning these coefficients are statistically non-significant. Therefore, we cannot validate this model. The other ten models were excluded for the exact same reasons.

Models AR(1), MA(1), ARMA(2,1) and ARMA(2,2) all have statistically significant values so they will be analyzed with further detail.

(31)

Table 3.2.2. ARMA(2,2) Model Output

· Intercept is zero, because probability is ͲǤͻͲ͸ͺ, which cause us to reject any other possibility. · ARMA Coefficients all display probabilities approximately zero, meaning all coefficients are

statistically relevant.

· R-Squared and Adjusted R-Squared present low values of ͲǤͳͳͶͷʹ͵ and ͲǤͳͳͲʹͻ͹, respectively, meaning there is few influence between the variables. These values are far from ideal, we search for higher ones, but it is not as important as in regression models.

· Prob (F-statistic) is a value extremely close to zero, which it’s a desired property. It suggests ܪ଴ is rejected and, therefore, model is globally good.

· Durbin-Watson statistic is ͳǤͻͷ͵Ͷ͸͹ very close to the optimal value of ʹ, meaning there is little residual correlation.

· Both AR roots and MA roots have values within the unitary circle.

For this first model, ARMA, Akaike Information Criterion is ʹǤͻͷͷͶ͸ and Schwarz Information Criterion is ʹǤͻͺ͵ͷͷ. We will compare these values with other models to see which one is the most suited model for this particular time series.

The next table displays all AIC and SIC obtained when computed the four previous validated models. Table 3.2.3. AIC/SIC values for validated models

Model AIC SIC

AR(1) 3.03205 3.04329 MA(1) 3.03108 3.04232 ARMA(2,1) 3.02468 3.04715 ARMA(2,2) 2.95546 2.98355

Since all values are extremely close to one another we decided to run some tests on residuals and forecast one value in order to determine which of the models fits our time series the better before making a decision about the efficiency of any model.