No. 461 Building neural networks models for time series: a statistical approach

(1)

No. 461

Building neural networks models for time series: a statistical approach

Marcelo C. Medeiros Timo Teräsvirta

Gianluigi Rech

TEXTO PARA DISCUSSÃO

DEPARTAMENTO DE ECONOMIA

www.econ.puc-rio.br

(2)

DEPARTAMENTO DE ECONOMIA PUC-RIO

TEXTO PARA DISCUSSÃO N ^o . 461

BUILDING NEURAL NETWORKS MODELS FOR TIMES SERIES:

A STATISTICAL APPROACH

MARCELO C. MEDEIROS TIMO TERÄSVIRTA

GIANLUIGI RECH

AGOSTO 2002

(3)

A Statistical Approach

Marcelo C. Medeiros

Department of Economics, Catholic University of Rio de Janeiro

TimoTerasvirta

Department of EconomicStatistics, Stockholm Schoolof Economics

GianluigiRech

Quantitative Analysis, Electrabel, Louvain-la-Neuve, Belgium

August 21,2002

Abstract

Thispaperisconcernedwithmodellingtimeseriesbysinglehiddenlayerfeedforwardneuralnetwork

models. Acoherentmodellingstrategybasedonstatistical inferenceispresented. Variableselectionis

carried outusingexisting techniques. Theproblem ofselecting the numberof hiddenunitsis solved

by sequentially applying Lagrange multiplier type tests, with the aim of avoiding the estimation of

unidentied models. Misspecication tests are derived for evaluating an estimated neural network

model. A small-sample simulation experiment is carried out to show how the proposed modelling

strategyworksand how themisspecication testsbehave insmallsamples. Twoapplications to real

timeseries, oneunivariate and theothermultivariate, areconsidered as well. Setsofone-step-ahead

forecastsareconstructedandforecastaccuracyiscomparedwiththatofothernonlinearmodelsapplied

tothesameseries.

Keywords: Modelmisspecication,neuralcomputing,nonlinearforecasting,nonlineartimeseries,

smoothtransitionautoregression,sunspotseries,thresholdautoregression,nancialprediction.

JEL Classication Codes: C22,C51,C52,C61, G12

Acknowledgments: Thisresearchhasbeensupportedby theToreBrowaldh'sFoundation. The

research of the rst author has been partially supported by CNPq. The paper is partly based on

Chapter2ofthePhDthesisofthethirdauthor. Apartoftheworkwas carriedoutduringthevisits

oftherstauthortotheDepartmentof EconomicStatistics,StockholmSchoolofEconomicsandthe

second author to the Department of Economics, PUC-Rio. The hospitality of these departments is

gratefullyacknowledged. MaterialfromthispaperhasbeenpresentedattheFifthBrazilianConference

on Neural Networks, Rio de Janeiro, April 2001, the 20 th

International Symposium on Forecasting,

Dublin, June 2002, and seminars at CORE (Louvain-la-Neuve), Monash University (Clayton, VIC),

SwedishSchoolofEconomics(Helsinki),UniversityofCalifornia,SanDiego,CornellUniversity(Ithaca,

NY),FederalUniveristyofRiodeJaneiro,andCatholicUniversityofRiodeJaneiro. Wewishtothank

theparticipantsoftheseoccasions,HalWhiteinparticular,for helpfulcomments. Ourthanksalsogo

toChrisChateldandDickvanDijkfor usefulremarks,andAllanTimmermannforthedatausedin

the secondempirical example ofthe paper. Theresponsibility for any errorsor shortcomings inthe

paperremainsours.

Address for correspondence: Timo Terasvirta, Department of EconomicStatistics, Stockholm School of Eco-

nomics,BOX6501,SE-11383,Stockholm,Sweden.E-mail: timo.terasvirta@hhs.se.

(4)

Alternativestolinearmodelsineconometricandtimeseriesmodellinghaveincreasedinpopularity

inrecentyears. Nonparametricmodelsthatdonotmakeassumptionsabouttheparametricformof

thefunctionalrelationshipbetweenthevariablestobemodelledhavebecomemoreeasilyapplicable

due to computational advances and increased computational power. Another class of models,

the exible functional forms, oers an alternative that in fact also leaves the functional form of

the relationship unspecied. While these models do contain parameters, often a largenumber of

them, the parameters are not globally identied or, using the statistical terminology, estimable.

Identicationorestimability,ifachieved,islocalat bestwithoutadditionalparameterrestrictions.

The parametersare notinterpretable eitherasthey oftenare inparametric models.

The articialneuralnetwork(ANN) modelisaprominentexampleofsuchaexiblefunctional

form. Ithasfoundapplicationsinanumberofelds,includingeconomics. KuanandWhite(1994)

surveyedtheuseof ANNmodelsin(macro)economics, andseveral nancialapplicationsappeared

ina recent specialissue ofIEEE Transactions onNeuralNetworks(Abu-Mostafa, Atiya,Magdon-

Ismailand White 2001). The use of the ANNmodel inapplied work is generally motivated by a

mathematical result statingthat undermild regularityconditions,a relativelysimple ANNmodel

iscapableofapproximatinganyBorel-measurablefunctiontoanygivendegree ofaccuracy;see,for

example,Funahashi(1989),Cybenko(1989), Hornik, Stinchombe,and White(1989,1990), White

(1990),orGallant and White(1992). Suchan approximator wouldstillcontain a nitenumberof

parameters. Howtospecifysuch amodel,thatis,howto ndtherightcombinationof parameters

and variables, is acentral topic inthe ANNliteratureand hasbeenconsideredin a largenumber

of books such asBishop (1995),Ripley (1996), Fine(1999), Haykin(1999), orReed andMarks II

(1999),andarticles. Manypopularspecicationtechniquesare\general-to-specic"or\top-down"

procedures: theinvestigatorbeginswithalargemodelandappliesappropriatealgorithmstoreduce

the number of parameters using a predetermined stopping-rule. Such algorithms usually do not

relyon statistical inference.

In thispaper, we proposea coherent modellingstrategy forsimplesingle hidden-layerfeedfor-

ward ANNtime seriesmodels. Thesemodelsdiscussed here areunivariate, butadding exogenous

(5)

specic approachesisthatoursworksintheoppositedirection,fromspecic to general. Webegin

with a small model and expand that according to a set of predetermined rules. The reason for

thisis thatweviewourANNmodelasastatisticalnonlinearmodelandapplystatistical inference

to the problem of specifying the model or, as the ANN experts express it, nding the network

architecture. We shall argue in the paper that proper statistical inference is not available if we

choose to proceed from largemodelsto smaller ones, from general to specic. Our \bottom-up"

strategybuildspartlyonearlyworkbyTerasvirtaandLin(1993). Morerecently,AndersandKorn

(1999) presenteda strategythat shares certain featureswith our procedure. Swanson and White

(1995,1997a,1997 b) alsodevelopedandappliedaspecic-to-generalstrategythatdeservesmention

here. Balkinand Ord(2000) proposedaninference-based methodforselecting thenumberof hid-

denunitsintheANNmodel. ZapranisandRefenes(1999)developedacomputerintensivestrategy

based onstatistics to selectthevariablesand thenumberofhiddenunitsofthe ANNmodel. Our

aim hasbeen to develop a strategy that minimizes the amount of computation required to reach

thenal specication and, furthermore,contains an in-sampleevaluation of theestimated model.

We shall considerthedierences betweenourstrategy and theother ones mentionedhere inlater

sectionsof thepaper.

The plan of the paper is as follows. Section 2 describes the model and Section 3 discusses

geometric and statisticalinterpretationsforit. A modelspecicationstrategy,consisting of speci-

cation, estimation, andevaluation of themodelisdescribed inSection4. Theresults concerning

a Monte-Carlo experiment are reported in Section 5 and two applications with real data sets are

presentedinSection 6. Section7 containsconcludingremarks.

2 The Autoregressive Neural Network Model

The AutoRegressive NeuralNetwork (AR-NN)modelisdened as

y

t

=G(x

t

; )+"

t

= 0

~ x

t +

h

X

i=1

i F(!~

0

i x

t

i )+"

t

(1)

(6)

where G(x

t

; ) is a nonlinearfunction of thevariablesx

t

with parametervector 2R

dened as = [ 0

;

1

;:::;

h

;

~

! 0

1

;:::;

~

! 0

h

;

1

;:::;

h ]

0

. The vector

~

x

t 2 R

q+1

is dened as

~

x

t

=

[1;x 0

t ]

0

, where x

t 2 R

q

is a vector of lagged values of y

t

and/or some exogenous variables. The

functionF(!~ 0

i x

t

i

), oftencalledthe activationfunction, is thelogisticfunction

F(!~ 0

i x

t

i )=

1+e (~!

0

i xt

i )

1

(2)

where !~

i

= [~!

1i

;:::;!~

qi ]

0

2 R q

and

i

2 R, and the linear combination of these functions in (1)

forms the so-called hiddenlayer. Model (1) with (2) does not contain lags of "

t

and is therefore

calledafeedforwardNNmodel. Forotherchoicesoftheactivation function,seeChen,Racine and

Swanson (2001). Furthermore, f"

t

g is a sequence of independently normally distributedrandom

variableswithzero meanand variance 2

. ThenonlinearfunctionF(!~ 0

i x

t

i

) isusuallycalled a

hiddenneuronor ahiddenunit. The normalityassumptionenables usto denethe log-likelihood

function, whichis requiredforthestatistical inference we need, butitcan berelaxed.

3 Geometric and Statistical Interpretation

The characterization and tasks of a layer of hidden neurons in an AR-NN model are discussed

in several textbooks such as Bishop (1995), Haykin (1999), Fine (1999), and Reed and Marks II

(1999). ItisneverthelessimportanttoreviewsomeconceptsinordertocomparetheAR-NNmodel

withother well-knownnonlineartimeseriesmodels.

Consider the output F(!~ 0

i x

t

i

) of a unit of the hiddenlayer of a neuralnetwork asdened

in(1) and (2). When!~ 0

i x

t

=

i

,theparameters !~

i and

i

denea hyperplaneina q-dimensional

Euclideanspace

H =fx

t 2R

q

j!~ 0

i x

t

=

i

g: (3)

The directionof !~

i

determinesthe orientationof thehyperplane and thescalar term

i

=k!~

i k the

positionofthehyperplaneintermsofitsdistancefromtheorigin. Ahyperplaneinducesapartition

of thespace R q

into two regionsdened bythe halfspaces

H +

=fx

t 2R

q

j~

! 0

i x

t

i

g (4)

(7)

H =fx

t 2R

q

j~

! 0

i x

t

<

i

g; (5)

associatedto thestatesoftheneuron. Withh hyperplanes,aq-dimensionalspacewillbesplitinto

severalpolyhedralregions. Eachregionisdenedbythenonemptyintersectionofthehalfspaces(4)

and (5) of each hyperplane. Hyperplaneslying parallelto each other constitutea specialcase. In

thissituation,!~

i

!,~ i=1;:::;h, and equation(1) becomes

y

t

=

0

~ x

t +

h

X

i=1

i F(!~

0

x

t

i )+"

t

: (6)

The inputspace is thus splitinh+1 regions.

Certainspecialcasesof (1)areofinterest. When x

t

=y

t d

inF,model(1)becomesamultiple

logistic smooth transition autoregressive (MLSTAR) model with h+1 regimes inwhich onlythe

intercept changes according to theregime. The resultingmodelisexpressed as

y

t

= 0

~ x

t +

h

X

i=1

i F(

i (y

t d c

i ) )+"

t

; (7)

where

i

=!~

i1 and c

i

=

i

=!~

i1

. When h =1, equation (7) denes a special case of an ordinary

LSTAR model considered in Terasvirta (1994). When

i

! 1, i = 1;:::;h, model (7) becomes

a Self-Exciting Threshold AutoRegressive (SETAR) model with a switching intercept and h+1

regimes. Inanother important specialcase x

t

=tin F. Inthis situationthe intercept of a linear

model changes smoothly as a function of time. A linear model with h structural shifts in the

intercept isobtainedas

i

!1,i=1;:::;h.

An AR-NN model can thus be either interpreted as a semi-parametric approximation to any

Borel-measurablefunctionorasanextensionof theMLSTARmodelwhere thetransitionvariable

can be alinear combination of stochastic variables. We shouldhowever stress thefactthat model

(1)is,inprinciple,neithergloballynorlocally identied. Threecharacteristics ofthemodel imply

non-identiability. The rst one is theexchangeability property of theAR-NN model. The value

in the likelihood function of the model remains unchanged if we permute the hiddenunits. This

resultsinh!dierentmodelsthatareindistinguishablefromeachotherandinh!equallocalmaxima

(8)

yieldstwo observationally equivalent parametrizations foreach hiddenunit. Finally,the presence

of irrelevanthiddenunitsis aproblem. If model(1) hashiddenunitssuch that

i

=0 forat least

one i,theparameters!~

i and

i

remainunidentied. Conversely,if!~

i

=0 then

i and

i

can take

anyvaluewithoutthevalueof thelikelihoodfunctionbeingaected.

Therstproblemissolvedbyimposing,say,therestrictions

1

h or

1

h . The

secondsourceofunderidenticationcanbecircumvented,forexample,byimposingtherestrictions

~

!

1i

> 0, i = 1;:::;h. To remedy the third problem, it is necessary to ensure that the model

containsnoirrelevanthiddenunits. ThisdiÆcultyisdealtwithbyapplyingstatisticalinferencein

model specication; seeSection 4. Forfurther discussionofthe identiabilityof ANNmodelssee,

for example, Sussman (1992), Kurkova and Kainen (1994), Hwang and Ding (1997), and Anders

and Korn(1999).

Forestimationpurposes itis oftenusefulto reparametrize thelogisticfunction(2) as

F

i

! 0

i x

t c

i

=

1+e

i (!

0

i x

t c

i )

1

(8)

where

i

>0,i=1;:::;h,and k!

i

k=1with

!

i1

= v

u

t

1 q

X

j=2

! 2

ij

>0;i=1;:::;h: (9)

The parametervector of model(1)becomes

=[ 0

;

1

;:::;

h

;

1

;:::;

h

;!

12

;:::;!

1q

;:::;!

h2

;:::;!

hq

;c

1

;:::;c

h ]

0

:

Inthiscasethersttwoidentifyingrestrictionsdiscussedabovecanbedenedas,rst,c

1

c

h

or

1

h

and, second,

i

> 0;i = 1;:::;h.. We shall return to this parameterization in

Section4.3, when maximumlikelihoodestimationof theparametersis considered.

(9)

4.1 Three Stages of Model Building

AsmentionedintheIntroduction,ouraim isto constructa coherentstrategyforbuildingAR-NN

models using statistical inference. The structure or architecture of an AR-NN model has to be

determined from the data. We callthis stage specication of the model, and it involves two sets

of decisionproblems. First, the lags orvariables to be includedin the model have to be selected.

Second,thenumberof hiddenunitshasto be determined. Choosingthecorrectnumberof hidden

unitsisparticularlyimportant asselectingtoomanyneuronsyieldsanunidentiedmodel. In this

work, the lag structure or the variables included in the model are determined using well-known

variable selection techniques. The specication stage of NN modelling also requires estimation

becausewesuggestchoosingthehiddenunitssequentially. Afterestimatingamodelwithhhidden

unitsweshalltestitagainsttheone withh+1hiddenunitsandcontinueuntiltherstacceptance

of anullhypothesis. Whatfollowsthereafter is evaluation of thenalestimated modelto checkif

thenalmodelisadequate. NNmodelsaretypicallyonlyevaluatedout-of-sample,butinthispaper

we suggest in-sample misspecicationtests forthe purpose. Similar tests are routinely appliedin

evaluatingSTARmodels(see EitrheimandTerasvirta(1996)),and inthisworkwe adaptthemto

the AR-NN models. Allthis requires consistency and asymptotic normalityfor the estimators of

parameters of the AR-NN model, conditions for which, based on results in Trapletti, Leisch and

Hornik (2000),willbestatedbelow.

Weshallbeginthediscussionofourmodellingstrategybyconsideringvariableselection. After

dealing with that problem we turn to parameter estimation. Finally, after discussing statistical

inference for selectingthe hiddenunitsand after presenting ourin-sample model evaluation tools

we outline themodellingstrategyasa whole.

4.2 Variable Selection

The rst step in our model specication is to choose the variables for the model from a set of

potentialvariables (lagsin thepure AR-NN case). Several nonparametricvariable selection tech-

niques exist (Tschernig and Yang 2000, Vieu 1995, Tjstheim and Auestad 1994, Yao and Tong

(10)

when the number of observations is not small. In this paper variable selection is carried out by

linearizing the model and applying well-known techniques of linear variable selection to this ap-

proximation. This keepscomputational costto a minimum. Forthispurposewe adoptthesimple

procedure proposed in Rech, Terasvirta and Tschernig (2001). Their idea is to approximate the

stationarynonlinearmodelbyapolynomialofsuÆcientlyhighorder. Adapted tothepresentsitu-

ation,therst stepis toapproximatefunctionG(x

t

; )in(1)byageneral k-thorderpolynomial.

By the Stone-Weierstrass theorem, the approximation can be made arbitrarily accurate if some

mildconditions,such astheparameterspace being compact,are imposedonfunctionG(x

t

; ).

ThustheAR-NNmodel,itselfauniversalapproximator,isapproximatedbyanotherfunction. This

yields

G(x

t

; )= 0

~ x

t +

q

X

j1=1 q

X

j2=j1

j

1 j

2 x

j

1

;t x

j

2

;t

++ q

X

j

1

=1

q

X

j

k

=j

k 1

j

1 :::j

k x

j

1

;t x

j

k

;t +R (x

t

; );

(10)

where R (x

t

; ) is the approximation error that can be made negligible by choosingk suÆciently

high. The 0

s are parameters, and 2 R q+1

is a vector of parameters. The linear form of the

approximationis independentof thenumberofhiddenunitsin(1).

In equation (10), every product of variablesinvolving at least one redundant variable has the

coeÆcient zero. The idea isto sort outtheredundantvariablesbyusingthisproperty of(10). In

ordertodothat,werstregressy

t

onallvariablesontheright-handsideofequation(10)assuming

R (x

t

; ) and compute the value of a model selection criterion (MSC), AIC or SBICfor example.

Afterdoingthat,weremoveonevariablefromtheoriginalmodelandregressy

t

onalltheremaining

termsinthecorrespondingpolynomialand againcomputethevalueoftheMSC.Thisprocedureis

repeatedbyomittingeachvariableinturn. Wecontinuebysimultaneouslyomittingtwo regressors

of the original model and proceed in that way until the polynomial is of a function of a single

regressor and, nally,just a constant. Having done that, we choose the combination of variables

that yields the lowest value of the MSC. This amounts to estimating P

q

i=1 q

i

+1 linear models

(11)

ANNmodelareselectedatthesametime. Rechetal.(2001)showedthattheprocedureworkswell

already in small samples when compared to well-known nonparametric techniques. Furthermore,

it can be successfullyapplied even inlarge sampleswhen nonparametric model selectionbecomes

computationallyinfeasible.

4.3 Parameter Estimation

Asselectingthenumberofhiddenunitsrequiresestimationofneuralnetworkmodels,wenowturn

to thisproblem. A large number of algorithms for estimatingthe parameters of a NN model are

availableintheliterature. InthispaperweinsteadestimatetheparametersofourAR-NNmodelby

maximumlikelihoodmaking useof theassumptionsmade of"

t

inSection2. The useofmaximum

likelihood or quasi maximumlikelihood makes it possible to obtain an idea of the uncertainty in

the parameter estimates through (asymptotic) standard deviation estimates. This is not possible

byusingthe above-mentioned algorithms. It maybe argued that maximumlikelihoodestimation

of neural network modelsis most likelyto lead to convergence problems, and that penalizingthe

log-likelihood function one way or the other is a necessary precondition for satisfactory results.

Two thingscan be saidin favour ofmaximumlikelihoodhere. First, inthispapermodel building

proceedsfromsmalltolargemodels,sothatestimationofunidentiedornearlyunidentiedmodels,

a majorreason forthe needto penalize the log-likelihood,is avoided. Second,the starting-values

of theparameterestimates arechosen carefully,and we discussthedetailsof thisinSection4.3.2.

The AR-NN model is similar to many linear or nonlinear time series models in that the in-

formation matrix of the logarithmic likelihood function is block diagonal in such a way that we

can concentrate the likelihood and rst estimate the parameters of the conditional mean. Thus

conditional maximum likelihood is equivalent to nonlinear least squares. Using model (1) as our

starting-point,wemake thefollowingassumptions:

(A.1) The((r+1)1)parametervector

=[ 0

; 2

] 0

isaninteriorpointofthecompactparameter

space which isa subspaceof R r

R +

;ther-dimensionalEuclideanspace.

(A.2) The rootsofthe lagpolynomial1 P

p

j=1

j z

j

lieoutside theunitcircle. Thisisa suÆcient

(12)

(A.3) The parameterssatisfytheconditions c

1

:::c

h ,

i

>0,i=1;:::;h; and !

i1

isdenedas

in(9)fori=1;:::;h.

Assumption (A.3) guarantees global identiability of the model. The maximum likelihood

estimator ofthe parametersof theconditional meanequals

^

=argmaxQ

T ( )=

1

2

argmin T

X

t=1 q

t ( );

where q

t

( ) = ( y

t G(x

t

; )) 2

. Conditions for consistency and asymptotic normality of

^

are

statedinthefollowingtheorem.

Theorem 1. Under the assumptions(A.1){(A.3) themaximumlikelihoodestimator

^

is almost

surelyconsistent for and

T 1=2

(

^

)!N

0; plim

T!1 A( )

1

; (11)

whereA( )= 1

2

T

@ 2

Q

T ( )

@ @ 0

.

Proof. Assumptions (A.1)-(A.3) together with the normality assumption of the errors satisfy

the assumptions of Theorem 4.1 (almost sure consistency) in Trapletti et al. (2000). As every

hiddenunitinmodel(1)isalogisticfunction,assumptions4.5and4.6ofTheorem4.2(asymptotic

normality)inTraplettietal. (2000) aresatisedaswell, sothatresult (11)follows.

In thispaper, we applythe heteroskedasticity-robustlarge sample estimator of thecovariance

matrixof

^

(White 1980)

^

B(

^

)= T

X

t=1

^

h

t

^

h 0

t

!

1

T

X

t=1

^

"

2

t

^

h

t

^

h 0

t

!

T

X

t=1

^

h

t

^

h 0

t

!

1

(12)

where

^

h

t

=

@qt( )

@

=

^

, and "^

t

is the residual. In the estimation, the use of algorithms such as

theBroyden-Fletcher-Goldfarb-Shanno(BFGS)ortheLevenberg-Marquardtalgorithmsisstrongly

recommended. See, for example, Bertsekas (1995) for details about optimization algorithms or

(13)

appropriatelinesearchprocedureto selectthestep-lengthisanotherimportantquestion. Cubicor

quadraticinterpolation are usuallyreasonablechoices. Allthe modelsinthispaperare estimated

withthe Levenberg-Marquardtalgorithmbased on acubic interpolationlinesearch.

4.3.1 Concentrated Maximum Likelihood

In order to reduce the computational burden we can apply concentrated maximum likelihood to

estimate asfollows. Consider thei th

iterationand rewritemodel(1)as

y=Z()+" ; (13)

wherey 0

=[y

1

;y

2

;:::;y

T ] ,"

0

=["

1

;"

2

;:::;"

T ] ,

0

=[ 0

;

1

;:::;

h ],and

Z()= 0

B

@

~

x 0

1 F(

1 (!

0

1 x

1 c

1

)) ::: F(

h (!

0

h x

1 c

h ))

.

~

x 0

T F(

1 (!

0

1 x

T c

1

)) ::: F(

h (!

0

h x

T c

h ))

1

C

A

;

with = [

1

;:::;

h

;! 0

1

;:::;! 0

h

;c

1

;:::;c

h ]

0

. Assuming xed (the value is obtained from the

previousiteration),the parametervector can beestimated analyticallyby

^

= Z()

0

Z()

1

Z() 0

y : (14)

TheremainingparametersareestimatedconditionallyonbyapplyingtheLevenberg-Marquardt

algorithm, which completes the i th

iteration. This form of concentrated maximum likelihoodwas

proposed by Leybourne,Newboldand Vougas (1998) inthe context of STAR models. It substan-

tially reducesthe dimensionalityof the iterative estimationproblem, asinstead of an inversion of

a singlelargeHessiantwo smallermatrices areinverted,and alinesearchis onlyneeded toobtain

thei th

estimate of.

(14)

Many iterative optimization algorithms are sensitive to the choice of starting-values, and this is

certainly so in the estimation of AR-NN models. Besides, an AR-NN model with h hidden units

contains h parameters,

i

;i=1;:::;h; that are not scale-free. Our rst task is thusto rescale the

inputvariablesinsuchawaythattheyhavethestandarddeviationequaltounity. Intheunivariate

AR-NNcase,thissimplymeansnormalizingy

t

. Ifthemodelcontainsexogenousvariables,theyare

normalized separately. This, together with thefact that k!

h

jj =1, gives us a basisfor discussing

thechoiceofstarting-valuesof

i

;i=1;:::;h. Anotheradvantageofthisisthat, inthemultivariate

case normalization generally makes numerical optimization easier as all variables have the same

standard deviation. Assume now that we have estimated an AR-NN model with h 1 hidden

unitsand want toestimate one withh units. Our specic-to-generalspecicationstrategyhasthe

consequence that this situation frequently occurs in practice. A natural choice of initial values

forthe estimationof parameters inthe model with h neuronsis to use the nalestimates forthe

parametersinthersth 1hiddenunitsandthelinearunit. Thestarting-valuesfortheparameters

h

;

h

;!

h andc

h

inthehth hiddenunit areobtainedinthree stepsasfollows.

1. Fork =1;:::;K:

(a) Construct a vector v (k)

h

= [v (k)

1h

;:::;v (k)

qh ]

0

such that v (k)

1h

2 (0;1] and v (k)

jh

2 [ 1;1],

j = 2;:::;q. The values for v (k)

1h

are drawn from a uniform (0;1] distribution and the

onesforv (k)

jh

,j =2;:::;q,from a uniform[ 1;1]distribution.

(b) Dene ! (k)

h

=v (k)

h kv

(k)

h k

1

,whichguaranteesk!

(k)

h

k=1.

(c) Letc (k)

h

=med(!

(k) 0

h

x),where x=[x

1

;:::;x

T ].

2. Denea gridof N positivevalues (n)

h

,n=1;:::;N,fortheslope parameter. Thisneednot

be done randomly. As the changes in

h

have a small eect of the slope when

h

is large,

onlyasmallnumberof largevaluesarerequired,and thegridshouldbenerat thelow end

of thehalfspaceof

h

-values.

3. For k =1;:::;K and n=1;:::;N, estimate using(14) and compute the value of Q

T ( )

foreachcombinationofstarting-values. Choosethosevaluesoftheparametersthatminimize

(15)

T

After selectingtheinitialvaluesoftheh th

hiddenunitwehavetoreordertheunitsifnecessary

inorder to ensurethatthe identifyingrestrictionsdiscussedinSection3 aresatised.

Typically, choosing K = 1000 and N = 20 ensures good initial estimates. We should stress,

however, thatK isanondecreasingfunctionof thenumberofinputvariables. Ifthelatterislarge

we have to select alargeK aswell.

4.3.3 Potential Problem in the Estimation of the Slope Parameter

Finding the appropriate starting-values for the parameters is not the only diÆcult phase of the

parameter estimation. It may also be diÆcult to obtain reasonably accurate estimates for those

slope parameters

i

, i = 1;:::;h, that are very large. This is the case unless the sample size is

also large. To obtain an accurate estimate of a large

i

it is necessary to have a large number

of observations in the neighbourhood of c

i

. When the sample size is not very large, there are

generallyfewobservationssuÆcientlyclosetoc

i

inthesample,whichresultsinimpreciseestimates

of the slope parameter. This manifests itself in low absolute t-values for the estimates of

i . In

such cases, themodel buildercannot take alowabsolute valueof thet-statistic ofthe parameters

of the transition function as evidence for omitting the hidden unit in question. Another reason

for not doing so is that the t-value does not have its customary interpretation as a value of an

asymptotic t-distributed statistic. This is dueto an identicationproblemto be discussed inthe

nextsubsection. Formore discussionsee,forexample, BatesandWatts(1988,p. 87)orTerasvirta

(1994).

4.4 Determining the Number of Hidden Units

The number of hiddenunits included in an ANN model is usually determined from the data. A

popularmethodfordoingthatispruning,inwhichamodelwithalargenumberofhiddenunitsis

estimatedrst,andthesizeofthemodelissubsequentlyreducedbyapplyinganappropriatetech-

niquesuch ascross-validation. Another technique usedin thisconnection is regularization, which

may be characterized aspenalized maximumlikelihoodor leastsquares appliedto theestimation

of neural network models. For discussion see, for example, Fine (1999, pp. 215{221). Bayesian

(16)

The use of regularization techniques precludes the possibility of applying methods of statistical

inference to theestimated model.

AsdiscussedintheIntroduction,anotherpossibilityistobeginwithasmallmodelandsequen-

tially add hiddenunits to the model, for discussion see, for example, Fine (1999, pp. 232{233),

Andersand Korn (1999),or Swanson and White (1995,1997a,b). Thedecision ofadding another

hiddenneuronisoftenbasedontheuseofmodelselectioncriteria(MSC)orcross-validation. This

has the following drawback. Supposethe data have been generated by an AR-NN model with h

hiddenunits. ApplyinganMSC to decidewhetherornotanotherhiddenunitshouldbeaddedto

themodelrequiresestimationofamodelwithh+1hiddenneurons. Inthissituation, however,the

larger model is not identied and its parameters cannot be estimated consistently. This is likely

to cause numerical problems inmaximumlikelihood estimation. Besides, even when convergence

isachieved, lackof identicationcausesa severeproblem ininterpretingthe MSC.The NN model

withhhiddenunitsisnestedinthemodelwithh+1units. AtypicalMSCcomparisonof thetwo

models is then equivalent to a likelihood ratio test of h units against h+1 ones, see, for exam-

ple, Terasvirtaand Mellin (1986) fordiscussion. The choice of MSC determinesthe (asymptotic)

signicance levelof the test. But then, when the largermodel isnot identied underthe nullhy-

pothesis,thelikelihoodratiostatisticdoesnothaveitscustomaryasymptotic 2

distributionwhen

thenullholds. Formore discussionof thegeneralsituation ofa model onlybeing identiedunder

thealternative hypothesis,see, forexample, Davies (1977,1987) or Hansen (1996). Inthe AR-NN

case, this lack of identication shows as an ambiguity in determiningthe size of the penalty. An

AR-NN model with h+1 hidden units can be reduced to one with h units by setting

h+1

=0,

which suggests that the numberof degrees of freedom in the penaltyterm should equal one. On

theotherhand, the(h+1) th

hiddenunitcan alsobeeliminatedbysetting !

h+1

=0. Thisinturn

suggests that thenumber ofdegrees offreedom shouldbe dim(x

t

). Inpractice,the mostcommon

choice seemsto equaldim(x

t )+2.

We shall also select the hidden units sequentially starting from a small model, in fact from

a linear one, but circumvent the identication problem in a way that enables us to control the

signicance level of the tests in the sequence and thus also the overall signicance level of the

(17)

acceptance of thenullhypothesis. Assume nowthat ourAR-NN model(1) containsh+1 hidden

unitsand writeitasfollows

y

t

= 0

~ x

t +

h

X

i=1

i F(

i (!

0

i x

t c

i ))+

h+1 F(

h+1 (!

0

h+1 x

t c

h+1 ))+"

t

: (15)

Assumefurtherthatwehave acceptedthehypothesis ofmodel(15) containingh hiddenunitsand

want to test forthe(h+1) th

hiddenunit. The appropriatenullhypothesis is

H

0 :

h+1

=0; (16)

whereasthealternative isH

1 :

h+1

>0. Under(16), the (h+1) th

hiddenunit isidenticallyequal

to a constant andmerges withthe interceptinthe linearunit.

We assume that under (16) the assumptions of Theorem 1 hold so that the parameters of

(15) can be estimated consistently. Model (15) is onlyidentiedunder thealternative so that, as

discussedabove,thestandardasymptoticinference is notavailable. Thisproblemiscircumvented

as inLuukkonen, Saikkonen and Terasvirta (1988) byexpanding the(h+1) th

hiddenunit into a

Taylor series around the nullhypothesis (16). Using a third-orderTaylor expansion, rearranging

and merging termsresultsinthe followingmodel

y

t

= 0

~ x

t +

h

X

i=1

i F(

i (!

0

i x

t c

i ))+

q

X

i=1 q

X

j=i

ij x

i;t x

j;t +

q

X

i=1 q

X

j=i q

X

k=j

ijk x

i;t x

j;t x

k;t +"

t

; (17)

where "

t

= "

t +

h+1 R (x

t ); R (x

t

) is the remainder. It can be shown that

ij

= 2

h+1

~

ij ,

~

ij 6=0,

i = 1;:::;q; j = i;:::;q; and

ijk

= 3

h+1

~

ijk ,

~

ijk

6= 0, i = 1;:::;q; j = i;:::;q, k = j;:::;q.

Thusthenullhypothesis H 0

0 :

ij

=0,i=1;:::;q; j=i;:::;q,

ijk

=0, i=1;:::;q;j =i;:::;q;

k = j;:::;q. Note that under H

0 : "

t

= "

t

; so that the properties of the error process remain

unchanged underthe nullhypothesis. Finally,it may bepointed outthat one may also view (17)

(18)

l

t

= 1

2 ln(2)

1

2 ln

2 1

2 2

(

y

t

0

~ x

t h

X

i=1

i F(

i (!

0

i x

t c

i ))

q

X

i=1 q

X

j=i

ij x

i;t x

j;t q

X

i=1 q

X

j=i q

X

k=j

ijk x

i;t x

j;t x

k;t )

2

:

(18)

We makethe followingassumptionto accompany thepreviousassumptions(A.1){(A.3):

(A.4) Ejx

t;i j

Æ

<1,i=1;:::;q, forsome Æ >6. This enables usto state thefollowingwell-known

result:

Theorem 2. Under H

0 :

h+1

=0 and assumptions(A.1){(A.4), the LMtype statistic

LM = 1

^ 2

T

X

t=1

^

"

t

^ 0

t 8

<

: T

X

t=1

^

t

^ 0

t T

X

t=1

^

t

^

h 0

t T

X

t=1

^

h

t

^

h 0

t

!

1

T

X

t=1

^

h

t

^ 0

t 9

=

; T

X

t=1

^

t

^

"

t

(19)

where"^

t

=y

t G(x

t

;

^

),

^

h

t

=

@G(x

t

; )

@ 0

=

^

=

"

~ x 0

t

;F(^

1 (!^

0

1 x

t

^ c

1

));:::;F(^

h (!^

0

h x

t

^ c

h ));

^

1

@F(^

1 (!^

0

1 x

t

^ c

1 ))

@

1

;:::;

^

h

@F(^

h (!^

0

h x

t

^ c

h ))

@

h

;

^

1

@F(^

1 (!^

0

1 x

t

^ c

1 ))

@!~ 0

12

;:::;

^

1

@F(^

1 (!^

0

1 x

t

^ c

1 ))

@!~ 0

1q

;:::;

^

h

@F(^

h (!^

0

h x

t

^ c

h ))

@!~ 0

h2

;:::;

^

h

@F(^

h (!^

0

h x

t

^ c

h ))

@!~ 0

hq

;

^

1

@F(^

1 (!^

0

1 x

t

^ c

1 ))

@c

1

;:::;

^

h

@F(^

h (!^

0

h x

t

^ c

h ))

@c

h

#

0

with

@F(^

i (!^

0

i x

t

^ c

i ))

@

i

=(!^ 0

i x

t

^ c

i )

2cosh ^

i (!^

0

i x

t

^ c

i )

2

;i=1;:::;h

@F(^

i (!^

0

i x

t

^ c

i ))

@!~ 0

ij

=

i

~ x

j;t

2cosh ^

i (!^

0

i x

t

^ c

i )

2

;i=1;:::;h;j =2;:::;q

@F(^

i (!^

0

i x

t

^ c

i ))

@c

i

=

i

2cosh ^

i (!^

0

i x

t

^ c

i )

2

;i=1;:::;h

(19)

and

t

= x 2

1;t

;x

1;t x

2;t

;:::;x

i;t x

j;t

;:::;x 3

1;t

;:::;x

i;t x

j;t x

k;t

;:::;x 3

h;t

, has an asymptotic 2

distri-

butionwithm=q(q+1)=2+q(q+1)(q+2)=6 degreesof freedom.

The test can also be carried outinstagesasfollows:

1. Estimatemodel(1)withhhiddenunits. IfthesamplesizeissmallandthemodelthusdiÆcult

to estimate, numericalproblems inapplyingthemaximumlikelihood algorithmmay leadto

a solution such that the residual vector is not precisely orthogonal to the gradient matrix

of G(x

t

;

^

). This has an adverse eect on the empirical size of the test. To circumvent

this problem, we regress the residuals "^

t on

^

h

t

and compute the sum of squared residuals

SSR

0

= P

T

t=1

~

"

2

t

. The new residuals"~

t

are orthogonal to

^

h

t .

2. Regress"~

t on

^

h

t and ^

t

. Computethesum of squaredresiduals SSR

1

= P

T

t=1

^ v 2

t .

3. Computethe 2

statistic

LM hn

2

=T SSR

0

SSR

1

SSR

0

; (20)

ortheF versionof thetest

LM hn

F

= (SSR

0

SSR

1 )=m

SSR

1

=(T n m)

; (21)

wheren=(q+2)h+p+1. UnderH

0 ,LM

hn

2

hasanasymptotic 2

distributionwithmdegrees

of freedomand LM hn

F

isapproximatelyF-distributedwithm and T n mdegrees offreedom.

The following cautionaryremark isin order. Ifany ^

i

,i=1;:::;h, is very large, thegradient

matrix becomes near-singular and the test statistic numericallyunstable, which distorts the size

of thetest. The reasonis thatthevectors correspondingto the partialderivativeswithrespectto

i ,!

i ,and c

i

,respectively,tend to bealmost perfectlylinearlycorrelated. This is dueto thefact

that the time series of those elements of the gradient resemble dummy variables being constant

most of the time and nonconstant simultaneously. The problem may be remedied by omitting

these elements from theregression instep 2. Thiscan be donewithoutsignicantlyaecting the

value of the test statistic; see Eitrheim and Terasvirta (1996) for discussion. This situation may

also cause problems in obtaining standard deviation estimates for parameter estimates through

the outer product matrix (12). The same remedy can be applied: omit the rows and columns

(20)

standard deviation estimates for the remaining parameter estimates. In fact, omitting the rows

and columnscorresponding to thehighvalues^

i

shouldsuÆce.

Testingzero hiddenunitsagainstat leastone isa specialcaseof theabove test. This amounts

to testing linearity, and the test statistic is in this case identical to the one derived for testing

linearityagainst the AR-NN modelin Terasvirta, Lin and Granger (1993). A natural alternative

to our procedure is the one rst suggested in White (1989) and investigated later in Lee, White

andGranger(1993). Inorderto testthenullhypothesisofh hiddenunits,oneaddsqhiddenunits

to model(1) byrandomly selectingthe parameters !~

h+j ,

h+j

, j =1;:::;q. Ifq is large(a small

valuemaynotrenderthe test powerfulenough), a smallnumberof principal componentsof theq

extraunitsmaybeusedinstead. Thissolvestheidenticationproblemastheextraneuronsorthe

transformedneuronsareobservable,andthenullhypothesis

h+1

==

h+q

=0oritsequivalent

when the principal component approach is taken, can be tested using standard inference. When

h=0;thistechniquealsocollapsesinto alinearitytest: seeLeeetal.(1993). Simulationresultsin

Terasvirta etal. (1993) and Andersand Korn (1999) indicatethat thepolynomialapproximation

method presented here compares wellwith White's approach, and it is appliedin the rest of this

work.

As mentioned inthe Introduction, the normality assumption can be relaxed while theconsis-

tencyand asymptoticnormalityof the (quasi)maximumlikelihoodestimators are retained. This

isimportant,forexample,innancialapplicationsoftheAR-NN model. Innancialapplications,

at least in ones to high-frequency data, such as intradaily,daily or even weekly series, the series

typicallycontain conditional heteroskedasticity. This possibilitycan be accounted forbyrobusti-

fyingthetestsagainstheteroskedasticityfollowingWooldridge(1990). Aheteroskedasticity-robust

version of the LM type test, based on the notionof robustifyingstatistic (18), can be carried out

asfollows.

1. Asbefore.

2. Regress^

t on

^

h

t

and computetheresiduals r

t .

3. Regress1 on "~

t r

t

and compute thesumof squared residualsSSR

1 .

(21)

LM r

2

=T SSR

1

: (22)

The test statistichasthesame asymptotic 2

nulldistributionasbefore.

It shouldbe noticedthatinthecaseof conditionalheteroskedasticity,themaximumlikelihood

estimates discussed in Section4.3 are just quasimaximumlikelihoodestimates. Under regularity

conditionsthey arestillconsistentand asymptoticallynormal.

4.5 Evaluation of the Estimated Model

Afteramodelhasbeenestimatedithastobeevaluated. Weproposetwoin-samplemisspecication

tests for this purpose. The rst one tests for the instability of the parameters. The second one

tests the assumption of no serial correlation in the errors and is an application of the results in

Eitrheimand Terasvirta (1996) andGodfrey(1988, pp. 112{121).

4.5.1 Test of Parameter Constancy

Testing parameter constancy is an important wayof checking theadequacy of linear or nonlinear

models. Many parameter constancy tests are tests against unspecied alternatives or a single

structuralbreak. Inthissectionwepresenta parametricalternativetoparameterconstancy which

allows the parameters to change smoothly as afunction of time underthe alternative hypothesis.

In thefollowingwe assume thatthelogistic function(8) hasconstant parameters whereasboth

and

i

,i=1;:::;h,maybesubjecttochangesovertime. Thisassumptionismademainlybecause

changes inthe parameters of thelogistic function aremore diÆcult to detect thanchanges inthe

linear parameters.

In order to derive thetest, considera model withtime-varyingparameters dened as

y

t

=

~

G(x

t

; ;

~

)+"

t

=~ 0

(t)~x

t +

h

X

i=1 n

~

i (t)F(

i (!

0

i x

t c

i ))

o

+"

t

; (23)

(22)

~

(t)=+F ( (t )); (24)

and

~

i

(t)=

i +

i

F ( (t ));i=1;:::;h: (25)

ThefunctionF in(24)and(25)isdenedasin(2),and >0. Theparametervector isdenedas

before,and

~

=[;;

1

;:::;

h

;;] 0

. The parameter controls thesmoothnessof themonotonic

change in the autoregressive parameters. When ! 1, equations (23){(25) represent a model

witha single structural breakat t=. Combining(24) and (25)with (23), we have thefollowing

model

y

t

=

0

+ 0

F ((t )) x~

t +

h

X

i=1 n

i +

i

F ((t )) o

F(

i (!

0

i x

t c

i ))+"

t

: (26)

The nullhypothesis hypothesisof parameterconstancy is

H

0

: =0: (27)

Note that model (26) is only identied under the alternative > 0. As is obvious from Section

4.4, aconsequenceof thiscomplication isthatthestandardasymptoticdistributiontheoryforthe

likelihood ratio or other classical test statistics for testing (27) is not available. To remedy this

problemwe expand F((t ))into a rst-orderTaylor expansionaround =0. Thisyields

t

1

= 1

4

(t )+R (t;;); (28)

where R (t;;) is the remainder. Replacing F((t )) in(26) by(28) and reparameterizing we

obtain

y

t

=

0

0 +

0

0 t

~

x

t +

h

X

i=1 (

i +

i t)F(

i (!

0

i x

t c

i ))+"

t

; (29)

where

0

= =4,

0

= =4,

i

=

i

=4,

i

=

i

=4, i = 1;:::;h, and "

t

=

(23)

t

H

0 :

0

=0;

1

==

h

=0: (30)

Under H

0

,R (t;;)=0 and "

t

="

t

,sothat standardasymptoticdistribution theoryis available.

The local approximation to thet th

terminthenormal log likelihoodfunction ina neighbourhood

of H

0

for observation t,ignoringR (t;;), is

l

t

= 1

2 ln(2)

1

2 ln

2

1

2 2

(

y

t

0

0 +

0

0 t

~ x

t h

X

i=1 (

i +

i t)F(

i (!

0

i x

t c

i ))

)

2

:

(31)

The consistent estimators of the relevant partial derivatives of the log likelihood under the null

hypothesis are

@

^

l

t

@ 0

0

H

0

= 1

^ 2

^

"

t

~ x

t

(32)

@

^

l

t

@ 0

0

H0

= 1

^ 2

^

"

t t~x

t

(33)

@

^

l

t

@

i

H

0

= 1

^ 2

^

"

t F(^

i (!^

0

i x

t

^ c

i

)) (34)

@

^

l

t

@ 0

i

H

0

= 1

^ 2

^

"

t tF(^

i (!^

0

i x

t

^ c

i

)) (35)

@

^

l

t

@ 0

i

H0

= 1

^ 2

^

"

t

^

i

@F(^

i (!^

0

i x

t

^ c

i ))

@ 0

i

(36)

@

^

l

t

@!

0

i

H0

= 1

^ 2

^

"

t

^

i

@F(^

i (!^

0

i x

t

^ c

i ))

@!

0

i

(37)

@

^

l

t

@c

i

H

0

= 1

^ 2

^

"

t

^

i

@F(^

i (!^

0

i x

t

^ c

i ))

@c

i

(38)

wherei=1;:::;h,^ 2

=(1=T) P

T

t=1

^

"

2

t ,and"^

t

=y

t G(x

t

;

^

)=y

t

^

0

~

x

t h

P

i=1

^

i F(^

i (

^

! 0

i x

t

^ c

i ))are

(24)

theresidualsestimatedunderthenullhypothesis. TheLM statisticis(19)with

^

h

t

=

@G(xt; )

@ 0

=

^

and ^

t

=

t~x 0

t

;tF(^

1 (!^

0

1 x

t

^ c

1

));:::;tF(^

h (!^

0

h x

t

^ c

h ))

0

. Testing hypothesis of just a subset of

coeÆcients being constant is also possible in this framework. Under H

0

, statistic (19) has an

asymptotic 2

distribution withq+h+1 degrees offreedom. Thisis thecase even if

^

t

contains

elementsdominatedbya deterministictrend, seeLinand Terasvirta (1994) fordetails.

The test can be carried outin stages asbefore. The onlydierences are the new denitionof

^

t

at stage2and thedegreesoffreedominthe 2

orFtest. Asbefore,useofF-versionof thetest

isrecommended.

The alternative to parameterconstancy may be made more generalsimplybydening

F((t c))= 1+exp (

K

Y

k=1 ( t

k )

)!

1

; >0 (39)

withK 1. Transitionfunction(39)withK2allowstheparameterstochangenonmonotonically

overtime,incontrasttothecaseK =1. Consequently,intheteststatistic(19),^

t

=[^

0

1t

;:::;^ 0

Kt ]

0

where

^ 0

kt

= h

t k

x 0

t

;t k

F(^

1 (!^

0

1 x

t

^ c

1

));:::;t k

F(^

h (!^

0

h x

t

^ c

h )

i

; k=1;:::;K

and the numberof degrees of freedom in the test statistic is adjustedaccordingly. In this paper

we report results for K = 1;2;3; and call the corresponding test statistics LM

K

;K = 1;2;3;

respectively. Iftheerrorprocessisheteroskedastic,arobustversionofthetest,immediatelyobvious

fromSection4.4,hasto beemployedinsteadofthestandardtest. When themodelisassumednot

tocontainanyhiddenunits:

i

(t)0,i=1;:::;h,thetestcollapsesintotheparameterconstancy

test inLinand Terasvirta(1994).

4.5.2 Test of Serial Independence

Assume thattheerrors inequation (1)followan r th

order autoregressive process denedas

"

t

= 0

t +u

t

; (40)

(25)

where =[

1

;:::;

r

]isaparametervector,

t

=["

t 1

;:::;"

t r

],and u

t

NID (0; ). Weassume

that under H

0

, assumptions (A.1){(A.3) hold. Consider the nullhypothesis H

0

: = 0 whereas

H

1

:6=0. Theconditionalnormallog-likelihoodof(1)with(40)forobservationt, giventhexed

startingvalueshastheform

l

t

= 1

2 ln(2)

1

2 ln

2

1

2 2

8

<

: y

t r

X

j=1

j y

t j

G(x

t

; )+ r

X

j=1

j G(x

t j

; ) 9

=

; 2

:

(41)

Therst partialderivativesofthenormallog-likelihoodforobservation twithrespectto and

are

@l

t

@

j

=

u

t

2

f y

t j G(x

t j

; )g;j =1;:::;r

@l

t

@

=

u

t

2

8

<

:

@G(x

t

; )

@

r

X

j=1

j

@G(x

t j

; )

@ 9

=

; :

(42)

Under thenullhypothesis,theconsistent estimatorsofthe score are

T

X

t=1

@

^

l

t

@

H

0

= 1

^ 2

T

X

t=1

^

"

t

^

t and

T

X

t=1

@

^

l

t

@

H

0

= 1

^ 2

T

X

t=1

^

"

t

^

h

t

;

where ^ 0

t

= [^"

t 1

;:::;"^

t r ], "^

t j

= y

t j

G(x

t j

^

; ), j = 1;:::;r,

^

h

t

=

@G(xt;

^

)

@

, and ^ 2

=

(1=T) P

T

t=1

^

"

2

t

. TheLM statisticis(19)with

^

h

t and ^

t

denedasabove,and ithasanasymptotic

2

distributionwithrdegreesoffreedomunderthenullhypothesis. Fordetails,seeGodfrey(1988,

pp. 112{121).

The test can be performed in three stages as shown before. It may be pointed out that the

Ljung-Boxtestoritsasymptoticallyequivalentcounterpart,theBox-Piercetest,bothrecommended

for use in connection with NN models by Zapranis and Refenes (1999), are not available. Their

asymptoticnulldistribution isunknownwhen theestimated modelis anAR-NN model.

(26)

At this point we areready to combine the above statistical ingredients into a coherent modelling

strategy. We rst dene the potential variables (lags) and select a subset of them applying the

variable selection technique considered in Section 4.2. After selecting the variables we select the

number of hidden units sequentially. We begin testing linearity against a single hidden unit as

described in Section 4.4 at signicance level . The model under the nullhypothesis is simply a

linear AR(p) model. If the nullhypothesis is not rejected, the AR model is accepted. In case of

a rejection,an AR-NN model witha single unitis estimated and testedagainst amodelwith two

hiddenunitsat thesignicancelevel%, 0<%<1. Anotherrejection leadsto estimatinga model

withtwo hiddenunitsandtesting itagainstamodelwiththreehiddenneuronsat thesignicance

level% 2

. Thesequenceisterminatedattherstacceptanceofthenullhypothesis. Thesignicance

levelisreducedat eachstepofthesequenceandconvergesto zero. IntheapplicationsinSection6,

weuse%=1=2:Thiswayweavoidexcessivelylargemodelsandcontroltheoverallsignicancelevel

of the procedure. An upperboundforthe overall signicance level

may be obtainedusingthe

Bonferronibound. Forexample, if =0:1; and %=1=2 then

0:187. Note that ifwe instead

of our LM type test applya model selection criterion such as AIC orSBIC to this sequence, we

infact usethe same signicance level at each step. Besides, theupperboundthat can beworked

outinthelinear case, see,forexample, TerasvirtaandMellin(1986), remainsunknownduetothe

identicationproblemmentionedabove.

Infollowingtheabovepathwehaveindeedassumedthatallhiddenneuronscontainthevariables

thatareoriginallyselectedtotheAR-NNmodel. Anothervariantofthestrategyistheoneinwhich

thevariablesineachhiddenunitarechosenindividuallyfromthesetoforiginallyselectedvariables.

Inthepresentcontextthismaybedone,forexample,byconsideringtheestimatedparametervector

^

!

h

of themost recentlyadded hiddenneuron, removing thevariables whose coeÆcients have the

lowestt-valuesandre-estimatingthemodel. AndersandKorn(1999)recommendedthisalternative.

It has the drawback that the computational burden may become high as frequent estimation of

neuralnetworkmodelsmaybeinvolved. Becauseofthiswesuggestanothertechniquethatcombines

sequential testing forhiddenunits and variable selection. Consider equation (15). Instead of just

testing a singlenullhypothesis asis donewithin(15), we can do the following. Firsttest thenull

(27)

test themodel withh hiddenunitsagainstthisreducedalternative. Remove each variableinturn

andcarryoutthetest. Continuebyremovingtwo variablesat atime. Finally,testthemodelwith

h neurons against the alternatives inwhich the(h+1) th

unit only containsa single variable and

an intercept. Findthe combination of variables for which the p-value of the test is minimized. If

thisp-value is lowerthan a prescribed value, \signicance level", add the(h+1) th

unit with the

correspondingvariablestothemodel. OtherwiseaccepttheAR-NNmodelwithhhiddenunitsand

stop. Thiswayofselectingthevariablesforeachhiddenunit isanalogousto thevariableselection

technique discussedinSection4.2.

Comparedtoourrststrategy,thisone addsto theexibilityandontheaverage leadsto more

parsimoniousmodelsthantheotherone. Ontheotherhand,aseverychoiceofhiddenunitinvolves

a possibly largenumberof tests, we donotcontrolthe signicancelevelof theoverallhiddenunit

test. We do that, albeit conditionally on the variables selected, if the set of input variables is

determined onceand forallbeforechoosingthenumberofhiddenunits.

Evaluation following the estimation of the nal model is carried out by subjecting the model

to misspecicationtests discussedin Section 4.5. Ifthe model doesnot pass the tests, the model

builderhas to reconsider the specication. For example, important lags (or, in the more general

case, exogenous variables), may be missing from the model. If parameter constancy is rejected,

model (23) may be estimated and used for forecasting, but reconsidering the whole specication

mayoften bea more sensibleoptionofthe two.

Anotherwayof evaluatingthemodelisout-of-sampleforecasting. AsAR-NN modelsaremost

often constructedforforecasting purposes,this isimportant. This partof themodelevaluation is

carried outbysavingthe lastobservations intheseriesforforecastingand comparingtheforecast

resultswiththosefromatleastonebenchmarkmodel. Note,however,thattheresultsaredependent

on theobservationscontainedinthat particularprediction period andmaynotallowvery general

conclusionsaboutthepropertiesoftheestimatedmodel. Theyarealsoconditionalonthestructure

of the model remaining unchanged over the forecasting period, which may not necessarily be the

case. Formore discussionaboutthis,see Clementsand Hendry(1999, Chapter 2).

(28)

It is useful to compare our modellingstrategy with other bottom-up approaches available in the

literature. Swanson and White (1995), Swanson and White (1997a), and Swanson and White

(1997b) apply the SBIC model selection criterion (Schwarz 1978) as follows. They start with a

linearmodel,addingpotentialvariablesto ituntilSBICindicatesthatthemodelcannotbefurther

improved. Then they estimatemodelswitha single hiddenunit and select regressors sequentially

toitonebyoneunlessSBICshowsnofurtherimprovement. NextSwansonandWhiteaddanother

hidden unit and proceed by adding variables to it. The selection process is terminated when

SBICindicates that no more hiddenunits orvariables shouldbe addedor when a predetermined

maximum numberof hiddenunitshas been reached. This modellingstrategy can be termedfully

sequential.

AndersandKorn(1999)essentiallyadopttheprocedureofTerasvirtaandLin(1993)described

in Section 4.4 for selecting the number of hidden units. After estimating the largest model they

suggest proceeding from general-to-specic by sequentially removing those variables from hidden

units whose parameter estimates have the lowest (t-test) p-values. Note that this presupposes

parameterizingthehiddenunitsasin(2), notasin(8)and (9).

BalkinandOrd(2000)selecttheorderedvariables(lags)sequentiallyusingalinearmodelanda

forwardstepwise regressionprocedure. IftheF-teststatisticof addinganotherlagobtainsavalue

exceeding 2,this lagis added to the set of inputvariables. The numberof variablesselected also

servesasamaximumnumberofhiddenunits. Theauthorssuggestestimatingallmodelsfromthe

onewithasinglehiddenunituptotheonewiththemaximumnumberofneurons. Thenalchoice

ismadeusingtheGeneralizedCross-ValidationCriterionofGolub, HeathandWahba (1979). The

model forwhichthe value of thismodelselection criterionis minimizedisselected.

RefenesandZapranis(1999)(seealsoZapranisandRefenes(1999))proposeaddinghiddenunits

intothemodelsequentially(thereisaowchartinthepaperindicatingthis). Thenumberofunits,

however, is selected onlyafter adding all units up to a predetermined maximumnumber, so that

theprocedureisnotgenuinelysequential. ThechoiceismadebyapplyingtheNetworkInformation

Criterion(Murata,YoshizawaandAmari1994)thatcanbetracedbacktoStone(1977). Themodel

is then pruned by removing redundant variables from the neurons and re-estimating the model.

(29)

which alsoformsan integralpart ofourmodellingprocedure. They suggest,forexample, thatthe

hypothesis of no error autocorrelation should be tested, by the Ljung-Box or the asymptotically

equivalentBox-Piercetest. Unfortunately,these testsdonothave theircustomaryasymptoticnull

distributionwhen theestimated modelis an AR-NNmodelinsteadof a linearautoregressive one.

Of these strategies, theSwanson and White one is computationallythe mostintensive one,as

thenumberofstepsinvolvinganestimationofaNNmodelislarge. Ourprocedureisinthisrespect

theleast demanding. The dierencebetween ourscheme and theAndersand Korn one is thatin

ourstrategy,variableselectiondoesnotrequireestimationofNNmodelsbecauseitiswhollybased

on LMtype tests(themodelisonlyestimatedunderthenullhypothesis). Furthermore,there isa

possibilityofomitting certainpotential variablesbefore even estimatingneuralnetwork models.

Like ours,the Swanson and White strategy is truly sequential: themodellerproceeds bycon-

sideringnested models. Thedierence liesin how to compare two nestedmodelsin thesequence.

SwansonandWhiteapplySBICwhereasAndersandKornandweuseLMtypetests. Theproblems

with theformer technique have beendiscussed inSection 4.4. The problemof estimatinguniden-

tied models is still more acute in the approaches of Balkin and Ord and Refenes and Zapranis.

Because these procedures require the estimation of NN models up to one containing a predeter-

minedmaximumnumberofhiddenunits,several estimatedmodelsmaythusbeunidentied. The

problemisevenmore seriousifstatisticalinferenceisappliedinsubsequentpruningastheselected

model may also be unidentied. The probabilityof this happening is smaller in the Anders and

Korncase,inparticularwhenthesequenceofhiddenunittestshasgraduallydecreasingsignicance

levels.

5 Monte-Carlo Study

Inthissectionwe reportresultsfromtwoMonte Carloexperiments. Thepurposeoftherstoneis

toillustratesomefeaturesoftheNNmodelselectionstrategydescribedinSection4.6andcompare

itwiththealternativeinwhichmodelselectioniscarriedoutusinganappropriatemodelselection

criterion. In thesecond experiment,the performanceof themisspecicationtests ofSection 4.5is

(30)

y

t

=0:10+[0:75 0:2F

(0:05(t 90))]y

t 1

0:05y

t 4

+[ 0:80 0:5F

(0:05(t 90))]F(2:24(0:45y

t 1

0:89y

t 4

+0:09))

+[ 0:70+2:0F(0:05(t 90))]F(1:12(0:44y

t 1

+0:89y

t 4

+0:35))+"

t

;

"

t

="

t 1 +u

t

; u

t

NID(0;

2

)

(43)

where F

is dened as in 43) with K = 1 or 2. However, in the rst experiment, = = 0.

The number of observations in the rst experiment is either 200 or 1000, in the second one we

reportresultsfor100 observations. Ineveryreplication, therst 500 observationsarediscardedto

eliminatetheinitializationeects. Thenumberofreplications equals 500.

5.1 Architecture Selection

Results from simulating the modellingstrategy can be found inTable 1. The table also contains

results on choosing the number of hidden units using SBIC. This model selection criterion was

chosenfortheexperimentbecause Swanson andWhite (1995, 1997a,b) appliedit tothisproblem.

In this case it is assumed that the model contains the correct variables. This is done in order to

obtainan ideaof thebehaviourof SBICfree from theeectsof an incorrectlyselected model.

Dierent results can be obtained by varying the error variance, the size of the coeÆcients of

hidden units or \connection strengths" and, in the case of our strategy, the signicance levels.

In this experiment, the signicance level is halved at every step, but other choices are of course

possible. It seems, at least in the present experiment, that selecting the variables is easier than

choosing therightnumberof hiddenunits. In smallsamples,there is a strongtendencyto choose

a linear model but, as can be expected, nonlinearity becomes more apparent with an increasing

sample size. The larger initial signicance level ( = 0:10) naturally leads to larger models on

theaverage than thesmaller one (=0:05). Overtting is relatively rare buttheresultssuggest,

again not unexpectedly, that the initial signicance level should be lowered when the number of

observationsincreases. Finally,improvingthesignal-to-noiseratioimprovestheperformanceofour

strategy.

No. 461 Building neural networks models for time series: a statistical approach

No. 461

Building neural networks models for time series: a statistical approach

Marcelo C. Medeiros Timo Teräsvirta

Gianluigi Rech

TEXTO PARA DISCUSSÃO

DEPARTAMENTO DE ECONOMIA

www.econ.puc-rio.br

DEPARTAMENTO DE ECONOMIA PUC-RIO

TEXTO PARA DISCUSSÃO N o . 461

BUILDING NEURAL NETWORKS MODELS FOR TIMES SERIES:

A STATISTICAL APPROACH

MARCELO C. MEDEIROS TIMO TERÄSVIRTA

GIANLUIGI RECH

AGOSTO 2002

TEXTO PARA DISCUSSÃO N ^o . 461