No. 461
Building neural networks models for time series: a statistical approach
Marcelo C. Medeiros Timo Teräsvirta
Gianluigi Rech
TEXTO PARA DISCUSSÃO
DEPARTAMENTO DE ECONOMIA
www.econ.puc-rio.br
DEPARTAMENTO DE ECONOMIA PUC-RIO
TEXTO PARA DISCUSSÃO N o . 461
BUILDING NEURAL NETWORKS MODELS FOR TIMES SERIES:
A STATISTICAL APPROACH
MARCELO C. MEDEIROS TIMO TERÄSVIRTA
GIANLUIGI RECH
AGOSTO 2002
A Statistical Approach
Marcelo C. Medeiros
Department of Economics, Catholic University of Rio de Janeiro
TimoTerasvirta
Department of EconomicStatistics, Stockholm Schoolof Economics
GianluigiRech
Quantitative Analysis, Electrabel, Louvain-la-Neuve, Belgium
August 21,2002
Abstract
Thispaperisconcernedwithmodellingtimeseriesbysinglehiddenlayerfeedforwardneuralnetwork
models. Acoherentmodellingstrategybasedonstatistical inferenceispresented. Variableselectionis
carried outusingexisting techniques. Theproblem ofselecting the numberof hiddenunitsis solved
by sequentially applying Lagrange multiplier type tests, with the aim of avoiding the estimation of
unidentied models. Misspecication tests are derived for evaluating an estimated neural network
model. A small-sample simulation experiment is carried out to show how the proposed modelling
strategyworksand how themisspecication testsbehave insmallsamples. Twoapplications to real
timeseries, oneunivariate and theothermultivariate, areconsidered as well. Setsofone-step-ahead
forecastsareconstructedandforecastaccuracyiscomparedwiththatofothernonlinearmodelsapplied
tothesameseries.
Keywords: Modelmisspecication,neuralcomputing,nonlinearforecasting,nonlineartimeseries,
smoothtransitionautoregression,sunspotseries,thresholdautoregression,nancialprediction.
JEL Classication Codes: C22,C51,C52,C61, G12
Acknowledgments: Thisresearchhasbeensupportedby theToreBrowaldh'sFoundation. The
research of the rst author has been partially supported by CNPq. The paper is partly based on
Chapter2ofthePhDthesisofthethirdauthor. Apartoftheworkwas carriedoutduringthevisits
oftherstauthortotheDepartmentof EconomicStatistics,StockholmSchoolofEconomicsandthe
second author to the Department of Economics, PUC-Rio. The hospitality of these departments is
gratefullyacknowledged. MaterialfromthispaperhasbeenpresentedattheFifthBrazilianConference
on Neural Networks, Rio de Janeiro, April 2001, the 20 th
International Symposium on Forecasting,
Dublin, June 2002, and seminars at CORE (Louvain-la-Neuve), Monash University (Clayton, VIC),
SwedishSchoolofEconomics(Helsinki),UniversityofCalifornia,SanDiego,CornellUniversity(Ithaca,
NY),FederalUniveristyofRiodeJaneiro,andCatholicUniversityofRiodeJaneiro. Wewishtothank
theparticipantsoftheseoccasions,HalWhiteinparticular,for helpfulcomments. Ourthanksalsogo
toChrisChateldandDickvanDijkfor usefulremarks,andAllanTimmermannforthedatausedin
the secondempirical example ofthe paper. Theresponsibility for any errorsor shortcomings inthe
paperremainsours.
Address for correspondence: Timo Terasvirta, Department of EconomicStatistics, Stockholm School of Eco-
nomics,BOX6501,SE-11383,Stockholm,Sweden.E-mail: timo.terasvirta@hhs.se.
Alternativestolinearmodelsineconometricandtimeseriesmodellinghaveincreasedinpopularity
inrecentyears. Nonparametricmodelsthatdonotmakeassumptionsabouttheparametricformof
thefunctionalrelationshipbetweenthevariablestobemodelledhavebecomemoreeasilyapplicable
due to computational advances and increased computational power. Another class of models,
the exible functional forms, oers an alternative that in fact also leaves the functional form of
the relationship unspecied. While these models do contain parameters, often a largenumber of
them, the parameters are not globally identied or, using the statistical terminology, estimable.
Identicationorestimability,ifachieved,islocalat bestwithoutadditionalparameterrestrictions.
The parametersare notinterpretable eitherasthey oftenare inparametric models.
The articialneuralnetwork(ANN) modelisaprominentexampleofsuchaexiblefunctional
form. Ithasfoundapplicationsinanumberofelds,includingeconomics. KuanandWhite(1994)
surveyedtheuseof ANNmodelsin(macro)economics, andseveral nancialapplicationsappeared
ina recent specialissue ofIEEE Transactions onNeuralNetworks(Abu-Mostafa, Atiya,Magdon-
Ismailand White 2001). The use of the ANNmodel inapplied work is generally motivated by a
mathematical result statingthat undermild regularityconditions,a relativelysimple ANNmodel
iscapableofapproximatinganyBorel-measurablefunctiontoanygivendegree ofaccuracy;see,for
example,Funahashi(1989),Cybenko(1989), Hornik, Stinchombe,and White(1989,1990), White
(1990),orGallant and White(1992). Suchan approximator wouldstillcontain a nitenumberof
parameters. Howtospecifysuch amodel,thatis,howto ndtherightcombinationof parameters
and variables, is acentral topic inthe ANNliteratureand hasbeenconsideredin a largenumber
of books such asBishop (1995),Ripley (1996), Fine(1999), Haykin(1999), orReed andMarks II
(1999),andarticles. Manypopularspecicationtechniquesare\general-to-specic"or\top-down"
procedures: theinvestigatorbeginswithalargemodelandappliesappropriatealgorithmstoreduce
the number of parameters using a predetermined stopping-rule. Such algorithms usually do not
relyon statistical inference.
In thispaper, we proposea coherent modellingstrategy forsimplesingle hidden-layerfeedfor-
ward ANNtime seriesmodels. Thesemodelsdiscussed here areunivariate, butadding exogenous
specic approachesisthatoursworksintheoppositedirection,fromspecic to general. Webegin
with a small model and expand that according to a set of predetermined rules. The reason for
thisis thatweviewourANNmodelasastatisticalnonlinearmodelandapplystatistical inference
to the problem of specifying the model or, as the ANN experts express it, nding the network
architecture. We shall argue in the paper that proper statistical inference is not available if we
choose to proceed from largemodelsto smaller ones, from general to specic. Our \bottom-up"
strategybuildspartlyonearlyworkbyTerasvirtaandLin(1993). Morerecently,AndersandKorn
(1999) presenteda strategythat shares certain featureswith our procedure. Swanson and White
(1995,1997a,1997 b) alsodevelopedandappliedaspecic-to-generalstrategythatdeservesmention
here. Balkinand Ord(2000) proposedaninference-based methodforselecting thenumberof hid-
denunitsintheANNmodel. ZapranisandRefenes(1999)developedacomputerintensivestrategy
based onstatistics to selectthevariablesand thenumberofhiddenunitsofthe ANNmodel. Our
aim hasbeen to develop a strategy that minimizes the amount of computation required to reach
thenal specication and, furthermore,contains an in-sampleevaluation of theestimated model.
We shall considerthedierences betweenourstrategy and theother ones mentionedhere inlater
sectionsof thepaper.
The plan of the paper is as follows. Section 2 describes the model and Section 3 discusses
geometric and statisticalinterpretationsforit. A modelspecicationstrategy,consisting of speci-
cation, estimation, andevaluation of themodelisdescribed inSection4. Theresults concerning
a Monte-Carlo experiment are reported in Section 5 and two applications with real data sets are
presentedinSection 6. Section7 containsconcludingremarks.
2 The Autoregressive Neural Network Model
The AutoRegressive NeuralNetwork (AR-NN)modelisdened as
y
t
=G(x
t
; )+"
t
= 0
~ x
t +
h
X
i=1
i F(!~
0
i x
t
i )+"
t
(1)
where G(x
t
; ) is a nonlinearfunction of thevariablesx
t
with parametervector 2R
dened as = [ 0
;
1
;:::;
h
;
~
! 0
1
;:::;
~
! 0
h
;
1
;:::;
h ]
0
. The vector
~
x
t 2 R
q+1
is dened as
~
x
t
=
[1;x 0
t ]
0
, where x
t 2 R
q
is a vector of lagged values of y
t
and/or some exogenous variables. The
functionF(!~ 0
i x
t
i
), oftencalledthe activationfunction, is thelogisticfunction
F(!~ 0
i x
t
i )=
1+e (~!
0
i xt
i )
1
(2)
where !~
i
= [~!
1i
;:::;!~
qi ]
0
2 R q
and
i
2 R, and the linear combination of these functions in (1)
forms the so-called hiddenlayer. Model (1) with (2) does not contain lags of "
t
and is therefore
calledafeedforwardNNmodel. Forotherchoicesoftheactivation function,seeChen,Racine and
Swanson (2001). Furthermore, f"
t
g is a sequence of independently normally distributedrandom
variableswithzero meanand variance 2
. ThenonlinearfunctionF(!~ 0
i x
t
i
) isusuallycalled a
hiddenneuronor ahiddenunit. The normalityassumptionenables usto denethe log-likelihood
function, whichis requiredforthestatistical inference we need, butitcan berelaxed.
3 Geometric and Statistical Interpretation
The characterization and tasks of a layer of hidden neurons in an AR-NN model are discussed
in several textbooks such as Bishop (1995), Haykin (1999), Fine (1999), and Reed and Marks II
(1999). ItisneverthelessimportanttoreviewsomeconceptsinordertocomparetheAR-NNmodel
withother well-knownnonlineartimeseriesmodels.
Consider the output F(!~ 0
i x
t
i
) of a unit of the hiddenlayer of a neuralnetwork asdened
in(1) and (2). When!~ 0
i x
t
=
i
,theparameters !~
i and
i
denea hyperplaneina q-dimensional
Euclideanspace
H =fx
t 2R
q
j!~ 0
i x
t
=
i
g: (3)
The directionof !~
i
determinesthe orientationof thehyperplane and thescalar term
i
=k!~
i k the
positionofthehyperplaneintermsofitsdistancefromtheorigin. Ahyperplaneinducesapartition
of thespace R q
into two regionsdened bythe halfspaces
H +
=fx
t 2R
q
j~
! 0
i x
t
i
g (4)
H =fx
t 2R
q
j~
! 0
i x
t
<
i
g; (5)
associatedto thestatesoftheneuron. Withh hyperplanes,aq-dimensionalspacewillbesplitinto
severalpolyhedralregions. Eachregionisdenedbythenonemptyintersectionofthehalfspaces(4)
and (5) of each hyperplane. Hyperplaneslying parallelto each other constitutea specialcase. In
thissituation,!~
i
!,~ i=1;:::;h, and equation(1) becomes
y
t
=
0
~ x
t +
h
X
i=1
i F(!~
0
x
t
i )+"
t
: (6)
The inputspace is thus splitinh+1 regions.
Certainspecialcasesof (1)areofinterest. When x
t
=y
t d
inF,model(1)becomesamultiple
logistic smooth transition autoregressive (MLSTAR) model with h+1 regimes inwhich onlythe
intercept changes according to theregime. The resultingmodelisexpressed as
y
t
= 0
~ x
t +
h
X
i=1
i F(
i (y
t d c
i ) )+"
t
; (7)
where
i
=!~
i1 and c
i
=
i
=!~
i1
. When h =1, equation (7) denes a special case of an ordinary
LSTAR model considered in Terasvirta (1994). When
i
! 1, i = 1;:::;h, model (7) becomes
a Self-Exciting Threshold AutoRegressive (SETAR) model with a switching intercept and h+1
regimes. Inanother important specialcase x
t
=tin F. Inthis situationthe intercept of a linear
model changes smoothly as a function of time. A linear model with h structural shifts in the
intercept isobtainedas
i
!1,i=1;:::;h.
An AR-NN model can thus be either interpreted as a semi-parametric approximation to any
Borel-measurablefunctionorasanextensionof theMLSTARmodelwhere thetransitionvariable
can be alinear combination of stochastic variables. We shouldhowever stress thefactthat model
(1)is,inprinciple,neithergloballynorlocally identied. Threecharacteristics ofthemodel imply
non-identiability. The rst one is theexchangeability property of theAR-NN model. The value
in the likelihood function of the model remains unchanged if we permute the hiddenunits. This
resultsinh!dierentmodelsthatareindistinguishablefromeachotherandinh!equallocalmaxima
yieldstwo observationally equivalent parametrizations foreach hiddenunit. Finally,the presence
of irrelevanthiddenunitsis aproblem. If model(1) hashiddenunitssuch that
i
=0 forat least
one i,theparameters!~
i and
i
remainunidentied. Conversely,if!~
i
=0 then
i and
i
can take
anyvaluewithoutthevalueof thelikelihoodfunctionbeingaected.
Therstproblemissolvedbyimposing,say,therestrictions
1
h or
1
h . The
secondsourceofunderidenticationcanbecircumvented,forexample,byimposingtherestrictions
~
!
1i
> 0, i = 1;:::;h. To remedy the third problem, it is necessary to ensure that the model
containsnoirrelevanthiddenunits. ThisdiÆcultyisdealtwithbyapplyingstatisticalinferencein
model specication; seeSection 4. Forfurther discussionofthe identiabilityof ANNmodelssee,
for example, Sussman (1992), Kurkova and Kainen (1994), Hwang and Ding (1997), and Anders
and Korn(1999).
Forestimationpurposes itis oftenusefulto reparametrize thelogisticfunction(2) as
F
i
! 0
i x
t c
i
=
1+e
i (!
0
i x
t c
i )
1
(8)
where
i
>0,i=1;:::;h,and k!
i
k=1with
!
i1
= v
u
u
t
1 q
X
j=2
! 2
ij
>0;i=1;:::;h: (9)
The parametervector of model(1)becomes
=[ 0
;
1
;:::;
h
;
1
;:::;
h
;!
12
;:::;!
1q
;:::;!
h2
;:::;!
hq
;c
1
;:::;c
h ]
0
:
Inthiscasethersttwoidentifyingrestrictionsdiscussedabovecanbedenedas,rst,c
1
c
h
or
1
h
and, second,
i
> 0;i = 1;:::;h.. We shall return to this parameterization in
Section4.3, when maximumlikelihoodestimationof theparametersis considered.
4.1 Three Stages of Model Building
AsmentionedintheIntroduction,ouraim isto constructa coherentstrategyforbuildingAR-NN
models using statistical inference. The structure or architecture of an AR-NN model has to be
determined from the data. We callthis stage specication of the model, and it involves two sets
of decisionproblems. First, the lags orvariables to be includedin the model have to be selected.
Second,thenumberof hiddenunitshasto be determined. Choosingthecorrectnumberof hidden
unitsisparticularlyimportant asselectingtoomanyneuronsyieldsanunidentiedmodel. In this
work, the lag structure or the variables included in the model are determined using well-known
variable selection techniques. The specication stage of NN modelling also requires estimation
becausewesuggestchoosingthehiddenunitssequentially. Afterestimatingamodelwithhhidden
unitsweshalltestitagainsttheone withh+1hiddenunitsandcontinueuntiltherstacceptance
of anullhypothesis. Whatfollowsthereafter is evaluation of thenalestimated modelto checkif
thenalmodelisadequate. NNmodelsaretypicallyonlyevaluatedout-of-sample,butinthispaper
we suggest in-sample misspecicationtests forthe purpose. Similar tests are routinely appliedin
evaluatingSTARmodels(see EitrheimandTerasvirta(1996)),and inthisworkwe adaptthemto
the AR-NN models. Allthis requires consistency and asymptotic normalityfor the estimators of
parameters of the AR-NN model, conditions for which, based on results in Trapletti, Leisch and
Hornik (2000),willbestatedbelow.
Weshallbeginthediscussionofourmodellingstrategybyconsideringvariableselection. After
dealing with that problem we turn to parameter estimation. Finally, after discussing statistical
inference for selectingthe hiddenunitsand after presenting ourin-sample model evaluation tools
we outline themodellingstrategyasa whole.
4.2 Variable Selection
The rst step in our model specication is to choose the variables for the model from a set of
potentialvariables (lagsin thepure AR-NN case). Several nonparametricvariable selection tech-
niques exist (Tschernig and Yang 2000, Vieu 1995, Tjstheim and Auestad 1994, Yao and Tong
when the number of observations is not small. In this paper variable selection is carried out by
linearizing the model and applying well-known techniques of linear variable selection to this ap-
proximation. This keepscomputational costto a minimum. Forthispurposewe adoptthesimple
procedure proposed in Rech, Terasvirta and Tschernig (2001). Their idea is to approximate the
stationarynonlinearmodelbyapolynomialofsuÆcientlyhighorder. Adapted tothepresentsitu-
ation,therst stepis toapproximatefunctionG(x
t
; )in(1)byageneral k-thorderpolynomial.
By the Stone-Weierstrass theorem, the approximation can be made arbitrarily accurate if some
mildconditions,such astheparameterspace being compact,are imposedonfunctionG(x
t
; ).
ThustheAR-NNmodel,itselfauniversalapproximator,isapproximatedbyanotherfunction. This
yields
G(x
t
; )= 0
~ x
t +
q
X
j1=1 q
X
j2=j1
j
1 j
2 x
j
1
;t x
j
2
;t
++ q
X
j
1
=1
q
X
j
k
=j
k 1
j
1 :::j
k x
j
1
;t x
j
k
;t +R (x
t
; );
(10)
where R (x
t
; ) is the approximation error that can be made negligible by choosingk suÆciently
high. The 0
s are parameters, and 2 R q+1
is a vector of parameters. The linear form of the
approximationis independentof thenumberofhiddenunitsin(1).
In equation (10), every product of variablesinvolving at least one redundant variable has the
coeÆcient zero. The idea isto sort outtheredundantvariablesbyusingthisproperty of(10). In
ordertodothat,werstregressy
t
onallvariablesontheright-handsideofequation(10)assuming
R (x
t
; ) and compute the value of a model selection criterion (MSC), AIC or SBICfor example.
Afterdoingthat,weremoveonevariablefromtheoriginalmodelandregressy
t
onalltheremaining
termsinthecorrespondingpolynomialand againcomputethevalueoftheMSC.Thisprocedureis
repeatedbyomittingeachvariableinturn. Wecontinuebysimultaneouslyomittingtwo regressors
of the original model and proceed in that way until the polynomial is of a function of a single
regressor and, nally,just a constant. Having done that, we choose the combination of variables
that yields the lowest value of the MSC. This amounts to estimating P
q
i=1 q
i
+1 linear models
ANNmodelareselectedatthesametime. Rechetal.(2001)showedthattheprocedureworkswell
already in small samples when compared to well-known nonparametric techniques. Furthermore,
it can be successfullyapplied even inlarge sampleswhen nonparametric model selectionbecomes
computationallyinfeasible.
4.3 Parameter Estimation
Asselectingthenumberofhiddenunitsrequiresestimationofneuralnetworkmodels,wenowturn
to thisproblem. A large number of algorithms for estimatingthe parameters of a NN model are
availableintheliterature. InthispaperweinsteadestimatetheparametersofourAR-NNmodelby
maximumlikelihoodmaking useof theassumptionsmade of"
t
inSection2. The useofmaximum
likelihood or quasi maximumlikelihood makes it possible to obtain an idea of the uncertainty in
the parameter estimates through (asymptotic) standard deviation estimates. This is not possible
byusingthe above-mentioned algorithms. It maybe argued that maximumlikelihoodestimation
of neural network modelsis most likelyto lead to convergence problems, and that penalizingthe
log-likelihood function one way or the other is a necessary precondition for satisfactory results.
Two thingscan be saidin favour ofmaximumlikelihoodhere. First, inthispapermodel building
proceedsfromsmalltolargemodels,sothatestimationofunidentiedornearlyunidentiedmodels,
a majorreason forthe needto penalize the log-likelihood,is avoided. Second,the starting-values
of theparameterestimates arechosen carefully,and we discussthedetailsof thisinSection4.3.2.
The AR-NN model is similar to many linear or nonlinear time series models in that the in-
formation matrix of the logarithmic likelihood function is block diagonal in such a way that we
can concentrate the likelihood and rst estimate the parameters of the conditional mean. Thus
conditional maximum likelihood is equivalent to nonlinear least squares. Using model (1) as our
starting-point,wemake thefollowingassumptions:
(A.1) The((r+1)1)parametervector
=[ 0
; 2
] 0
isaninteriorpointofthecompactparameter
space which isa subspaceof R r
R +
;ther-dimensionalEuclideanspace.
(A.2) The rootsofthe lagpolynomial1 P
p
j=1
j z
j
lieoutside theunitcircle. Thisisa suÆcient
(A.3) The parameterssatisfytheconditions c
1
:::c
h ,
i
>0,i=1;:::;h; and !
i1
isdenedas
in(9)fori=1;:::;h.
Assumption (A.3) guarantees global identiability of the model. The maximum likelihood
estimator ofthe parametersof theconditional meanequals
^
=argmaxQ
T ( )=
1
2
argmin T
X
t=1 q
t ( );
where q
t
( ) = ( y
t G(x
t
; )) 2
. Conditions for consistency and asymptotic normality of
^
are
statedinthefollowingtheorem.
Theorem 1. Under the assumptions(A.1){(A.3) themaximumlikelihoodestimator
^
is almost
surelyconsistent for and
T 1=2
(
^
)!N
0; plim
T!1 A( )
1
; (11)
whereA( )= 1
2
T
@ 2
Q
T ( )
@ @ 0
.
Proof. Assumptions (A.1)-(A.3) together with the normality assumption of the errors satisfy
the assumptions of Theorem 4.1 (almost sure consistency) in Trapletti et al. (2000). As every
hiddenunitinmodel(1)isalogisticfunction,assumptions4.5and4.6ofTheorem4.2(asymptotic
normality)inTraplettietal. (2000) aresatisedaswell, sothatresult (11)follows.
In thispaper, we applythe heteroskedasticity-robustlarge sample estimator of thecovariance
matrixof
^
(White 1980)
^
B(
^
)= T
X
t=1
^
h
t
^
h 0
t
!
1
T
X
t=1
^
"
2
t
^
h
t
^
h 0
t
!
T
X
t=1
^
h
t
^
h 0
t
!
1
(12)
where
^
h
t
=
@qt( )
@
=
^
, and "^
t
is the residual. In the estimation, the use of algorithms such as
theBroyden-Fletcher-Goldfarb-Shanno(BFGS)ortheLevenberg-Marquardtalgorithmsisstrongly
recommended. See, for example, Bertsekas (1995) for details about optimization algorithms or
appropriatelinesearchprocedureto selectthestep-lengthisanotherimportantquestion. Cubicor
quadraticinterpolation are usuallyreasonablechoices. Allthe modelsinthispaperare estimated
withthe Levenberg-Marquardtalgorithmbased on acubic interpolationlinesearch.
4.3.1 Concentrated Maximum Likelihood
In order to reduce the computational burden we can apply concentrated maximum likelihood to
estimate asfollows. Consider thei th
iterationand rewritemodel(1)as
y=Z()+" ; (13)
wherey 0
=[y
1
;y
2
;:::;y
T ] ,"
0
=["
1
;"
2
;:::;"
T ] ,
0
=[ 0
;
1
;:::;
h ],and
Z()= 0
B
B
B
B
@
~
x 0
1 F(
1 (!
0
1 x
1 c
1
)) ::: F(
h (!
0
h x
1 c
h ))
.
.
.
.
.
.
.
.
.
.
.
.
~
x 0
T F(
1 (!
0
1 x
T c
1
)) ::: F(
h (!
0
h x
T c
h ))
1
C
C
C
C
A
;
with = [
1
;:::;
h
;! 0
1
;:::;! 0
h
;c
1
;:::;c
h ]
0
. Assuming xed (the value is obtained from the
previousiteration),the parametervector can beestimated analyticallyby
^
= Z()
0
Z()
1
Z() 0
y : (14)
TheremainingparametersareestimatedconditionallyonbyapplyingtheLevenberg-Marquardt
algorithm, which completes the i th
iteration. This form of concentrated maximum likelihoodwas
proposed by Leybourne,Newboldand Vougas (1998) inthe context of STAR models. It substan-
tially reducesthe dimensionalityof the iterative estimationproblem, asinstead of an inversion of
a singlelargeHessiantwo smallermatrices areinverted,and alinesearchis onlyneeded toobtain
thei th
estimate of.
Many iterative optimization algorithms are sensitive to the choice of starting-values, and this is
certainly so in the estimation of AR-NN models. Besides, an AR-NN model with h hidden units
contains h parameters,
i
;i=1;:::;h; that are not scale-free. Our rst task is thusto rescale the
inputvariablesinsuchawaythattheyhavethestandarddeviationequaltounity. Intheunivariate
AR-NNcase,thissimplymeansnormalizingy
t
. Ifthemodelcontainsexogenousvariables,theyare
normalized separately. This, together with thefact that k!
h
jj =1, gives us a basisfor discussing
thechoiceofstarting-valuesof
i
;i=1;:::;h. Anotheradvantageofthisisthat, inthemultivariate
case normalization generally makes numerical optimization easier as all variables have the same
standard deviation. Assume now that we have estimated an AR-NN model with h 1 hidden
unitsand want toestimate one withh units. Our specic-to-generalspecicationstrategyhasthe
consequence that this situation frequently occurs in practice. A natural choice of initial values
forthe estimationof parameters inthe model with h neuronsis to use the nalestimates forthe
parametersinthersth 1hiddenunitsandthelinearunit. Thestarting-valuesfortheparameters
h
;
h
;!
h andc
h
inthehth hiddenunit areobtainedinthree stepsasfollows.
1. Fork =1;:::;K:
(a) Construct a vector v (k)
h
= [v (k)
1h
;:::;v (k)
qh ]
0
such that v (k)
1h
2 (0;1] and v (k)
jh
2 [ 1;1],
j = 2;:::;q. The values for v (k)
1h
are drawn from a uniform (0;1] distribution and the
onesforv (k)
jh
,j =2;:::;q,from a uniform[ 1;1]distribution.
(b) Dene ! (k)
h
=v (k)
h kv
(k)
h k
1
,whichguaranteesk!
(k)
h
k=1.
(c) Letc (k)
h
=med(!
(k) 0
h
x),where x=[x
1
;:::;x
T ].
2. Denea gridof N positivevalues (n)
h
,n=1;:::;N,fortheslope parameter. Thisneednot
be done randomly. As the changes in
h
have a small eect of the slope when
h
is large,
onlyasmallnumberof largevaluesarerequired,and thegridshouldbenerat thelow end
of thehalfspaceof
h
-values.
3. For k =1;:::;K and n=1;:::;N, estimate using(14) and compute the value of Q
T ( )
foreachcombinationofstarting-values. Choosethosevaluesoftheparametersthatminimize
T
After selectingtheinitialvaluesoftheh th
hiddenunitwehavetoreordertheunitsifnecessary
inorder to ensurethatthe identifyingrestrictionsdiscussedinSection3 aresatised.
Typically, choosing K = 1000 and N = 20 ensures good initial estimates. We should stress,
however, thatK isanondecreasingfunctionof thenumberofinputvariables. Ifthelatterislarge
we have to select alargeK aswell.
4.3.3 Potential Problem in the Estimation of the Slope Parameter
Finding the appropriate starting-values for the parameters is not the only diÆcult phase of the
parameter estimation. It may also be diÆcult to obtain reasonably accurate estimates for those
slope parameters
i
, i = 1;:::;h, that are very large. This is the case unless the sample size is
also large. To obtain an accurate estimate of a large
i
it is necessary to have a large number
of observations in the neighbourhood of c
i
. When the sample size is not very large, there are
generallyfewobservationssuÆcientlyclosetoc
i
inthesample,whichresultsinimpreciseestimates
of the slope parameter. This manifests itself in low absolute t-values for the estimates of
i . In
such cases, themodel buildercannot take alowabsolute valueof thet-statistic ofthe parameters
of the transition function as evidence for omitting the hidden unit in question. Another reason
for not doing so is that the t-value does not have its customary interpretation as a value of an
asymptotic t-distributed statistic. This is dueto an identicationproblemto be discussed inthe
nextsubsection. Formore discussionsee,forexample, BatesandWatts(1988,p. 87)orTerasvirta
(1994).
4.4 Determining the Number of Hidden Units
The number of hiddenunits included in an ANN model is usually determined from the data. A
popularmethodfordoingthatispruning,inwhichamodelwithalargenumberofhiddenunitsis
estimatedrst,andthesizeofthemodelissubsequentlyreducedbyapplyinganappropriatetech-
niquesuch ascross-validation. Another technique usedin thisconnection is regularization, which
may be characterized aspenalized maximumlikelihoodor leastsquares appliedto theestimation
of neural network models. For discussion see, for example, Fine (1999, pp. 215{221). Bayesian
The use of regularization techniques precludes the possibility of applying methods of statistical
inference to theestimated model.
AsdiscussedintheIntroduction,anotherpossibilityistobeginwithasmallmodelandsequen-
tially add hiddenunits to the model, for discussion see, for example, Fine (1999, pp. 232{233),
Andersand Korn (1999),or Swanson and White (1995,1997a,b). Thedecision ofadding another
hiddenneuronisoftenbasedontheuseofmodelselectioncriteria(MSC)orcross-validation. This
has the following drawback. Supposethe data have been generated by an AR-NN model with h
hiddenunits. ApplyinganMSC to decidewhetherornotanotherhiddenunitshouldbeaddedto
themodelrequiresestimationofamodelwithh+1hiddenneurons. Inthissituation, however,the
larger model is not identied and its parameters cannot be estimated consistently. This is likely
to cause numerical problems inmaximumlikelihood estimation. Besides, even when convergence
isachieved, lackof identicationcausesa severeproblem ininterpretingthe MSC.The NN model
withhhiddenunitsisnestedinthemodelwithh+1units. AtypicalMSCcomparisonof thetwo
models is then equivalent to a likelihood ratio test of h units against h+1 ones, see, for exam-
ple, Terasvirtaand Mellin (1986) fordiscussion. The choice of MSC determinesthe (asymptotic)
signicance levelof the test. But then, when the largermodel isnot identied underthe nullhy-
pothesis,thelikelihoodratiostatisticdoesnothaveitscustomaryasymptotic 2
distributionwhen
thenullholds. Formore discussionof thegeneralsituation ofa model onlybeing identiedunder
thealternative hypothesis,see, forexample, Davies (1977,1987) or Hansen (1996). Inthe AR-NN
case, this lack of identication shows as an ambiguity in determiningthe size of the penalty. An
AR-NN model with h+1 hidden units can be reduced to one with h units by setting
h+1
=0,
which suggests that the numberof degrees of freedom in the penaltyterm should equal one. On
theotherhand, the(h+1) th
hiddenunitcan alsobeeliminatedbysetting !
h+1
=0. Thisinturn
suggests that thenumber ofdegrees offreedom shouldbe dim(x
t
). Inpractice,the mostcommon
choice seemsto equaldim(x
t )+2.
We shall also select the hidden units sequentially starting from a small model, in fact from
a linear one, but circumvent the identication problem in a way that enables us to control the
signicance level of the tests in the sequence and thus also the overall signicance level of the
acceptance of thenullhypothesis. Assume nowthat ourAR-NN model(1) containsh+1 hidden
unitsand writeitasfollows
y
t
= 0
~ x
t +
h
X
i=1
i F(
i (!
0
i x
t c
i ))+
h+1 F(
h+1 (!
0
h+1 x
t c
h+1 ))+"
t
: (15)
Assumefurtherthatwehave acceptedthehypothesis ofmodel(15) containingh hiddenunitsand
want to test forthe(h+1) th
hiddenunit. The appropriatenullhypothesis is
H
0 :
h+1
=0; (16)
whereasthealternative isH
1 :
h+1
>0. Under(16), the (h+1) th
hiddenunit isidenticallyequal
to a constant andmerges withthe interceptinthe linearunit.
We assume that under (16) the assumptions of Theorem 1 hold so that the parameters of
(15) can be estimated consistently. Model (15) is onlyidentiedunder thealternative so that, as
discussedabove,thestandardasymptoticinference is notavailable. Thisproblemiscircumvented
as inLuukkonen, Saikkonen and Terasvirta (1988) byexpanding the(h+1) th
hiddenunit into a
Taylor series around the nullhypothesis (16). Using a third-orderTaylor expansion, rearranging
and merging termsresultsinthe followingmodel
y
t
= 0
~ x
t +
h
X
i=1
i F(
i (!
0
i x
t c
i ))+
q
X
i=1 q
X
j=i
ij x
i;t x
j;t +
q
X
i=1 q
X
j=i q
X
k=j
ijk x
i;t x
j;t x
k;t +"
t
; (17)
where "
t
= "
t +
h+1 R (x
t ); R (x
t
) is the remainder. It can be shown that
ij
= 2
h+1
~
ij ,
~
ij 6=0,
i = 1;:::;q; j = i;:::;q; and
ijk
= 3
h+1
~
ijk ,
~
ijk
6= 0, i = 1;:::;q; j = i;:::;q, k = j;:::;q.
Thusthenullhypothesis H 0
0 :
ij
=0,i=1;:::;q; j=i;:::;q,
ijk
=0, i=1;:::;q;j =i;:::;q;
k = j;:::;q. Note that under H
0 : "
t
= "
t
; so that the properties of the error process remain
unchanged underthe nullhypothesis. Finally,it may bepointed outthat one may also view (17)
l
t
= 1
2 ln(2)
1
2 ln
2 1
2 2
(
y
t
0
~ x
t h
X
i=1
i F(
i (!
0
i x
t c
i ))
q
X
i=1 q
X
j=i
ij x
i;t x
j;t q
X
i=1 q
X
j=i q
X
k=j
ijk x
i;t x
j;t x
k;t )
2
:
(18)
We makethe followingassumptionto accompany thepreviousassumptions(A.1){(A.3):
(A.4) Ejx
t;i j
Æ
<1,i=1;:::;q, forsome Æ >6. This enables usto state thefollowingwell-known
result:
Theorem 2. Under H
0 :
h+1
=0 and assumptions(A.1){(A.4), the LMtype statistic
LM = 1
^ 2
T
X
t=1
^
"
t
^ 0
t 8
<
: T
X
t=1
^
t
^ 0
t T
X
t=1
^
t
^
h 0
t T
X
t=1
^
h
t
^
h 0
t
!
1
T
X
t=1
^
h
t
^ 0
t 9
=
; T
X
t=1
^
t
^
"
t
(19)
where"^
t
=y
t G(x
t
;
^
),
^
h
t
=
@G(x
t
; )
@ 0
=
^
=
"
~ x 0
t
;F(^
1 (!^
0
1 x
t
^ c
1
));:::;F(^
h (!^
0
h x
t
^ c
h ));
^
1
@F(^
1 (!^
0
1 x
t
^ c
1 ))
@
1
;:::;
^
h
@F(^
h (!^
0
h x
t
^ c
h ))
@
h
;
^
1
@F(^
1 (!^
0
1 x
t
^ c
1 ))
@!~ 0
12
;:::;
^
1
@F(^
1 (!^
0
1 x
t
^ c
1 ))
@!~ 0
1q
;:::;
^
h
@F(^
h (!^
0
h x
t
^ c
h ))
@!~ 0
h2
;:::;
^
h
@F(^
h (!^
0
h x
t
^ c
h ))
@!~ 0
hq
;
^
1
@F(^
1 (!^
0
1 x
t
^ c
1 ))
@c
1
;:::;
^
h
@F(^
h (!^
0
h x
t
^ c
h ))
@c
h
#
0
with
@F(^
i (!^
0
i x
t
^ c
i ))
@
i
=(!^ 0
i x
t
^ c
i )
2cosh ^
i (!^
0
i x
t
^ c
i )
2
;i=1;:::;h
@F(^
i (!^
0
i x
t
^ c
i ))
@!~ 0
ij
=
i
~ x
j;t
2cosh ^
i (!^
0
i x
t
^ c
i )
2
;i=1;:::;h;j =2;:::;q
@F(^
i (!^
0
i x
t
^ c
i ))
@c
i
=
i
2cosh ^
i (!^
0
i x
t
^ c
i )
2
;i=1;:::;h
and
t
= x 2
1;t
;x
1;t x
2;t
;:::;x
i;t x
j;t
;:::;x 3
1;t
;:::;x
i;t x
j;t x
k;t
;:::;x 3
h;t
, has an asymptotic 2
distri-
butionwithm=q(q+1)=2+q(q+1)(q+2)=6 degreesof freedom.
The test can also be carried outinstagesasfollows:
1. Estimatemodel(1)withhhiddenunits. IfthesamplesizeissmallandthemodelthusdiÆcult
to estimate, numericalproblems inapplyingthemaximumlikelihood algorithmmay leadto
a solution such that the residual vector is not precisely orthogonal to the gradient matrix
of G(x
t
;
^
). This has an adverse eect on the empirical size of the test. To circumvent
this problem, we regress the residuals "^
t on
^
h
t
and compute the sum of squared residuals
SSR
0
= P
T
t=1
~
"
2
t
. The new residuals"~
t
are orthogonal to
^
h
t .
2. Regress"~
t on
^
h
t and ^
t
. Computethesum of squaredresiduals SSR
1
= P
T
t=1
^ v 2
t .
3. Computethe 2
statistic
LM hn
2
=T SSR
0
SSR
1
SSR
0
; (20)
ortheF versionof thetest
LM hn
F
= (SSR
0
SSR
1 )=m
SSR
1
=(T n m)
; (21)
wheren=(q+2)h+p+1. UnderH
0 ,LM
hn
2
hasanasymptotic 2
distributionwithmdegrees
of freedomand LM hn
F
isapproximatelyF-distributedwithm and T n mdegrees offreedom.
The following cautionaryremark isin order. Ifany ^
i
,i=1;:::;h, is very large, thegradient
matrix becomes near-singular and the test statistic numericallyunstable, which distorts the size
of thetest. The reasonis thatthevectors correspondingto the partialderivativeswithrespectto
i ,!
i ,and c
i
,respectively,tend to bealmost perfectlylinearlycorrelated. This is dueto thefact
that the time series of those elements of the gradient resemble dummy variables being constant
most of the time and nonconstant simultaneously. The problem may be remedied by omitting
these elements from theregression instep 2. Thiscan be donewithoutsignicantlyaecting the
value of the test statistic; see Eitrheim and Terasvirta (1996) for discussion. This situation may
also cause problems in obtaining standard deviation estimates for parameter estimates through
the outer product matrix (12). The same remedy can be applied: omit the rows and columns
standard deviation estimates for the remaining parameter estimates. In fact, omitting the rows
and columnscorresponding to thehighvalues^
i
shouldsuÆce.
Testingzero hiddenunitsagainstat leastone isa specialcaseof theabove test. This amounts
to testing linearity, and the test statistic is in this case identical to the one derived for testing
linearityagainst the AR-NN modelin Terasvirta, Lin and Granger (1993). A natural alternative
to our procedure is the one rst suggested in White (1989) and investigated later in Lee, White
andGranger(1993). Inorderto testthenullhypothesisofh hiddenunits,oneaddsqhiddenunits
to model(1) byrandomly selectingthe parameters !~
h+j ,
h+j
, j =1;:::;q. Ifq is large(a small
valuemaynotrenderthe test powerfulenough), a smallnumberof principal componentsof theq
extraunitsmaybeusedinstead. Thissolvestheidenticationproblemastheextraneuronsorthe
transformedneuronsareobservable,andthenullhypothesis
h+1
==
h+q
=0oritsequivalent
when the principal component approach is taken, can be tested using standard inference. When
h=0;thistechniquealsocollapsesinto alinearitytest: seeLeeetal.(1993). Simulationresultsin
Terasvirta etal. (1993) and Andersand Korn (1999) indicatethat thepolynomialapproximation
method presented here compares wellwith White's approach, and it is appliedin the rest of this
work.
As mentioned inthe Introduction, the normality assumption can be relaxed while theconsis-
tencyand asymptoticnormalityof the (quasi)maximumlikelihoodestimators are retained. This
isimportant,forexample,innancialapplicationsoftheAR-NN model. Innancialapplications,
at least in ones to high-frequency data, such as intradaily,daily or even weekly series, the series
typicallycontain conditional heteroskedasticity. This possibilitycan be accounted forbyrobusti-
fyingthetestsagainstheteroskedasticityfollowingWooldridge(1990). Aheteroskedasticity-robust
version of the LM type test, based on the notionof robustifyingstatistic (18), can be carried out
asfollows.
1. Asbefore.
2. Regress^
t on
^
h
t
and computetheresiduals r
t .
3. Regress1 on "~
t r
t
and compute thesumof squared residualsSSR
1 .
LM r
2
=T SSR
1
: (22)
The test statistichasthesame asymptotic 2
nulldistributionasbefore.
It shouldbe noticedthatinthecaseof conditionalheteroskedasticity,themaximumlikelihood
estimates discussed in Section4.3 are just quasimaximumlikelihoodestimates. Under regularity
conditionsthey arestillconsistentand asymptoticallynormal.
4.5 Evaluation of the Estimated Model
Afteramodelhasbeenestimatedithastobeevaluated. Weproposetwoin-samplemisspecication
tests for this purpose. The rst one tests for the instability of the parameters. The second one
tests the assumption of no serial correlation in the errors and is an application of the results in
Eitrheimand Terasvirta (1996) andGodfrey(1988, pp. 112{121).
4.5.1 Test of Parameter Constancy
Testing parameter constancy is an important wayof checking theadequacy of linear or nonlinear
models. Many parameter constancy tests are tests against unspecied alternatives or a single
structuralbreak. Inthissectionwepresenta parametricalternativetoparameterconstancy which
allows the parameters to change smoothly as afunction of time underthe alternative hypothesis.
In thefollowingwe assume thatthelogistic function(8) hasconstant parameters whereasboth
and
i
,i=1;:::;h,maybesubjecttochangesovertime. Thisassumptionismademainlybecause
changes inthe parameters of thelogistic function aremore diÆcult to detect thanchanges inthe
linear parameters.
In order to derive thetest, considera model withtime-varyingparameters dened as
y
t
=
~
G(x
t
; ;
~
)+"
t
=~ 0
(t)~x
t +
h
X
i=1 n
~
i (t)F(
i (!
0
i x
t c
i ))
o
+"
t
; (23)
~
(t)=+F ( (t )); (24)
and
~
i
(t)=
i +
i
F ( (t ));i=1;:::;h: (25)
ThefunctionF in(24)and(25)isdenedasin(2),and >0. Theparametervector isdenedas
before,and
~
=[;;
1
;:::;
h
;;] 0
. The parameter controls thesmoothnessof themonotonic
change in the autoregressive parameters. When ! 1, equations (23){(25) represent a model
witha single structural breakat t=. Combining(24) and (25)with (23), we have thefollowing
model
y
t
=
0
+ 0
F ((t )) x~
t +
h
X
i=1 n
i +
i
F ((t )) o
F(
i (!
0
i x
t c
i ))+"
t
: (26)
The nullhypothesis hypothesisof parameterconstancy is
H
0
: =0: (27)
Note that model (26) is only identied under the alternative > 0. As is obvious from Section
4.4, aconsequenceof thiscomplication isthatthestandardasymptoticdistributiontheoryforthe
likelihood ratio or other classical test statistics for testing (27) is not available. To remedy this
problemwe expand F((t ))into a rst-orderTaylor expansionaround =0. Thisyields
t
1
= 1
4
(t )+R (t;;); (28)
where R (t;;) is the remainder. Replacing F((t )) in(26) by(28) and reparameterizing we
obtain
y
t
=
0
0 +
0
0 t
~
x
t +
h
X
i=1 (
i +
i t)F(
i (!
0
i x
t c
i ))+"
t
; (29)
where
0
= =4,
0
= =4,
i
=
i
i
=4,
i
=
i
=4, i = 1;:::;h, and "
t
=
t
H
0 :
0
=0;
1
==
h
=0: (30)
Under H
0
,R (t;;)=0 and "
t
="
t
,sothat standardasymptoticdistribution theoryis available.
The local approximation to thet th
terminthenormal log likelihoodfunction ina neighbourhood
of H
0
for observation t,ignoringR (t;;), is
l
t
= 1
2 ln(2)
1
2 ln
2
1
2 2
(
y
t
0
0 +
0
0 t
~ x
t h
X
i=1 (
i +
i t)F(
i (!
0
i x
t c
i ))
)
2
:
(31)
The consistent estimators of the relevant partial derivatives of the log likelihood under the null
hypothesis are
@
^
l
t
@ 0
0
H
0
= 1
^ 2
^
"
t
~ x
t
(32)
@
^
l
t
@ 0
0
H0
= 1
^ 2
^
"
t t~x
t
(33)
@
^
l
t
@
i
H
0
= 1
^ 2
^
"
t F(^
i (!^
0
i x
t
^ c
i
)) (34)
@
^
l
t
@ 0
i
H
0
= 1
^ 2
^
"
t tF(^
i (!^
0
i x
t
^ c
i
)) (35)
@
^
l
t
@ 0
i
H0
= 1
^ 2
^
"
t
^
i
@F(^
i (!^
0
i x
t
^ c
i ))
@ 0
i
(36)
@
^
l
t
@!
0
i
H0
= 1
^ 2
^
"
t
^
i
@F(^
i (!^
0
i x
t
^ c
i ))
@!
0
i
(37)
@
^
l
t
@c
i
H
0
= 1
^ 2
^
"
t
^
i
@F(^
i (!^
0
i x
t
^ c
i ))
@c
i
(38)
wherei=1;:::;h,^ 2
=(1=T) P
T
t=1
^
"
2
t ,and"^
t
=y
t G(x
t
;
^
)=y
t
^
0
~
x
t h
P
i=1
^
i F(^
i (
^
! 0
i x
t
^ c
i ))are
theresidualsestimatedunderthenullhypothesis. TheLM statisticis(19)with
^
h
t
=
@G(xt; )
@ 0
=
^
and ^
t
=
t~x 0
t
;tF(^
1 (!^
0
1 x
t
^ c
1
));:::;tF(^
h (!^
0
h x
t
^ c
h ))
0
. Testing hypothesis of just a subset of
coeÆcients being constant is also possible in this framework. Under H
0
, statistic (19) has an
asymptotic 2
distribution withq+h+1 degrees offreedom. Thisis thecase even if
^
t
contains
elementsdominatedbya deterministictrend, seeLinand Terasvirta (1994) fordetails.
The test can be carried outin stages asbefore. The onlydierences are the new denitionof
^
t
at stage2and thedegreesoffreedominthe 2
orFtest. Asbefore,useofF-versionof thetest
isrecommended.
The alternative to parameterconstancy may be made more generalsimplybydening
F((t c))= 1+exp (
K
Y
k=1 ( t
k )
)!
1
; >0 (39)
withK 1. Transitionfunction(39)withK2allowstheparameterstochangenonmonotonically
overtime,incontrasttothecaseK =1. Consequently,intheteststatistic(19),^
t
=[^
0
1t
;:::;^ 0
Kt ]
0
where
^ 0
kt
= h
t k
x 0
t
;t k
F(^
1 (!^
0
1 x
t
^ c
1
));:::;t k
F(^
h (!^
0
h x
t
^ c
h )
i
; k=1;:::;K
and the numberof degrees of freedom in the test statistic is adjustedaccordingly. In this paper
we report results for K = 1;2;3; and call the corresponding test statistics LM
K
;K = 1;2;3;
respectively. Iftheerrorprocessisheteroskedastic,arobustversionofthetest,immediatelyobvious
fromSection4.4,hasto beemployedinsteadofthestandardtest. When themodelisassumednot
tocontainanyhiddenunits:
i
(t)0,i=1;:::;h,thetestcollapsesintotheparameterconstancy
test inLinand Terasvirta(1994).
4.5.2 Test of Serial Independence
Assume thattheerrors inequation (1)followan r th
order autoregressive process denedas
"
t
= 0
t +u
t
; (40)
where =[
1
;:::;
r
]isaparametervector,
t
=["
t 1
;:::;"
t r
],and u
t
NID (0; ). Weassume
that under H
0
, assumptions (A.1){(A.3) hold. Consider the nullhypothesis H
0
: = 0 whereas
H
1
:6=0. Theconditionalnormallog-likelihoodof(1)with(40)forobservationt, giventhexed
startingvalueshastheform
l
t
= 1
2 ln(2)
1
2 ln
2
1
2 2
8
<
: y
t r
X
j=1
j y
t j
G(x
t
; )+ r
X
j=1
j G(x
t j
; ) 9
=
; 2
:
(41)
Therst partialderivativesofthenormallog-likelihoodforobservation twithrespectto and
are
@l
t
@
j
=
u
t
2
f y
t j G(x
t j
; )g;j =1;:::;r
@l
t
@
=
u
t
2
8
<
:
@G(x
t
; )
@
r
X
j=1
j
@G(x
t j
; )
@ 9
=
; :
(42)
Under thenullhypothesis,theconsistent estimatorsofthe score are
T
X
t=1
@
^
l
t
@
H
0
= 1
^ 2
T
X
t=1
^
"
t
^
t and
T
X
t=1
@
^
l
t
@
H
0
= 1
^ 2
T
X
t=1
^
"
t
^
h
t
;
where ^ 0
t
= [^"
t 1
;:::;"^
t r ], "^
t j
= y
t j
G(x
t j
^
; ), j = 1;:::;r,
^
h
t
=
@G(xt;
^
)
@
, and ^ 2
=
(1=T) P
T
t=1
^
"
2
t
. TheLM statisticis(19)with
^
h
t and ^
t
denedasabove,and ithasanasymptotic
2
distributionwithrdegreesoffreedomunderthenullhypothesis. Fordetails,seeGodfrey(1988,
pp. 112{121).
The test can be performed in three stages as shown before. It may be pointed out that the
Ljung-Boxtestoritsasymptoticallyequivalentcounterpart,theBox-Piercetest,bothrecommended
for use in connection with NN models by Zapranis and Refenes (1999), are not available. Their
asymptoticnulldistribution isunknownwhen theestimated modelis anAR-NN model.
At this point we areready to combine the above statistical ingredients into a coherent modelling
strategy. We rst dene the potential variables (lags) and select a subset of them applying the
variable selection technique considered in Section 4.2. After selecting the variables we select the
number of hidden units sequentially. We begin testing linearity against a single hidden unit as
described in Section 4.4 at signicance level . The model under the nullhypothesis is simply a
linear AR(p) model. If the nullhypothesis is not rejected, the AR model is accepted. In case of
a rejection,an AR-NN model witha single unitis estimated and testedagainst amodelwith two
hiddenunitsat thesignicancelevel%, 0<%<1. Anotherrejection leadsto estimatinga model
withtwo hiddenunitsandtesting itagainstamodelwiththreehiddenneuronsat thesignicance
level% 2
. Thesequenceisterminatedattherstacceptanceofthenullhypothesis. Thesignicance
levelisreducedat eachstepofthesequenceandconvergesto zero. IntheapplicationsinSection6,
weuse%=1=2:Thiswayweavoidexcessivelylargemodelsandcontroltheoverallsignicancelevel
of the procedure. An upperboundforthe overall signicance level
may be obtainedusingthe
Bonferronibound. Forexample, if =0:1; and %=1=2 then
0:187. Note that ifwe instead
of our LM type test applya model selection criterion such as AIC orSBIC to this sequence, we
infact usethe same signicance level at each step. Besides, theupperboundthat can beworked
outinthelinear case, see,forexample, TerasvirtaandMellin(1986), remainsunknownduetothe
identicationproblemmentionedabove.
Infollowingtheabovepathwehaveindeedassumedthatallhiddenneuronscontainthevariables
thatareoriginallyselectedtotheAR-NNmodel. Anothervariantofthestrategyistheoneinwhich
thevariablesineachhiddenunitarechosenindividuallyfromthesetoforiginallyselectedvariables.
Inthepresentcontextthismaybedone,forexample,byconsideringtheestimatedparametervector
^
!
h
of themost recentlyadded hiddenneuron, removing thevariables whose coeÆcients have the
lowestt-valuesandre-estimatingthemodel. AndersandKorn(1999)recommendedthisalternative.
It has the drawback that the computational burden may become high as frequent estimation of
neuralnetworkmodelsmaybeinvolved. Becauseofthiswesuggestanothertechniquethatcombines
sequential testing forhiddenunits and variable selection. Consider equation (15). Instead of just
testing a singlenullhypothesis asis donewithin(15), we can do the following. Firsttest thenull
test themodel withh hiddenunitsagainstthisreducedalternative. Remove each variableinturn
andcarryoutthetest. Continuebyremovingtwo variablesat atime. Finally,testthemodelwith
h neurons against the alternatives inwhich the(h+1) th
unit only containsa single variable and
an intercept. Findthe combination of variables for which the p-value of the test is minimized. If
thisp-value is lowerthan a prescribed value, \signicance level", add the(h+1) th
unit with the
correspondingvariablestothemodel. OtherwiseaccepttheAR-NNmodelwithhhiddenunitsand
stop. Thiswayofselectingthevariablesforeachhiddenunit isanalogousto thevariableselection
technique discussedinSection4.2.
Comparedtoourrststrategy,thisone addsto theexibilityandontheaverage leadsto more
parsimoniousmodelsthantheotherone. Ontheotherhand,aseverychoiceofhiddenunitinvolves
a possibly largenumberof tests, we donotcontrolthe signicancelevelof theoverallhiddenunit
test. We do that, albeit conditionally on the variables selected, if the set of input variables is
determined onceand forallbeforechoosingthenumberofhiddenunits.
Evaluation following the estimation of the nal model is carried out by subjecting the model
to misspecicationtests discussedin Section 4.5. Ifthe model doesnot pass the tests, the model
builderhas to reconsider the specication. For example, important lags (or, in the more general
case, exogenous variables), may be missing from the model. If parameter constancy is rejected,
model (23) may be estimated and used for forecasting, but reconsidering the whole specication
mayoften bea more sensibleoptionofthe two.
Anotherwayof evaluatingthemodelisout-of-sampleforecasting. AsAR-NN modelsaremost
often constructedforforecasting purposes,this isimportant. This partof themodelevaluation is
carried outbysavingthe lastobservations intheseriesforforecastingand comparingtheforecast
resultswiththosefromatleastonebenchmarkmodel. Note,however,thattheresultsaredependent
on theobservationscontainedinthat particularprediction period andmaynotallowvery general
conclusionsaboutthepropertiesoftheestimatedmodel. Theyarealsoconditionalonthestructure
of the model remaining unchanged over the forecasting period, which may not necessarily be the
case. Formore discussionaboutthis,see Clementsand Hendry(1999, Chapter 2).
It is useful to compare our modellingstrategy with other bottom-up approaches available in the
literature. Swanson and White (1995), Swanson and White (1997a), and Swanson and White
(1997b) apply the SBIC model selection criterion (Schwarz 1978) as follows. They start with a
linearmodel,addingpotentialvariablesto ituntilSBICindicatesthatthemodelcannotbefurther
improved. Then they estimatemodelswitha single hiddenunit and select regressors sequentially
toitonebyoneunlessSBICshowsnofurtherimprovement. NextSwansonandWhiteaddanother
hidden unit and proceed by adding variables to it. The selection process is terminated when
SBICindicates that no more hiddenunits orvariables shouldbe addedor when a predetermined
maximum numberof hiddenunitshas been reached. This modellingstrategy can be termedfully
sequential.
AndersandKorn(1999)essentiallyadopttheprocedureofTerasvirtaandLin(1993)described
in Section 4.4 for selecting the number of hidden units. After estimating the largest model they
suggest proceeding from general-to-specic by sequentially removing those variables from hidden
units whose parameter estimates have the lowest (t-test) p-values. Note that this presupposes
parameterizingthehiddenunitsasin(2), notasin(8)and (9).
BalkinandOrd(2000)selecttheorderedvariables(lags)sequentiallyusingalinearmodelanda
forwardstepwise regressionprocedure. IftheF-teststatisticof addinganotherlagobtainsavalue
exceeding 2,this lagis added to the set of inputvariables. The numberof variablesselected also
servesasamaximumnumberofhiddenunits. Theauthorssuggestestimatingallmodelsfromthe
onewithasinglehiddenunituptotheonewiththemaximumnumberofneurons. Thenalchoice
ismadeusingtheGeneralizedCross-ValidationCriterionofGolub, HeathandWahba (1979). The
model forwhichthe value of thismodelselection criterionis minimizedisselected.
RefenesandZapranis(1999)(seealsoZapranisandRefenes(1999))proposeaddinghiddenunits
intothemodelsequentially(thereisaowchartinthepaperindicatingthis). Thenumberofunits,
however, is selected onlyafter adding all units up to a predetermined maximumnumber, so that
theprocedureisnotgenuinelysequential. ThechoiceismadebyapplyingtheNetworkInformation
Criterion(Murata,YoshizawaandAmari1994)thatcanbetracedbacktoStone(1977). Themodel
is then pruned by removing redundant variables from the neurons and re-estimating the model.
which alsoformsan integralpart ofourmodellingprocedure. They suggest,forexample, thatthe
hypothesis of no error autocorrelation should be tested, by the Ljung-Box or the asymptotically
equivalentBox-Piercetest. Unfortunately,these testsdonothave theircustomaryasymptoticnull
distributionwhen theestimated modelis an AR-NNmodelinsteadof a linearautoregressive one.
Of these strategies, theSwanson and White one is computationallythe mostintensive one,as
thenumberofstepsinvolvinganestimationofaNNmodelislarge. Ourprocedureisinthisrespect
theleast demanding. The dierencebetween ourscheme and theAndersand Korn one is thatin
ourstrategy,variableselectiondoesnotrequireestimationofNNmodelsbecauseitiswhollybased
on LMtype tests(themodelisonlyestimatedunderthenullhypothesis). Furthermore,there isa
possibilityofomitting certainpotential variablesbefore even estimatingneuralnetwork models.
Like ours,the Swanson and White strategy is truly sequential: themodellerproceeds bycon-
sideringnested models. Thedierence liesin how to compare two nestedmodelsin thesequence.
SwansonandWhiteapplySBICwhereasAndersandKornandweuseLMtypetests. Theproblems
with theformer technique have beendiscussed inSection 4.4. The problemof estimatinguniden-
tied models is still more acute in the approaches of Balkin and Ord and Refenes and Zapranis.
Because these procedures require the estimation of NN models up to one containing a predeter-
minedmaximumnumberofhiddenunits,several estimatedmodelsmaythusbeunidentied. The
problemisevenmore seriousifstatisticalinferenceisappliedinsubsequentpruningastheselected
model may also be unidentied. The probabilityof this happening is smaller in the Anders and
Korncase,inparticularwhenthesequenceofhiddenunittestshasgraduallydecreasingsignicance
levels.
5 Monte-Carlo Study
Inthissectionwe reportresultsfromtwoMonte Carloexperiments. Thepurposeoftherstoneis
toillustratesomefeaturesoftheNNmodelselectionstrategydescribedinSection4.6andcompare
itwiththealternativeinwhichmodelselectioniscarriedoutusinganappropriatemodelselection
criterion. In thesecond experiment,the performanceof themisspecicationtests ofSection 4.5is
y
t
=0:10+[0:75 0:2F
(0:05(t 90))]y
t 1
0:05y
t 4
+[ 0:80 0:5F
(0:05(t 90))]F(2:24(0:45y
t 1
0:89y
t 4
+0:09))
+[ 0:70+2:0F(0:05(t 90))]F(1:12(0:44y
t 1
+0:89y
t 4
+0:35))+"
t
;
"
t
="
t 1 +u
t
; u
t
NID(0;
2
)
(43)
where F
is dened as in 43) with K = 1 or 2. However, in the rst experiment, = = 0.
The number of observations in the rst experiment is either 200 or 1000, in the second one we
reportresultsfor100 observations. Ineveryreplication, therst 500 observationsarediscardedto
eliminatetheinitializationeects. Thenumberofreplications equals 500.
5.1 Architecture Selection
Results from simulating the modellingstrategy can be found inTable 1. The table also contains
results on choosing the number of hidden units using SBIC. This model selection criterion was
chosenfortheexperimentbecause Swanson andWhite (1995, 1997a,b) appliedit tothisproblem.
In this case it is assumed that the model contains the correct variables. This is done in order to
obtainan ideaof thebehaviourof SBICfree from theeectsof an incorrectlyselected model.
Dierent results can be obtained by varying the error variance, the size of the coeÆcients of
hidden units or \connection strengths" and, in the case of our strategy, the signicance levels.
In this experiment, the signicance level is halved at every step, but other choices are of course
possible. It seems, at least in the present experiment, that selecting the variables is easier than
choosing therightnumberof hiddenunits. In smallsamples,there is a strongtendencyto choose
a linear model but, as can be expected, nonlinearity becomes more apparent with an increasing
sample size. The larger initial signicance level ( = 0:10) naturally leads to larger models on
theaverage than thesmaller one (=0:05). Overtting is relatively rare buttheresultssuggest,
again not unexpectedly, that the initial signicance level should be lowered when the number of
observationsincreases. Finally,improvingthesignal-to-noiseratioimprovestheperformanceofour
strategy.