Data imputation analysis for Cosmic Rays time series

(1)

Data imputation analysis for Cosmic Rays time series

R.C. Fernandes

a,⇑

, P.S. Lucio

b

, J.H. Fernandez

b

a_{Programa de Po´s-Graduac¸a˜o em Cieˆncias Clima´ticas, Universidade Federal do Rio Grande do Norte, Natal/RN 59078970, Brazil} b_{Departamento de Cieˆncias Atmosfe´ricas e Clima´ticas, Universidade Federal do Rio Grande do Norte, Natal/RN 59078970, Brazil}

Received 30 July 2016; received in revised form 12 February 2017; accepted 13 February 2017 Available online 22 February 2017

Abstract

The occurrence of missing data concerning Galactic Cosmic Rays time series (GCR) is inevitable since loss of data is due to mechan-ical and human failure or technmechan-ical problems and diﬀerent periods of operation of GCR stations. The aim of this study was to perform multiple dataset imputation in order to depict the observational dataset. The study has used the monthly time series of GCR Climax (CLMX) and Roma (ROME) from 1960 to 2004 to simulate scenarios of 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80% and 90% of missing data compared to observed ROME series, with 50 replicates. Then, the CLMX station as a proxy for allocation of these scenarios was

used. Three diﬀerent methods for monthly dataset imputation were selected: AME´ LIA II – runs the bootstrap Expectation Maximization

algorithm, MICE – runs an algorithm via Multivariate Imputation by Chained Equations and MTSDI – an Expectation Maximization algorithm-based method for imputation of missing values in multivariate normal time series. The synthetic time series compared with the

observed ROME series has also been evaluated using several skill measures as such as RMSE, NRMSE, Agreement Index, R, R2, F-test

and t-test. The results showed that for CLMX and ROME, the R2and R statistics were equal to 0.98 and 0.96, respectively. It was

observed that increases in the number of gaps generate loss of quality of the time series. Data imputation was more eﬃcient with MTSDI method, with negligible errors and best skill coeﬃcients. The results suggest a limit of about 60% of missing data for imputation, for monthly averages, no more than this. It is noteworthy that CLMX, ROME and KIEL stations present no missing data in the target period. This methodology allowed reconstructing 43 time series.

Keywords: Bootstrap; Expectation maximization; Skill; Multivariate; Chained equations

1. Introduction

A major problem in the study of the Galactic Cosmic Rays (GCR) time series is the diﬃculty in ﬁnding a non-gapped long-term series. Data losses can be caused by mechanical or technical failure and human errors. Thus, several GCR studies are restricted to few stations dis-tributed around the globe. This data missing problem is not a GCR time series privilege, but can also be observed

in several other areas, like Meteorology, Biomedicine, Information Systems datasets, among others.

Over the past decades, the historical GCR time series has been reconstructed using sunspot numbers and

cosmo-genic10Be isotope levels found in both Earth Polar Caps

(Usoskin et al., 2002, 2005; Mursula et al., 2003;

McCracken, 2004; McCracken and Beer, 2007). However,

it leads to some questions, such as: (a) Is it possible to cre-ate a synthetic series from another Neutron Monitor (NM) station? (b) What is the criterion for ﬁlling data gaps? and (c) Which GCR stations would be ﬁlled?

Therefore, the main aim of this study was to report the use of the multiple imputation method in order to analyze its eﬃciency on the reconstruction of observational GCR data.

http://dx.doi.org/10.1016/j.asr.2017.02.022

⇑Corresponding author.

E-mail addresses: [email protected] (R.C. Fernandes), [email protected](P.S. Lucio),[email protected](J.H. Fernandez).

www.elsevier.com/locate/asr

ScienceDirect

(2)

2. Material and methods 2.1. Dataset

GCR monthly databases from the Russian Academy of Sciences (http://www.wdcb.ru/stp/) and from the Geophysical World Data Center (GWDC) for Solar-Terrestrial Physics, for

the 1960–2004 period (http://www.wdcb.ru/stp/data/cosmic.

ray/Neutron_Monitors(monthly_values)/) were used. The spatial distribution of stations is shown in Fig. 1a. It was

observed in Fig. 1b the Climax station (CLMX,

Lat = 39.37°, Long = 106.1°, Alt = 3.400 m, Cut-Oﬀ

Rigid-ity 2.99 GV, 17 NM64) and Rome (ROME, Lat. = 41.9°,

Long. = 12.52°, Alt. 60 m, Cut-Oﬀ Rigidity 6.32 GV).

−178 −150 −100 −50 0 50 100 150 178 −85 −60 −40 −20 0 20 40 60 84 (a) −180 −150 −100 −50 0 50 100 150 180 −85 −60 −40 −20 0 20 40 60 84 (b) CLMX ROME

Fig. 1. (a) GCR NM stations spatial distribution in the globe, over the Mollweide projection and (b) Climax (CLMX) and Rome (ROME) GCR NM locations, according to the Azequalarea projection.

(3)

(a) Time in years GCR intensity in counts hr 100 3000 3500 4000 4500 1960 1970 1980 1990 2000 CLMX ROME (b) GCR intensity (CLMX) in counts hr 100 GCR intensity (R OME) in counts hr 100 4000 4200 4400 4600 4800 3000 3500 4000 y= 2362.39 + 0.56x, R2 = 0.96

Fig. 2. GCR intensity (a) Climax (in blue) and Rome (in magenta) GCR monthly observed time series and (b) their correlation for the period from January 1st, 1960 to December 31st, 2004. (For interpretation of the references to colour in this ﬁgure legend, the reader is referred to the web version of this article.)

(4)

(a) Missing values in % Coefficients −1.5 −1.0 −0.5 0 .0 0.5 1 .0 20 40 60 80

d(AMÉLIA II) d(MICE) d(MTSDI)

(b) Missing values in % Coefficients −1.5 −1.0 −0.5 0 .0 0.5 1 .0 20 40 60 80

r(AMÉLIA II) r(MICE) r(MTSDI)

Fig. 3. Average coeﬃcients (a) d, (b) R (c) R2and (d) NSE for the respective imputations percentage, calculated by AME´ LIA II (blue), MICE (green) and MTSDI (red) models. (For interpretation of the references to colour in this ﬁgure legend, the reader is referred to the web version of this article.)

(5)

(c) Missing values in % Coefficients −1.5 −1.0 −0.5 0 .0 0.5 1 .0 20 40 60 80

r2(AMÉLIA II) r2(MICE) r2(MTSDI)

(d) Missing values in % Coefficients −1.5 −1.0 −0.5 0.0 0 .5 1.0 20 40 60 80

nse(AMÉLIA II) nse(MICE) nse(MTSDI)

(6)

2.2. Simulation and assessment

For this study, two GCR stations, CLMX and ROME, were selected due to their complete time series for the cho-sen period (from 1960 to 2004). CLMX was adopted for being the reference station and ROME as the simulated one. Overall, 9 diﬀerent scenarios for ROME were pre-pared, randomly drawing 10%, 20%, 30%, 40%, 50%,

60%, 70%, 80% and 90% of observational data

(N = 540 months). For each scenario, 50 replicates were performed.

A feature in the R program that performs a random

‘‘draw” based on the total length of the series and the

desired amount of missing data was used. A series has length of 540 months, in this case, there was a draw of 54 (10%) data by replacing these 54 data observed by NA and later, imputations were performed. This procedure was performed for the remaining percentages. The initial simulation tried to faithfully represent the several observed series that have missing data. Thus, MTSDI has achieved imputations both with continuous series of missing data as randomly with less error.

Then, the R packages AME´ LIA II (Honaker et al.,

2011), MICE (Multivariate Imputation by Chained

Equations) and the MTDSI (Multivariate Time Series Data Imputation) for missing data imputation were used. The CLMX time series served as a proxy for allocation of ROME scenarios. Therefore, several synthetic series for ROME were created, and each

scenario was compared with the original ROME series.

2.2.1. AME´LIA II

The AME´ LIA II software (Honaker et al., 2011) uses an expectation maximization (EM) algorithm based on

boot-strap methodology, considered fast and robust (Dempster

et al., 1977; McLachlan and Krishnan, 1997; Horton and Kleinman, 2007; Honaker et al., 2011, 2017). The boot-strap method processes a sample as a ‘‘pseudo-population”, randomly generating other data sets with the same size of

the original series (Kline, 2015). This distribution has

95% conﬁdence interval, with sample replication of the

aver-age distribution on bootstrapping samples (Schmidheiny,

2012). In this study, 100 replicates were used. For imputa-tion of missing data, an imputaimputa-tion value was randomly selected, based on the series distribution and on the number of generated replications.

2.2.2. MICE

The MICE method is based on chained equations (Van

Buuren et al., 2006; Van Buuren and Groothuis-Oudshoorn, 2011). Synthetic data allocation was carried out separately for each variable using other variables as predictors (Kline, 2015). At each step of the algorithm, the imputation of a given missing data occurs according to the predictor variable (Kline, 2015). This process is continuously repeated to input missing data using the Missing values in % P ercentual(%) 0 5 0 100 20 40 60 80

nrmse(AMÉLIA II) nrmse(MICE) nrmse(MTSDI)

Fig. 4. Average NRMSE coeﬃcients for the respective imputations scenarios in the AME´ LIA II (blue), MICE (green) and MTSDI (red) models. (For interpretation of the references to colour in this ﬁgure legend, the reader is referred to the web version of this article.)

(7)

(a) AMÉLIA II Time in years GCR intensity in counts hr 100 4000 4500 5000 1960 1970 1980 1990 2000 ROME 10% 20%30% 40%50% 60%70% 80%90% (b) MICE Time in years GCR intensity in counts hr 100 4000 4500 5000 1960 1970 1980 1990 2000 ROME 10% 20%30% 40%50% 60%70% 80%90%

Fig. 5. Comparison between the original ROME Station GCR Time series and 1–9 missing data scenarios reconstructed series, for the period from January 1960 to December 2004, as obtained using the (a) AME´ LIA II, (b) MICE and (c) MTSDI models.

(8)

Gibbs sampling procedure until the process reaches

conver-gence, as deﬁned byKline (2015). For the GCR ﬂux

impu-tation, as a continuous variable, linear regression model was used.

2.2.3. MTSDI

The MTSDI method (Junger et al., 2003; Junger and

Leon, 2012) uses the EM algorithm with the Autoregressive Integrated Moving Average (ARIMA) method, also

known as Box–Jenkins model (Box and Jenkin, 1976;

Meyler et al., 1998). The data provided by ARIMA (p, d, q) depend on the number of autoregressive terms (p), the number of diﬀerences (d), and the number of terms in the moving average (q) (Meyler et al., 1998). Default conﬁgu-ration was used.

2.3. Evaluation of imputation methods

Relying on the goodness-of-ﬁt functions for comparison of simulated and observed hydrological time series,

avail-able in the hydroGOF package (Zambrano-Bigiarini,

2014), the performance of data imputations was

quantita-tively evaluated by comparing synthetic and observed ser-ies. Thus, it was calculated:

The Root Mean Square Error (RMSE) is the standard deviation of the prediction error model. Lower value indicates better model performance (1).

RMSE ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1

N

XN

i¼1ðSi OiÞ 2 r

ð1Þ

The Normalized Root Mean Square Error (NRMSE) is the relative sample standard deviation of the differences between predicted (Si) and observed (Oi) values, given in percentage (Eq.(2)). NRMSE ¼ 100 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 N P_N

i¼1ðSi OiÞ 2 q

nval ð2Þ

where nval ¼ sdðOiÞ; norm ¼ \sd"

Oimax Oimin; norm ¼ \maxmin"

The Nash-Sutcliffe Efficiency coefficient (NSE) determi-nes the relative magnitude of the residual variance,

com-pared to the data variance measurement (Nash and

Sutcliffe, 1970). NSE can range from negative infinity to 1, with 1 indicating perfect fit (Eq.(3)):

NSE ¼ 1

PN

i¼1ðSi OiÞ 2

PN

i¼1ðOi OÞ

2 ð3Þ

The Agreement Index (d) is a standard measure of the

prediction error model ranging from 0 to 1 (Willmott,

1981), where 1 indicates perfect match and 0 indicates

no agreement at all. This index is sensitive to extreme

values due to squared diﬀerences (Legates and

McCabe, 1999) (Eq.(4)).

d ¼ 1

PN

i¼1ðOi SiÞ 2

PN

i¼1ðjSi Oj þ jOi OjÞ

2 ð4Þ (c) MTSDI Time in years GCR intensity in counts hr 100 4000 4500 5000 1960 1970 1980 1990 2000 ROME 10% 20%30% 40%50% 60%70% 80%90% Fig 5. (continued)

(9)

(a) AMÉLIA II Time in years GCR intensity in counts hr 100 3000 3500 4000 4500 1960 1970 1980 1990 2000 CLMX 10% 20%30% 40%50% 60%70% 80%90% (b) MICE Time in years GCR intensity in counts hr 100 3000 3500 4000 4500 1960 1970 1980 1990 2000 CLMX 10% 20%30% 40%50% 60%70% 80%90%

Fig. 6. Comparison between the original CLMX Station GCR Time series and 1–9 missing data scenarios reconstructed series, for the period from January 1960 to December 2004, as obtained using the (a) AME´ LIA II, (b) MICE and (c) MTSDI models.

(10)

The Linear Correlation Coeﬃcient (R) is the covariance

(s) between Oi (observed) and Si (predicted), standard

deviations of observed and simulated data, respectively. This coeﬃcient is given by (Eq.(5)):

R ¼ Soisi

SoiSsi

ð5Þ

The Determination Coeﬃcient (R2

) is given by (Eq.(6)): R2¼

Pn

i¼1ðSi OiÞ 2

Pn

i¼1Oi O ð6Þ

Subsequently, the F-test between observed and simu-lated ROME series was applied, with signiﬁcance level a = 0.05, where: H0: r2 ROME¼ r 2 SIM and H1: r 2 ROME– r 2 SIM;

Then, the t-test was also applied between observed and

simulated ROME series, with signiﬁcance level a = 0.05,

where: H0: lROME¼ lSIM and H1: lROME– lSIM;

To perform the calculations of coeﬃcients and to obtain the time series, open source free software, the R Project for

Statistical Computing, version 3.1.2 (R, 2014) was used.

Finally, for plotting the graphics and making the ﬁgures, the Lattice Software was used (Sarkar, 2008).

3. Results and discussion

Fig. 2a shows CLMX and ROME time series with

sim-ilar profiles, differing in the intensity of GCR flows. The correlation between these two series corresponds to

R = 0.98 and R2= 0.96 (Fig. 2b). It is well known that

GCR is modulated by the magnetic ﬁeld of the Sun, Earth’s magnetic rigidity, geographical position and altitude (Usoskin et al., 2005; Zhou et al., 2006; Herbst et al., 2013; Ahluwalia, 2014). These modulations correspond to time periods ranging from 11 years to secular variations (Solanki et al., 2000; Berggren et al., 2009).

It was observed that the increase in the imputed data percentage (beginning with 10% up to 90% of imputation, in 10% steps) is negatively reﬂected in the synthetic series, with gradual loss of eﬃciency in the respective scenarios

(1–9). Analyzing the d index (Willmott, 1981), which

ranges from 0 to 1, from the worst to the best model,

respectively, it was found that MTDSI (0.999

± 0.001 dMTSDI 0.977 ± 0.03) showed similar behavior

to MICE (0.998 ± 0.001 dMICE 0.965 ± 0.009), and

MTDSI obtained the best performance. AME´ LIA II

(0.953 dAMELIA 0.475) (Fig. 3a) provided the most dis-crepant results, being considered the worst model for these imputations.

It could also be observed inFig. 3b that the best corre-lations for the diﬀerent scenarios were observed with

MTDSI (0.998 ± 0.001 RMTSDI 0.976 ± 0.013) and

MICE (0.996 ± 0.001 RMICE 0.965 ± 0.009). AME´LIA

II (0.908 RAMELIA 0.12) has signiﬁcantly corrupted

the series, showing the worst correlations.

Fig. 3c shows MTDSI (0.997 ± 0.001 R2MTSDI

0.953 ± 0.026) and MICE (0.992 ± 0.003 R2MICE

0.88 ± 0.023) models, correlation indexes close to 1, being

MTSDI the model with the best ﬁt. Once again, AME´ LIA

(c) MTSDI Time in years GCR intensity in counts hr 100 3000 3500 4000 4500 1960 1970 1980 1990 2000 CLMX 10% 20%30% 40%50% 60%70% 80%90% Fig 6. (continued)

(11)

(12)

II (0.825 R2AMELIA 0.016) proved to be the worst model, showing signiﬁcant correlation losses when increas-ing the amount of imputations.

Fig. 3d shows the NSE coeﬃcients for scenarios 1–9

(from 10% to 90% of missing data, respectively). Both

NSEMICE (0.992 ± 0.003 NSEMICE 0.876 ± 0.026)

and NSEMTSDI (0.997 ± 0.001 NSEMTSDI 0.930

± 0.072) were satisfactory; however, NSEMTSDI eﬃciency

was greater than NSEMICE. Again, Amelia was the most discrepant model for these simulations, obtaining the low-est eﬃciency results.

It was observed (Fig. 4) that NMSEMTSDI and

NMSEMICE were lower than 24.23 ± 10.64% and 34.978

± 3.588%, respectively (in all scenarios). Imputations made

by AME´ LIA II method did not produce satisfactory

results, with NMSEAMELIAyielding 42.798% for scenario

1 (10% of imputations) and 142.022% for scenario 9 (90% of imputations).

The resultant series with imputations made by

AME´ LIA II (Fig. 5a) did not represent the expected behav-ior of the ROME time series, for all the different scenarios, justifying the poor previous coefficients reported. MICE (Fig. 5b) and MTSDI (Fig. 5c) showed good adjustments in comparison with the real series (also in accordance with previous indexes obtained), being both capable of repro-ducing the expected ROME time series profile, although the MTDSI model had provided the best fits.

The adopted AME´ LIA II methodology failed to

satis-factorily reproduce the behavior of the observed series.

AME´ LIA II is based on the bootstrapping method, and

this method replicates a series in 100 times and randomly, and the imputation of missing data is in accordance with the sample distribution. It was observed that AME´ LIA II had poor performance, while MICE and MTSDI best

rep-resented the observed data. The AME´ LIA II package

con-siders that all variables in a dataset have multivariate normal distribution (MVN), using mean and covariance to summarize data. The imputation is carried out ran-domly, so, it failed to represent the observed GCR data. The MICE assumes the probability that a given missing data will only depend on the observed value and can be expected to use them. Using a linear regression combining their results and thus making the imputation. MTSDI based on the EM algorithm assumes the temporal correla-tion structure of GCR and is modeled with the aid of AutoRegressive models, Integrated and Moving Averages, ARIMA (p, d, q). Thus, MTSDI could better detect the behavior patterns of the GCR time series.

Would it be possible for the Roma station to serve as a proxy for imputation of missing CLIMX data? Yes! The procedures previously used have been redone, simulating the CLIMX series. It was observed that the results were

similar for AME´ LIA II (Fig. 6a), MICE (Fig. 6b) and

MTSDI (Fig. 6c), with MTSDI presenting the best result.

(13)

An extrapolation at 5% signiﬁcance level threshold already from the third simulation scenario (30% of missing data) was observed for AME´ LIA II (Fig. 7a), which only occurred from the eighth scenario (80% of missing data)

for both MICE (Fig. 7b) and (60% of missing data)

MTSDI (Fig. 7c) models. Below the observed thresholds

(p > 0.05), the reconstructed series do not signiﬁcantly dif-fer from the original one, leading to the acceptance of the equality of variance hypothesis.

An extrapolation at 5% signiﬁcance level threshold already from the third simulation scenario (30% of

miss-ing data) was observed for AME´ LIA II (Fig. 8a), and

(14)

from the eighth scenario (80% of missing data) for

MICE (Fig. 8b) and sixth scenario (60% of missing data)

for MTSDI models (Fig. 8c). Again, values below these

limits do not significantly differ from the original ROME series, accepting the average equality hypothesis. Data imputation performed by MTSDI method was more efficient, yielding p-values close to 1.

According to the F-test, the extrapolation of the conﬁdence level occurred in the percentage of 60% of missing data, thus suggesting this limit for imputation, valid for GCR time series with monthly averages. The results showed that in the percentage of 60% of missing

data the mean NRMSE coeﬃcients, d, R and R2 were

equal to 14.598 ± 4.662%, 0.994 ± 0.01, 0.989 ± 0.012 and 0.977 ± 0.023, respectively. Several studies have

used several percentages such as 33% (Rodwell et al.,

2014), 20% (Nunes et al., 2009), 27.1% (Nunes et al.,

2010).

Considering the time series from 1960 to 2004 equal to 100%, stations that had 60% of missing data in this interval were selected, even when operating in diﬀerent periods. It

was observed inFig. 9a that all GRC stations around the

world that had up to 60% of missing data resulted in a total of 43 stations for the period analyzed here (1960–2004). In the same period, CLMX, KIEL and ROME stations showed no missing data at all (0% of missing data). JUNG and JUNG1 stations had 0.4% (2 months) and 59.8% (323 months) of missing data, respectively. It was observed in Fig. 9a that all the time series plotted present similar

behavior. After the imputation process, various synthetic series were created, as shown in Fig. 9b, displayed at the right side of the observed data plot for visual comparison proposals. The eﬃciency of imputations with cross correla-tion was veriﬁed.

4. Summary and final remarks

Given the importance of GCR time series, both miss-ing data problems and the criteria for their imputation have been investigated. The results were satisfactory, leading to the possibility of creating reliable synthetic series from diﬀerent imputation methods, namely, MICE

and MTSDI. Unfortunately, for this study, the AME´

-LIA II method proved to be inefficient compared to the others. For the imputation quality verification, the results obtained with the ‘‘goodness-of-fit” algorithm were compared, showing that these indexes can be used to compare the models applied. The present analysis suggests that up to 60% of missing data in a time series is acceptable for creating a reliable synthetic series, according to methodology adopted, since gaps are ran-domly distributed over the series. Following the criteria adopted, 43 GCR stations series were completed by imputation, with the production of various synthetic ser-ies. Based on these results, this study suggests and rec-ommends the use of imputation methods to complete gapped temporal GCR series.

(15)

Acknowledgments

To the Brazilian Coordination for the Improvement of Higher Education Personnel (CAPES) and to the Research Support Foundation of Rio Grande do Norte State (FAPERN) for granting doctoral fellowship to the author. Paulo S. Lucio is sponsored by a PQ2 grant (Proc.

307988/2013-9) from CNPq (Brazil). The author thanks at George U. Pedra by contributions for this article. References

Ahluwalia, H.S., 2014. Sunspot activity and cosmic ray modulation at 1a. u. for 1900–2013. Adv. Space Res. 54 (8), 1704–1716.

(a) Time in years GCR intensity in counts hr 100 5000 1 0000 15000 20000 25000 1960 1970 1980 1990 2000 CLMX KIEL ROME JUNG MCMD MOSC APTY KERG IRKT THUL CALG OULU NWRK TBLS YKTK LMKS SOPO SNAE AATA DPRV GSBY PTFM NVBK TXBY MGDN MTNR HUAN TERA MTWL MTWS AATB KIEV TSMB TKYO DRBS DRHM CAPS LEED IRK2 MWSN ALRT BJNG IRK3 MXCO ZUGS JUNG1 (b) Time in years GCR intensity in counts hr 100 5000 10000 15000 20000 25000 1960 1970 1980 1990 2000 CLMX KIEL ROME JUNG MCMD MOSC APTY KERG IRKT THUL CALG OULU NWRK TBLS YKTK LMKS SOPO SNAE AATA DPRV GSBY PTFM NVBK TXBY MGDN MTNR HUAN TERA MTWL MTWS AATB KIEV TSMB TKYO DRBS DRHM CAPS LEED IRK2 MWSN ALRT BJNG IRK3 MXCO ZUGS JUNG1

(16)

Berggren, A.M., Beer, J., Possnert, G., Aldahan, A., Kubik, P., Christl, M., Johnsen, S.J., Abreu, J., Vinther, B.M., 2009. A 600-year annual

10

Be record from the NGRIP ice core, Greenland. Geophys. Res. Lett. 36 (11), L11801.http://dx.doi.org/10.1029/2009GL038004.

Box, G.E.P., Jenkin, G.M., 1976. Time Series Analysis: Forecasting and Control, second ed. Holden-Day, San Francisco.

Dempster, A.P., Laird, N.M., Rubin, D.B., 1977. Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc.: Ser. B (Methodol.) 39, 1–38.

Herbst, K., Kopp, A., Heber, B., 2013. Influence of the terrestrial magnetic field geometry on the cutoff rigidity of cosmic ray particles. Ann. Geophys. 31, 1637–1643.

Honaker, J., King, G., Blackwell, M., 2011. Amelia II: a program for missing data. J. Stat. Softw. 45 (7), 1–47.

Honaker, J., King, G., Blackwell, M., Amelia II: A Program for b. Missing Data.http://gking.harvard.edu/amelia.

Horton, N.J., Kleinman, K.P., 2007. Much ado about nothing: a comparison of missing data methods and software to ﬁt incomplete data regression models. American Statistician 61 (1), 79–90. Junger, W., Leon, A.P., 2012. mtsdi: Multivariate Time Series Data

Imputation. R package 0.3, 3, < https://CRAN.R-project.org/pack-age=mtsdi>.

Junger, W., Leon, A.P., Santos, N., 2003. Missing data imputation in multivariate time series via EM Algorithm. Cadernos do IME 15, 8– 21.

Kline, R.B., 2015. Boostrapping. In: Principles and Practice of Structural Equation Modeling, fourth ed. Guilford Press, New York, NY, pp. 60–61.

Legates, D.R., McCabe, G.J., 1999. Evaluating the use of ‘‘goodness-of-ﬁt” measures in hydrologic and hydroclimatic model validation. Water Resour. Res. 35 (1), 233–241.

McCracken, K.G., 2004. Geomagnetic and atmospheric eﬀects upon the cosmogenic10Be observed in polar ice. J. Geophys. Res. 109, A04101. http://dx.doi.org/10.1029/2003JA010060.

McCracken, K.G., Beer, J., 2007. Long term changes in the cosmic ray intensity at Earth, 1428–2005. J. Geophys. Res.: Space Phys. 112. http://dx.doi.org/10.1029/2006JA012117.

McLachlan, G.J., Krishnan, T., 1997. The EM Algorithm and Extensions. John Wiley and Sons, New York, NY.

Meyler, A., Kenny, G., Quinn, T., 1998. Forecasting irish inﬂation using ARIMA models. In: Central Bank and Financial Services Authority of Ireland Technical Paper Series. Munich, Germany, No. 3/RT/98 (December 1998), pp. 1–48.

Mursula, K., Usoskin, I.G., Kovaltsov, G.A., 2003. Reconstructing the long-term cosmic ray intensity: linear relations do not work. Ann. Geophys. 21 (4), 863–867.

Nash, J.E., Sutcliﬀe, J.V., 1970. River ﬂow forecasting through conceptual models Part I – a discussion of principles. J. Hydrol. 10 (3), 282–290.

Nunes, L.N., Klu¨ck, M.M., Fachel, J.M.G., 2009. Uso da imputaçaõ mu´ltipla de dados faltantes: uma simulaçaõ utilizando dados epi-demiolo´gicos. Cadernos de Sau´de Pu´blica (Reports in Public Health). Rio de Janeiro 25 (2), 268–278 (in Portuguese).

Nunes, L.N., Klu¨ck, M.M., Fachel, J.M.G., 2010. Comparaçaõ de me´todos de imputaçaõ uńica e mu´ltipla usando como exemplo um modelo de risco paramortalidade ciru´rgica. Rev. Bras Epidemiol. 13 (4), 596–606 (in Portuguese).

R Development Core Team, 2014. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, ISBN: 3-900051-07-0.

Rodwell, L., Lee, K.J., Romaniuk, H., Carlin, J.B., 2014. Comparison of methods for imputing limited-range variables: a simulation study. BMC Med. Res. Methodol. 14. http://dx.doi.org/10.1186/1471-2288-14-57.

Sarkar, D., 2008. Lattice: Multivariate Data Visualization with R. Springer-Verlag, New York, NY, ISBN: 978-0-387-75968-5. Schmidheiny, K., 2012. The bootstrap, in: Short Guides to

Microecono-metrics. Spring, Basel, Switzerland: Universita¨t Basel, 1–10, <http:// kurt.schmidheiny.name/teaching/bootstrap2up.pdf>.

Solanki, S.K., Schu¨ssler, M., Fligge, M., 2000. Evolution of the Sun’s large-scale magnetic ﬁeld since the Maunder minimum. Nature 408, 445–447.

Usoskin, I.G., Mursula, K., Solanki, S.K., Schu¨ssler, M., Kovaltsov, G. A., 2002. A physical reconstruction of cosmic ray intensity since 1610. J. Geophys. Res.: Space Phys. 107 (A11). http://dx.doi.org/10.1029/ 2002JA009343.

Usoskin, I.G., Alanko-Huotari, K., Kovaltsov, G.A., Mursula, K., 2005. Heliospheric modulation of cosmic rays: monthly reconstruction for 1951–2004. J. Geophys. Res.: Space Phys. 110 (A12).http://dx.doi. org/10.1029/2005JA011250.

Van Buuren, S., Groothuis-Oudshoorn, K., 2011. MICE: Multivariate imputation by chained equations in R. J. Stat. Softw. 45 (3).http://dx. doi.org/10.18637/jss.v045.i03.

Van Buuren, S., Brand, J.P., Groothuis-Oudshoorn, C.G.M., Rubin, D. B., 2006. Fully conditional speciﬁcation in multivariate imputation. J. Stat. Comput. Simul. 76 (12), 1049–1064.

Willmott, C.J., 1981. On the validation of models. Phys. Geogr. 2 (2), 184– 194.

Zambrano-Bigiarini, M., 2014. hydroGOF: Goodness-of-ﬁt functions for comparison of simulated and observed hydrological time series. R package version 0.3-8, <http://CRAN.R-project.org/package= hydroGOF>.

Zhou, D., O’Sullivan, D., Semones, E., Heinrich, W., 2006. Radiation ﬁeld of Cosmic Rays measured in low Earth orbit by CR-39 detectors. Adv. Space Res. 37 (9), 1764–1769.