Data imputation analysis for Cosmic Rays time series
R.C. Fernandes
a,⇑, P.S. Lucio
b, J.H. Fernandez
baPrograma de Po´s-Graduac¸a˜o em Cieˆncias Clima´ticas, Universidade Federal do Rio Grande do Norte, Natal/RN 59078970, Brazil bDepartamento de Cieˆncias Atmosfe´ricas e Clima´ticas, Universidade Federal do Rio Grande do Norte, Natal/RN 59078970, Brazil
Received 30 July 2016; received in revised form 12 February 2017; accepted 13 February 2017 Available online 22 February 2017
Abstract
The occurrence of missing data concerning Galactic Cosmic Rays time series (GCR) is inevitable since loss of data is due to mechan-ical and human failure or technmechan-ical problems and different periods of operation of GCR stations. The aim of this study was to perform multiple dataset imputation in order to depict the observational dataset. The study has used the monthly time series of GCR Climax (CLMX) and Roma (ROME) from 1960 to 2004 to simulate scenarios of 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80% and 90% of missing data compared to observed ROME series, with 50 replicates. Then, the CLMX station as a proxy for allocation of these scenarios was
used. Three different methods for monthly dataset imputation were selected: AME´ LIA II – runs the bootstrap Expectation Maximization
algorithm, MICE – runs an algorithm via Multivariate Imputation by Chained Equations and MTSDI – an Expectation Maximization algorithm-based method for imputation of missing values in multivariate normal time series. The synthetic time series compared with the
observed ROME series has also been evaluated using several skill measures as such as RMSE, NRMSE, Agreement Index, R, R2, F-test
and t-test. The results showed that for CLMX and ROME, the R2and R statistics were equal to 0.98 and 0.96, respectively. It was
observed that increases in the number of gaps generate loss of quality of the time series. Data imputation was more efficient with MTSDI method, with negligible errors and best skill coefficients. The results suggest a limit of about 60% of missing data for imputation, for monthly averages, no more than this. It is noteworthy that CLMX, ROME and KIEL stations present no missing data in the target period. This methodology allowed reconstructing 43 time series.
Ó 2017 COSPAR. Published by Elsevier Ltd. All rights reserved.
Keywords: Bootstrap; Expectation maximization; Skill; Multivariate; Chained equations
1. Introduction
A major problem in the study of the Galactic Cosmic Rays (GCR) time series is the difficulty in finding a non-gapped long-term series. Data losses can be caused by mechanical or technical failure and human errors. Thus, several GCR studies are restricted to few stations dis-tributed around the globe. This data missing problem is not a GCR time series privilege, but can also be observed
in several other areas, like Meteorology, Biomedicine, Information Systems datasets, among others.
Over the past decades, the historical GCR time series has been reconstructed using sunspot numbers and
cosmo-genic10Be isotope levels found in both Earth Polar Caps
(Usoskin et al., 2002, 2005; Mursula et al., 2003;
McCracken, 2004; McCracken and Beer, 2007). However,
it leads to some questions, such as: (a) Is it possible to cre-ate a synthetic series from another Neutron Monitor (NM) station? (b) What is the criterion for filling data gaps? and (c) Which GCR stations would be filled?
Therefore, the main aim of this study was to report the use of the multiple imputation method in order to analyze its efficiency on the reconstruction of observational GCR data.
http://dx.doi.org/10.1016/j.asr.2017.02.022
0273-1177/Ó 2017 COSPAR. Published by Elsevier Ltd. All rights reserved.
⇑Corresponding author.
E-mail addresses: [email protected] (R.C. Fernandes), [email protected](P.S. Lucio),[email protected](J.H. Fernandez).
www.elsevier.com/locate/asr
ScienceDirect
2. Material and methods 2.1. Dataset
GCR monthly databases from the Russian Academy of Sciences (http://www.wdcb.ru/stp/) and from the Geophysical World Data Center (GWDC) for Solar-Terrestrial Physics, for
the 1960–2004 period (http://www.wdcb.ru/stp/data/cosmic.
ray/Neutron_Monitors(monthly_values)/) were used. The spatial distribution of stations is shown in Fig. 1a. It was
observed in Fig. 1b the Climax station (CLMX,
Lat = 39.37°, Long = 106.1°, Alt = 3.400 m, Cut-Off
Rigid-ity 2.99 GV, 17 NM64) and Rome (ROME, Lat. = 41.9°,
Long. = 12.52°, Alt. 60 m, Cut-Off Rigidity 6.32 GV).
−178 −150 −100 −50 0 50 100 150 178 −85 −60 −40 −20 0 20 40 60 84 (a) −180 −150 −100 −50 0 50 100 150 180 −85 −60 −40 −20 0 20 40 60 84 (b) CLMX ROME
Fig. 1. (a) GCR NM stations spatial distribution in the globe, over the Mollweide projection and (b) Climax (CLMX) and Rome (ROME) GCR NM locations, according to the Azequalarea projection.
(a) Time in years GCR intensity in counts hr 100 3000 3500 4000 4500 1960 1970 1980 1990 2000 CLMX ROME (b) GCR intensity (CLMX) in counts hr 100 GCR intensity (R OME) in counts hr 100 4000 4200 4400 4600 4800 3000 3500 4000 y= 2362.39 + 0.56x, R2 = 0.96
Fig. 2. GCR intensity (a) Climax (in blue) and Rome (in magenta) GCR monthly observed time series and (b) their correlation for the period from January 1st, 1960 to December 31st, 2004. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
(a) Missing values in % Coefficients −1.5 −1.0 −0.5 0 .0 0.5 1 .0 20 40 60 80
d(AMÉLIA II) d(MICE) d(MTSDI)
(b) Missing values in % Coefficients −1.5 −1.0 −0.5 0 .0 0.5 1 .0 20 40 60 80
r(AMÉLIA II) r(MICE) r(MTSDI)
Fig. 3. Average coefficients (a) d, (b) R (c) R2and (d) NSE for the respective imputations percentage, calculated by AME´ LIA II (blue), MICE (green) and MTSDI (red) models. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
(c) Missing values in % Coefficients −1.5 −1.0 −0.5 0 .0 0.5 1 .0 20 40 60 80
r2(AMÉLIA II) r2(MICE) r2(MTSDI)
(d) Missing values in % Coefficients −1.5 −1.0 −0.5 0.0 0 .5 1.0 20 40 60 80
nse(AMÉLIA II) nse(MICE) nse(MTSDI)
2.2. Simulation and assessment
For this study, two GCR stations, CLMX and ROME, were selected due to their complete time series for the cho-sen period (from 1960 to 2004). CLMX was adopted for being the reference station and ROME as the simulated one. Overall, 9 different scenarios for ROME were pre-pared, randomly drawing 10%, 20%, 30%, 40%, 50%,
60%, 70%, 80% and 90% of observational data
(N = 540 months). For each scenario, 50 replicates were performed.
A feature in the R program that performs a random
‘‘draw” based on the total length of the series and the
desired amount of missing data was used. A series has length of 540 months, in this case, there was a draw of 54 (10%) data by replacing these 54 data observed by NA and later, imputations were performed. This procedure was performed for the remaining percentages. The initial simulation tried to faithfully represent the several observed series that have missing data. Thus, MTSDI has achieved imputations both with continuous series of missing data as randomly with less error.
Then, the R packages AME´ LIA II (Honaker et al.,
2011), MICE (Multivariate Imputation by Chained
Equations) and the MTDSI (Multivariate Time Series Data Imputation) for missing data imputation were used. The CLMX time series served as a proxy for allocation of ROME scenarios. Therefore, several synthetic series for ROME were created, and each
scenario was compared with the original ROME series.
2.2.1. AME´LIA II
The AME´ LIA II software (Honaker et al., 2011) uses an expectation maximization (EM) algorithm based on
boot-strap methodology, considered fast and robust (Dempster
et al., 1977; McLachlan and Krishnan, 1997; Horton and Kleinman, 2007; Honaker et al., 2011, 2017). The boot-strap method processes a sample as a ‘‘pseudo-population”, randomly generating other data sets with the same size of
the original series (Kline, 2015). This distribution has
95% confidence interval, with sample replication of the
aver-age distribution on bootstrapping samples (Schmidheiny,
2012). In this study, 100 replicates were used. For imputa-tion of missing data, an imputaimputa-tion value was randomly selected, based on the series distribution and on the number of generated replications.
2.2.2. MICE
The MICE method is based on chained equations (Van
Buuren et al., 2006; Van Buuren and Groothuis-Oudshoorn, 2011). Synthetic data allocation was carried out separately for each variable using other variables as predictors (Kline, 2015). At each step of the algorithm, the imputation of a given missing data occurs according to the predictor variable (Kline, 2015). This process is continuously repeated to input missing data using the Missing values in % P ercentual(%) 0 5 0 100 20 40 60 80
nrmse(AMÉLIA II) nrmse(MICE) nrmse(MTSDI)
Fig. 4. Average NRMSE coefficients for the respective imputations scenarios in the AME´ LIA II (blue), MICE (green) and MTSDI (red) models. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
(a) AMÉLIA II Time in years GCR intensity in counts hr 100 4000 4500 5000 1960 1970 1980 1990 2000 ROME 10% 20%30% 40%50% 60%70% 80%90% (b) MICE Time in years GCR intensity in counts hr 100 4000 4500 5000 1960 1970 1980 1990 2000 ROME 10% 20%30% 40%50% 60%70% 80%90%
Fig. 5. Comparison between the original ROME Station GCR Time series and 1–9 missing data scenarios reconstructed series, for the period from January 1960 to December 2004, as obtained using the (a) AME´ LIA II, (b) MICE and (c) MTSDI models.
Gibbs sampling procedure until the process reaches
conver-gence, as defined byKline (2015). For the GCR flux
impu-tation, as a continuous variable, linear regression model was used.
2.2.3. MTSDI
The MTSDI method (Junger et al., 2003; Junger and
Leon, 2012) uses the EM algorithm with the Autoregressive Integrated Moving Average (ARIMA) method, also
known as Box–Jenkins model (Box and Jenkin, 1976;
Meyler et al., 1998). The data provided by ARIMA (p, d, q) depend on the number of autoregressive terms (p), the number of differences (d), and the number of terms in the moving average (q) (Meyler et al., 1998). Default configu-ration was used.
2.3. Evaluation of imputation methods
Relying on the goodness-of-fit functions for comparison of simulated and observed hydrological time series,
avail-able in the hydroGOF package (Zambrano-Bigiarini,
2014), the performance of data imputations was
quantita-tively evaluated by comparing synthetic and observed ser-ies. Thus, it was calculated:
The Root Mean Square Error (RMSE) is the standard deviation of the prediction error model. Lower value indicates better model performance (1).
RMSE ¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1
N
XN
i¼1ðSi OiÞ 2 r
ð1Þ
The Normalized Root Mean Square Error (NRMSE) is the relative sample standard deviation of the differences between predicted (Si) and observed (Oi) values, given in percentage (Eq.(2)). NRMSE ¼ 100 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 N PN
i¼1ðSi OiÞ 2 q
nval ð2Þ
where nval ¼ sdðOiÞ; norm ¼ \sd"
Oimax Oimin; norm ¼ \maxmin"
The Nash-Sutcliffe Efficiency coefficient (NSE) determi-nes the relative magnitude of the residual variance,
com-pared to the data variance measurement (Nash and
Sutcliffe, 1970). NSE can range from negative infinity to 1, with 1 indicating perfect fit (Eq.(3)):
NSE ¼ 1
PN
i¼1ðSi OiÞ 2
PN
i¼1ðOi OÞ
2 ð3Þ
The Agreement Index (d) is a standard measure of the
prediction error model ranging from 0 to 1 (Willmott,
1981), where 1 indicates perfect match and 0 indicates
no agreement at all. This index is sensitive to extreme
values due to squared differences (Legates and
McCabe, 1999) (Eq.(4)).
d ¼ 1
PN
i¼1ðOi SiÞ 2
PN
i¼1ðjSi Oj þ jOi OjÞ
2 ð4Þ (c) MTSDI Time in years GCR intensity in counts hr 100 4000 4500 5000 1960 1970 1980 1990 2000 ROME 10% 20%30% 40%50% 60%70% 80%90% Fig 5. (continued)
(a) AMÉLIA II Time in years GCR intensity in counts hr 100 3000 3500 4000 4500 1960 1970 1980 1990 2000 CLMX 10% 20%30% 40%50% 60%70% 80%90% (b) MICE Time in years GCR intensity in counts hr 100 3000 3500 4000 4500 1960 1970 1980 1990 2000 CLMX 10% 20%30% 40%50% 60%70% 80%90%
Fig. 6. Comparison between the original CLMX Station GCR Time series and 1–9 missing data scenarios reconstructed series, for the period from January 1960 to December 2004, as obtained using the (a) AME´ LIA II, (b) MICE and (c) MTSDI models.
The Linear Correlation Coefficient (R) is the covariance
(s) between Oi (observed) and Si (predicted), standard
deviations of observed and simulated data, respectively. This coefficient is given by (Eq.(5)):
R ¼ Soisi
SoiSsi
ð5Þ
The Determination Coefficient (R2
) is given by (Eq.(6)): R2¼
Pn
i¼1ðSi OiÞ 2
Pn
i¼1Oi O ð6Þ
Subsequently, the F-test between observed and simu-lated ROME series was applied, with significance level a = 0.05, where: H0: r2 ROME¼ r 2 SIM and H1: r 2 ROME– r 2 SIM;
Then, the t-test was also applied between observed and
simulated ROME series, with significance level a = 0.05,
where: H0: lROME¼ lSIM and H1: lROME– lSIM;
To perform the calculations of coefficients and to obtain the time series, open source free software, the R Project for
Statistical Computing, version 3.1.2 (R, 2014) was used.
Finally, for plotting the graphics and making the figures, the Lattice Software was used (Sarkar, 2008).
3. Results and discussion
Fig. 2a shows CLMX and ROME time series with
sim-ilar profiles, differing in the intensity of GCR flows. The correlation between these two series corresponds to
R = 0.98 and R2= 0.96 (Fig. 2b). It is well known that
GCR is modulated by the magnetic field of the Sun, Earth’s magnetic rigidity, geographical position and altitude (Usoskin et al., 2005; Zhou et al., 2006; Herbst et al., 2013; Ahluwalia, 2014). These modulations correspond to time periods ranging from 11 years to secular variations (Solanki et al., 2000; Berggren et al., 2009).
It was observed that the increase in the imputed data percentage (beginning with 10% up to 90% of imputation, in 10% steps) is negatively reflected in the synthetic series, with gradual loss of efficiency in the respective scenarios
(1–9). Analyzing the d index (Willmott, 1981), which
ranges from 0 to 1, from the worst to the best model,
respectively, it was found that MTDSI (0.999
± 0.001 dMTSDI 0.977 ± 0.03) showed similar behavior
to MICE (0.998 ± 0.001 dMICE 0.965 ± 0.009), and
MTDSI obtained the best performance. AME´ LIA II
(0.953 dAMELIA 0.475) (Fig. 3a) provided the most dis-crepant results, being considered the worst model for these imputations.
It could also be observed inFig. 3b that the best corre-lations for the different scenarios were observed with
MTDSI (0.998 ± 0.001 RMTSDI 0.976 ± 0.013) and
MICE (0.996 ± 0.001 RMICE 0.965 ± 0.009). AME´LIA
II (0.908 RAMELIA 0.12) has significantly corrupted
the series, showing the worst correlations.
Fig. 3c shows MTDSI (0.997 ± 0.001 R2MTSDI
0.953 ± 0.026) and MICE (0.992 ± 0.003 R2MICE
0.88 ± 0.023) models, correlation indexes close to 1, being
MTSDI the model with the best fit. Once again, AME´ LIA
(c) MTSDI Time in years GCR intensity in counts hr 100 3000 3500 4000 4500 1960 1970 1980 1990 2000 CLMX 10% 20%30% 40%50% 60%70% 80%90% Fig 6. (continued)
II (0.825 R2AMELIA 0.016) proved to be the worst model, showing significant correlation losses when increas-ing the amount of imputations.
Fig. 3d shows the NSE coefficients for scenarios 1–9
(from 10% to 90% of missing data, respectively). Both
NSEMICE (0.992 ± 0.003 NSEMICE 0.876 ± 0.026)
and NSEMTSDI (0.997 ± 0.001 NSEMTSDI 0.930
± 0.072) were satisfactory; however, NSEMTSDI efficiency
was greater than NSEMICE. Again, Amelia was the most discrepant model for these simulations, obtaining the low-est efficiency results.
It was observed (Fig. 4) that NMSEMTSDI and
NMSEMICE were lower than 24.23 ± 10.64% and 34.978
± 3.588%, respectively (in all scenarios). Imputations made
by AME´ LIA II method did not produce satisfactory
results, with NMSEAMELIAyielding 42.798% for scenario
1 (10% of imputations) and 142.022% for scenario 9 (90% of imputations).
The resultant series with imputations made by
AME´ LIA II (Fig. 5a) did not represent the expected behav-ior of the ROME time series, for all the different scenarios, justifying the poor previous coefficients reported. MICE (Fig. 5b) and MTSDI (Fig. 5c) showed good adjustments in comparison with the real series (also in accordance with previous indexes obtained), being both capable of repro-ducing the expected ROME time series profile, although the MTDSI model had provided the best fits.
The adopted AME´ LIA II methodology failed to
satis-factorily reproduce the behavior of the observed series.
AME´ LIA II is based on the bootstrapping method, and
this method replicates a series in 100 times and randomly, and the imputation of missing data is in accordance with the sample distribution. It was observed that AME´ LIA II had poor performance, while MICE and MTSDI best
rep-resented the observed data. The AME´ LIA II package
con-siders that all variables in a dataset have multivariate normal distribution (MVN), using mean and covariance to summarize data. The imputation is carried out ran-domly, so, it failed to represent the observed GCR data. The MICE assumes the probability that a given missing data will only depend on the observed value and can be expected to use them. Using a linear regression combining their results and thus making the imputation. MTSDI based on the EM algorithm assumes the temporal correla-tion structure of GCR and is modeled with the aid of AutoRegressive models, Integrated and Moving Averages, ARIMA (p, d, q). Thus, MTSDI could better detect the behavior patterns of the GCR time series.
Would it be possible for the Roma station to serve as a proxy for imputation of missing CLIMX data? Yes! The procedures previously used have been redone, simulating the CLIMX series. It was observed that the results were
similar for AME´ LIA II (Fig. 6a), MICE (Fig. 6b) and
MTSDI (Fig. 6c), with MTSDI presenting the best result.
An extrapolation at 5% significance level threshold already from the third simulation scenario (30% of missing data) was observed for AME´ LIA II (Fig. 7a), which only occurred from the eighth scenario (80% of missing data)
for both MICE (Fig. 7b) and (60% of missing data)
MTSDI (Fig. 7c) models. Below the observed thresholds
(p > 0.05), the reconstructed series do not significantly dif-fer from the original one, leading to the acceptance of the equality of variance hypothesis.
An extrapolation at 5% significance level threshold already from the third simulation scenario (30% of
miss-ing data) was observed for AME´ LIA II (Fig. 8a), and
from the eighth scenario (80% of missing data) for
MICE (Fig. 8b) and sixth scenario (60% of missing data)
for MTSDI models (Fig. 8c). Again, values below these
limits do not significantly differ from the original ROME series, accepting the average equality hypothesis. Data imputation performed by MTSDI method was more efficient, yielding p-values close to 1.
According to the F-test, the extrapolation of the confidence level occurred in the percentage of 60% of missing data, thus suggesting this limit for imputation, valid for GCR time series with monthly averages. The results showed that in the percentage of 60% of missing
data the mean NRMSE coefficients, d, R and R2 were
equal to 14.598 ± 4.662%, 0.994 ± 0.01, 0.989 ± 0.012 and 0.977 ± 0.023, respectively. Several studies have
used several percentages such as 33% (Rodwell et al.,
2014), 20% (Nunes et al., 2009), 27.1% (Nunes et al.,
2010).
Considering the time series from 1960 to 2004 equal to 100%, stations that had 60% of missing data in this interval were selected, even when operating in different periods. It
was observed inFig. 9a that all GRC stations around the
world that had up to 60% of missing data resulted in a total of 43 stations for the period analyzed here (1960–2004). In the same period, CLMX, KIEL and ROME stations showed no missing data at all (0% of missing data). JUNG and JUNG1 stations had 0.4% (2 months) and 59.8% (323 months) of missing data, respectively. It was observed in Fig. 9a that all the time series plotted present similar
behavior. After the imputation process, various synthetic series were created, as shown in Fig. 9b, displayed at the right side of the observed data plot for visual comparison proposals. The efficiency of imputations with cross correla-tion was verified.
4. Summary and final remarks
Given the importance of GCR time series, both miss-ing data problems and the criteria for their imputation have been investigated. The results were satisfactory, leading to the possibility of creating reliable synthetic series from different imputation methods, namely, MICE
and MTSDI. Unfortunately, for this study, the AME´
-LIA II method proved to be inefficient compared to the others. For the imputation quality verification, the results obtained with the ‘‘goodness-of-fit” algorithm were compared, showing that these indexes can be used to compare the models applied. The present analysis suggests that up to 60% of missing data in a time series is acceptable for creating a reliable synthetic series, according to methodology adopted, since gaps are ran-domly distributed over the series. Following the criteria adopted, 43 GCR stations series were completed by imputation, with the production of various synthetic ser-ies. Based on these results, this study suggests and rec-ommends the use of imputation methods to complete gapped temporal GCR series.
Acknowledgments
To the Brazilian Coordination for the Improvement of Higher Education Personnel (CAPES) and to the Research Support Foundation of Rio Grande do Norte State (FAPERN) for granting doctoral fellowship to the author. Paulo S. Lucio is sponsored by a PQ2 grant (Proc.
307988/2013-9) from CNPq (Brazil). The author thanks at George U. Pedra by contributions for this article. References
Ahluwalia, H.S., 2014. Sunspot activity and cosmic ray modulation at 1a. u. for 1900–2013. Adv. Space Res. 54 (8), 1704–1716.
(a) Time in years GCR intensity in counts hr 100 5000 1 0000 15000 20000 25000 1960 1970 1980 1990 2000 CLMX KIEL ROME JUNG MCMD MOSC APTY KERG IRKT THUL CALG OULU NWRK TBLS YKTK LMKS SOPO SNAE AATA DPRV GSBY PTFM NVBK TXBY MGDN MTNR HUAN TERA MTWL MTWS AATB KIEV TSMB TKYO DRBS DRHM CAPS LEED IRK2 MWSN ALRT BJNG IRK3 MXCO ZUGS JUNG1 (b) Time in years GCR intensity in counts hr 100 5000 10000 15000 20000 25000 1960 1970 1980 1990 2000 CLMX KIEL ROME JUNG MCMD MOSC APTY KERG IRKT THUL CALG OULU NWRK TBLS YKTK LMKS SOPO SNAE AATA DPRV GSBY PTFM NVBK TXBY MGDN MTNR HUAN TERA MTWL MTWS AATB KIEV TSMB TKYO DRBS DRHM CAPS LEED IRK2 MWSN ALRT BJNG IRK3 MXCO ZUGS JUNG1
Berggren, A.M., Beer, J., Possnert, G., Aldahan, A., Kubik, P., Christl, M., Johnsen, S.J., Abreu, J., Vinther, B.M., 2009. A 600-year annual
10
Be record from the NGRIP ice core, Greenland. Geophys. Res. Lett. 36 (11), L11801.http://dx.doi.org/10.1029/2009GL038004.
Box, G.E.P., Jenkin, G.M., 1976. Time Series Analysis: Forecasting and Control, second ed. Holden-Day, San Francisco.
Dempster, A.P., Laird, N.M., Rubin, D.B., 1977. Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc.: Ser. B (Methodol.) 39, 1–38.
Herbst, K., Kopp, A., Heber, B., 2013. Influence of the terrestrial magnetic field geometry on the cutoff rigidity of cosmic ray particles. Ann. Geophys. 31, 1637–1643.
Honaker, J., King, G., Blackwell, M., 2011. Amelia II: a program for missing data. J. Stat. Softw. 45 (7), 1–47.
Honaker, J., King, G., Blackwell, M., Amelia II: A Program for b. Missing Data.http://gking.harvard.edu/amelia.
Horton, N.J., Kleinman, K.P., 2007. Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models. American Statistician 61 (1), 79–90. Junger, W., Leon, A.P., 2012. mtsdi: Multivariate Time Series Data
Imputation. R package 0.3, 3, < https://CRAN.R-project.org/pack-age=mtsdi>.
Junger, W., Leon, A.P., Santos, N., 2003. Missing data imputation in multivariate time series via EM Algorithm. Cadernos do IME 15, 8– 21.
Kline, R.B., 2015. Boostrapping. In: Principles and Practice of Structural Equation Modeling, fourth ed. Guilford Press, New York, NY, pp. 60–61.
Legates, D.R., McCabe, G.J., 1999. Evaluating the use of ‘‘goodness-of-fit” measures in hydrologic and hydroclimatic model validation. Water Resour. Res. 35 (1), 233–241.
McCracken, K.G., 2004. Geomagnetic and atmospheric effects upon the cosmogenic10Be observed in polar ice. J. Geophys. Res. 109, A04101. http://dx.doi.org/10.1029/2003JA010060.
McCracken, K.G., Beer, J., 2007. Long term changes in the cosmic ray intensity at Earth, 1428–2005. J. Geophys. Res.: Space Phys. 112. http://dx.doi.org/10.1029/2006JA012117.
McLachlan, G.J., Krishnan, T., 1997. The EM Algorithm and Extensions. John Wiley and Sons, New York, NY.
Meyler, A., Kenny, G., Quinn, T., 1998. Forecasting irish inflation using ARIMA models. In: Central Bank and Financial Services Authority of Ireland Technical Paper Series. Munich, Germany, No. 3/RT/98 (December 1998), pp. 1–48.
Mursula, K., Usoskin, I.G., Kovaltsov, G.A., 2003. Reconstructing the long-term cosmic ray intensity: linear relations do not work. Ann. Geophys. 21 (4), 863–867.
Nash, J.E., Sutcliffe, J.V., 1970. River flow forecasting through conceptual models Part I – a discussion of principles. J. Hydrol. 10 (3), 282–290.
Nunes, L.N., Klu¨ck, M.M., Fachel, J.M.G., 2009. Uso da imputac¸a˜o mu´ltipla de dados faltantes: uma simulac¸a˜o utilizando dados epi-demiolo´gicos. Cadernos de Sau´de Pu´blica (Reports in Public Health). Rio de Janeiro 25 (2), 268–278 (in Portuguese).
Nunes, L.N., Klu¨ck, M.M., Fachel, J.M.G., 2010. Comparac¸a˜o de me´todos de imputac¸a˜o u´nica e mu´ltipla usando como exemplo um modelo de risco paramortalidade ciru´rgica. Rev. Bras Epidemiol. 13 (4), 596–606 (in Portuguese).
R Development Core Team, 2014. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, ISBN: 3-900051-07-0.
Rodwell, L., Lee, K.J., Romaniuk, H., Carlin, J.B., 2014. Comparison of methods for imputing limited-range variables: a simulation study. BMC Med. Res. Methodol. 14. http://dx.doi.org/10.1186/1471-2288-14-57.
Sarkar, D., 2008. Lattice: Multivariate Data Visualization with R. Springer-Verlag, New York, NY, ISBN: 978-0-387-75968-5. Schmidheiny, K., 2012. The bootstrap, in: Short Guides to
Microecono-metrics. Spring, Basel, Switzerland: Universita¨t Basel, 1–10, <http:// kurt.schmidheiny.name/teaching/bootstrap2up.pdf>.
Solanki, S.K., Schu¨ssler, M., Fligge, M., 2000. Evolution of the Sun’s large-scale magnetic field since the Maunder minimum. Nature 408, 445–447.
Usoskin, I.G., Mursula, K., Solanki, S.K., Schu¨ssler, M., Kovaltsov, G. A., 2002. A physical reconstruction of cosmic ray intensity since 1610. J. Geophys. Res.: Space Phys. 107 (A11). http://dx.doi.org/10.1029/ 2002JA009343.
Usoskin, I.G., Alanko-Huotari, K., Kovaltsov, G.A., Mursula, K., 2005. Heliospheric modulation of cosmic rays: monthly reconstruction for 1951–2004. J. Geophys. Res.: Space Phys. 110 (A12).http://dx.doi. org/10.1029/2005JA011250.
Van Buuren, S., Groothuis-Oudshoorn, K., 2011. MICE: Multivariate imputation by chained equations in R. J. Stat. Softw. 45 (3).http://dx. doi.org/10.18637/jss.v045.i03.
Van Buuren, S., Brand, J.P., Groothuis-Oudshoorn, C.G.M., Rubin, D. B., 2006. Fully conditional specification in multivariate imputation. J. Stat. Comput. Simul. 76 (12), 1049–1064.
Willmott, C.J., 1981. On the validation of models. Phys. Geogr. 2 (2), 184– 194.
Zambrano-Bigiarini, M., 2014. hydroGOF: Goodness-of-fit functions for comparison of simulated and observed hydrological time series. R package version 0.3-8, <http://CRAN.R-project.org/package= hydroGOF>.
Zhou, D., O’Sullivan, D., Semones, E., Heinrich, W., 2006. Radiation field of Cosmic Rays measured in low Earth orbit by CR-39 detectors. Adv. Space Res. 37 (9), 1764–1769.