Prediction of Stellar Rotation Periods Using Regression Analysis
4.3 Minimising the Size of the Data Sets
In section 3.3.2, we described how we built two new data sets of reduced size, DS9 and DS10, from the master data set stripped out of Prot variables (DS2). Using those data sets, we trained RF
4.3. MINIMISING THE SIZE OF THE DATA SETS 77 and XGBoost models, in a similar way to what we have done before with DS0, DS1, . . . , DS8. In what follows, we will describe the procedure we followed and the results we obtained.
4.3.1 Results From Models Trained With DS9
DS9 resulted from selecting the most important predictors for all the models trained so far with both RFs and XGBoost, and by filtering out stars with rotation periods outside the interval]7,45[ days. We were left with a data set comprised of 36 predictors and 13 627 instances, which was split into training and testing sets in a 80-to-20 ratio.*
We started by training two RFs models, one on top of the full DS9 data set, and another after performing feature pre-pruning, by selecting the most statistically significant predictors, as we previously did for the XGBoost models. We wanted to check out if the predictive performance would be affected by applying this procedure in a data set of reduced size, such as DS9. The pruned version of DS9 contained 33 predictors (three less than the full set). In both cases, we searched for the optimal values of themtryandmin.node.sizehyperparameters using the same grid as before, but we fixedsplitruleto “variance”, since this was always found as the best value in the previous trainings, which allowed us to save extra learning time. The optimal number of randomly sampled predictors to possibly split at in each node,mtry, was 15 in the full version of DS9, and 13 in the set with recommended predictors, while the optimal minimal node size was 2 for both of them. The training time, given that we were always using a shared machine with heavy traffic, was similar to the training time obtained before with DS6, which has approximately the same number of features as DS9.
The quality assessment of the RFs models trained with DS9 are presented in the first two rows of table 4.6. Overall, the performance of the model built with the recommended variables is at the same level as the performance of the model trained with all predictors available in DS9. On average, the models were wrong about 2.3 % of the time. The 10 % interval-based accuracy was above 90 %. The RMSE measured on the testing set was about 2.7 times larger than its training counterpart, and the testing MAE was approximately 2.8 times larger than the traning one. These differences, alghouth not significantly large, denote some degree of overfitting by the models.
The goodness of fit, as measured byR2adj, was circa 96 %, denoting high quality models, in the sense that, on the one hand, they surpass the performance of all the models previously built with RFs and XGBoost and, on the other hand, they point to a large fraction of the variability of the response explained by the predictors.
The scatter plots of the reference stellar rotation periodsvs. the predicted values for the model built with the recommended variables are illustrated in fig. 4.3.† The corresponding residuals and 10 %-error metric plots are illustrated in fig. 4.4.‡Overall, these plots indicate that there is a general good agreement between the predicted and the reference response values. This is reinforced by the marginal histograms and density plots for both the predicted and reference values on the right
*Given the reduction in the number of cases when filtering out stars, we decided to adjust the relative sizes of the training and testing sets.
†Equivalent plots for the model built with all the features available in DS9 can be found in the appendix, in fig. B.10.
‡The equivalent plots for the model built with all variables available in DS9 are illustrated in fig. B.13.
panel of fig. 4.3, which are similar between them. Most of the points fall on they=xline, but some outliers can be seen, both representing under and over predicted cases. The largest errors can go up to approximately 20 d in magnitude. The relative errors oscillate between nearly −1 and+0.5, corresponding to overpredictions of approximately twice and underpredictions of about half the real values, respectively. Except for two cases, the underpredictions are always larger than 50 % of the corresponding real values. The overpredictions can go up to 200 % of the true values, the largest of which typically lie in the 20 d to 40 d range. The situation is similar for the model built with all available variables of DS9.
Table 4.6: Quality assessment of the two RF (first two rows) and the XGBoost (last row) models learnt with DS9. The indices “rec” and “all” indicate the subset of recommended variables and the full version of DS9, respectively. The XGBoost model was built upon the recommended set of variables. The considerations applied to the quality metrics in table 4.2 hold here.
DS9 µerr acc20 acc10 acc5
RMSE train
test
MAE train test
R2adj
RFrec 0.0231 0.945 0.902 0.837 0.684
1.84
0.212
0.597 0.9601
RFall 0.0229 0.946 0.901 0.839 0.681
1.84
0.210
0.592 0.9597
XGB 0.0226 0.958 0.906 0.810 0.921
1.45
0.389
0.564 0.9758
Figure 4.3: Scatter plots of the reference rotation periodsvs. the predictions for the RF model trained with the DS9 data set using recommended variables. On the left panel, the blue solid line represents the identity function, and the red dashed line the linear model between the predicted and the true values. On the right panel, the red dashed line refers to the identity function; the margins contain the histograms and density plots of the sample of predicted and reference rotation periods.