5. Case Study
5.3. Estimator training stage
In this stage we train estimators to predict each one of the partially observable features used by the optimistic classifiers, as defined in section 4.3. For this, we use the data from the data wrangling phase of the first stage. We train models to estimate the features in the booking curve data, based on the calendar data and the features in the booking curve data that is already observable from the perspective of the feature we are estimating. We make the estimations through two different methods: additive pickup method and traditional regression. We also tried to train an estimator based on recurrent neural networks, but it was unsuccessful. From all the estimators, we select the best performing one to be used in the prediction stage.
The performance of the estimators are evaluated using the 5-fold cross validation RMSE of the final number of rooms sold and revenue gained. The splits used at this stage are the same as the ones used
60 in the classifier training stage, to make sure that when we evaluate the whole methodology in the prediction stage, none of the models used would have seen the test observations.
With any of the methods, we estimate the number of rooms sold and revenue gained for each DBA from 364 (one year before the stay date) to 0 (stay date). These two indicators will be predicted using the same data for each specific DBA, since you only observe both the number of rooms sold and revenue gained at the same point in time. For us to do more than a one-step estimation, we feed the estimated values from the estimations of prior DBAs to the estimators of the following DBAs. Assuming we are currently 14 days (or 0.5 months) away from the stay date, the estimators for the indicators at DBA = 14 will be given only known values. However, for the estimators for the indicators at DBA = 13 and following, it will be given the estimated values for DBA = 14, since they are not known until the next day.
For the additive pickup method, we used the formulation defined in section 4.3.1, where we would estimate the value for a specific DBA by adding, the difference in value seen the previous year between that DBA and the following, to the last known (or estimated) value of this year. This method resulted in estimations that yielded a 5-fold cross validated RMSE for the final number of rooms sold between 3.441% and 19.061%, with a mean RMSE of 15.828%. For the final revenue gained, the RMSE values were between 2.514% and 15.188%, with a mean RMSE of 11.567%. The RMSE values for the final number of rooms sold and revenue gained, from 0 to 12 months away from the stay date are shown in Fig 29 and 30 respectively.
For the traditional regression estimators, we tested training random forests and linear regressions. Since the features given to the estimators are highly correlated, we modelled the linear regression using the least angle regression (LARS), tunning the L1 regularization parameter with in-sample cross validation.
With this linear regression, the RMSE values for the final number of rooms sold varied between 2.715%
and 17.344%, with a mean RMSE of 14.087%. The RMSE values for the final revenue gained varied between 3.817% and 17.588%, with a mean RMSE of 11.994%. With the random forests, the RMSE values for the final number of rooms sold varied between 3.840% and 18.698%, with a mean RMSE of 14.423%. The RMSE values for the final revenue gained varied between 4.335% and 17.454%, with a mean RMSE of 11.631%. The RMSE values for the final number of rooms sold and revenue gained, from 0 to 12 months away from the stay date are shown in Fig 29 and 30 respectively, for both the linear regression and the random forest.
On average with both features, the random forest performed slightly better than the linear regression, but it also performed better than the linear regression when it was closer to the stay date. From a business standpoint, it is more important to perform better closer to the stay date than further away, because making a less optimal decision then is worse than making it further away, since you would have more time to react and adjust your strategy. For these two reasons, the random forest was selected as the traditional regression estimator to be used in the prediction stage.
61 Fig 29. RMSE for the final rooms sold from 0 to 12 months to the stay date for all estimators
Fig 30. RMSE for the final revenue gained from 0 to 12 months to the stay date for all estimators We also tried to train a sequential estimator with a recurrent neural network, but we were unsuccessful.
We tested building the model as defined in section 4.3.3 with different network architectures. However what had the most impact was the way we trained it. We tested training it with complete data, where every feature in the training was known, and with simulated missing data, where we would give the model training data from different number of days to the stay date. In the first scenario, which was the best performing, the model would yield somewhat good results until 1 week to the stay date. After that the RMSE would suffer a high increase, reaching 20% in both indicators at 3 months to the stay date, and going above 40% further than 9 months to the stay date. The second scenario, would completely ignore the booking curve data, leading to situations where the accuracy was nearly constant between 0 and 12 months to the stay date.
62