• Nenhum resultado encontrado

Table 3 below has the best results for each model with Feature Selection, either used or not, and with the different scalers applied. Training was run repeatedly until the best result on R2 was achieved using each model with a combination of the factors referred:

Model Parameter

Tuning Scaler Feature

Selection MAE MSE R2

Linear Regression Standard

Scaler 0.15798 0.05263 0.40678

RANSAC Regressor Standard

Scaler Yes 0.15486 0.05699 0.35773

Huber Regressor Standard

Scaler 0.14767 0.05421 0.38907 Decision Tree Regressor MinMax

Scaler 0.06692 0.01509 0.82989 Random Forest Regressor Yes Robust

Scaler 0.05798 0.01352 0.84766 Gradient Boosting Regressor MinMax

Scaler 0.08662 0.01812 0.79574 Hist Gradient Boosting

Regressor

Standard

Scaler 0.06950 0.01389 0.84344

AdaBoost Regressor

MinMax (-1,1) Scaler

0.14254 0.03539 0.60119

MLP Regressor Yes Robust

Scaler 0.07886 0.01615 0.81800 Table 5 - Best Models Configuration

As it can be seen, the highest scoring algorithm configuration was the Random Forest Regressor that scored in R2 0.84766. The number of configurations used to achieve this value were many. The code has 9 models that could be run using 4 different scalers all while using or not feature selection. In addition to this 4 out of these 9 models could be run again with the appropriate

parameters after using either GridSearchCV or RandomizedSearchCV. This totalled 104 model configurations.

The result of a model was reached following the diagram below.

37 As it shows above, after treating the dataset, next came the scaling phase where all the mentioned scalers were used. As it can be seen in Table 3, where we have the best combination of configurations for each model utilized, there was no single best solution for achieving the highest R2 concerning what scaler to use. Only the Linear, Huber and RANSAC regressor maintained the same scaler and if we look to the Decision Tree models we can see that’s not the case. So, this

experimentation was needed to achieve the best result for each model. After that, there is the feature selection. Contrary to the variety in scalers used, not utilizing feature selection proved to result in the best score for almost all models. And finally, the model training phase, where each path taken previously ends with the training of the 13 models, the 9 demonstrated in Table 3 and 4 versions of Decision Tree, Random Forest, Hist Gradient Boosting and MLP Regressor where the hyperparameters were tuned and not default.

So, the top performing model resulted from one of the runs where the Random Forest Regressor was utilized after running the RandomizedSearchCV to find the optimal value for this algorithms parameters. The

parameters found were defining Number of Estimators at 10, Max Features at log2, Max Depth as 25 and the Criterion was Absolute Error. It was trained using all variables in the dataset and the Robust Scaler. It

achieved a score of 0.84766 in R2 with an MAE and MSE of 0.05798 and 0.01352

respectively. The next figure is a sample of predictions made in the test split of the dataset.

Model Training Phase Feature Selection Phase

Scaling Phase

Dataset Post Treatment Dataset

Standard Scaler

RFE

13 Models

Used

No RFE

13 Models

Used

Robust Scaler

RFE

13 Models

Used

No RFE

13 Models

Used

MinMax Scaler

RFE

13 Models

Used

No RFE

13 Models

Used

MinMax Custom

RFE

13 Models

Used

No RFE

13 Models

Used

Figure 35 - Top Down Task Workflow

Figure 36 - Pred vs Actual Sample

38 Analysing the reasons behind the top score, it can be first said that the value of R2 is very good, suspiciously good. This is due to the nature of the dataset. Since the dataset comes from a single dealership in the Caetano Baviera brand, that practices the same prices for a specific part be it a filter or oil, and for the hourly labour costs, the model became very good at predicting the prices of ubiquitous repair jobs like an oil change or filter change. If we add the fact that the dataset provides the model of the vehicle being repaired, it becomes very easy to predict these repetitive margins especially if the vehicle is also in the dataset with a past registered visit for the same maintenance items.

Concerning the sale itself, the value for the hourly labour rate is set and every customer pays that amount times the active time taken on the service, meaning the actual work. This means that without the diversity of more data from more dealerships in the dataset, which will be discussed on the next section, it becomes quite trivial for a model to predict the target variable. Dealers practice different prices as they have different challenges from being in a different location, from supply chain to customer affluence ones. However, when looking at unseen data regarding the small parcel of electric vehicles or more complicated repair jobs this is when the model finds the most trouble. As we can see from Figure 31, the model makes very accurate predictions most of the time and a few predictions that are not accurate at all as is the case with the third, sixth and eight bar on that histogram.

When comparing the result of the best model in the test set with its application on the train set then it can be seen that the model performs extremely well on the train dataset with an R2 of 0.92 approximately compared to the 0.85 of the test set which evidences the points explained before.

That being said, while the model utilized the entirety of the dataset, some variables impacted the result more than others, as we can see to the right. The list of the most impacting variables in order of importance can be separated into 3 bins as well. In the first bin by order of importance is

“Custo” by a long margin then

“Tipo_Linha” and

“Nr_Horas_Faturadas”. The second bin contains variables that are very close in terms of importance and impact the

model moderately. They are “Tipo_Venda”, “Tempo_Servico”. “Tipo_Cliente”,

“Total_Gasto_Veiculo”, “Kms_Medios”, “Nr_Trabalho”, “Codigo_Veiculo”, “Colisao”, “Idade_Viatura”

Figure 37 - Feature Importance

39 and “Modelo”. Variables “Electrico” and the ones identifying the brands “BMW”, “BMW_i” and

“MINI” were of very little importance.

Despite the complications discussed above, I believe the results achieved by this model have proved relevant and enticing to any possible expansions of this study into other dealerships and brands, especially at a later stage.

40

Documentos relacionados