Predictive Modelling Using Machine Learning on Dealerships Profit Margin Concerning the Increase in Electric Vehicles Repairs

(1)

i

PREDICTIVE MODELLING USING MACHINE LEARNING ON DEALERSHIPS PROFIT MARGIN CONCERNING THE INCREASE IN ELECTRIC VEHICLE REPAIRS

Gonçalo Silvestre

Dissertation proposal presented as partial requirement for

obtaining the Master’s degree in Information Management

(2)

i

(3)

ii NOVA Information Management School

Instituto Superior de Estatística e Gestão de Informação Universidade Nova de Lisboa

PREDICTIVE MODELLING USING MACHINE LEARNING ON DEALERSHIPS PROFIT MARGIN CONCERNING THE INCREASE IN

ELECTRIC VEHICLE REPAIRS

by

Gonçalo Silvestre

Dissertation proposal presented as a partial requirement of the degree of Master of Information Management, with a specialization in Knowledge Management and Business Intelligence

Advisor: Bruno Damásio Co Advisor: Sandro Mendonça

October 2021

(4)

iii

ABSTRACT

The modern electric vehicle came to revolutionize the way we travel but also the way dealerships that perform car maintenance services operate. The repair services they offer make up a big slice of their generated profit, with approximately 47% of revenue from repairs being profit on average, but that is in a world where non-hybrid combustion vehicles are the majority of cars on the road, a reality we are gradually leaving behind. This study seeks to help dealerships increase their bottom line by creating a predictive model that determines the dealerships gross profit margin, concerning maintenance services, with real repair data from a BMW authorized dealership in Lisbon. All attributes that affect this gross profit margin were explored taking into special consideration the electric vehicles and the reasons and details of their visit. To achieve this objective, the Sample, Explore, Modify, Model, Assess methodology was used with Python. The models used were Linear Regression, Decision Tree Regressor, Random Forest Regressor, Gradient Boosting Regressor and MLP Regressor. Using the Random Forest Regressor with no Feature Selection and on the Robust Scaler the model scored in R2 0.85. Electric vehicles did not have the expected impact on the model performance, being quite irrelevant in the predicting the target variable.

KEYWORDS

Dealership; Electric; Car; Predictive Modelling; Profit; Automotive Industry

(5)

iv

ACKNOWLEDGEMENTS

I would like to thank Dr Rui Conceição of Caetano Baviera for allowing me access to the data that is the foundation of this study. Your prompt acceptance of my request and your immediate availability are things that I will be eternally grateful for.

I am also grateful for all the support my advisor and co-advisor Bruno Damásio and Sandro Mendonça have given me. They accepted and allowed me to run with my own idea for my thesis despite being somewhat outside their main knowledge area. Nevertheless, they were always

available to help. The cutting comments and knowledgeable remarks guided me to finish this thesis. I hope my work satisfies your high standards.

I would also like to thank my family for giving me their unconditional love and support in my endeavours and for allowing me to pursue this Masters.

I am grateful for my friends as well. They provided me with the motivation and guidance necessary to finish my work and understood the challenge ahead of me, which might have made me absent for any interactions. In this regard I’d like to thank Sara Nunes for being especially tolerant and supportive. I whole heartedly believe I would not have been able to finish this thesis without your help. Hopefully I have retributed in kind.

(6)

v

INDEX

1. Introduction ... 1

1.1. Context and Relevance ... 1

1.2. Background ... 2

1.3. Scope, Objectives and Research Questions ... 3

2. Literature Review ... 4

2.1. Profitability and Maintenance for Different Types of Vehicles ... 4

2.1.1. Li-Ion battery anatomy, degradation and charging ... 5

2.1.2. Li-Ion battery pricing ... 7

2.2. Dealership services ... 7

2.3. Profitability and Warranty Coverage ... 8

3. Methodology ... 10

3.1. Sample ... 10

3.2. Explore ... 14

3.3. Modify ... 24

3.3.1. Data Cleaning... 24

3.4. Model ... 31

3.4.1. Feature Selection ... 31

3.4.2. Models Used ... 32

3.5. Assess... 35

3.5.1. Metrics Used ... 35

4. Results and Discussion ... 36

5. Conclusions ... 40

Bibliography... 43

(7)

vi

LIST OF FIGURES

Figure 1 - Relative cost to ICE vehicles (Propfe et al., 2012) ... 4

Figure 2 - - Cost Distribution by Vehicle Type (Propfe et al., 2012) ... 4

Figure 3 - Representation of a lithium-ion battery (Keil et al., 2016) ... 5

Figure 4 - Charging Patterns ... 5

Figure 5 - Battery price evolution with battery pack and battery price split (Battery Pack Prices

Fall to an Average of $132/kWh, but Rising Commodity Prices Start to Bite, 2021) ... 7

Figure 6 - Row Distribution by Month ... 10

Figure 7 - "Nr_Trabalho" Distribution ... 15

Figure 8 - "Modelo" Distribution ... 15

Figure 9 - "BMW" Distribution ... 15

Figure 10 - "Idade_Viatura" Distribution ... 16

Figure 11 - "Tipo_Venda" Distribution ... 17

Figure 12 - "Tipo_Cliente" Distribution ... 18

Figure 13 - "Tipo_Linha" Distribution ... 18

Figure 14 - "Custo" Distribution ... 19

Figure 15 - "Custo" Distribution Post Outliers Removal ... 19

Figure 16 - "Total_Gasto_Veiculo" Distribution ... 20

Figure 17 - "Nr_Horas_Faturadas" Distribution ... 20

Figure 18 - "Nr_Horas_Faturadas" Post Outliers Removal ... 20

Figure 19 - "Mg_Bruta_Perc" Distribution ... 21

Figure 20 - "Mg_Bruta_Perc" Post Outlier Removal ... 21

Figure 21 - "Kms_Medios" Distribution ... 22

Figure 22 - "Colisao" Distribuition ... 22

Figure 23 - "Tempo_Serviço" Distribution ... 23

Figure 24 - "Electrico" Distribution ... 23

Figure 25 - Missing Values ... 24

Figure 26 - "Nr_Trabalho" Final ... 25

Figure 27 - "Custo" Final ... 25

Figure 28 - "Nr_Horas_Faturadas" Final ... 26

Figure 29 - "Mg_Bruta_Perc" Final ... 26

Figure 30 - "Total_Gasto_Veiculo" Final ... 27

Figure 31 - "Tempo_Servico" Final ... 27

Figure 32 - Feature Correlation Matrix ... 29

Figure 33 - Target Feature Correlation ... 29

(8)

vii

Figure 34 - Nested for RFE ... 31

Figure 35 - Top Down Task Workflow ... 37

Figure 36 - Pred vs Actual Sample ... 37

Figure 37 - Feature Importance ... 38

(9)

viii

LIST OF TABLES

Table 1 - Dataset Variables ... 13

Table 2 - Numerical Variable Descriptive Statistics ... 14

Table 3 - Categorical Variable Descriptive Statistics ... 14

Table 4 - Hyperparameter tuning for all used models ... 34

Table 5 - Best Models Configuration ... 36

Table 6 - Consolidated Model Metrics ... 40

(10)

ix

ACRONYMS

1. HEV – Hybrid Electric Vehicles

2. PHEV – Plug-in Hybrid Electric Vehicles 3. BEV – Battery Electric Vehicles

4. EDV – Electric Drive Vehicles 5. ICE – Internal Combustion Engine 6. ICV – Internal Combustion Vehicles 7. FCV – Fuel Cell Vehicles

8. M&R – Maintenance and Repair 9. RFE – Recursive Feature Elimination

(11)

1

1. INTRODUCTION

1.1. C

ONTEXT AND

R

ELEVANCE

Dealerships provide various services in addition to selling new and used vehicles and have a part to play in every vehicle’s life, both in the beginning and during its life cycle. This section of the dealerships has a large interval for profitability and the challenges stem from the way dealerships negotiate the parts and man hours required to perform each service. These challenges have been enlarged due to the mass emergence of the electric vehicle.

Currently, the electric vehicle has most definitively its foot firmly on the automotive market with the help of subsidies and government grants, with a study on a US federal tax credit in 2016

attributing more than 30% of BEV vehicles sales to this credit (Weldon et al., 2018). However, this alternative is far from being the main option on the market. In 2019, electric vehicles accounted for 2.6% of all worldwide vehicle sales and global stock of electric passenger cars continued to expand at a rapid pace in 2019, reaching 7.2 million units, 40% higher than in 2018. Concerning the markets themselves, in 2019 the largest ones were China, then Europe followed by the US. In these regions electric vehicle sales accounted for 4.9%, 3.5% and 2.1% respectively of all vehicle sales (IEA, 2020).

This increase in sales can also be attributed to the increase in affordability and in performance of batteries for motor vehicles applications, and people have started to view them as valid and attractive options concerning the automobile market. Consumers view the flexibility of charging at home and the potentially cheaper means of transportation as an advantage over regular cars. In addition, with the world facing consequences of climate change and global warming, removing the polluting aspect of the most ubiquitous mode of transportation in the northern hemisphere is a concern shared by many consumers, whose concerns are tackled by the electric vehicle, considered as a sustainable and environmentally friendly means of transportation.

The points explained above are quite strong evidence of the multitude of vehicle solutions that are now available in the market which adds to the challenges of supply chain management, customer retention and personnel training.

Due to these facts and the fact that every major car brand has an electric and hybrid solution available for their customers, it is now more important than ever to discuss the profitability of dealerships repair shops, and this is where the basis for this thesis lies.

More specifically, this study is relevant to the organization where the model might be applied, which is a chain of dealerships in Portugal for the BMW, BMW i and MINI brands, and it might be to automotive dealerships in general. It is relevant because the model seeks to determine profitability from historical and concrete facts gathered from the repair shop itself with the goal of optimizing potential profits to increase the dealerships bottom line and to help manage the dealerships usage of its vehicle bays before the car even made it to the appointed service.

(12)

2

1.2. B

ACKGROUND

Car repair and maintenance is a fundamental part of car ownership and as such businesses and brands have been created to offer those services at competitive prices, with two types of repair shops available. Official brand dealerships that offer repair and maintenance services, encompassed in this study, and the independent repair shops, not encompassed in this study.

The National Automobile Dealers Association states in their 2020 annual report (“NADA Data 2020”, 2020) that services and parts account for 12% of revenue generated by all US dealers in 2020 and that the average dealership generated 6.7 million dollars with 46.8% being gross profit. This makes the services offered post sale a very large contributor to a dealerships profit, precisely what we seek to analyse, while not being the most revenue generating section of the business model.

However, this section of profit being generated is currently changing as EVs and HEVs increase their popularity.

Dealerships have been increasingly dealing with cars that no longer have a typical powerplant or have both, a hybrid powerplant, as their associated brands launch and update their vehicles. This creates new challenges for dealerships as they are now working with vehicles that are fundamentally different than the usual automobiles they are accustomed to. Dealerships that have associated with brands that have a line-up of vehicles that is composed of typical vehicles, hybrid ones and electrical have seen the type of services they offer, and their technical knowledges, expand and vary, naturally.

Failure to adapt to these new challenges and this new reality would result in less profit and possibly customer dissatisfaction.

Considering the cause of the service itself, meaning the reason behind the visit to the shop, it can be either due to regular maintenance or due to a crash or other type of accident. Concerning maintenance, there are dissimilarities between these types of vehicles that will be discussed now and further in the literature review.

Normal ICE vehicles are quite maintenance heavy, at least when compared with electric vehicles, regarding their method of propulsion. Their inherent design, specially being a combustion engine, requires fluids and filters for the correct and efficient operation of said engine which itself has more moving parts than the electric engine.

This does not happen with electric cars. This is since fluids that need to be present and regularly changed in a vehicle with a typical reciprocating internal combustion engine are just not present in an electrical vehicle. But they are present in a car that is hybrid in addition to the electrical engine and all the expenses associated. Furthermore, an electric engine contains comparatively a lower number of moving parts than an internal combustion engine (Barkenbus, 2020) while a hybrid vehicle contains the sum of both the internal combustion engine mentioned and the electric motor,

themselves functioning in unison. These relatively new additions to the range of cars serviced by dealers and to the specifications of the vehicles themselves, change the way these businesses function with new parts and new practices thrown into the mix and with new inventory and training challenges surfacing.

(13)

3

1.3. S

COPE

, O

BJECTIVES AND

R

ESEARCH

Q

UESTIONS

The growth and popularization of EVs are changing inventory and material requirements of maintenance service providers implicating challenges in supplier management, due to the intricacies and nature of the vehicles in question, in addition to the fact that a significant source of profit for dealerships is the maintenance service they offer, as I’ve stated earlier.

This concern is deeply impactful on the profit of dealerships and it’s the basis for this study. This study aims to use data analytics on the data gathered by the dealership in question using their information system to attempt to predict their profit margin. This can help management in their decision making to maintain or improve their profit margin as it brings out hidden relationships and information considering all factors involved.

This study seeks to create a model that determines the profit of a typical dealership taking into consideration the percentages of the different types of cars it services, cars with a normal

powertrain, hybrids and EVs. The type of service will be taken into account, if it’s routine or resulting from an extraordinary reason, and also the type of part being replaced, if it’s either something common and kept in stock, such as oil, or something that is not kept in stock.

Based on what was discussed above, the main objective is to create a model that answers the following research questions:

1. How can gross profit margin be predicted?

2. What are the most impactful variables in predicting the target variable?

3. How does the electric vehicle impact the target variable?

4. What is the best model to predict the target variable?

(14)

4

2. LITERATURE REVIEW

This chapter will be discussing the different reports and literature on the aspects that are fundamental to comprehend in this study such as the intricacies of the multiple engines and services we are referring to and the particularities of prices and dealerships in the global economy.

2.1. P

ROFITABILITY AND

M

AINTENANCE FOR

D

IFFERENT

T

YPES OF

V

EHICLES

Comparatively to internal combustion engines, EVs have little to no maintenance. Tesla’s support section of their website states that “Tesla engineers continuously review maintenance recommendations to optimize the performance, reliability, durability, safety and resale value of your Tesla. Unlike gasoline cars, Tesla cars require no traditional oil changes, fuel filters, spark plug replacements or emission checks. As electric cars, even brake pad replacements are rare because regenerative braking returns energy to the battery, significantly reducing wear on brakes.” (Vehicle Maintenance | Tesla Support, n.d.). All the maintenance referred in the quote is where the

dealerships might make the bulk of their profit, the small and regular maintenance tasks. However, that would change in the case of a car fleet where most vehicles are electric.

A study done in 2012 (Propfe et al., 2012).

found that battery electric vehicles are in fact typically 18% less expensive to maintain and repair (M&R), as it can be seen in Figure 1, comparatively to an internal combustion engine.

It also found that in EVs, repairs related to the powertrain such as batteries account for most of the total money spent on maintenance where in ICE vehicles the cost is more spread out between component groups as displayed on Figure 2. It can be seen that M&R costs for full electric vehicles like BEVs tend to be driven by major components such as the battery components or other parts related to the electric motorization, whereas conventional ICVs as well as hybrids do not have a particular component accounting for a significant share of the overall M&R costs.

Another study (Palmer et al. (2018)), M&R costs were estimated for two countries and two states of the United States that were Japan, the UK, California and Texas. The percentage decrease of BEVs M&R costs compared with equivalent ICVs is 23% in Japan and the UK, 24%

in Texas and 30% in California.

Figure 2 - - Cost Distribution by Vehicle Type (Propfe et al., 2012)

Figure 1 - Relative cost to ICE vehicles (Propfe et al., 2012)

(15)

5 2.1.1. Li-Ion battery anatomy, degradation and charging

It is expected for EVs to change their batteries during their lifetime and the price per kilowatt hour of the battery pack being replaced will certainly influence the cost of maintenance of the vehicle for both the owner and the dealership.

Concerning the battery itself, in a most basic way, li- ion batteries are formed by the electrodes that are the 2 battery ends, the anode and the cathode, intersected by a separator and an electrolyte. Inside the anode and the cathode is lithium and as you can see in the picture the flow of electrons stemming from the lithium ions flowing between electrodes either charges or discharges the battery.

Batteries have 2 types of aging that lead to battery degradation these are calendar aging and cyclical aging.

Calendar aging comprises all aging processes that lead to a degradation of a battery cell independent of charge- discharge cycling. It is an important factor in many applications of lithium-ion batteries where the operation periods are substantially shorter than the idle intervals,

such as in electric vehicles (Keil et al., 2016). Cyclical aging on the other hand are all processes that lead to aging, related to a battery charge cycle. A charge cycle is using all the battery capacity not necessarily in just one charge.

Concerning calendar aging, studies have shown that to maximize battery life, lithium-ion cells should not be stored at high SoC corresponding to low anode potential. For long-term storage, the graphite anode should be lithiated less than 50% (Keil et al., 2016). In layman’s terms, this means that the battery must be below 50% capacity to best prevent degradation. For example, if the driver of a Nissan Leaf – which has about a 100-mile range – drives 30 miles daily, the battery will only be depleted 30% each day (Mude., 2018). This is a typical situation on commuter private cars that aids battery degradation with respect to

calendar aging.

When discussing cyclic aging, we need to specify charging levels, level 1, level 2 and Direct Current Fast Chargers (DCFC). Level 1 is just simply plugging the car to a normal outlet, and it takes the longest to fully charge, over 24 hours from empty to full;

level 2 provides more power compared to level 1 and it is what’s typically installed on public charging parking spaces. They can also be installed cheaply at home and take about 8 hours to fully charge. DCFC are the

Figure 3 - Representation of a lithium-ion battery (Keil et al., 2016)

Figure 4 - Charging Patterns

Source: Global EV Outlook 2020 (2020, p. 229)

(16)

6 most expensive due to their power draw and the fastest way to charge, normally taking 20 to 40 minutes to charge the battery from 0 to 80%.

On Figure 4, we can see the charging patterns identified on global EV outlook 2020 by the International Energy Agency on PLDVs (Passenger Light Duty Vehicles) and LCVs (Light Commercial Vehicles). It found that PLDVs are mostly used for commutes and charged at home thus most likely staying at the top percentiles of battery capacity incurring on the problem of calendar aging

mentioned above. These cars are mostly charged at home or in the office using most commonly level 1 or level 2 charging types. For LCVs they travel frequently resulting in many miles covered and consequently many cycles.

A study made in February 2018 compared the same battery packs, the ones used in the 2012 Nissan Leaf, to assess their cyclic aging. Overall, both packs experienced significant capacity fade. For instance, the AC Level 2 pack degraded 20% after 11 months of cycling after completing 660 cycles, which is an equivalent of 41,000 miles of driving under the accelerated cycling protocol. On the other hand, the DCFC pack reached 20% fade after only 9 months of cycling (540 cycles), which is

equivalent to 33,500 miles of driving (Tanvir et al., 2018). A BEV battery is typically considered to have reached the end of its life when its available capacity or maximum power under reference conditions has decreased by 20% of its original value, although the loss of capacity is typically the determining factor given that the initial power capabilities of the battery are superior to what the vehicle requires (Pelletier et al., 2017).

Another study (Knowles, 2013) found that the batteries’ aspects concerning range and efficiency was the cause for the below expected adoption of electric vehicles by the mass market, and that EVs reliability and longevity were, at the time of the study, challenges to overcome. The study stressed the importance of managing battery replacements which ties into the studies above concerning battery degradation. The uncertainty around battery life and replacement prices and costs was found to impact leases and sales which in turn impacts a dealerships repair profit margin, either negatively or positively, depending on how these obstacles are surpassed.

All these factors indicate that EVs are susceptible to battery degradation and battery replacement without factoring in the individual brand design of equipment and technology that might impact these values either negatively or positively. Factor in the uncertainty concerning batteries and a possible massification of EVs and hybrids, and these aspects could have a severe impact on dealerships repair departments.

(17)

7 2.1.2. Li-Ion battery pricing

Lithium-Ion battery packs costs have been decreasing their prices quickly now approximately costing 132 dollars per kWh (kilowatt hour) in 2021 as

compared to 2013 when they were 684 dollars per kWh as found in BloombergNEF’s annual battery price survey. “Lithium-ion battery pack prices, which were above

$1,100 per kilowatt-hour in 2010, have fallen 89% in real terms to

$132/kWh in 2021.” The same source states also that the prices for lithium have increased

substantially from 2020 to 2021 with the increase in demand.(Battery Pack Prices Fall to an Average of $132/kWh, but Rising Commodity Prices Start to Bite, 2021)

2.2. D

EALERSHIP SERVICES

On maintenance prices, recurrentauto.com compiled a list of real-world maintenance services where battery replacement was needed (Costs of Electric Car Battery Replacement, n.d.). This analysis they say, “doesn’t include the labor costs, tax policy, business models, and other

externalities related to a battery replacement.” It is also stated that “overall, it doesn’t appear that EV owners are benefiting from the Bloomberg reported average of $137/kWh price.” The cars listed are the BMW i3, Chevrolet Bolt, Nissan Leaf, Tesla Model 3 and Model S and VW e-Golf.

“In 2016, BMW reported that replacing an i3 battery would cost about $16,000. This was for the 2013-2016 model years which have 22 kWh packs, equating to $727/ kWh in real 2016 dollars.”

All Chevrolet Bolts from 2017 to 2022 are under a recall to replace their batteries due to safety issues, but prior to the recall “according to Chevrolet themselves, in 2017 they reported that the price of a battery pack for a Chevy Bolt was $262/kWh.”

“According to a post in the MyNissanLeaf.com forum, as of Jan 30, 2020, the cost of labor and replacement of a 24-kWh battery in a Nissan Leaf is $5,500. It appears the battery itself would cost

$4,500, placing the $/kWh at $187/kWh, which is 36% above the quoted 2020 $137/kWh price.”

“Current Automotive posted the 2020 receipt of one customer’s replacement of a remanufactured 75kWh battery in a Tesla Model 3: the battery itself was $13,500, Labor was

$2,299.27 totaling $15,799.27. This equates to $180/kWh which is about 31% higher than the Bloomberg cited average of $137/kWh. Since Tesla Model 3s are so new, there is likely very little data on repair costs in the public domain.” Another important aspect is that Tesla labor costs hover around 150 and 250 dollars per hour depending on location. So, if we divide the labour cost on this vehicles service by $200 per hour, we can see the service took just a bit longer than 11 hours.

Figure 5 - Battery price evolution with battery pack and battery price split (Battery Pack Prices Fall to an Average of $132/kWh, but Rising

Commodity Prices Start to Bite, 2021)

(18)

8 For the Tesla Model S there are various values. “In April 2021, FindMyElectric reported that there were a few cases where the price of the batteries cost $12,000 to $15,000. A used 2014 Tesla Model S owner reported that a full battery pack replacement would have cost $20,000, including labour, in Dec 2020. And finally on Tesla motors club, there are two reports of owners of (2012 and 2013) Model S’s being quoted in 2021 $22k for full battery replacements out of warranty.”

And for the VW e-Golf “a replacement battery for a 2017 to 2018 VW e-Golf is quoted as

$23,442.91 by Pignataro VW as of August 2021. Given that the 2017 e-golf battery pack is 35.8 kWh, that equates to $654/kWh, a whopping 477% more expensive than the average 2020 price.”

IMR states that “67% of independent repair shops indicate a portion of their business is from servicing battery electric vehicles. On average, 3.1% (or 3 of every 100 vehicles serviced) are BEVs.”

This demonstrates that ICE vehicles still are the main contributor to these businesses profit (BEV &

HEV Repair and Service Impact at Auto Repair Shops, 2021). They also elaborated a list of the 10 most common maintenance jobs where only 3 can be performed on an EV.

1. Oil/oil filter changed 2. Wiper blades replacement 3. Replace air filter

4. Scheduled maintenance 5. New tires

6. Battery replacement 7. Brake work

8. Antifreeze added 9. Engine tune-up

10. Wheels aligned/balanced

Note that the values above are for independent repair shops as such they are not applicable to dealerships but serve as an important comparison between these types of establishments.

2.3. P

ROFITABILITY AND

W

ARRANTY

C

OVERAGE

Some small aspects of warranties need to be expounded as these occurrences are not only present in our dataset but are commonplace in dealership repairs worldwide and thus impacts the predictions of the models created. Warranties can affect the repair cycles of vehicles and the profit margins of dealers. Dealers make a small profit with warranties as they are paid by the manufacturer.

The dealer is paid for the expense by the associated brand and generally makes about 5 to 10% in profit. This fact can offset the total profit of said dealership as they make less from an item that on a normal sale would earn a much larger profit. If we take the average profit percentage value

mentioned in the introduction of 46.8%, we can see that warranties impact the overall profit margin negatively. However, there is another aspect. A vehicle that requests a service due to a warranty issue technically goes to the dealership one more time than if it hadn’t run into that issue and just stayed on regular maintenance. While impacting the overall profit, it represents another instance of revenue.

(19)

9 Another aspect is the different warranty duration times and distances for different parts.

Considering the study’s focus on electric vehicles, items related to said motorizations will be investigated.

Different brands also offer different warranties. BMW, and consequently the electric vehicle sub brand BMW i, vehicles are under warranty for 36 months or 3 years after being bought. The brand offers an extension in warranty to up to 5 years or 200 thousand kilometres, whichever comes first, however BMW notes that high voltage batteries are not covered on this extended warranty package.

High voltage batteries that are present in BMW electric vehicles and hybrid vehicles have an 8 year or 160 thousand kilometres warranty. (Garantias E Assistência, n.d.)

Tesla offers a competitive warranty for their vehicles, 4 years or 50 thousand miles, or about 80 thousand kilometres. For their batteries and drive units they offer 8 years and 100 to 150 thousand miles depending on the model specification, 160 to 240 thousand kilometres. (Vehicle Warranty | Tesla Support, n.d.)

Despite being an almost impossible factor to control, warranties do in fact impact dealerships profit margins, which warranted this small mention.

(20)

10

3. METHODOLOGY

The goal of this study, as was touched on before, is to create a predictive model that can determine the dealerships profit margin with the maintenance services they offer from a set of variables taken from the dealerships information system.

Python was the basis for this study used in Jupyter Notebook, along with several data science packages. Pandas, sklearn, seaborn and numpy were invaluable in the elaboration of this study.

The study followed the SEMMA process for data mining, which consists of Sample, Explore, Modify, Model, and Assess, hence the SEMMA acronym. Sampling is comprised of inputting the data and possibly partitioning and exploring is following the input with an analysis of the data. In this phase, it is visualized, looking at variable distribution, detecting and deleting invalid variables and discovering outliers in the ones that remain. Modify consists of transforming the variables, creating new ones and deleting their outliers. Finally, Model is fitting models to the dataset after scaling the data and trimming or not the variables to use and Assess is evaluating the result of the models used on a partitioned part of the dataset for test.

3.1. S

AMPLE

T

he purpose of this study is to predict the dealerships maintenance services profit margins, as mentioned. In order to come up with the results required to draw conclusions we will be using a dataset sourced from Caetano Baviera, a member of the Salvador Caetano group, that contains the data from the repair and maintenance department retrieved from the information system that monitors that section of the business. From that dataset, the independent variable is the percentual gross profit margin or “Mg_Bruta_Perc”. Concerning the provider of this dataset, Salvador Caetano is in the automotive industry and one of the largest automotive groups in Portugal. It operates

dealerships around Portugal for almost every car brand and Caetano Baviera is the child brand that deals with the BMW group members, BMW and MINI. The dealerships belonging to this brand are official BMW group dealerships in every aspect, both in sales and in maintenance services. These two brands, BMW and MINI, have a line-up of cars

consisting of ICE vehicles, HEV, PHEV and EVs.

The dataset itself is from a single dealership in Lisbon, concerning the first six months of 2022.

From January 1^st, 2022, to July 7^th of the same year. These datasets are from the information system that dealers use to help them catalogue arrivals, types of services in addition to

characteristics of the vehicle such as age and kilometres travelled. The process that these systems report begins when a customer calls to have his or her car serviced in a special occasion, such as an accident or any other kind of problem separate from normal periodic repairs and

Figure 6 - Row Distribution by Month

(21)

11 maintenance, or the car itself communicates that it is due to be maintained for a few potential reasons which prompts the dealer to call the owner of the vehicle to schedule a visit. This initiation creates a document containing the vehicle information and will be updated as the process evolves.

Following the scheduling and the arrival of the vehicle to the dealership, the planned maintenance jobs are catalogued in the system along with the man hours required to perform such jobs and the parts that are needed to service the vehicle. After this, the real man hours, parts and tasks that were required are declared, creating an invoice for the customer from the document mentioned before.

The dataset is analysed comparatively to what was discussed above in the previous chapter to see if the literature and studies concerning the most common services and most repaired parts hold some grain of truth in the dataset as well.

The base dataset had several variables of which many either identified the customer or the car such as the clients address and the vehicle’s license plate. Others were either empty or did not add value to the dataset, for instance the name of the receptionist that took the clients call. These variables had to be removed, some due to not helping the performance of the dataset, others due to privacy reasons and confidentiality. These customers personal information was not needed for this dissertation and thus all data regarding said personal information was erased.

The dataset in its original form had 98604 entries for 6117 cars in total. Out of these cars, 95 are fully electric, 1897 rows, which corresponds to 1.92% rows of the dataset and to 1.55% of all cars in the dataset.

Concerning each variable, the following section gives a description and shows their values and distribution:

1. Nr Trabalho – Corresponds to the item number on the invoice presented to the customer. A single item on the invoice has several sub items that are individual tasks or aspects of the main item itself. An example might be a car that goes for regular maintenance and has a single item on its invoice, an oil change. This item has as sub items every part required (oil itself, oil plug, gasket or O-ring and oil filter), the labour hours required for all tasks related to the particular item, and extra tasks that in this case is the correct disposal of all the old oil and parts and any other residue treatment. So, every single row related to this item on the invoice has the value ‘1’ in its ‘Nr Trabalho’ variable. A second item would have its tasks as ‘2’ and a third as ‘3’ and so on.

2. Modelo – defines the model of the car. There are 72 different values for vehicle models. It was label encoded from the original variable that contained the names of all models in the dataset.

3. Codigo_Veiculo – Created from the original variable in the original CSV named “Matrícula”.

As the name says, it contains the license plates of all vehicles in the dataset. For privacy and data protection, it was changed into what it is now, an identifier of the vehicles in the dataset. Goes from 1 to 5980 after data processing.

4. Idade_Viatura – Has values measured in years concerning every vehicle age. It goes from 0 to 13 years. Note that rows that have the value 13 consider vehicles that are more than 13 years old and

(22)

12 not between 13 and 14 years old as is the case with every other value. This was a characteristic of the source dataset.

5. Tipo_Venda – Variable categorizing the type of sale. Can be external sales, internal sales, maintenance contracts and warranties, meaning the sale is made to the manufacturer, and “others”.

Variable was label encoded.

6. Tipo_Cliente – Has the various types of costumers. They can be normal vehicle owners, cars on behalf of insurance companies, fleet managers, companies that have negotiated protocols with the dealership, parts retailers, the Grupo Salvador Caetano itself, employees, agents. This variable was also label encoded.

7. Tipo_Linha – Categorizes the type of row concerning what’s involved in the repair from the dealership side, meaning, it diversifies between labour, parts, or “others” like proper disposal of any old material and fluids. This variable was label encoded as well.

8. Custo – Cost in euros of that specific row for the dealership

9. Mg_Bruta_Perc – Percentual gross profit margin, our target variable 10. Nr_Horas_Faturadas – Number of labour hours involved in the repair.

11. Kms_Medios – Kilometrage of the vehicle being repaired

12. Total_Gasto_Veiculo – Sum of all costs by vehicle code or “Codigo_Veiculo”. In other words, total of expenses made by the dealership in servicing a specific car throughout its life

13. Electrico – If a vehicle is fully electric or not. Binary, 1 for yes 0 for no.

14. Tempo_Servico – Number of days the service process in question took to be fully resolved.

Made from the difference between the sale date and when the service document was created, meaning the date the process began, when the costumer contacted the dealership to make use of its services.

15. Colisao – If the service is resultant of a crash or other accident. Binary, 1 for yes 0 for no.

Resulted from one hot encoding a variable that had two options, this variable, 15, and “Mecanica”.

16. BMW – If the car being serviced is a BMW. Binary, 1 for yes 0 for no. Resulted from one hot encoding a variable that had 3 distinct options, this variable, 16, and the next two, 17 and 18.

17. BMW_i – If the car being serviced is a BMW i.

18. MINI – If the car being serviced is a Mini.

(23)

13 Table 1 - Dataset Variables

Number Variable Type Values Units

1. Nr_Trabalho Continuous Numeric Number

2. Modelo Categorical 72 values Model Identifier

3. Codigo_Veiculo Categorical 5980 values

Individual Vehicle Identifier

4. Idade_Viatura Categorical 14 values Years

5. Tipo_Venda Categorical 5 values Sale type

Identifier

6. Tipo_Cliente Categorical 9 values Customer

Identifier

7. Tipo_Linha Categorical 3 values Type of task

Identifier

8. Custo Continuous Numeric Price

9. Target

Variable Mg_Bruta_Perc Continuous Numeric Percentual

10. Nr_Horas_Faturadas Continuous Numeric Hours

11. Kms_Medios Continuous Numeric Kilometers

12. Total_Gasto_Veiculo Continuous Numeric Price

13. Electrico Categorical Binary Binary

14. Tempo_Servico Continuous Numeric Days

15. Colisao Categorical Binary Binary

16. BMW Categorical Binary Binary

17. BMW_i Categorical Binary Binary

18. Mini Categorical Binary Binary

(24)

14

3.2. E

XPLORE

In SEMMA, after the sample phase, comes the exploration phase. The dataset has a total of 89271 rows and 18 columns. To explore these variables several plots and descriptive statistics were used to better understand and convey the data.

For the numerical variables, the following tables shows the maximum, minimum, mean, first quartile, median, third quartile and the standard deviation. This table displays all values from the original variables after data cleaning, totalling 87722 rows.

Variable Mean Standard

Deviation Min 1^st

Quartile Median 3^rd

Quartile Max

Nr_Trabalho 2.81 2.39 1 1 2 4 14

Idade_Viatura 4.59 3.55 0 2 4 6 13

Custo 25.13 72.36 0 1.25 5 18.11 991

Mg_Bruta_Perc 0.58 0.30 -1.92 0.37 0.62 0.80 1.03

Nr_Horas_Faturadas 0.20 0.72 0 0 0 0.17 14.83

Kms_Medios 82356.76 69052.38 1 33254 62037 113506 634108 Total_Gasto_Veiculo 968.56 1447.26 0 165.62 413.63 1125.90 9929.02

Tempo_Servico 11.65 22.52 -7 1 3 11 196

Table 2 - Numerical Variable Descriptive Statistics

For the categorical variables, the next tables display the count, number of unique values, mode and the number of times that most frequent value appeared in the variable. The table does not show the dummy variables as they will be addressed further ahead.

Variable Unique Mode Encoded Mode Frequency

Modelo 72 S1 – 5P (F20) 12 8117

Tipo_Venda 5 Vendas a Crédito 4 40901 Tipo_Cliente 9 Particulares 5 24050

Tipo_Linha 3 Peças 2 48102

Table 3 - Categorical Variable Descriptive Statistics

(25)

15 For the first variable “Nr_Trabalho” had the following distribution as can be seen above. The frequency of ocurrences decreases as the variable value increases and the mass of values remained on the lower values of the variable.

As will also be explained further ahead, the dataset started with a variable called “Marca” that was separated into 3 other dummy variables that have the values the original values took. Above we can see the distribution of that same variable, rows per brand. As seen, the car brand with the most rows is BMW by a large margin, then MINI, BMW i and then Audi and BMW Motos. The last two columns only make 7 rows and where deleted

before making the following 3 binary variables.

The first variable is “BMW”. Most of our rows are about cars that are BMW, 71411 of them, and 17655 are not.

The second and third variables are “MINI”

and “BMW_i”. We have 15784 rows that belong to the MINI brand of vehicles and 73282 that do not.

For BMW i, only 1871 rows belong to this brand while 87195 do not. In sum, 71411 are BMW, 15784 are MINI and 1871 are BMW i.

Figure 7 - "Nr_Trabalho" Distribution

Figure 8 - "Modelo" Distribution

Figure 9 - "BMW" Distribution

(26)

16 The variable “Codigo_Veiculo” is merely and identifier for the vehicle. There are 5981 unique values that indicates just as many distinct cars on the dataset.

Then we have the column “Modelo” which identifies the existent models in the dataset. Out of 72 models, 5 belong to the BMW i brand, 9 are MINI and the remaining 58 are BMW.

Vehicle age was explored here by using the graphs above. We can observe that most invoice items are concerning vehicles that are between 2 and 6 years old. 3-year-old vehicles have the highest frequency followed by vehicles that are 2 years old. This distribution is likely due to leasing contracts to either private citizens or enterprises, maintenance contracts and warranty and extended warranty incidents. The earlier referred aspects might be reasons to justify this observation, that newer cars appear more prevalent in official brand dealerships as shown in the above figure. I would add that the vehicles featured in this dataset, that belong to BMW, BMW i included, and MINI, are considered luxury cars which is a segment of the car market, meaning the luxury car market from subcompact to SUV, where the BMW group has a heavy and influential presence. Servicing these cars is also a premium service, which would explain the decline in the observations of older vehicles, as these cars pass to the second-hand market and as these cars end their leasing contracts and

warranty kilometrage and duration. There is an apparent challenge in maintaining these more recent customers.

Figure 10 - "Idade_Viatura" Distribution

Figure 10 - "MINI" Distribution Figure 11 - "BMW_i" Distribution

(27)

17 Note that all vehicles that have the “13” value are 13 years old and older.

This variable has 5 unique values possible. Firstly, we have normal sales that are the most frequent observation. Then we have sales regarding warranty, “Garantias”, after that, internal sales,

“Vendas Internas”, made between departments and then what the dealership classifies as “Outras”

or others. Finally, there is maintenance contracts or “Contratos de Manutenção”.

Figure 11 - "Tipo_Venda" Distribution

(28)

18 Concerning “Tipo_Cliente”, meaning the type of customer being serviced, we can attribute 3 bins to trios of values in this variable. The top 3 values numbering above 20 thousand of rows each belong to one bin. The top value in this bin is “Particulares”, meaning a private individual owner that brings his or her car in for a service, secondly, we have insurance companies, they typically have cars brought in for repair after a collision which undoubtedly requires more billable items on an invoice than normal maintenance which would explain the high numbers. Thirdly, we have warranties likely due to the problem that leads to the car being brought in, being more widespread. The second bin has the values concerning company cars, car fleet managers and internal customers and the final and third bin are the entities with protocols with the brand, parts retailers and itinerants.

This variable has 3 possible values when classifying the type of cost of the row. It can be labour costs, parts or other

“diverse” costs such as fluid disposal or recycling, where parts are the most common type, then labour then these diverse costs.

Figure 12 - "Tipo_Cliente" Distribution

Figure 13 - "Tipo_Linha" Distribution

(29)

19 The following variable is “Custo” or the cost of the row. As it can be seen it has a few outliers and most importantly negative values that will need to be addressed in the outlier removal phase.

Below is the boxplot of “Custo” after addressing the problems mentioned earlier. The densest area is between 1.25 and 20 euros, which indicates a ubiquity of cars performing regular maintenance involving more common repair items, procedures and lower labour durations and costs.

Below we have the figure for our next variable, “Total_Gasto_Veiculo”, meaning total spent in vehicle. The purpose of this variable is to track the total amount spent on a vehicle by the

dealership, attributing some similarity between vehicles that might have the same aspect such as being the same model, same type of sale and customer. Note that there are vehicles that only visit the dealership once in the datasets time interval.

Figure 15 - "Custo" Distribution Post Outliers Removal Figure 14 - "Custo" Distribution

(30)

20

“Nr_Horas_Faturadas” had the same issues as the variable “Custo” as demonstrated below.

The next figure shows the adjusted distribution. As can be seen, most rows take a short time to fulfil which might indicate that most tasks are basic and happen quite regularly.

Figure 16 - "Total_Gasto_Veiculo" Distribution

Figure 17 - "Nr_Horas_Faturadas" Distribution

Figure 18 - "Nr_Horas_Faturadas" Post Outliers Removal

(31)

21 The figure above, about the target variable, demonstrates that the column has quite

nonsensical values that completely skew the observations. After correcting these issues, which will also be referred in the outlier removal section, the following figure shows the new distribution.

The dealership typically makes a significant profit on every sale, the first quartile being 36,5%, the median 62,1% and the third 80,1%. This is due to well established supplier chains and the

Figure 19 - "Mg_Bruta_Perc" Distribution

Figure 20 - "Mg_Bruta_Perc" Post Outlier Removal

(32)

22 frequency of services that require common products. This is what the electric vehicle could

potentially disrupt.

This is the variable “Kms_Medios”. Most cars have around 100 thousand kilometres and lower. This graphic substantiates the observations made in “Idade_Viatura”, that the dealership has some challenges maintaining older vehicles and their owners as customers as they change to their second ownership and leasing contracts and warranties end. Concerning this last aspect, BMW offers extended warranty packages for both new and used vehicles, so warranties can go up to 5 years.

This is “Colisao”, and it simply

indicates if the car is being serviced due to a collision or other accident. This is because vehicles that have been involved in an accident require damage assessment and thus the dealership separates these cars from the normal maintenance

department into a collision department. As it can be seen most rows were not from accidents or other destructive incidents and compose approximately 31.5% of all rows, after the dataset has been treated of all missing values and outliers and other imperfections. This might not translate to a proportional number of vehicles in both groups of this variable, as a crashed vehicle most likely would need more repairs than a vehicle going in for normal maintenance which generates more rows per car.

Figure 21 - "Kms_Medios" Distribution

Figure 22 - "Colisao" Distribuition

(33)

23 The variable seen above quantifies the time it took to service a vehicle. Adding to what was said before, most tasks performed are maintenance tasks that take a shorter time and are easier to train younger staff in and easier to perform due to being less complex, justifying the location of the interquartile range and overall boxplot.

The final variable to be discussed in this phase is

“Electrico” that refers to whether a vehicle is electric or not. As indicated before, out of all cars, 95 are fully electric, 1897 rows, which corresponds to 1.92% rows of the dataset and to 1.55%

of all cars in the dataset.

However, these numbers refer to the original dataset.

Since this variable resulted from a selection of all electric vehicles in the dataset after all phases of the modify section of SEMMA except feature engineering, this graphic shows that the 1753 rows account for 2% of all rows in the dataset.

Figure 23 - "Tempo_Serviço" Distribution

Figure 24 - "Electrico" Distribution

(34)

24

3.3. M

ODIFY

3.3.1. Data Cleaning 3.3.1.1. Missing Values

Taking into account all that was seen in the previous step of

SEMMA, explore, the next task was dealing with missing data. The table to the right shows us the situation concerning this task. As it can be seen our target variable was missing

12.1% of its values, 11945 rows. The other row was missing just 48 values, not even 0.1%. Starting then with the latter, as it would not add or detract from the value of this dataset to fill the missing values or delete the values, these rows were promptly deleted. Our target variable proved a different challenge, due to the small number of outliers, that were so far apart from the rest of the values. The solution was to fill with the median instead of the mean. In conclusion, after this the dataset had less 48 rows, less than 0.1%.

3.3.1.2. Outlier and Inconsistency Removal

Outliers stand as values “out of range”, or values that significantly differ from other values.

Taking this into consideration, this could influence the training process of the algorithm because, during the fitting process, the algorithm will consider this data as cases that occurred naturally thus negatively influencing training. As outliers are phenomena that occur rarely in reality, this may introduce bias to an algorithm.

So, following the earlier step, the inconsistencies and imperfections in the dataset had to be addressed. In the variable concerning the model. “Modelo” had vehicles that weren’t from the brands belonging to the dealership and some bikes. These rows where discarded and comprised of 5227 units. Next, “Idade_Viatura”, the variable that held the value in years of the age of the vehicles, also had some imperfections, some values were “Não Definido” or “Undefined”. “Tipo_Cliente” also had this issue. The number of rows with this value were 2058 and were discarded. The latter variable needed some further changes concerning certain values that were overly specific concerning the dealerships clients. Thankfully these were only 82 rows, so they were discarded as well. All in all, the dataset decreased in size by 7367 rows. Adding the number of rows deleted mentioned in this paragraph to the ones from the paragraph before and we end up with 7415 rows deleted.

Figure 25 - Missing Values

(35)

25 Next step was applying this outlier removal process to numeric variables. Firstly, we had the variable “Nr_Trabalho”, that had, as we have seen before, a lot of values, from 8 to 49 that were outside the boxplot. Looking at Figure 4, we can confirm that many of the values have a relatively small frequency. As such, from 14 onward the values were deleted, numbering 928 rows.

The next variable that had outliers removed was “Custo”. This variable had values that were negative, indicating negative costs from the dealership. The reasons behind this could be varied.

However, in many of the cases where the cost was negative, the percentual profit margin, the target variable was positive. This either was some clerical error or a correction from the dealership,

indicating that the object that was initially sold might not have been sold after all and thus a deduction had to be made for to compensate for that fact. These values were transformed into positive values and accounted for 2205 rows. This was done due to the fact that both values provide valuable insight to the dataset models and because the number of rows affected were low relative to the overall size of the dataset.

In addition to these last changes, this column had outliers sparsely populating the area above the value of 1000 as can be seen in Figure 14, numbering 232 rows. These were discarded as well.

Concerning the issue discussed before with the previous column about the negative values, the variable “Nr_Horas_Faturadas” had the same issue. As for outliers, “Nr_Horas_Faturadas” had 31 that were removed. The figure below is the new boxplot without the removed rows.

Figure 26 - "Nr_Trabalho" Final

Figure 27 - "Custo" Final

(36)

26 Then we have variable “Mg_Bruta_Perc”, our target variable. As we could see earlier, the

variable sometimes took quite contrasting values that completely offset the boxplot. After limiting the profit margin to values between -2 and 2, the following graph shows the boxplot distribution of values. 674 rows were deleted.

The new variables also had to be inspected for outliers and this resulted in removals as well. The column “Total_Gasto_Veiculo” had a number of items greater than 10000 that were sparsely

populating that area as we could see in Figure 15, in the explore phase.

Figure 29 - "Mg_Bruta_Perc" Final Figure 28 - "Nr_Horas_Faturadas" Final

(37)

27 Those values, numbering 1344 rows, where removed and the figure above shows the new distribution of values.

The last variable that had outliers was “Tempo_Serviço”, 211 rows to be exact, that were removed as well. The box plot was trimmed above the value of 200 for this variable.

That concludes our section of data cleaning and outlier removal. For the first part there were removed 7415 rows and for the second, 3467 rows. Considering the 7415 rows, these had to be removed due to their inconsistencies as they were unsuited for the dataset. As for the 3467 rows they accounted for 3.8% of the dataset after removing the inconsistencies, above the 3% mark but excusable due to the size of the dataset that now has 87722 rows.

Figure 30 - "Total_Gasto_Veiculo" Final

Figure 31 - "Tempo_Servico" Final

(38)

28 3.3.1.3. Feature Engineering

This phase was where it was proceeded to label encode, to create dummy variables and to create new variables that would ultimately be used on the model.

As said before, the variables that were label encoded were “Modelo”, “Tipo_Venda”,

“Tipo_Cliente” and “Tipo_Linha”. This was due to these variables having 3 or more possible values.

The dummy variables that were created were “Colisao”, “BMW”, “BMW_i” and “MINI”. The first variable resulted from a columns in the original dataset that was type of department and had two values, mechanical or collision, which generated this binary variable. The latter three resulted from the brand variable that had 3 possible values, BMW, BMW i, and MINI.

When discussing the creation of new variables, “Total_Gasto_Veiculo”, “Electrico” and

“Tempo_Serviço” were created. The first one is the sum of dealership spending per vehicle, meaning the sum of all values in variable cost or “Custo”, where the vehicle code was the same. The second one is a binary variable as opposed to the first, which is continuous. It indicates if a car is electric or not where “1” means yes, it is electric and “0” means no, it is not. The third variable measures the time the process took to be fully completed from when the customer contacted the dealership scheduling a service to when the customer paid.

(39)

29 After creating all variables, an analysis of the correlation matrix between all columns was done.

As we can see in the matrix above, we have a high positive correlation between “Kms_Medios”

and “Idade_Viatura” and between “Electrico” and

“BMW_i” with values of 0.73 and 0.97 respectively.

This is likely due, in the first case, to the notion that an older car most likely would have travelled farther than a newer car, however that’s not always the case. Concerning the second case, almost all vehicles sold by BMW_i, except the i8, is electric which would explain the situation.

The table to the left is the list of correlation values between the target variable and all independent variables sorted by descending.

Figure 32 - Feature Correlation Matrix

Figure 33 - Target Feature Correlation

(40)

30 3.3.1.4. Scaling

The variables present in the dataset have very different scales even after label and one hot encoding. Thus, normalizing or standardizing data may be required to avoid bias in certain data mining algorithms. In this study, 4 methods were used alternatively, Standard Scaler, MinMax Scaler twice, using two variants, one where values were transformed into a range from 0 to 1 and another ranging from -1 to 1, and Robust Scaler. The latter is works by removing the median and scaling the data according to the interquartile range (IQR), meaning between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile). This scaler is not affected by outliers as their values are not

considered for scaling (Scikit-Learn, 2020 e.).

(41)

31

3.4. M

ODEL

3.4.1. Feature Selection

Before starting to build the model, I used RFE to help with the selection of variables to use in the model. RFE is an algorithm for feature

selection meaning that it seeks to select variables from the training dataset that are most relevant for predicting the target variable. The configurations required are just the choice in the number of features and the algorithm used to choose the variables.

In this case, the block of code to the right was used to iterate through different test validation data samples to get some insights on the impact the samples have on the model score. The goal was to find the best score with the optimum number of variables.

The algorithm used as an

estimator was the Decision Tree Regressor to model a continuous dependent variable (which was the actual case). Before importing the model, fitting it into the Train set's independent variables, and making the prediction using the Test set, a few other feature selection methods were performed using the number of variables we got from the results of the block of code above. SelectKBest with Linear Regression and with Mutual Information Regression and RFE with the same Decision Tree Regressor to get further results on the ranking of variables and which to use.

This estimator selects the features by recursively considering smaller and smaller sets of features. In this case, as it wasn’t known the optimum number of features to choose and use, a nested for loop was created to get the best result after performing 10 to the power of 18 iterations, where it cycled through 10 different data samples. As the size of the dataset was quite large, about 80 thousand rows, and the complexity of the code was quadratic, due to the nested for, the number of different samples couldn’t be greater. The first for consists of the number of samples to do of the train dataset, that had passed through data pre-processing, and

Figure 34 - Nested for RFE

(42)

32 the second for had the range of the number of features, 18 features, so it would test the score the model had given all its features and the given data sample.

As the figure above shows, the number of features to use that would yield us the best score was 5 features when performing the RFE itself. The objective was obtaining the ideal number of features (n_features_to_select) to use in the following RFE, this one being used to rank the 18 features.

Then, as was said before, a few more feature selection algorithms were done using the ideal number of features found in the first part of this study’s RFE, 5 features.

Finally, the dataset with the optimal number of the most important features will be discussed in the results section of this study, as it was extensively compared to other training versions of the various models used with and without Feature Selection.

3.4.2. Models Used

After selecting features, the study proceeded to build the models required and as such the following were used:

1. Linear, Huber and RANSAC Regression

The Huber regressor and RANSAC regressor are versions of linear modelling that are resistant to outliers and were used in comparison to the Linear regression model. The Linear regressor maintained the highest score out of the 3 with a close second from the Huber Regressor. If the service is resultant of a crash or other accident. Binary, 1 for yes 0 for no. Resulted from one hot encoding a variable that had two options, this variable, 15, and the next one, 16.

2. Decision Tree Regressor

Typical decision tree regressor done before using the gradient boosting and bagging variants.

3. Random Forest

A random forest is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over- fitting (scikit-learn, 2022). One of the best performing algorithms on this study.

(43)

33 4. Gradient Boosting Regressor and Histogram-based Gradient Boosting Regression Tree This estimator builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage a regression tree is fit on the negative gradient of the given loss function (scikit-learn, 2022). The Histogram version is a faster version for more bulky datasets like the one in this study and in this case, it also yielded a better result as we will see later.

5. Multi-Layer Perceptron Regressor

A good universal algorithm for data mining, especially due to its capacity to learn non-linear models. Had an already good score even before hyperparameter tuning.

6. AdaBoost Regressor

An AdaBoost regressor is a meta-estimator that begins by fitting a regressor on the original dataset and then fits additional copies of the regressor on the same dataset but where the weights of instances are adjusted according to the error of the current prediction (scikit-learn, 2022). This regressor also had a good result, similar to what MLP Regressor had.

(44)

34 3.4.2.1. Hyperparameter Tuning

Out of those earlier models mentioned, the Linear Regression is the only one where it is not possible to tune the hyperparameters for. As such, below are the tables with the

hyperparameters used to tune the models that allow such tuning, in order to optimise their performance, using GridSearchCV or RandomizedSearchCV both with 5 folds. GridSearch indeed was very exhaustive and time consuming to run for the number of parameters to tune and for the size of the dataset, so RandomizedSearchCV was more favored.

Table 4 - Hyperparameter tuning for all used models

Model Parameters

Decision Tree

Max Features ‘auto’, ‘sqrt’, ‘log2’

Max Depth None, 5

Criterion ‘squared_error’, ‘friedman_mse’,

‘absolute_error’

Min Samples Split 10, 20, 40

Min Samples Leaf 20, 40, 100

Max Leaf Nodes 5, 20, 100

Random Forest

Nr of Estimators 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 50, 100

Max Features ‘sqrt’, ‘log2’

Max Depth None, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55

Criterion ‘squared_error’, ‘absolute_error’

Histogram Gradient Boosting Regressor

Loss ‘squared_error’, ‘absolute_error’,

‘quantile’

Max Depth 2, 5, 10, 20, 50

Learning Rate 0.01, 0.02, 0.03, 0.05, 0.1

MLP Regressor

Hidden Layer Sizes (50,100,50), (100,)

Activation ‘tanh’, ‘relu’

Solver ‘sgd’, ‘adam’

Learning Rate ‘constant’, ‘adaptive’

(45)

35

3.5. A

SSESS

The final step of SEMMA methodology is the assess phase, where we try and understand and evaluate the results of the models based on the metrics we will discuss in this section.

3.5.1. Metrics Used

1. Mean Absolute Error (MAE)

Mean Absolute Error is a metric that measures the absolute difference between what the model predicted and the actual value. The MAE formula takes the absolute sum of all errors,

difference between prediction and value, and divides it by the total number of observations. The goal is to have this value as low as possible since it is a loss function, to have a low MAE is to have a more accurate prediction and thus model.

MAE is in the same unit as the target variable which in this case is percentage and it is robust to outliers due to its averaging operation in its calculation.

2. Mean Squared Error (MSE)

Mean Squared Error is a metric that is similar to the Mean Absolute Error, but it replaces the absolute with the squared operation. It is the average of the sum of the squared distance between the prediction and the observed value.

MSE is not as robust to outliers as MAE because of that change in the way its calculated.

Squaring the distance between points when the value in question is an outlier stretches the gap between an error value in the mean and another that’s an outlier.

3. R Squared (R2)

R2 is a metric for measuring the performance of the model, not a loss function as MAE and MSE.

This metric compares the model’s performance against the performance of a mean line from 0 to 1, where 0 equals the mean line, meaning no difference between the model and mean, and 1 equals perfection, the model makes no mistake which is impossible.