Multivariate Analysis of Sustainable Data in Energy Utility

(1)

i

Master Degree Program in Data Science and Advanced Analytics

Multivariate Analysis of Sustainable Data in Energy Utility Clustering Analysis Application in Contributing Locations

Filipe Vieira Lourenço

Internship Report

presented as partial requirement for obtaining the Master Degree Program in Data Science and Advanced Analytics

NOVA Information Management School

Instituto Superior de Estatística e Gestão de Informação

Universidade Nova de Lisboa

MDSAA

(2)

1 NOVA Information Management School

Instituto Superior de Estatística e Gestão de Informação Universidade Nova de Lisboa

by

Filipe Vieira Lourenço

Internship report presented as partial requirement for obtaining the Master’s degree in Advanced Analytics, with a Specialization in Business Analytics

Supervisor / Co Supervisor: Miguel de Castro Neto

November 2022

MDSAA

(3)

2

STATEMENT OF INTEGRITY

I hereby declare having conducted this academic work with integrity. I confirm that I have not used plagiarism or any form of undue use of information or falsification of results along the process leading to its elaboration. I further declare that I have fully acknowledge the Rules of Conduct and Code of Honor from the NOVA Information Management School.

[student signature]

[place, date]

(4)

3

ACKNOWLEDGEMENTS

I want to express my unmeasurable gratitude to my family for all the support and love and for giving me all the possibilities to become the person I wanted to be, the person I am today.

(5)

4

ABSTRACT

With the increase of amount of data generated, finding a way to get knowledge and insights from them is fundamental in decision-making for any kind company in any field. For a company as EDP, an electric utility, and its responsibilities and commitments entailed its crucial to meet the expectations of its customers and investors, otherwise it is in danger of losing its reputation and money to its competitors.

With this in mind and being that I was in the team responsible for collecting the sustainable data from the locations belonging to EDP Group, we decided to move forward to an analysis of the reporting process and by complementing it by analysing a clustering solution to characterize the diverse locations. This project followed CRISP-DM methodology from the beginning until the end where the clustering solution was found by applying hierarchical clustering on top of K-means. All the data understanding, transformation, modeling and model evaluation was performed using Jupyter Notebook, being the final solution built in Power BI platform.

KEYWORDS

Sustainability; Reporting; CRISP-DM Methodology; Data Mining; Unsupervised Learning; Clustering

(6)

5

INDEX

1. INTRODUCTION ... 10

1.1. Company Context... 10

1.1.1. History... 10

1.1.2. Guiding Principles and Value Chain... 11

1.1.3.

Company’s Numbers and Acknowledgements ... 12

1.2. Objectives of Internship ... 12

2. THEORETICAL FRAMEWORK ... 13

2.1. Sustainability ... 13

2.2. CRISP-DM Methodology ... 14

2.3. Principal Component Analysis ... 18

2.4. Machine Learning... 19

2.4.1. Supervised and Unsupervised Learning ... 19

2.4.2. Clustering Algorithms ... 20

3. METHOLOGY ... 22

3.1. Business Understanding ... 22

3.1.1. Business Objectives ... 22

3.1.2. Situation Assessment ... 22

3.1.3. Data Mining Goals ... 22

3.1.4. Project Plan ... 23

3.2. Data Understanding I ... 23

3.3. Data Preparation I ... 27

3.4. Data Understanding II ... 28

3.5. Data Preparation II ... 31

3.6. Modeling ... 33

3.7. Evaluation ... 36

3.8. Deployment ... 38

4. RESULTS AND DISCUSSION ... 40

5. CONCLUSIONS... 44

6. LIMITATIONS AND RECOMMENDATIONS FOR FUTURE WORKS ... 45

7. REFERENCES... 46

(7)

6

LIST OF FIGURES

Figure 1. EDP S.A. Logo during the internship period ...10

Figure 2. SDG 5 (SDG Compass, 2022) ...13

Figure 3. SDG 7 (SDG Compass, 2022) ...13

Figure 4. SDG 7 (SDG Compass, 2022) ...13

Figure 5. SDG 9 (SDG Compass, 2022) ...14

Figure 6. SDG 11 (SDG Compass, 2022) ...14

Figure 7. SDG 12 (SDG Compass, 2022) ...14

Figure 8. SDG 13 (SDG Compass, 2022) ...14

Figure 9. SDG 15 (SDG Compass, 2022) ...14

Figure 10. SDG 17 (SDG Compass, 2022) ...14

Figure 11. Phases of the CRISP-DM Methodology (Wirth & Hipp, 2000) ...15

Figure 12. Tasks in Business Understanding Phase ...15

Figure 13. Tasks in Data Understanding Phase ...16

Figure 14. Tasks in Data Preparation Phase...16

Figure 15. Tasks in Modeling Phase ...17

Figure 16. Tasks in Evaluation Phase ...17

Figure 17. Tasks in Deployment Phase ...18

Figure 18. Data Dataframe ...24

Figure 19. Data Details Dataframe ...25

Figure 20. Distribution of Number of Indicators by Location ...26

Figure 21. Distribution of Number of Subthemes by Location ...26

Figure 22. Numerical Variables’ Histograms ...29

Figure 23. Binary Variables’ Bar Charts ...30

Figure 24. Inertia Plot and Average Silhouette Score for Uploaded Information Perspective 35 Figure 25. Inertia Plot and Average Silhouette Score for Indicators Perspective ...35

Figure 26. Hierarchical Clustering Dendrogram ...35

Figure 27. Distribution of Locations by Cluster ...36

Figure 28. Binary Variables Proportion by Cluster ...37

Figure 29. Binary Variables Proportion by Cluster ...38

Figure 30. Feature Importance ...38

Figure 31. Initial Analysis I Page ...40

Figure 32. Initial Analysis II and III Pages ...40

Figure 33. Initial Analysis IV Page ...41

Figure 34. Initial Analysis V Page ...41

(8)

7

Figure 35. Cluster Analysis I and II Pages ...43

(9)

8

LIST OF TABLES

Table 1. Data Dataset Variables ...24

Table 2. Data Details Dataset Variables ...25

Table 3. Principal Components Created and Its Description ...33

Table 4. Variables Included for Uploaded Information Perspective ...34

Table 5. Variables Included for Indicators Perspective ...34

(10)

9

LIST OF ABBREVIATIONS AND ACRONYMS

EDP Energias de Portugal

SDG Sustainable Development Goal

CRISP-DM CRoss-Industry Standard Process for Data Mining CO2 Carbon Dioxide

LOF Local Outlier Factor PC Principal Component

PCA Principal Component Analysis KNN K-Nearest Neighbors

(11)

10

1. INTRODUCTION

“Availability and use of relevant and reliable information are vital to improving the governance of countries, cities, and companies to achieve sustainability goals” (Digitalization for Sustainability, 2022).

With the previous citation in mind, it is clear that besides having a huge quantity of data and knowing how to transform and report them properly is fundamental to extract knowledge and insights to allow to understand the difficulties faced or the actions to take.

EDP subscribed more than half of the Sustainable Development Goals defined by United Nations emphasizing the importance of taking informed decisions to accomplish these goals. With this report, I provide a description of applying a clustering solution in a real case on top of EDP’s locations that report sustainable data in a specific data collection tool. The final solution is divided in two analysis, firstly an initial analysis to data related with the reporting process and secondly the clustering analysis.

1.1. C

OMPANY

C

ONTEXT

Energias de Portugal, S.A. (EDP S.A.) has more than 40 years of history, being currently the main electricity producer in Portugal, the third Iberian producer and one of the largest private operators in Brazil (EDP, 2022). EDP Group is also between the major operators in electricity sector in Europe, focusing its activity on generation, networks or distribution and supply of natural gas in Portugal, Spain and Brazil and supply of natural gas in Portugal and Spain. EDP Group is composed by several subsidiaries as EDP Comercial, E-Redes, EDP España, EDP Brasil and EDP Renováveis, being present in 20 different markets and in 4 continents.

1.1.1. History

The history of electricity in Portugal began in the 19^th century, more precisely on the 28^th of February of 1878, in the commemorations of the 15^th birthday of Prince D. Carlos, when six voltaic lamps were lit in the Cascais Citadel esplanade. These lamps were imported from Paris and offered by Lisbon Municipality, being used once again in the next month in Chiado area. These events admired Lisbon residents and were a strong impulse in the development of electrification in Portugal, so much that in the end of the 19^th century and in the beginning of the 20^th century emerged the first companies in generation and distribution of electricity. Important marks followed these developments, as in 1893, Braga was the first Portuguese city being completely lit by electric light or, in 1894, Vila Real installed the first public lighting network powered by hydraulic energy. Started to arise the first entrepreneurs and companies generating energy from hydroelectric and thermoelectric sources, predominating the small thermal power plants until the thirties, when the first big dams were constructed making

Figure 1. EDP S.A. Logo during the internship period

(12)

11 generation of energy from water the main type of energy obtained at the time. During the fifties and sixties, the Portuguese government applied some policies to encourage the expansion and development of electrification to reach as many municipalities as possible, namely in rural areas where electrification was scarce. By the end of the sixties, a merger of hydroelectric, thermoelectric and energy transportation concessionaries was defined and assigned to Companhia Portuguesa de Eletricidade (CPE), controlling 95% of the electric power produced in the country.

Two years after the overturn of the dictatorial regime in 1974 revolution, a decree-law was published by the Minister of Industry and Technology at the time and EDP was founded as a state-owned entity, arising from a merger process of 13 companies responsible for the power generation and electricity distribution in Portugal, that were nationalized in 1975. Confronted with an unbalanced distribution network and due to its public character, EDP was responsible to electrify the entire country, to improve the quality of services provided and to standardize the tariffs paid by the clients. Moving on to the nineties, in 1996, the first community directive was published triggering some changes in the organization and jurisdiction of the European electricity sector. In 1997 was started the first of eight phases of privatization by selling 30% of EDP’s capital, almost 180 million shares were sold to 770 thousand shareholders. In the fourth phase, in 2000, was the moment when the Portuguese State stopped to be the main shareholder, owning only 30% of the capital, and, only in 2013, EDP could become completely private due to legal obligations. Two years earlier, China Three Gorges Corporation became the main shareholder, a situation that remains unchanged until the present, by acquiring 21,35% of EDP’s capital in return for paying over 2.5 billion Euros.

1.1.2. Guiding Principles and Value Chain

According to the EDP Sustainability Report 2022, company’s vision is to be ‘a global energy company, leading the energy transition to create superior value’ following a mission in climate transition by promoting clean energy generation through a sustainable way respecting the environmental, social and governance dimensions (EDP Sustainability Report 2022, 2022). EDP also has commitments defined that are in line with the company’s vision and mission, namely in sustainability by assuming its social and environmental responsibilities in the territories in it operates or by avoiding specific greenhouse gas emissions with the energy produced. Besides sustainability, there are also defined commitments with people by ensuring a fair and balanced work environment for its employees, with clients by taking into consideration the clients’ opinions and needs in company’s decisions and with results by complying with the responsibilities embraced with its shareholders and anticipating what the market demands. EDP identifies building human and trusting relationships with employees, customers, partners and communities, improving the quality of life for present and future generations and creating value by innovating as its key values.

EDP’s value chain in three different phases: energy generation, networks or distribution and consumers’ supply. The generation phase consists in obtaining electricity from primary energy sources as wind, solar and water that are renewable or coal, gas and nuclear that are fossil fuels. The distribution phase is only present in Portugal, Spain and Brazil and is where the producer centres are connected to consumer centres through the transportation of electricity in very high voltage and the distribution of electricity in high, medium and low voltage networks. In the last phase, the energy is in the place of supply and is commercialized by the retailer to the final consumer.

(13)

12 1.1.3. Company’s Numbers and Acknowledgements

In 2021, according with the EDP Sustainability Report 2022, EDP had more than 12 thousand employees distributed around 4 continents, being Portugal the country with the highest number of employees with 5716, followed by Brazil with 3191 and Spain 2013 whereas, on average, each employee had 28 hours of training. In social dimension, EDP had a 23 million euros volunteer investment, over 11 thousand hour of volunteering time and almost 1 thousand beneficiary entities.

EDP provide electricity and gas to around 9 million customers in Portugal, Spain and Brazil, being present in the remaining countries with renewable energy generation purposes only, namely with wind and solar farms. EDP is the main electricity producer in Portugal, the third in Portugal and Spain, one of the main private entities in energy sector in Brazil and one of the leading wind energy producers in the world, producing around 75% of the total energy from renewable resources. In 2021 there was a reduction of 51% in specific emissions (Scope 1 and 2 emissions) when comparing with the emissions of 2015 and 5 terawatt-hour of energy was saved by costumers since 2015. The energy is delivered in high voltage lines, having 165 kilometres in operation and 1252 kilometres under construction. In distribution, EDP has 378 thousand kilometres of networks and registered an increasing of 8 percentual points in smart meters, compared to the previous year of 2020, due to the investment in network modernization.

1.2. O

BJECTIVES OF

I

NTERNSHIP

I was integrated in Information Management and Reporting team in Sustainability department of EDP S.A., this team has the responsibility of gathering non-financial and sustainable related data of the diverse subsidiaries of EDP group, as well as the treatment and preparation of these data to make it available for other teams and to report it in the annual sustainable report. The annual sustainable report has a colossal importance in company´s strategies and policies as it allows the shareholders and investors keep up to date with company’s situation and to take informed decisions.

My responsibilities were based on managing the data collection application, the Sustainable Data, where I reported the problems faced with the application to developers’ team, managing the periods of data collection next to the locations, trying to answer the doubts and difficulties faced by the contributors and making available the data collected by the application. Besides the Sustainable Data application, I was also responsible for the maintenance and construction of dashboards built with Power BI application, using the collected data with Sustainable Data as well as data from other sources, sharing them with diverse teams in sustainable department and other business units inside EDP Group.

With the responsibilities I got, I decided to propose to my team an analysis to the information obtained from the Sustainable Data and the events occurring during the periods of data collection in order to understand the main limitations and difficulties faced. Up to this point no analysis had been made, relying only on their perceptions and interactions with contributors to elaborate some actions next to the contributors and to avoid these difficulties. As my team looked favorably upon this proposal, we moved forward with the idea and I decided to use it in my internship report.

(14)

13

2. THEORETICAL FRAMEWORK

2.1. S

USTAINABILITY

In 2015, in a meeting with the governments from all over the world in New York during its seventieth birthday, United Nations adopted the 2030 Agenda for Sustainable Development where were defined 17 Sustainable Goals and 169 related targets to be reached in 2030. These goals and targets take into consideration the economic, social and environmental dimensions of sustainable development and follow-up on the Millennium Development Goals, previously defined by United Nations in 2000 that should have been accomplished in 2015. Fighting all the forms of poverty was referred in the document as the main challenge, other important challenges were also mentioned as the inequalities in society, the protection of human rights and gender equality, the economic growth simultaneously respecting the planet and the conservation of natural resources. This Agenda emphasizes the importance of cooperation between all the countries, businesses and citizens to reach the shared goals, everyone has some responsibility during these 15 years, and the enormous ambition and need of implementation of this “Agenda for the full benefit of all, for today’s generation and for future generations” (United Nations, 2015).

EDP Group has subscribed 9 out of 17 Sustainable Development Goals defined by United Nations and since it is recognized as one of the most sustainable companies in the world for several consecutive years, EDP believes they play an important role in cooperating and taking measures and actions to ensure these goals are achieved.

The first Sustainable Development Goal subscribed by EDP was Goal 5 to achieve gender equality and empower all women and girls. In order to achieve this goal EDP intends to increase the percentage of women employees to 30%

by 2025, through the promotion of diverse initiatives as offering electrician courses for women, applying equal KPI’s to all employees and management bodies or providing specific training on the unconscious discriminations which may have influence on decision-making processes.

The next subscribed goal was Goal 7 and states to ensure access to affordable, reliable, sustainable and modern energy for all. To cooperate with this goal, EDP expects to invest 20 million euros in electrical networks where populations do not have access to it, to finance Access to Energy projects in Africa, to produce energy only from renewable sources until 2030 and to increase decentralized solar capacity in clients. Some initiatives to potentiate the goals described are supporting renewable energy projects in developing countries, investing in new solar and wind farms in various regions or implementing energy efficiency measures and campaigns.

Goal 8 defends to promote sustained, inclusive and sustainable economic growth, full and productive employment and decent work for all. By applying salary and hiring criteria, internship programs or implementing audits to suppliers, EDP commits to reach 2% of employees with disabilities until 2025, to pay equal salaries for employees in same functions, to have

Figure 2. SDG 5 (SDG Compass, 2022)

(15)

14 no mortal accidents and to achieve 100% on the EDP certification, including

high risk suppliers.

Goal 9 focus on building resilient infrastructure, promoting inclusive and sustainable industrialization and fostering innovation. In order to attain this goal EDP compromises to invest 2 billion euros in innovation and digitalization projects, to install 100% intelligent energy meters in Iberian Peninsula until 2025, the investment in systems that react smartly to consumers actions, the promotion of electric mobility by committing to electrify 100% of light fleet and 50% of heavy vehicles or in solutions to manage and store energy.

Goal 11 pretends to make cities and human settlements inclusive, safe, resilient and sustainable. EDP defined to have 180 thousand clients with electric mobility services until 2025.

Goal 12, ensure sustainable consumption and production patterns. To comply with this goal EDP pretends to eliminate single-use plastics until 2022, to stimulate circular economy inside the group’s companies and to reach 100% in environmental certification of its operational activities, including suppliers.

Goal 13, take urgent action to combat climate change and its impacts. EDP defined to be carbon neutral by 2030, to implement climate change adaptation plans in all business units, to reduce scope 1 and 2 CO2 emissions by 98% between 2015 and 2030 and to reach 100% in environmental certification of its operational activities, including suppliers.

EDP expects to reach the previous goals by the creation of specific plans to adapt and anticipate weather conditions and energy demand to plan the electrical system management.

Goal 15, protect, restore and promote sustainable use of terrestrial ecosystems, sustainably manage forests, combat desertification, and halt and reverse land degradation and halt biodiversity loss. EDP commits to implement a strategy to mitigate its activity impact on the biodiversity by the development of local environmental actions, the reduction of wind turbine blades, the afforestation in specific areas or strategies to reduce the impact of dams on fish.

Goal 17, strengthen the means of implementation and revitalize the Global Partnership for Sustainable Development. EDP defined to invest 20 million euros in electricity access until 2025.

2.2. CRISP-DM M

ETHODOLOGY

The CRISP-DM stands for CRoss-Industry Standard Process for Data Mining being developed and published in 2000 by a consortium firstly formed by three different entities: DaimlerChrysler, SPSS and NCR, obtaining financing from European Commission and its purpose is to be generalised and Figure 6. SDG 11 (SDG

Compass, 2022) Figure 5. SDG 9 (SDG

Compass, 2022)

Figure 10. SDG 17 (SDG Compass, 2022) Figure 9. SDG 15 (SDG

Compass, 2022)

(16)

15 applicable to any data mining project. Even though this methodology has more than 20 years, when considering goal-oriented data mining projects and achieving the goals of industrial projects it remains to be the most complete data mining methodology (Martínez-Plumed, Contreras-Ochando, Ferri, Hernández-Orallo, Kull, Lachiche, Ramírez-Quintana, & Flach, 2019) and, accordingly with both KDnuggets Polls on Methodology held in 2007 and 2014, CRISP-DM remains the most used methodology in data mining projects (Piatetsky, 2014). Additionally, CRISP-DM is considered as ‘de- facto’ standard of data mining methodology and the central approach of the evolution of data mining methodologies (Mariscal, Marbán & Fernández, 2010).

This methodology follows a hierarchical structure which is divided in four different levels: phases, generic tasks, specialized tasks and process instances, from the most general to the most detailed. There are six different phases defined by CRISP-DM Methodology: business understanding, data understanding, data preparation, modeling, evaluation and deployment, usually it follows this order as shown in Figure 11 but this sequence is flexible meaning that can move forward and backward, being quite usual given the nature and complexity of data mining problems. Each phase will have influence in the next phase once its output and knowledge obtained will determine the direction to follow and the actions to be taken.

Business understanding phase involves understanding project objectives and requirements from a business perspective to convert into a data mining problem and is divided in four tasks. Starts by determining business goals where information about the organization’s context is recorded, the objectives from a business perspective are defined, related questions are addressed and the definition of a business criteria, possibly measurable in order to objectively define if a solution is successful or not. The next task is to assess the situation and the goal is to go into detail

about resources available as collaborators and its qualifications, data storage and software and hardware resources. Besides the resources go also into detail about requirements, assumptions, constraints, risks and contingencies, terminology both on business side and on data mining side and perform an analysis to compare the costs against the benefits to evaluate the feasibility and monetary return of the project. Another task is to determine from the business goals the data mining goals and data mining success criteria and, finally, the last task of producing the Project Plan by describing the

Figure 12. Tasks in Business Understanding Phase Figure 11. Phases of the CRISP-DM Methodology (Wirth & Hipp, 2000)

(17)

16 steps to achieve both the data mining and business goals in terms of their duration or the resources required followed by an initial assessment of tools and techniques.

Data understanding phase starts with data collection and exploration, goes from identifying data quality problems as outliers, duplicated values or incoherencies in the dataset variables, finding initial patterns in data or getting some basic statistics about the variables. It is also divided in four tasks, starting by collecting the initial data and producing an initial report where may be listed the dataset’s locations, the methods used to acquire them and the problems found. In data description task, the goal is to analyse the characteristics of the data including the

variables data types, the number of rows and variables in each dataset, a statistical analysis of the variables in the datasets or understanding the meaning and importance of each variable in the problem’s context, producing a report and evaluating if the data fulfils the constraints and requirements of the project. Following data description task is data exploration where some queries and visualizations are performed to get deeper insights and use them in the following phases as in data preparation when constructing new variables. The final task is to verify data quality, by searching for special and missing values, incoherencies, duplicated rows and outliers or checking the keys.

Data preparation phase includes all the necessary changes from the original dataset or datasets until the final dataset when it is ready to use in the modeling phase. Includes table, record and attribute selection, as well as transformation and cleaning of data to use in the models. Involves tasks as data selection where the inclusion and exclusion of data should be justified by considering the business and data mining goals, the organization’s context, the variables quality assessed in data understanding phase, the limitations of each model technique in terms of data types, data volume and computational effort. Following data

selection task is data cleaning with the goal of increasing data quality through correcting, eliminating or ignoring the noise, dealing with special values and evaluating the importance or relevance of the variables with noise in the problem’s context. In construct data task, new variables can be created from existing ones as total sales amount resulting from the multiplication of quantity sold and unit price or the addition of weights to define the importance of each variable. Other activities are the generation of new records to add new knowledge to the dataset, the normalization and

Figure 14. Tasks in Data Preparation Phase Figure 13. Tasks in Data Understanding Phase

(18)

17 standardization of numerical variables, handling missing values through average, aggregation or estimation with a model or applying transformations in specific fields to satisfy model requirements as transforming numerical variables into ranges or vice-versa. In integrate data task, data from multiple tables or different information sources about the same individuals are merged and aggregations could also be done to summarize diverse information. Finally in format data task, involves formatting changes in dataset required in modeling phase as variables or records orders but not changing its meaning.

Modeling phase consists in selecting one or various modeling techniques that are appropriate for the data mining problem defined previously, through models’

parameters optimization and by applying them to the dataset. Due to some modeling techniques have specific characteristics and requirements, may be necessary to return to data preparation phase and apply the required modifications to the dataset. This phase starts with select modeling technique task where one or more techniques are chosen accordingly with their data quality, format or distribution, political

requirements and existing constraints inside the organization. The next task is to generate test design in order to evaluate model’s quality and suitability, by defining which technique should be applied to divide the dataset into training data, test data and validation data. Build model task consists in applying the chosen model techniques to create one or various models, considering that each model has several parameters to define it is usual to create diverse models with the same modeling technique by changing its parameters. In assess model task, it is assessed if the model meets the data mining success criteria, the created models are compared through the interpretation of each one, are ranked between them accordingly with success and evaluation criteria, the results are interpreted in a business perspective, the model´s reliability is checked and the possible deployment is analysed.

Evaluation phase starts right after the models are built and it is time to compare between them to choose a final model, evaluate if the steps were properly made and if the data mining and business goals have been achieved. Considering that the results are not only the models but also the findings and associated questions, allows to have a broader view of the value created to the organization with the data mining project in question. This phase starts with evaluate results task by evaluating if the model meets the business objectives and

trying to find any weakness in the model, Figure 16. Tasks in Evaluation Phase Figure 15. Tasks in Modeling Phase

(19)

18 includes activities as understanding the data mining results, interpret its applicability, utility and value created in business context, establish if new business objectives emerged due to the findings in the current project or define recommendations for future work. In the end of evaluate results task after comparing and assessing the created models, a model that meet the defined criteria should be chosen.

The next task is to review the process, at this point a summary of the process must be done by identifying the activities and assessing if anything was missing, was wrongly executed, was fundamental for the project, could be improved or done in a different way. The last task of this phase is to determine next steps, based on the previous tasks a decision about how to move forward is made taking into consideration the capacity for deployment and the available resources.

Deployment phase consists in displaying the results and knowledge obtained from the built model so that the client who probably is not familiarized with data mining concepts can easily understand and use it to add value in their business. This phase can involve a simple solution as a simple report or a more complex solution as a dashboard updated with real-time data, besides that a maintenance or a guidelines report may be necessary to ensure the solution keeps functional and bringing value to client’s decisions. Plan deployment is the first task and consists in summarizing the deployable

results, considering different strategies for deployment, documenting how the knowledge will be transmitted to users and how it will be monitored, define how the result will be deployed inside the organization and the needed resources and identifying possible pitfalls in deployment. Plan monitoring and maintenance task includes a detailed maintenance strategy to help precisely using the data mining results, define how the quality of data mining results will be evaluated, define criteria to stop using the model to ensure while the model is active it is producing a solution with quality and what should happen when the model no longer can be used. In produce final report task must include a description of the whole process, the associated costs, the results achieved, adjustments to the initial plan, the monitoring and maintenance plan and recommendations for future work. Review project is the last task and consists in balancing between what was done correctly, incorrectly and alternative ways by getting feedback from all the participants and analysing the project from the beginning until the end.

2.3. P

RINCIPAL

C

OMPONENT

A

NALYSIS

Principal Component Analysis is a well-known dimensionality reduction technique being defined as “a technique for forming new variables which are linear composites of the original variables” (Sharma S., 1996) or as “a reduced dimension by mapping data to a new coordinate axis of a low dimension, which not only reduces the complexity of the data, but also identifies the most important features” (Yang, Yan & Dong, 2019).

PCA tries to find the direction in the data or the rotation of an axis where the variance is maximized, forming the first PC where the variance in the data is the maximum of the total variance. In this way,

Figure 17. Tasks in Deployment Phase

(20)

19 first PC gathers the maximum information, followed by the second PC with the remaining maximum variance possible in data and so on, being the maximum number of PCs the number of variables in the dataset. Each new additionally PC will be orthogonal to all previous PCs, meaning that the correlation between PCs is 0 and that they are independent. Additionally, the dataset must be standardized in such a way that variables with different variances do not have different impact on the construction of the PCs.

When performing a PCA, there are as many eigenvectors as variables in the input dataset and are used to represent the contribution of each variable in each PC and to calculate the PC scores with the weights, the values of the new variables. Similarly, there as many eigenvalues as eigenvectors and variables in the input dataset and they represent the variance explained or information present in each PC from the total variance in data. The sum of the eigenvalues from all the PCs must be equal to the total variance in the dataset. The correlations between each variable and each PC are identified as loadings and represent the impact and direction (negative or positive) of each variable on the construction of the PC, being therefore critical in the interpretability and on labelling of the PCs.

Loadings can be calculated from the formula below:

𝑙_𝑖𝑗 =𝑤_𝑖𝑗 𝑠̂_𝑗 √𝜆_𝑖

Where 𝑙𝑖𝑗 is the loading of the j^thvariable for the i^thPC, 𝑤𝑖𝑗 is the eigenvector value of the j^thvariable for the i^thPC, 𝑠̂_𝑗 is the standard deviation of the j^thvariable and 𝜆_𝑖 is the variance or eigenvalue of the i^thPC.

When deciding the number of PCs to retain, it will always depend on the nature of the problem but it is usual to measure the information loss by summing the variances or eigenvalues of the PCs not used in the lower-dimensional space. Since the ideal number of PCs could be difficult to determine, there are 3 criteria commonly used:

• Following Kaiser’s criterion, it should be kept the PCs with eigenvalues greater than 1;

• With Pearson´s criteria, keep all the PCs until 80% of the variance is explained, in other words the sum of the proportion of the eigenvalues is at least 80%;

• With Scree plot method, by plotting the proportion of variance explained by each PC, look for an elbow on the plot and retain the PCs until the PC with the elbow.

2.4. M

ACHINE

L

EARNING

2.4.1. Supervised and Unsupervised Learning

Supervised learning or predictive modeling is applied to labelled data, with the goal of learning a decision criterion to classify new unlabelled objects, the target variable, and can be divided in two main categories, in classification and regression problems. In other words, the goal goes through learning a function as accurate as possible that explains the relationship or the structure between input variables and the target variable. This target variable intends to bring value to the business by answering a specific problem, for example by predicting hotel reservations cancelations or predicting train delays based on diverse variables.

(21)

20 Classification problems focus on predicting a categorical label by recognizing patterns on input features while regression problems intend to predict and find relationships between continuous variables (Talabis et al., 2015). When taking into consideration the algorithms to apply should be assessed the quality, quantity and type of input data available, thus without a large quantity of data or with unbalanced distributed data, high-complexity models should not be used to avoid overfitting the training data.

Contrary to supervised learning, in unsupervised learning or descriptive modeling there is no target variable, that is there is no label on data, trying to classify the input dataset by discovering a structure through common elements in the data (Talabis et al., 2015).

Clustering is one of the techniques included in unsupervised learning, trying to find similarities or dissimilarities in the data which allow to differentiate the data points and to group them. The goal with this technique goes through having clusters as much as different from the other clusters while, simultaneously, the objects inside each cluster share the maximum characteristics in common.

Different perspectives can be chosen to perform the clustering, namely when performing a market segmentation for a company, a value perspective can use variables related with the quantity of purchased products or the amount spent or a customer perspective can use customer related data as age, income or address. K-means and hierarchical clustering are two of the most popular and used clustering algorithms, going into more detail later. Dimensionality reduction is another use case for unsupervised learning, helping to reduce the number of variables and minimizing the impact data quality and integrity, being PCA a dimensionality reduction technique . This technique is quite often used in data preparation phase to avoid the curse of dimensionality as well as helping to reduce computational efforts, when applying a model on top of the data, and still for facilitate to visualize data in a lower dimensional space. Another technique are association rules, being particularly useful and bringing value by uncovering relationships between the products bought in physical stores by improving their cross-selling and in online stores with personalized recommendations (Kerenidis &

Prakash, 2016).

2.4.2. Clustering Algorithms

Mangiameli et al (1996) state that there are two distinct problems in clustering, choosing the appropriate number of clusters and the assignment of data objects to one and only one cluster. With this in mind clustering algorithms can be divided in partitioning and hierarchical techniques, being that partitional algorithms in three categories, distance, model or density based, conforming the similarity measure used and hierarchical algorithms can be divided in agglomerative or divisive. In the clustering algorithms explained below, hierarchical clustering and K-means, each data object can only belong to exactly one cluster and each cluster must contain a minimum of one data object.

In hierarchical clustering the divisive approach follows a top-down approach, starting with one cluster containing all the objects and dividing the clusters sequentially into smaller clusters until a condition is achieved or each object belongs to a different cluster. The agglomerative approach is the opposite, initially each data object belongs to its own cluster and merge them successively in clusters according with the distance between them until all the objects belong to one cluster, meaning that follows a bottom-up approach (Saxena et al, 2017).

(22)

21 One drawback of hierarchical clustering is that it is not possible to go back, meaning that allows to reduce computational efforts but simultaneously wrong decisions cannot be rectified. On the other hand, it does not require to define the number of clusters a priori as clusters are merged accordingly with dissimilarity between data objects which is the distance between them, as Euclidean, Minkowski or Mahalanobis distance (Oyelade et al, 2019). When measuring the distance between clusters there several methods to consider, being that five of the most common methods used are listed and described below (Sharma, 1996):

• Single-linkage or nearest-neighbor method measures the distance between two clusters by the minimum distance between all possible combinations of data object pairs, one of each cluster.

• Complete-linkage or farthest-neighbor method is the opposite of single linkage method, the distance between two clusters is the largest distance from a data object of one cluster with a data object of the second cluster.

• Average-linkage method measures the distance between two clusters by calculating the average distance between all data objects pairs.

• Centroid method calculates the distance using the centroid of each cluster, that is the average data object of each cluster.

• Ward´s method uses within-cluster sum of squares or error sum of squares to measure cluster homogeneity, trying to minimize it to guarantee that the data objects in each cluster are as much homogeneous as possible.

K-means is a distance-based partitioning algorithm and one of the main differences from hierarchical clustering is the need to define the number of clusters a priori. Besides k-means being one of the most popular clustering algorithms, it has its own limitations as not having an efficient way to identity the initial partitions neither the number of clusters and as being sensitive to outliers in data given that all data objects must belong to a cluster, including outliers or with extreme values (Saxena et al, 2017).

With the k, the number of clusters, defined k-means forms an initial partition, using an iterative process to increase the homogeneity inside each cluster and increase heterogeneity between clusters.

Basically, the algorithm can be described in the following steps:

1. Choose the number of cluster k and, consecutively, the initial seeds.

2. Assign each data object to the closest seed.

3. Calculate the centroid of each cluster.

4. Repeat steps 3 and 4 until the centroids no longer suffer changes.

5. The algorithm ends when there are no alterations in the centroids.

(23)

22

3. METHOLOGY

3.1. B

USINESS

U

NDERSTANDING

3.1.1. Business Objectives

EDP Group is an organization that incorporates all the phases in energy sector, from electricity generation in power plants distributed for several countries and continents to electricity distribution and consumer supply in Portugal, Spain and Brazil. EDP Group is formed by various subsidiaries that are divided accordingly with its main activity and geography, therefore the data from the group is also dispersed. This challenge is overtaken with Sustainable Data application where each location from the different subsidiaries or companies reports its sustainable related data through its contributors.

While up to this point the data were merely reported and the difficulties faced by the contributors and the interactions with them were not registered. Therefore, to take some advantage of the points mentioned above, our team decided to perform an analysis with the objective of understanding which locations and contributors deal with more difficulties and identify possible groups of locations with similar characteristics. With this in mind, not only the data related with interactions between me and the contributors were taking into consideration, but also the data obtained from the Sustainable Data application. Information Management and Reporting team pretends to take advantage of this analysis, by taking measures to increase the quality of reported data.

Considering the objective defined, we can identify two business success criteria: reduce the difficulties faced along with the delays and readjustments on reported data and give valuable information about the types of locations and contributors accordingly with the data we got from them.

3.1.2. Situation Assessment

The vast majority sustainability related data is collected at a quarter level at EDP, so I was provided with data from one quarter that had to be anonymous due to privacy constraints. I was provided with two Excel files, the first is an extraction of the data on Sustainable Data corresponding to an anonymous quarter, containing a total of 65.409 rows and 13 variables where each row contains information about an indicator, a location and the value uploaded by the contributor. The second file includes data with details about the difficulties and limitations faced by each location where each row corresponds to one location, having a total of 379 locations and 11 variables.

With the mentioned above, it seems clear that is essential to do some feature engineering on data in order to achieve the desired objectives and to execute an analysis with quality and useful for the team.

The main risks come from being the first time an analysis of this type is made, so no previous knowledge is available about the viability of the project and a potential lack of quality in the data.

3.1.3. Data Mining Goals

Taking into consideration the business objective described previously, the data mining goal is based in elaborating an effective division of the groups of locations evidencing their main characteristics.

Besides that, is fundamental to provide useful and clear insights so that the Information Management and Reporting team can gain expertise about their contributors and locations in such a way they can take some actions and measures to increase the quality of data collection process.

(24)

23 3.1.4. Project Plan

Accordingly, with was mentioned previously, it will be necessary to apply some transformations to Data dataset to be able to merge its information with Data Details dataset. These transformations will be made using Jupyter Notebook environment and, consequently, 2 different notebooks will be used.

The first notebook intends to make the transformations and merger mentioned above while the second intends to work on the merged dataset. Without going into too much detail, the project will be based in the following steps:

1. Collect the two initial datasets and load them to the first notebook.

2. In Data Understanding I, explore the two datasets obtained, specially from Data dataset, including its method of collection, format and variable datatypes or through a statistical analysis and construction of visualizations.

3. Moving to Data Preparation I, apply the needed transformations with Data dataset by building new variables to add them to Data Details dataset. After this is concluded, save the resulting dataset to use it in second notebook.

4. Moving now to the second notebook to Data Understanding II, start by exploring the data more into detail with a statistical analysis and visualizations of the variables to uncover some patterns and problems, verify data quality as null values, duplicated rows and incoherencies between the variables.

5. With Data Understanding II done, it is time to Data Preparation II, to remove outliers, construct new variables, select the features and apply dimensionality reduction techniques to decrease the dataframe complexity.

6. Start modeling phase with the definition of different perspectives to apply clustering techniques on top of each perspective. Apply K-means algorithm to each perspective following by hierarchical clustering to merge the cluster solutions found with K-means and, finally, KNN- Classifier to merge the locations classified as outliers in Data Preparation II.

7. The next step is to evaluate the model, following by the deployment of the results achieved through a Power BI dashboard.

3.2. D

ATA

U

NDERSTANDING

I

Regarding collecting initial data, it was necessary to use two datasets, collecting the first dataset (designated by Data) from the Sustainable Data application was simple, it is the file extraction made available for Information Management and Reporting team and other business units at EDP. However, a single adjustment had to be made, changing the location names by numbers in a way that identity of locations remains anonymous. Table 1 is a list of the 13 variables in this dataset with a short description and the data types of each one and in Figure 18 are the 3 first rows of the Data dataframe accompanied by the column names.

Variable Name Variable Description Data Type

Index_Location Location index Nominal variable

Theme The theme associated with the indicator Nominal variable Subtheme The subtheme associated with the indicator Nominal variable

(25)

24 Consolidation Method The consolidation method associated with

the location

Nominal variable

Technology The type of technology used for energy generation in each location

Nominal variable

Activity The type of activity associated to each location

Nominal variable

Indicator Reference Indicator reference used in Sustainable Data application

Nominal variable

Indicator Name Indicator name Nominal variable

Active Indicator If the indicator is active or not Binary variable Unit Category Type of units associated with the indicator Nominal variable Unit Measure Unit of measure associated with the indicator

and location

Nominal variable

Value Value uploaded by the contributor Numerical variable

Data/hora última atualização Date and time corresponding to the uploaded value

Datetime variable

Table 1. Data Dataset Variables

Collecting the second dataset (designated Data Details) was not straightforward as the first, besides also changing location names by numbers to anonymize it, collecting variables as “Nº of Doubts via Teams” or “Nº of Doubts via Email” was carried out manually with the contacts made with me since I was the responsible for managing Sustainable Data application. Other variables as “Insertion via Excel”

or “Nº of Uploaded Files” were collected considering the information available inside Sustainable Data application as well as “Replaced values” and “Validated Values” variables. Table 2 shows the 11 variables in this dataset with its description and corresponding data type and in Figure 19 are the 3 first rows of the Data Details dataframe accompanied by the column names.

Variable Name Variable Description Data Type

Index_Location Location index Nominal variable

Replaced values 1 if the values were changed during the filling period, 0 otherwise

Binary variable Figure 18. Data Dataframe

(26)

25 Validated Values 1 if the values were validated from the person

responsible for it, 0 otherwise

Binary variable

Nº of Doubts via Teams Number of doubts exposed via Microsoft Teams for each location

Numerical variable

Nº of Doubts via Email Number of doubts exposed via email for each location

Numerical variable

Nº of indicators with applicability changes

Number of indicators associated or disassociated for each location

Numerical variable

Nº of New Contributors Number of new contributors for each location Numerical variable Change of Contributors 1 if the contributors changed since the previous

filling period, 0 otherwise (new locations were considered as not suffering change of contributors)

Binary variable

Insertion via Excel 1 if the values were uploaded via Excel files, 0 otherwise

Binary variable

Nº of Uploaded Files Number of uploaded files for each location Numerical variable New Location 1 if it's a new location since the previous filling

period, 0 otherwise

Binary variable

Table 2. Data Details Dataset Variables

In this first notebook the objective goes through using some of the information in Data dataset to enrich Data Details dataset where each row represents each location. Information related to the applicability of subthemes, activities or indicators to each location or about date and time of uploaded data will be useful for the purposed objective.

I started by reading the two Excel files into two Pandas dataframes and, immediately, one problem was faced when I tried to change variable Value datatype to float, the datatype used in Python when a variable has decimal numbers. Negative values in the mentioned variable had an apostrophe before the minus sign, so it had to be removed to be able to change variable datatype to float.

As mentioned previously, there are 13 variables and 65.409 rows in Data dataframe and by analysing some of the variables, Theme has 2 different values, Subtheme 18 and Technologies 8 and there are a total of 486 different Indicators accompanied by 22 different unit measures. Getting statistical information about the variables is visible an unbalanced distribution in some variables as in Theme variable where “Environmental” occurs more than 50 thousand times. The same happens with

Figure 19. Data Details Dataframe

(27)

26

“Consolidated” in Consolidation Method variable, “Wind” in Technology, “Renewables” in Activity variable, “Weight (g, kg, kt, t)” in Unit Category variable and “t” in Unit Measure. In Value variable it is visible that there are a large number of repeated values since first and third quantile of its distribution are 0, meaning that at least 50% of the values are 0.

In Data Details dataframe, there are 379 rows each one corresponding to one single location and 11 variables. In the second notebook, I will analyse more in detail this dataframe since the goal of this first notebook is to prepare the information in Data dataframe to be used in Data Details dataframe.

To analyse more in detail the variables in Data dataframe, I decided to build some visualizations. By analysing figure 20, it is clear that most of the locations have around 180 associated indicators following by some locations with 50 or less indicators, having a mean value of 173 indicators by location.

Figure 20. Distribution of Number of Indicators by Location

Now considering figure 21 where the distribution of number of subthemes by location is displayed, is shown that most of them have 9 different associated subthemes following some locations with 1 or 4 different associated subthemes.

Figure 21. Distribution of Number of Subthemes by Location

(28)

27 By performing other analysis, I can understand that most of the locations reported the first values on 22^nd, the last day of the filling period followed right by the day before, the 21^st. Regarding the last values reported, 27^th was the day with most locations meaning that the majority of locations reported values after the defined period. Therefore, if some problem is found in the values it will delay the following processes where these values are needed or wrong values will be used until there is new update.

One last note, Data dataframe does not contain any missing value, so I can proceed to data preparation phase.

3.3. D

ATA

P

REPARATION

I

Moving into data preparation, the next step is to use variables from Data dataframe where there were various rows for each location, perform some feature engineering to create new variables that will be useful in following phases and merge them with Data Details dataframe. In this way, each location matches with only one value of the new created variables. The created variables and merged with Data Details dataframe were the following:

• FirstDate, is the day of the oldest date corresponding to the first value uploaded by the contributor for each location. It was

• LastDate, is the day of the most recent date that corresponding to the last value uploaded by the contributor for each location.

• MaxValuesInserted, is the maximum number of values uploaded simultaneously. This variable was created from the date column by counting the number of values uploaded with the same date and time and finding the maximum for each location.

• NLateInd, represents the number of indicators uploaded later than the last day of the filling period.

• NInd3week, is the number of indicators uploaded during the third week of the filling period.

• NInd2week, is the number of indicators uploaded during the second week of the filling period.

• NInd1week, is the number of indicators uploaded during the first week of the filling period.

• ValueNUnique, is the number of unique uploaded values.

• NSubthemeDist, is the number of distinct subthemes associated to each location.

• NTotalIndic, is the total number of indicators associated to each location.

• NMeasureDist, is the number of distinct measures associated to each location.

• NThemeDist, is the number of distinct themes associated to each location.

• NIndWaste, is the number of indicators associated with waste subtheme.

• NIndWater, is the number of indicators associated with water subtheme.

• NIndEmissions, is the number of indicators associated with emissions subtheme.

• NIndEnvComplaints, is the number of indicators associated with environmental complaints subtheme.

• NIndEffluents, is the number of indicators associated with liquid effluents subtheme.

• NIndSpills, is the number of indicators associated with spills and near misses subtheme.

• NIndEnergyEfficiency, is the number of indicators associated with energy efficiency subtheme.

• NIndOtherEmissions, is the number of indicators associated with other emissions subtheme.

• NIndEnergyConsup, is the number of indicators associated with energy consumption subtheme.

(29)

28

• NIndEnvMS, is the number of indicators associated with environmental management system subtheme.

• NIndNetwork, is the number of indicators associated with distribution network characterization subtheme.

• NIndFleet, is the number of indicators associated with fleet subtheme.

• NIndEnergyBal, is the number of indicators associated with distribution energy balance subtheme.

• NIndLinesEquip, is the number of indicators associated with lines and equipment in hazardous areas subtheme.

• NIndByProducts, is the number of indicators associated with by-products subtheme.

• NIndBiodiversity, is the number of indicators associated with biodiversity subtheme.

• NIndQualityNetwork, is the number of indicators associated with quality of network service subtheme.

• Consolidated, is a binary variable, is 1 when the location is consolidated or 0 when not consolidated or assets under management.

• Wind, is a binary variable, is 1 when the location’s technology is wind or 0 otherwise.

• Renewables, is a binary variable, is 1 when the location’s activity is renewables or 0 when is distribution, generation or activities support.

• NZeroValues, is the number of uploaded values equal to zero.

• NNonZeroValues, is the number of uploaded values different from zero.

After some feature engineering to create these 34 variables, the dataframe is now ready to be used in second notebook.

3.4. D

ATA

U

NDERSTANDING

II

Moving now to the second notebook, I started by reading the dataframe previously prepared in first notebook. This dataframe contains 45 variables, the initial 11 variables joined with the 34 created variables and 379 rows, the total number of locations.

Now looking at the data more into detail through summary statistics, boxplots, histograms and correlation matrixes, there were uncovered some reasons to concern. Firstly, a substantial part of the 34 numerical variables show an unbalanced distribution with some values occurring with a high frequency which may be problematic. As it can be seen in Figure 22, few variables do not have an extremely unbalanced distribution, including some with a positively skewed distribution, namely ValueNUnique and NNonZeroValues variables.

By analysing Figure 22, it is clear that most of the locations placed 1 doubt via Microsoft Teams and 2 doubts via email and most of the locations did not change the applicability of their indicators neither had new contributors. Considering variables Nº of Uploaded Files and MaxValuesInserted, the most common is locations to upload 4 different files with some locations uploading 10 files, which may be problematic in the filling process, and the vast majority of locations upload a maximum of 8 indicators simultaneously. This low number of indicators uploaded simultaneously is worrisome as most of the locations have 180 associated indicators, meaning they do not upload all the information at once.

Regarding variables with the number of indicators uploaded by week, it is evident in first and second week of filling period very few indicators are uploaded leading to have a huge number of uploaded indicators in third week and a significant number of indicators uploaded after the defined period. It is

(30)

29 also clear that most of the locations show a low variability of distinct uploaded values, not being the ideal scenario for dividing the locations in heterogenous groups. Regarding the number of indicators by subtheme, there is also a bad distribution having a huge number of locations with the same number of indicators for each subtheme. One last point, there is an enormous number of indicators with zero values, meaning that is difficult to use these data further on.

Figure 22. Numerical Variables’ Histograms

(31)

30 Regarding binary variables, there are also an unbalanced distribution between yes and no values as shown in Figure 23. Consolidated and Wind are the variables where there is less difference between yes and no values, even so it is visible that most of the locations are consolidated and have wind technology. On the other hand, Insertion via Excel is the most unbalanced binary variable, meaning that almost all the locations upload their files using the Excel templates available in Sustainable Data application. It is also important to highlight New Location variable, where a significant part of the locations are new considering that the previous filling period was with only a difference of a quarter.

Considering now Replaced Values and Change of Contributors variables, a small portion of locations had their indicators values replaced or suffered changes in contributors.

Analysing correlations between variables, to measure the level of dependency between numerical variables was used Pearson correlation. Pearson correlation coefficient measures the level of dependency between two numerical variables, varying from -1, representing a perfect negative correlation, to 1, representing a perfect positive correlation, with 0 representing an absence of a relationship between the variables (Adler & Parmryd, 2010).

Figure 23. Binary Variables’ Bar Charts

(32)

31 Pearson Correlation coefficient between two variables X and Y can be calculated using the formula below, where 𝜌𝑋,𝑌 represents the correlation between variables X and Y, 𝑐𝑜𝑣(𝑋, 𝑌) represents the covariance between variables X and Y and 𝜎_𝑋 and 𝜎_𝑌 are the standard deviations of variables X and Y, respectively.

𝜌_𝑋,𝑌=𝑐𝑜𝑣(𝑋, 𝑌) 𝜎𝑋𝜎𝑌

By applying Pearson correlation to dataframe variables, NInd2week and NInd3week variables highlight for having correlations really close to zero while NTotalIndic, NInd3week, NIndWaste and NZeroValues variables are all highly correlated between them. These 6 mentioned variables are problematic for the purposed goal, so some action might be necessary in data preparation phase to mitigate these problems.

By verifying data quality, there are no null values in the dataframe, however 21 duplicated rows were found and it was decided to delete them from the dataframe. Regarding incoherencies, after checked them no incoherences were found and the incoherencies tested were the following:

1. The date corresponding to the first value uploaded cannot be more recent than the date corresponding to the last value uploaded;

2. The maximum number of values uploaded simultaneously cannot be higher than the total number of indicators because each value corresponds to a different indicator;

3. The sum between the number of uploaded values equal to zero and the number of uploaded values different from zero must be exactly equal to the total number of indicators, for the exact same reason mentioned above;

4. The sum of the diverse numbers of indicators by subtheme cannot be different from the total number of indicators.

5. If any location has a last value uploaded after the 22^nd, the number of indicators uploaded later than the last day of the filling period must be different from zero;

6. If any location has no last value uploaded after the 22^nd, the number of indicators uploaded later than the last day of the filling period must be equal to zero.

3.5. D

ATA

P

REPARATION

II

With data understanding phase done, move immediately to data preparation phase starting by cleaning data by removing the outliers in the dataframe. Outliers in literature are defined as “a few unusual observations that do not seem to belong to the pattern of variability produced by the other observations” (Johnson & Wichern, 2002) or “a set of data to be an observation (or subset of observations) which appears to be inconsistent with the remainder of that set of data” (Barnett &

Lewis, 1994). Outliers are characterized as abnormal, inconsistent, irrelevant or noisy information can have diverse causes namely human, measurement, sampling or data processing errors or sometimes it may seem an error but in fact it is not (Mandhare & Idate, 2017).

Considering the definitions above mentioned and the goal in modeling phase, outlier identification and elimination is fundamental and impacts in clustering results as outliers may lead to the formation of new clusters or the bias of the actual clusters. One of the approaches to detect outliers is a depth based approach that neither follows a statistical distribution nor needs a predefined distance measure.