Exploration of representation of the different ways data analysis in real time and in its
materialization in useful business information.
Cristina Meng
Dissertação de Mestrado apresentada à
Faculdade de Ciências da Universidade do Porto e Altice Labs em Data Analytics
2022
2º CICLO
MSc
FCUP UNI2 2022
Explorationofrepresentationofthedifferentwaysdataanalysisinrealtimeandinitsmaterializationinusefulbusinessinformation. CristinaMeng
Exploration of
representation of the different ways
data analysis in real time and in its
materialization in useful business information.
Cristina Meng
MSc. Engineering Physics
Department of Physics and Astronomy 2022
Orientador
João Manuel Viana Parente Lopes, Faculty of Sciences of University of Porto
Supervisor
Célio Gomes Abreu, Altice Labs
Todas as correções determinadas pelo júri, e só essas, foram efetuadas.
O Presidente do Júri,
Porto, / /
U NIVERSITY OF P ORTO
M
ASTERST
HESISExploration of representation of the different ways data analysis in real time and in its
materialization in useful business information.
Author:
Cristina MENG
Supervisors:
Jo˜ao Manuel VIANAPARENTE
LOPES C´elio GOMESABREU
A thesis submitted in fulfilment of the requirements for the degree of MSc. Engineering Physics
at the
Faculty of Sciences of University of Porto Department of Physics and Astronomy
December 16, 2022
“ When life gives you melons, you might be dyslexic. ”
Random girl I saw on TikTok
Sworn Statement
I, Cristina Meng, enrolled in the Master Degree of Engineering Physics at the Faculty of Sciences of the University of Porto hereby declare, in accordance with the provisions of paragraph a) of Article 14 of the Code of Ethical Conduct of the University of Porto, that the content of this internship report reflects perspectives, research work and my own interpretations at the time of its submission.
By submitting this dissertation/ internship report/ project [choose accordingly], I also declare that it contains the results of my own research work and contributions that have not been previously submitted to this or any other institution.
I further declare that all references to other authors fully comply with the rules of attribution and are referenced in the text by citation and identified in the bibliographic references section.
This dissertation/ internship report/ project [choose accordingly] does not include any content whose reproduction is protected by copyright laws.
I am aware that the practice of plagiarism and self-plagiarism constitute a form of academic offense.
Cristina Meng 16th December 2022
Acknowledgements
A huge thank you to my university’s supervisor, professor Jo˜ao Viana Lopes, for all the support given, the availability and the motivational words during the most challeng- ing times and the people at Altice Labs: C´elio Abreu, my supervisor at the company, and Filipa Ferreira, my team leader, both of them welcomed me, showed availability and accompanied all of my work during this internship, it was a wonderful experience.
I would also like to thank my friends from FEUP. You guys made my choice of en- rolling in informatics in my first year all worth it.
To my best friend of all time, Joana Ferreira, who was smart enough not to put herself through a master’s degree.The girls that girl, girl and the girls who girln’t, goren’t.
To my boyfriend, for the coffee, ice cream and support given during these last few days of insanity, in which I even dared to say I wanted a cat.
And lastly, I would like to thank the tiny human being created by my sister, to whom I am also thankful for: my wonderful niece. Let me raise a toast to the girl I currently love most. Auntie will buy you and your future sister all the ice cream in the world.
UNIVERSITY OF PORTO
Abstract
Faculty of Sciences of University of Porto Department of Physics and Astronomy
MSc. Engineering Physics
Exploration of representation of the different ways data analysis in real time and in its materialization in useful business information.
byCristina MENG
This internship carried out at Altice Labs has the goal of searching patterns in the clients’
activity and look for anomalous behaviour. This is important as it allows for the operator to know its client and take the necessary measures in case anything anomalous shows up.
If the operator does not know what a normal pattern looks like, if something anomalous comes up, it will look normal to him.
In order to do so, it is essential to, first, anonymise the data: numerous problems have arose from this process, including mixed up records and fields with the wrong data.
Second, to analyze the data, the most common data science technologies were used:
SQL, R and Python. These last two have libraries that enable clustering, a process that allows us to aggregate clients with similar attributes, enabling their general characteriza- tion.
Our patterns allowed us to discover not only clients with a million records that have an almost 100% of not answering phone calls, and others on the extreme opposite, but also clients with only a couple of calls.
UNIVERSITY OF PORTO
Resumo
Faculty of Sciences of University of Porto Department of Physics and Astronomy Mestrado Integrado em Engenharia F´ısica
Explora¸c˜ao da representa¸c˜ao das diferentes formas de an´alise de dados em tempo real e na sua materializa¸c˜ao em informa¸c ˜oes comerciais ´uteis.
porCristina MENG
Este est´agio, realizado na Altice Labs, tem como objetivo procurar padr ˜oes na ativi- dade dos diferentes clientes e detetar comportamentos an ´omalos. Isto ´e importante por- que permite que o operador conhec¸a os seus cliente e tome as providˆencias necess´arias caso algo fora do padr˜ao aparec¸a. Se o operador n˜ao souber o padr˜ao normal do seu cliente, se surgir algo fora do comum, este em princ´ıpio parecer´a normal para ele.
Para isso, ´e essencial, em primeiro lugar, anonimizar os dados: v´arios problemas sur- giram deste processo, incluindo registos trocados e campos com dados errados.
Em segundo lugar, para analisar os dados, foram utilizadas as tecnologias de data sciencemais comuns: SQL, R e Python. Estas duas ´ultimas possuem bibliotecas que per- mitem oclustering, processo que permite agregar clientes com atributos semelhantes, pos- sibilitando a sua caracterizac¸˜ao geral.
Os padr ˜oes encontrados permitiram-nos descobrir que n˜ao s ´o existem clientes com um milh˜ao de registos que tˆem quase 100% de chamadas n˜ao atendidas e outros no extremo oposto, mas como tamb´em existem clientes com apenas algumas chamadas.
Contents
Acknowledgements vi
Abstract vii
Resumo ix
Contents xi
List of Figures xiii
Glossary xv
1 Introduction 1
1.1 Altice Labs . . . 1
1.2 Context and Motivation . . . 2
1.3 Data . . . 5
1.3.1 General Data Protection Regulation (GDRP) . . . 5
1.3.2 Types of Data . . . 6
1.4 Internship and Report . . . 7
2 Methods 9 2.1 Technology Used . . . 9
2.1.1 Structured Query Language (SQL) . . . 10
2.1.2 Google Colaboratory . . . 10
2.1.3 Python . . . 11
2.1.4 R . . . 11
2.2 Workflow. . . 12
2.2.1 Features . . . 12
2.2.1.1 Data Encoding . . . 13
2.2.2 Data Cleaning . . . 14
2.2.2.1 Creation of a Calendar . . . 14
2.3 Global Characterization . . . 15
2.3.1 Call Pattern . . . 15
2.4 Client Segmentation . . . 15
2.4.1 Clustering . . . 16
2.4.1.1 Hierarchical Clustering . . . 16
xi
2.4.2 Daily Activity . . . 19
2.4.3 Answered and Unanswered Calls . . . 19
2.4.4 Internal Service & Service . . . 20
2.4.4.1 Approach 1: Waiting Queue . . . 22
2.4.4.2 Approach 2 . . . 22
3 Analysis 25 3.1 Global characterization . . . 25
3.1.1 Comparison between old and new datasets . . . 25
3.1.2 Call Type. . . 27
3.1.3 Time frame analysis . . . 28
3.2 Client Segmentation . . . 32
3.2.1 Session Answered . . . 32
3.2.1.1 Client 80933 (Almost all answered calls) . . . 42
3.2.1.2 Client 1303 (Almost no answered calls) . . . 44
3.2.2 Group Services . . . 48
3.2.2.1 Approach 1: Waiting Queue . . . 48
3.2.2.2 Approach 2: Call Pattern of Group Services . . . 49
4 Conclusions 55
Bibliography 57
List of Figures
1.1 Map of the locations of Altice Labs’ offices. Retrieved from [3]. . . 2
1.2 Strategic themes that Altice Labs’ RDI activities revolve around. Retrieved from [3]. . . 3
1.3 What ABC has to offer. Retrieved from [3]. . . 4
1.4 Different types of data variables. Retrieved from [16]. . . 7
2.1 Schematic of the access to the database. Adapted from [20, 21].. . . 10
2.2 The data processing pipeline. Adapted from [40, 41]. . . 15
2.3 Hierarchical Clustering . . . 17
2.4 Different criteria for Hierarchical Clustering. Adapted from [40]. . . 18
2.5 Schematic of Ward’s method. Retrieved from [49]. . . 18
3.1 Number of events associated with each call. . . 26
3.2 Internal services that a call goes through for GID = ’992005ff-69d8-4a62- bb21-0f53a0234fe5’, for both the old and new datasets.. . . 27
3.3 Different call types distribution.. . . 28
3.4 Hourly activity of each day with the average of events represented in blue. 29 3.5 Weekly Activity of some days of March. . . 30
3.6 Normalised hourly activity, in which the number between brackets repre- sents the day of the week. Zero for Monday and Six for Sunday. . . 31
3.7 Distribution of answered and not answered calls throughout March for ev- ery day and hour. . . 33
3.8 Number of clients that only have unanswered and answered voice calls. . . 35
3.9 Example of a sequence of test calls. . . 36
3.10 Circular dendrogram resulting from the Hierarchical Clustering of the clients that have answered all calls. . . 37
3.11 Distribution of the percentage of unanswered calls, divided by order of magnitude of calls that the clients receive. . . 38
3.12 Relation between the number of total calls and the percentage of unan- swered ones. . . 39
3.13 Distribution of the percentage of unanswered calls. . . 40
3.14 Activity of clients 1303 (in green) and 80933 (in pink). . . 41
3.15 Daily hourly activity of Client’ 80933’. . . 42
3.16 Normalized hourly activity of Client ’80933’. . . 43
3.17 Internal Services of client 80933 and respective Number of occurrences for answered (in green) and unanswered (in pink) sessions.. . . 44
3.18 Daily hourly activity of Client’ 1303’. . . 45
3.19 Normalized hourly activity of Client ’1303’. . . 45 xiii
3.20 Histogram of the occurrences of LEC=’33001’ for client 1303 as a function
of time. . . 46
3.21 Internal Services of client 1303 and respective Number of occurrences for answered (in green) and unanswered (in pink) sessions. . . 47
3.22 Clients’ count of the times they go through the service Waiting Queue. . . . 48
3.23 Number of events for each group service and each client. . . 50
3.24 Clients chose based on the number of events per service and service distri- bution. . . 50
3.25 Activity of Client 114114.. . . 51
3.26 Activity of Client 91763. . . 52
3.27 Call Pattern for an encoded GID found through IVR. . . 53
Glossary
FCUP Faculty of Sciences of University of Porto RDI Research, Development and Innovation
IoT Internet of Things
ABC Advanced Business Communications AS Application Server
SEC Convergent Enterprise Service CDR Call Detail Records
GDPR General Data Protection Regulation EU European Union
EEA European Economic Area ML Machine Learning
DB Database
GID Global Call ID Identifier LEC Logic End Cause
HG Hunt Group
WQ Waiting Queue PA Pre-Answer
IVR Interactive Voice Response WMS Windless Media Server
IP Internet Protocol
xv
Chapter 1
Introduction
In this document, a synthesis of the internship carried out at Altice Labs is presented.
This internship took place under the curricular unit of ”Internship” of the Master’s in En- gineering Physics of the Faculty of Sciences of University of Porto (FCUP). This internship takes place at Altice Labs, SA, headquartered in Aveiro but taken out remotely at Altice’s office in Porto.
1.1 Altice Labs
With a legacy of several decades,Altice Labs, previously known asPT Inova¸c˜ao, has been a catalyst of innovation and transformation, asserting a position of leadership in the RDI (Research, Development and Innovation) area in Portugal. For example, in the past year, 2021, Altice Labs announced an innovation in fibre optics that revolutionised how these networks were deployed on the ground. With this, they doubled their capacity, enabling them to serve twice as many customers as compared to the current technology. Over 300 million people and more than 60 countries use products and solutions made in Altice Labs [1].
Altice Labs has offices in Algarve, Madeira and Azores archipelagos, Porto, Lisbon, Viseu, and also outside of Europe, such as Brazil, through centres dedicated to the re- search and development of advanced solutions of Telecommunications and Systems of Information [2].
1
2
EXPLORATION OF REPRESENTATION OF THE DIFFERENT WAYS DATA ANALYSIS IN REAL TIME AND IN ITS MATERIALIZATION IN USEFUL BUSINESS INFORMATION.
FIGURE1.1: Map of the locations of Altice Labs’ offices. Retrieved from [3].
1.2 Context and Motivation
Altice Labs continually supports collaborative projects of RDI, working in partnership and cooperation with universities all over the world, RD institutions, partners, suppliers and costumers [3]. Presented in fig.1.2 are the activities of RDI in which Altice Labs is involved. These involve Artificial Intelligence & Machine Learning, Smart Cities, Smart Living, Internet of Things (IoT), Digital Services & Platforms, 5G Future Networks, in- cluding the optical evolution framework [3].
1. INTRODUCTION 3
FIGURE1.2: Strategic themes that Altice Labs’ RDI activities revolve around. Retrieved from [3].
ABC (Advanced Business Communications) is an Enterprise Collaboration Solution in which ABC-AS (where AS stands for Application Server), previously designated as SEC (Convergent Enterprise Service), delivers a unified communication experience across all user devices - fig.1.3. One of its main features is the Analytics, where we have the usage information of Customer resources and services. However, the corporate environment framed in telecommunications, combined with the emergence of new services and tech- nologies, is increasingly complex and with faster changes/transformations. The simple collection and presentation of data from all activities with an operational and business nature are insufficient. A need to manipulate and analyse the data from different per- spectives arises, fostering the interpretation and conversion of these data into useful in- formation that companies can use to understand and enhance their business.
Every telecommunication company stores detailed call records, denominated by CDRs (Call Detail Records), that consist of information captured by the telecommunications
4
EXPLORATION OF REPRESENTATION OF THE DIFFERENT WAYS DATA ANALYSIS IN REAL TIME AND IN ITS MATERIALIZATION IN USEFUL BUSINESS INFORMATION.
FIGURE1.3: What ABC has to offer. Retrieved from [3].
providers during the calls, messages, and internet activity of a customer. Telecommu- nication providers routinely collect these records to detect congested cell towers to man- age additional bandwidth, to bill customers for cellular usage and are typically used to troubleshoot and improve the network’s performance. For each communication between individuals, the mobile operator keeps a CDR that stores metadata that can contain in- formation such as the type of call, if the call was incoming or outgoing, the starting and ending time of the call and call duration. Since the contents of the communication are not revealed through the CDR, as they only store communication-related properties, CDR can be considered metadata [4]. These records are essential since they allow the identification of similar use patterns and information that telecom operators and their clients can use to their advantage [5,6]. The knowledge of the clients’ patterns allows the telecommunica- tions operator to understand what anomalous behaviour might look like and, according to that, make the appropriate decisions. Another critical aspect of these call patterns is that they can present distinct characteristics even though multiple clients may follow the same pattern. Characterising these clients would be in the operator’s best interest, as it may help with future business plans.
1. INTRODUCTION 5
1.3 Data
With the invention of the Internet and its technological advancements, data is at the centre of everything. The volume of data collected and stored globally is growing exponentially, and everyday data is being used in new ways. Worldwide demand for digital services and growth results in more data to protect: data privacy and security have become key social and business issues [7]. Data protection is about securing digital information while keep- ing data usable for business purposes without trading customer or end-user privacy [8].
It helps to reduce risk and enables a business or agency to respond quickly to threats. In 1995, the European Data Protection Directive was passed [8]. However, as it was adopted when the Internet was still in its early stages, it established minimum data privacy and security standards [9]. Stricter guidelines and restrictions needed to be adopted. In 2016, the EU adopted the General Data Protection Regulation (GDPR) [9], a regulation we will discuss below.
1.3.1 General Data Protection Regulation (GDRP)
GDRP (EU) 2016/679 is a regulation of the European right related to the privacy and pro- tection of personal data, applicable to all European Union (EU) and European Economic Area (EEA), implemented in 2018 [10]. It regulates the exportation of personal data out- side of the EU and EEA, and one of its main goals is to give the citizens and residents ways to control their data [10].
The regulation contains clauses and demands related to how personal information is treated in the EU, and it applies to every company that operates in the EEA, no matter their origin country [11]. The treatment of personal data is made sure to respect the prin- ciples of data protection: all the data must be stored using pseudonymisation or complete anonymisation:
• Pseudonymisation is a required process for stored data that transforms personal data in such a way that the resulting data cannot be attributed to a specific data subject without the use of additional information [12].Moreover, it needs to ensure that the data cannot be available without explicit consent and be used to identify someone without the use of additional information (key) kept separately from the pseudonymised data [12].
6
EXPLORATION OF REPRESENTATION OF THE DIFFERENT WAYS DATA ANALYSIS IN REAL TIME AND IN ITS MATERIALIZATION IN USEFUL BUSINESS INFORMATION.
• Complete anonymisation: for the data to be truly anonymised, the anonymisation process needs to be irreversible, as opposed to pseudonymisation that has a key to decrypt the sensitive data [12,13].
The regulation also does not allow the treatment of any data outside the legal context specified in it, besides the case in which the one who controls the data has received the explicit consent andopt-inof the data’s owner. The owner also has the right to revoke this permission at any time [14].
The responsible for the treatment of personal data should declare any data collection, the legal framework that allows that collection, the goal of the data processing, the time frame the data is going to be stored and the sharing of information with any third party partner outside the EU. The users have the right to demand a copy of the collected data and the right to demand the elimination of that same data under certain circumstances [14]. The public authorities and companies, whose activity is focused on the regular or systematic treatment of personal data, are mandated to have a data protection officer (DPO) responsible for ensuring that the processing is going according to the GDPR.
1.3.2 Types of Data
The simplest form of data is multidimensional data. This type of data typically contains a set of records. Each record contains a set of fields referred to as attributes, dimensions or features. These fields describe the different properties of that record. A multidimensional data set is defined as follows:
Definition 1.3.1 (Multidimensional Data) A multidimensional data set D is a set of n records, X1...Xn, such that each record Xicontains a set of d features denoted by(x1i...xdi)[15].
The records are the table rows, and the fields are the columns. As seen from fig.1.4, there are two broad types of data - categorical (or qualitative) and numeric (or quantita- tive) - and each of them can be split into different categories.
Categorical data is grouped into categories that take on discrete unordered values. It can be classified as nominal when the categories do not have an order or ordinal, when the categories do have an order [15,17].
Quantitative data, as the name says, tries to quantify things by considering numerical values that make it countable in nature [18]. Thus, when each value ofxijin Definition 1.3.1 is quantitative, the corresponding data set is referred to as quantitative multidimensional data [15].
1. INTRODUCTION 7
FIGURE1.4: Different types of data variables. Retrieved from [16].
• Continuous data is measured on a continuous numerical scale and can take on a large number of possible values. Continuous data can be classified as an interval when it does not have an absolute zero and negative numbers also have a meaning, or as a ratio when it does have an absolute zero and negative numbers do not have a meaning [17].
• Discrete data measures counts or numbers of events. So while it is numerical data, it is not measured on a continuous numerical scale and hence does not fit neatly into either of the classifications above. So it can be treated as either categorical or continuous, depending on the number of possible values [17].
To work with quantitative data, we first need to encode it before processing [15,18].
Data mining algorithms, such as Machine Learning (ML) ones, can only work with nu- merical data, as the models are mathematical by nature. Thus, these categories help us decide which encoding strategy can be applied to which data type.
1.4 Internship and Report
In this internship, the given use case is the search for anomalous behaviour in call pat- terns associated with voice communications. We explore different clients’ time series pat- terns to identify possible anomalous behaviour. Such behaviour encompasses traffic alter- ations (drop/peak in the time-series pattern) and abnormal percentages of answered/u- nanswered calls and their root causes.
This report divides into three main chapters, excluding the Introduction chapter. The first one focuses on the Methods, a chapter divided into four main topics, starting with
8
EXPLORATION OF REPRESENTATION OF THE DIFFERENT WAYS DATA ANALYSIS IN REAL TIME AND IN ITS MATERIALIZATION IN USEFUL BUSINESS INFORMATION. the technology used to retrieve and analyse the data and moving to the workflow of con- verting this raw data into useful one. The approach taken to characterise the data into patterns is described, and lastly, anomalous behaviour will try to be found. However, throughout this internship, many challenges arose with retrieving the data, as the main one being with its anonymisation: as the telecom’s operator consent was necessary, not only for which data to give but also if they agreed or not with the process of anonymisa- tion, multiple back and forth meetings took place with MEO (mobile and fixed telecom- munications service and a brand from Altice that revolutionised the telecommunications market in Portugal [19]). Only the data that contained sensitive information related to the users of the ABC service were anonymised or ’zeroed’, this last one meaning that, when importing these fields, the information was not obtained. The remaining fields remained unchanged.
MEO defined a key that allowed the anonymisation process to go through. Altice does not know this key, so this process belongs in the pseudonymisation category. This strategy guaranteed that, even though the information is always anonymised, it would always refer to the same identification. This process took around nine to ten months, so an extension of two months of the initial eleven months was necessary to have time to analyse the data.
The second chapter details the analysis of the patterns and anomalous behaviour, with a few examples of what this may look like, whether in the call patterns themselves and the percentage of unanswered calls and its logic end causes to a sample characterisation of one client and their call patterns.
The final chapter reports the conclusions retrieved from this internship, the challenges faced, and considerations for the future.
Chapter 2
Methods
In order to analyze the CDRs given, we first need to understand what is available for data analysis and the right tools for what we are trying to achieve. In terms of strategy, dia- grams depicting what was done regarding retrieving and converting raw data into usable data are presented. We studied the general activity of all the clients and their voice calls to accomplish the objectives for this internship. Clustering allowed us to aggregate clients with similar attributes, enabling their general characterization. These characteristics can be the number of unanswered/answered calls, the number of calls, the functionalities one goes through (internal services), or the most usual, the call activity. We will be exploring each one of these in the following sections.
2.1 Technology Used
The most commonly used set of tools for data analysis include SQL, Python and R, which we will go into more detail about below. The main tool used to run Python in the web browser, Google Colab, will also be discussed.
The connection between these used technologies is depicted in the workflow from fig.2.1.
9
10
EXPLORATION OF REPRESENTATION OF THE DIFFERENT WAYS DATA ANALYSIS IN REAL TIME AND IN ITS MATERIALIZATION IN USEFUL BUSINESS INFORMATION.
FIGURE2.1: Schematic of the access to the database. Adapted from [20,21].
2.1.1 Structured Query Language (SQL)
Due to the GDPR, exporting and storing sensitive data in our local hardware was prohib- ited. The only way to access the data was to connect directly to the database (DB). Due to this, SQL was used in order to write the required queries for retrieving the data. SQL is a non-procedural language, meaning the user only needs to specify ’what to do’ and not
’how to do it’.
SQL mainly has four essential operations: creating, reading, updating and deleting data from the table where the content is stored. In this internship, we used SQL solely to read the data from the database [22,23].
2.1.2 Google Colaboratory
Google Colaboratory, also known as Colab, is a Jupyter notebook environment that runs entirely in Google’s cloud servers. Similar to Jupyter Notebook, Colab is a list of cells that can contain text or Python executable code and its respective outputs. The main advantage of using Colab is that none of the user’s computer resources are used to run the notebook, and also, it makes it easier to share the notebooks with others since Colab notebooks are stored in Google Drive, and if it contains sensitive content, the cells’ output can be omitted [24,25]. However, since we need to access our local database, we need to be able to run Colab locally. Colab allows us to connect to a local runtime using Jupyter, meaning that all code will be executed on our local hardware, enabling access to our local files. In order to do so, the jupyter extension ’jupyter http over ws’ needs to be installed and enabled.
2. METHODS 11 2.1.3 Python
Python is a popular programming language with plenty of libraries suited for various tasks, which is often cited as one of Python’s strengths. A description of the libraries used can be seen below.
• Pandasallows data import from not only various file formats, such as comma-separated values, SQL database tables and Microsoft Excel spreadsheets, but also various data manipulation operations such as merging, reshaping, selecting, as well as data cleaning, and data wrangling features [26].
• cx Oracleis a module that enables access to Oracle Database. Executing SQL queries is the primary way a Python application communicates with Oracle Database. The queries can only be executed using the method Cursor.execute(). Rows can then be all fetched using one of the methods Cursor.fetchall(). The fetch methods return data as tuples which Pandas then process to convert into a dataframe [27].
• Matplotlib is Python’s data visualisation and graphical plotting library. All graphs from this report were plotted using this library [28].
• Scikitis a free software machine learning library for Python. It features various clas- sification, regression, and clustering algorithms, including hierarchical clustering, and it is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy [29,30].
2.1.4 R
R is a programming language focused on statistical analysis and data visualisation [31].
Like Python, it provides a collection of libraries that aid in everyday data science tasks. In this case, R was used for its libraries associated with hierarchical clustering: ’Circlize’ and
’Dendextend’. The first one allowed the plot of a circular dendrogram, while the second allowed for its personalisation, e.g., colours, labels, etc. When we have too many labels, a regular dendrogram can become hard to read, so plotting it in a circular manner is the best solution.
12
EXPLORATION OF REPRESENTATION OF THE DIFFERENT WAYS DATA ANALYSIS IN REAL TIME AND IN ITS MATERIALIZATION IN USEFUL BUSINESS INFORMATION.
2.2 Workflow
Unfortunately, most of the time, raw data is not immediately suitable for automated pro- cessing. In our case, the first challenge encountered is the number of features we have, which are over a hundred, most of them not relevant to the problem at hand. So here we highlight the importance of feature selection.
2.2.1 Features
The general features used in the analysis involve:
• CALL TYPE (Number)
TABLE2.1: CALL TYPE: Possible numbers and description.
Number CALL TYPE
-1 Unknown
0 Voice
1 Data
2 SMS
3
USSD
(Unstructured supplementary service data):
service used to receive important information,
such as the available balance, via SMS after dialing a service code.
4 Fax
5 SS NOTIF where SS stands for Supplementary Service [32].
6
CS VIDEO
(Circuit-Switched Video call)
where a video call occurs using a channel in the circuit-switched network.
7 DATA64K:
digital transmission method where each call is carried by a 64-kbps digital stream [33].
• CLIENT ID (Number): customer identifier. Customer is an ABC entity referring to enterprises that subscribe to the service provided by the telecommunications oper- ator.
• INTERNAL SERVICE (VARCHAR2): ABC functionality or list of functionalities ap- plied during the call. The reason for the field ’INTERNAL SERVICE’ to be ’VAR- CHAR2’ instead of ’Number’ is because sometimes, in one CDR, a call goes through multiple services at the same time, meaning that the field will be occupied by a se- quence of numbers separated by commas, making the final result, not in number form.
• INFO35 (VARCHAR2): general field prepared to accommodate a wide variety of value types. In this case, it contains the Global Call ID Identifier (GID), which refers
2. METHODS 13 to a generated unique identifier used to mark each call in order to allow tracing the call along its path.
• LOCATION TYPE (VARCHAR2): Type of call from location processing. It can be one of two types, either ONNET (internal calls), in which the call is between num- bers within the same client or OFFNET (external calls), where the call happens for offnet destinations (different clients). This field’s existence is associated with the different payment plans, e.g., internal calls are more likely to be free of charge than external calls.
• LOGIC END CAUSE (LEC) (Number): number code used to identify the cause of the call ending. These codes may refer to success causes, such as the call being ended by the origin after it was established successfully, or error codes, identifying the causes of why the call was not allowed to proceed.
• SERVICE (Number): ABC Sub-service used in the session, responsible to generate the CDR.
• SERVICE ACCESS TIME (DateTime): DateTime of service access, with system time.
• SESSION ANSWERED (Binary): 1 if the session is answered and 0 if not answered.
Afterwards, it comes the actual pre-processing of the data, that involves not only tasks such as data cleaning, data encoding, but also involves feature extraction, which generates new features from raw data that for some reason, are not directly comparable, such as categorical data. As mentioned previously, many ML algorithms only work on numerical data, so, in order to do so, we resort to data encoding.
2.2.1.1 Data Encoding
Several methods can be used to convert categorical data into numerical data, the most popular ones being One-Hot Encoding and Label Encoding.
What One-Hot Encoding does is it takes a column with categorical data. For each unique value in the set of the categorical attribute, additional columns (features) are cre- ated and assigned the value one or zero. Each feature is represented as a binary vector.
All the values are zero, and the index is marked with one [34]. This method increases the dimensionality of the dataset, which is not ideal [35].
14
EXPLORATION OF REPRESENTATION OF THE DIFFERENT WAYS DATA ANALYSIS IN REAL TIME AND IN ITS MATERIALIZATION IN USEFUL BUSINESS INFORMATION. In Label Encoding, each categorical value is replaced with a numeric value between 0 and the number of classes minus 1. Each label is assigned a unique value, and it is always possible to revert it to the original value. In the documentation of this technique, nothing explains how the numbers are attributed to the labels [36]. However, the problem with Label Encoding is that it might add a relationship between the labels that did not exist prior to the encoding. For example, if a distance based clustering is performed, Label Encoding may add a proximity relationship between the labels and cluster them together.
However, since it does not occupy more memory, we will most likely use this technique.
2.2.2 Data Cleaning
In order to analyze the events a call goes through, we decided to exclude those rows in which the internal service and the GID were NaN. NaN values are due to most likely pars- ing errors. Moreover, some CDRs go through multiple internal services simultaneously, presenting themselves no longer as numerical data, as commas separate them. This way, it was also necessary to separate this single CDR into multiple ones, where everything is the same besides the internal service field [37,38].
Following this, we have feature transformation, a family of algorithms that creates new features from existing features. For example, the date of an event might not be explana- tory enough. As we are looking for time-series patterns, knowing the weekday might be helpful. So a new feature for the weekday is created from the DateTime one.
2.2.2.1 Creation of a Calendar
The time and date of each CDR come in the format ’yy.MM.DD HH:mm:ss.ffffff’. To identify possible temporal patterns, we separated the CDRs within certain time frames, such as ’Week of the Month’, ’Day of the Week’, ’Month’ and even ’Hour of the Day’.
These time frames aggregated is what we call the calendar. In case of missing hours or even dates, the respective rows were inserted in the data table with the fields not related to the date time filled with ’NaN’. This allows for an aligned representation of the curves within the same day of the week for an other week [38,39].
As we can see, the data mining process is a pipeline containing many phases, as rep- resented in fig.2.2.
2. METHODS 15
FIGURE2.2: The data processing pipeline. Adapted from [40,41].
2.3 Global Characterization
Before going deep into characterising the clients, we will first be studying their call pat- terns and the services they go through. This is one way to find patterns, as clients may go through the same services, designing the same pattern or not.
2.3.1 Call Pattern
Each call has a series of events, so each GID corresponds to a different call. For each event, the call goes through an internal service, a functionality, so a first attempt to look at time- series patterns is to, for each GID, plot all the internal services the call goes through as a function of time.
In the first approach, we will analyse the call patterns of different clients and the clients themselves. This will allow us to move forward to not only analyse the most common and least common services a call goes through but also which clients appear more and less frequently and, afterwards, a more detailed analysis of the calls themselves if either they were or were not answered and the reason as to why it happened.
We are only interested in voice calls, so we will filter out every row that does not have
’CALL TYPE’ = 0, as this is the code for voice.
2.4 Client Segmentation
One of the possible ways to segment the different clients is by clustering them according to their characteristics. Clustering is the most common unsupervised learning algorithm used to explore the data analysis and find hidden patterns or groupings in the data [42].
16
EXPLORATION OF REPRESENTATION OF THE DIFFERENT WAYS DATA ANALYSIS IN REAL TIME AND IN ITS MATERIALIZATION IN USEFUL BUSINESS INFORMATION. Unsupervised learning is a type of machine learning that uses unlabeled and unclassified training data [43]. The main goal of it is to discover hidden and interesting patterns in unlabeled data, that the algorithm will find on its own, unlike what happens with super- vised learning, in which we know what values we will obtain for the output [44].
2.4.1 Clustering
One definition of clustering is: Given a set of data points, partition them into a set of groups which are as similar as possible [45].
There is not only one type of clustering algorithm; the most studied ones are the par- titional and hierarchical. They have been used in different applications due to their sim- plicity and ease of implementation relative to other clustering algorithms [45].
Partitional methods need to be provided with a set of initial seeds (or clusters), which are then improved iteratively and can be divided into distance-based and density-based [45]. Distance-based methods optimize a global criterion based on the distance between the patterns, while density-based methods optimize local criteria based on density infor- mation of the patterns [46].
On the other hand, hierarchical methods can start with the individual data points in single clusters and build the clustering. The role of the distance metric is also different in both of these algorithms. In hierarchical clustering, the distance metric is initially ap- plied to the data points at the base level and then progressively applied to subclusters by choosing absolute representative points. However, in the case of partitional methods, the representative points chosen at different iterations can be virtual points such as the centroid of the cluster (which is nonexistent in the data) [45]. Hierarchical clustering is the method chosen for clustering, as it does not have to define the number of clusters.
2.4.1.1 Hierarchical Clustering
Hierarchical clustering algorithms approach the problem of clustering by developing a tree structure called the dendrogram. Once the dendrogram is constructed, a horizontal line can be traced anywhere in the dendrogram, and the number of legs it intersects is the number of clusters we will have.
Hierarchical clustering can be achieved in two different ways, bottom-up (agglomer- ative) and top-down (divisive) clustering [45]. Divisive algorithms are more robust in the early stages compared to their counterpart, and agglomerative clustering techniques, in
2. METHODS 17 turn, are more understandable and by far the most popular [47]. From fig.2.3a, we can see that agglomerative clustering works like the following: from a data set, we attribute each data point to a cluster, and from there, the individual data points are successively agglom- erated into higher-level clusters, moving up in the hierarchy. At each stage of hierarchical clustering, the clusters for which the distance between them is the minimum are merged.
At the same time, a dendrogram is constructed: the top of the U-link indicates a cluster merge. The two legs of the U-link indicate which clusters were merged, and their length represents the distance between the child clusters [48].
(A) Schematic of bottom-up clustering. (B) Dendrogram.
FIGURE2.3: Hierarchical Clustering
In order to decide which clusters should be merged, appropriate criteria should be applied. These criteria can be based not only on the distances (such as Euclidean) between clusters, for example, [49,50]:
• Centroid linkage: computes the centroids of the clusters and calculates the distance between them;
• Single linkage: calculates the distance between the closest points of the clusters;
• Complete linkage: calculates the distance between the furthest points of the clusters;
• Average linkage (not so popular): calculates the distance between all pairs of the clusters and takes its average as the distance, as seen in eq.2.2
D(A,B) =TAB/(NA·NB) (2.1)
18
EXPLORATION OF REPRESENTATION OF THE DIFFERENT WAYS DATA ANALYSIS IN REAL TIME AND IN ITS MATERIALIZATION IN USEFUL BUSINESS INFORMATION. whereTAB is the sum of all pairwise distances between cluster A and cluster B.NA and NB are the sizes of clusters A and B, respectively. This formula is, of course, adaptable for every single number of clusters.
FIGURE2.4: Different criteria for Hierarchical Clustering. Adapted from [40].
But also on the minimization of dispersion of the points inside each cluster:
• Ward’s method is based on the error sum of squares (ESS). The ESS for a cluster is defined as the sum of the squared Euclidean distances from the cluster’s points to the cluster’s centroid [51]. It combines clusters where the increase in within-cluster variance is the smallest.
ESSj =
nj
i
∑
=1||Xij−Xj||2 (2.2)
WhereXij is theithobservation in thejthcluster andXjis the cluster’s centroid [51].
The distance between two clusters, 1 and 2, is how much the sum of squares will increase when they are merged [52].
FIGURE2.5: Schematic of Ward’s method. Retrieved from [49].
2. METHODS 19 2.4.2 Daily Activity
It is mainly here that we make use of the calendar. It is a way to see the client’s most active and inactive hours and if the activity curve, which consists of the number of events per hour, follows the same trend for all days of the week. An unusual spike or drop in the curve could represent a possible anomaly. In order to see if all days of the week follow the same trend, independently of the number of events registered that day, a normalisation of the curves was required. This consisted in dividing the number of events by the area below the activity curve.
2.4.3 Answered and Unanswered Calls
One important aspect to analyse is whether the call was answered or not. As in daily activity for all voice calls, in this case, we will divide it into two curves, one for the activity of answered calls and another for the unanswered calls. There are many reasons why a call was or was not answered: the general cause is identified by the first two digits of the
’LOGIC END CAUSE’ five-digit number and the remaining digits identify the specific cause
TABLE2.2: General Causes for answered and unanswered calls.
Cause Group Code
Calls with success Normal successful call 10
Call successful (but with completion forced by logic) 11
Calls without success
Generic failure of a call 30
Identification /validating failure 31
Loading failure 32
Failure associated with the destination 33 Failure associated with the origin 34
Incorrect utilization 35
Problems in external system 36
Below, in table2.3, there are the logic end causes that appear throughout the analysis in section3.2.1.
20
EXPLORATION OF REPRESENTATION OF THE DIFFERENT WAYS DATA ANALYSIS IN REAL TIME AND IN ITS MATERIALIZATION IN USEFUL BUSINESS INFORMATION. TABLE2.3: Logic End Causes that appear throughout this analysis and its description.
LOGIC END CAUSE Description
10001 CALL OK
10009 Origin Abandoned (caller hangs up before answering) 10022 Origin Abandoned with Call Completed Elsewhere Reason
11010 Session terminated by network
30055 call Forwarded by Unconditional forwarding
30057 call Forwarded by No answer forwarding
30059 call Forwarded by Moved temporarily forwarding
33001 Busy destination
33002 Destination does not respond
33004 Nonexistent destination
33007 Customer unreachable destination
34031 Exceeded number of simultaneous calls
Our source of concern is the percentage of unanswered calls. We visualise this data using a histogram with the percentages of unanswered calls versus the number of clients that have them. Moreover, we need to consider clients who may only have a few voice calls throughout the analysed dates. A percentage of 100% is more relevant to clients with a million voice calls than to clients with only one voice call. This way, we will separate the graphs by order of magnitude, from 1 to 10 million voice calls.
2.4.4 Internal Service & Service
ABC functionalities are classified as Services and Internal Services. Internal Services refer to specialised features within each ABC Service. There are internal services that are most common to appear throughout this analysis: some refer to Group or Advanced Services, which are functionalities applied to the enterprise as a whole, and others refer to User Services, which allow to process the call differently depending on the settings of a specific user. Group services involve Hunt Group (HG), Waiting Queue (WQ), Pre-Answer (PA) and Interactive Voice Response (IVR), and a respective explanation of these is given below.
The number between parenthesis corresponds to the number of the Service.
2. METHODS 21
• Global (1): Service that manages calls to or from a user, defines if it processes the originated or received party, and verifies if it is necessary to apply other services to the call.
• Hunt Group (2): Service that allows organisations to distribute calls to a group of people in a company. The incoming calls are made to a single phone number and can be re-routed to multiple phone lines.
• Waiting Queue (3): Service that distributes incoming customer calls based on the call order. The caller remains on hold until an agent becomes available, at which point the call queue routes the customer to the representative [53].
• Pre-Answer(4): An audio announcement made to an incoming caller before the call is put through.
• IVR (11): Service that allows incoming callers to access information via a voice re- sponse system of pre-recorded messages without having to speak to an agent, as well as to utilise menu options via touch tone keypad selection or speech recogni- tion to have their call routed to specific departments or specialists [54].
Similar to group services, an explanation of the User services is also given:
• Forward: allows the redirection of a received call to a new destination according to a set of conditions, such as temporal patterns, or due to the destination’s status (’Busy’, ’No Answer’, and ’Not Available’), among others.
• OIP (Originating Identity Presentation): OIP is a supplementary service in which the network provides the Called Party with the trusted identity of the Calling Party.
• Out Numbers: allows to define which is the number used to identify the call made by a user.
• Convergent: allows the call’s origin to be identified differently depending on whether the destination is fixed or mobile. For example, the call goes to the network identi- fied with a mobile number because the destination is mobile. However, if the desti- nation is fixed, the call made by the same user with the same telephone is identified with a fixed number.
22
EXPLORATION OF REPRESENTATION OF THE DIFFERENT WAYS DATA ANALYSIS IN REAL TIME AND IN ITS MATERIALIZATION IN USEFUL BUSINESS INFORMATION.
• NP Announce Surpress (Service Portability Announcement Inhibition): allows the activation or deactivation of the option to play an announcement identifying that the number the user is calling is ported. Ported number announcement is played when, for example, a user calls a 96xxx number which is currently ported to Vodafone. So, this announcement allows the user to realise that the call may have a different cost since the number now belongs to a different provider.
• On Net Display: allows to define the number type used to identify an internal call to the destination. For example, it defines if the call is identified by the user’s short number or by the user’s DDI (Direct Dial In number), in case the user has both.
• Number Portability: differentiation of ported numbers.
2.4.4.1 Approach 1: Waiting Queue
One of the internal services that’s the richest in information is the Waiting Queue. The idea is to identify which clients go through this Service more often, pick a suitable one and then check for abnormalities in the fields of this Service - table2.4.
TABLE2.4: Specific features used for the Waiting Queue service
Feature Description
INFO4 0: Call wasn’t answered by Waiting Queue destination;
1: Call was answered by Waiting Queue destination.
INFO5
0: Call has left the waiting queue due to error;
1: Call has left waiting queue because call was answered by waiting queue;
2: Call not answered by waiting queue because the queue is full. destination or because it was canceled by the origin;
3: Call has left waiting queue because user press Intercept exit option;
4: Call has left waiting queue because max waiting time reached.
INFO6 0: The call was not forwarded;
1: The call was forwarded INFO7 Indicates the waiting time in Waiting Queue.
This is a very targeted approach, as we pick a specific service. However, it is also possible to study the generality of group services which we will see below.
2.4.4.2 Approach 2
In this case, what we will be doing is analysing the call path for each one of these services.
However, instead of using all the global call IDs for a specific time frame like in section 2.3.1, we will be grabbing the ’USER ID’ that are present in each of the services for each
2. METHODS 23 client, storing all the GIDs that have the same common ’USER ID’ and for each one of them, do a plot of the services a call goes through for a certain DateTime. Note that typically the USER ID field is used to store the database ID of the user who made or received the call, but since these are Group Services and therefore not applied to a specific user, the same field is used to save the database rule IDs for each Service.
For a more in-depth analysis, we will be grabbing the same clients and will be analysing them by the ’LOCATION TYPE’ of the call and filtering the Service to Global only.
Chapter 3
Analysis
In this chapter, we will compare data sets anonymised manually and, by a process, have a general idea of the call types that our data has and which dates are worth analysing.
Moreover, the activity and unanswered and answered call curves for all clients will be studied, along with the most common LEC cause groups. Due to their unusually high percentage of unanswered and answered calls, some clients will be targeted, analysing their most common internal services and LECs. As the number of clients allows it (for a more readable graph), hierarchical clustering will be performed to aggregate them ac- cording to their activity for clients with only answered calls. At last, a characterisation according to the volume of internal and external calls will be performed on two different clients, and an analysis of the call pattern of the group services will be shown.
3.1 Global characterization
This section will be more of an understanding of the data we have at hand, knowing what it contains, the number of CDRs per hour, day, and so on.
3.1.1 Comparison between old and new datasets
At the start of this internship, no use case had been defined and at this point, having only minimal data, we started by analysing the Global Call ID (GID). Each call has a series of events, and each different GID corresponds to a different call. For each event, the call goes through a service. The initial approach consisted in counting the number of events per call and representing it graphically, the number of events as a function of each GID - fig.3.1.
25
26
EXPLORATION OF REPRESENTATION OF THE DIFFERENT WAYS DATA ANALYSIS IN REAL TIME AND IN ITS MATERIALIZATION IN USEFUL BUSINESS INFORMATION.
FIGURE3.1: Number of events associated with each call.
We can see a few calls with an unusually higher number of events. However, this anal- ysis is not feasible anymore for the new coming data. There are around 16000 clients in the new data set, with over thousands of records, making it impossible to analyse all calls for all existing clients. The first dataset provided, manually anonymised, was relatively poor, as access to only a day and a half of communications was provided. With the sec- ond batch, anonymised with an algorithm, we had access to around two and a half days of communications. The goal was to understand whether or not this anonymisation algo- rithm changed the data by any means. As previously mentioned, the number of events corresponds to the number of ”jumps” performed inside the same call, and therefore, in order to compare the data from the second dataset to the first one, we decided to check if, for the same GID, the call goes through the same internal services. An example of this comparison can be seen in fig.3.2 It is essential to remember that all fields of ’INFO35’
with NaN were erased, as it does not allow us to trace back the events to a call.
3. ANALYSIS 27
FIGURE 3.2: Internal services that a call goes through for GID = ’992005ff-69d8-4a62- bb21-0f53a0234fe5’, for both the old and new datasets.
At first sight, the plots look very distinct because the time of the call is much more precise in the second dataset, going up to milliseconds, while in the first dataset, it only goes to minutes. However, in both cases, the same internal services are passed through, meaning that the anonymisation algorithm is reliable.
3.1.2 Call Type
From this moment forward, all the analysis will be performed with the new data, the one anonymised with an automatic process rather than manually. Fig.3.3shows the different types of calls that appear in the data set. Our focus will be directed to voice calls, that occupy the most significant slice with a percentage of almost 78%. The remaining slices
28
EXPLORATION OF REPRESENTATION OF THE DIFFERENT WAYS DATA ANALYSIS IN REAL TIME AND IN ITS MATERIALIZATION IN USEFUL BUSINESS INFORMATION. of the pie chart include ’Data’ and ’SMS’, which although not our focus, their percentages were big enough to be left out as individual slices. The ’Miscellaneous’ slice includes calls of types such as video messages and faxes, which alone, were not significant enough to be left out as individual slices. In fact, these calls were aggregated into a broader category, in which the most significant participant are the ’NaN’ values, which is not possible for us to attribute to a type of call.
FIGURE3.3: Different call types distribution.
3.1.3 Time frame analysis
A count of the voice calls from the CDRs as a function of the hour of the day, for all of the days available, along with the daily hourly average, was plotted in fig.3.4.
3. ANALYSIS 29
FIGURE3.4: Hourly activity of each day with the average of events represented in blue.
Something that feels off is that, from fig.3.4, there is unusual activity at midnight from almost every day we have available in March. Moreover, there is a gap in the data at mid- day from every day. This leads us to believe that something went wrong in the anonymi- sation process and that the data from these two hours got mixed up. In theory, the data anonymisation process should retrieve the data from the source, anonymise it, and then write it in the target platform. The process was meant to anonymise some of the fields, and the columns with the data type ’DateTime’ were not supposed to be anonymised.
However, the anonymisation process, developed through the Talend tool [55], did not consider ’masking’ over the ’DateTime’ type of fields when obtaining the data from the source. And because of this, the ’DateTime’ type data came out in a text format, which was misinterpreted when placing the data on the target platform, thus promoting the dis- appearance of the midday events and the appearance of these at midnight. Upon further research, there were also duplicated/tripled CDRs. Due to the rush to make the data available as soon as possible, an attempt at a method of having parallel processes to write the data in the database was created. Unfortunately, these threads were consuming the same data, resulting in duplicate/triplicate registers for the same CDR.
After fixing these mistakes, we carried on with the analysis, meaning that only fig.3.4 is plotted using the bugged data. In addition to the fact that there are days in which the plotted line is not complete, for example, the 2nd of March, in which the line ends at 13h, the 9th of March, where there is only data from 14h, the 28th of February, even though not visible, only has activity from 21h-23h. The 1st of March drops at 13h, which stays constant until the very end of the day. As these last two days correspond to a Monday and
30
EXPLORATION OF REPRESENTATION OF THE DIFFERENT WAYS DATA ANALYSIS IN REAL TIME AND IN ITS MATERIALIZATION IN USEFUL BUSINESS INFORMATION. a Tuesday, respectively, it feels very out of the pattern of a regular weekday. Furthermore, there is also a gap of several days between the first two dates mentioned, in which there is no data. The decision was to check the number of voice calls per day as the day of the week to see if these dates are still worth analysing when compared to the rest. The result is in fig.3.5.
FIGURE3.5: Weekly Activity of some days of March.
Observing fig.3.5, we can see that the 28th of February has a deficient activity, which is predictable, as it only had 2 hours of activity. Since all the remaining data corresponds to March, not many conclusions can be taken out of this date and this date only, and therefore we decided to exclude it. Moreover, the lowest points of this graph reflect the lack of activity on the days mentioned previously. The fact that the 29th of March has almost the same Amount of calls as these days, means that somewhere in fig.3.4, the line for this day also either breaks somewhere or starts in the middle of the day. The reason behind these days’ abnormal activity is that the data collection was done by accessing a field that represents a sequence number, which is managed by caching instance. Caching is used to provide quick access to values that are often read from databases, storing them in a temporary storage instance, the cache. The cache is used since a simultaneous request of value in the same sequence could return repeated values. For example, ’instance 1’
and ’instance 2’ both cache 5000 values. ’Instance 1’ caches values from 1 to 5000 and
’instance 2’ caches values from 5001 to 10000. When its range ends, the sequence asks for another numbering range. And this assigns from 10001 to 15000 to ’instance 1’ and so on. When obtaining the data, since there are millions of records to be extracted, we
3. ANALYSIS 31 asked for the minimum ID and the maximum ID of the database. And so far, the data for those particular days have not yet arrived because they may have an ID higher than the maximum obtained so far.
Therefore, the data we will be analysing corresponds to the time frame between the beginning of the 10th of March and the end of the 28th of March. So far, we know the num- ber of events each day has gone through, that the remaining weeks are in sync with each other (fig.3.5: lines corresponding to weeks 2, 3 and 4 overlap), and that weekdays are much busier than weekends. However, we don’t know whether the curves for weekends and weekdays follow the same pattern, regardless of the number of calls made/received that day. The idea of normalising the daily curves came up, and the result is in fig.3.6.
FIGURE3.6: Normalised hourly activity, in which the number between brackets repre- sents the day of the week. Zero for Monday and Six for Sunday.
From fig.3.6, it is clear that we have two standard curves, one for the weekdays and another one for the weekends. Besides these last ones being much slower, weekdays have a vigorous activity within the Portuguese eight daily working hours, including a drop in the activity within the lunch hours. Outside these working hours, which comprehend the period of 9h until 17h or even 18h, the activity drops to almost null. For weekends, however, the peaks observed are at around 11h-12h and 19h, with a more subtle drop between these hours, which is not as abrupt as we saw on weekdays. Not many clients have an intense activity on weekends or any activity considered relevant at all, meaning that the curves with the peaks and lows observed for the weekends are primarily due to a few clients that have this activity in specific. For example, food delivery services usually
32
EXPLORATION OF REPRESENTATION OF THE DIFFERENT WAYS DATA ANALYSIS IN REAL TIME AND IN ITS MATERIALIZATION IN USEFUL BUSINESS INFORMATION. peak a bit before eating hours, and hospitals, as usually, should not be closed on any day of the week.
What is more curious is the orange line that represents the 28th of March. Compared to other weekdays, it seems to have an offset of one hour to the left. This is due to the time change of one hour more that happened on the 27th of March at 01:00. So for the daylight saving time, 01:00 is 02:00, and so, there is no data available for 01:00 as that hour did not exist that day. The transition in the UTC (Coordinated Universal Time) from UTC+0 to UTC+1 shifts the calls from the following days one hour to the left.
3.2 Client Segmentation
Clients are characterised according to their percentage of answered/unanswered calls.
For those with extreme percentages, we will perform a more in-depth analysis of the services they transit through and their LECs. For the remaining clients, we present a general review of their LECs. Finally, we will analyse the Waiting Queue service for one client and the group service patterns of two different clients.
3.2.1 Session Answered
The flux of calls that are answered or not answered is given in fig.3.7. The two curves accompany each other, not only in shape but also in number: besides making the ’M’
letter shape, the number of unanswered calls is surprisingly very similar to the number of answered ones. Even though the curve for unanswered calls is almost all of the time below the answered one, with some more unusual drops observed on the 11th and 24th of March and a more pronounced peak on the 14th of March, these ’anomalies’ are not that significant, as they are within the statistical sample, which leaves us to think as to why the number of unanswered calls is almost equal to the answered ones, as one would think that there would to be many more answered calls than unanswered.
3. ANALYSIS 33
FIGURE3.7: Distribution of answered and not answered calls throughout March for ev- ery day and hour.
If we go by numbers, there is a total of 32017 clients, 6381 ( 20%) of which have more unanswered calls than answered. Of these, 1023 clients, meaning around 16%, have zero answered calls ( 3.2% of the total number of clients). Furthermore, for answered calls, there is a 0.5% amount of the total number of clients that have only answered calls. This