• Nenhum resultado encontrado

2.4 Client Segmentation

3.1.3 Time frame analysis

A count of the voice calls from the CDRs as a function of the hour of the day, for all of the days available, along with the daily hourly average, was plotted in fig.3.4.

3. ANALYSIS 29

FIGURE3.4: Hourly activity of each day with the average of events represented in blue.

Something that feels off is that, from fig.3.4, there is unusual activity at midnight from almost every day we have available in March. Moreover, there is a gap in the data at mid-day from every mid-day. This leads us to believe that something went wrong in the anonymi-sation process and that the data from these two hours got mixed up. In theory, the data anonymisation process should retrieve the data from the source, anonymise it, and then write it in the target platform. The process was meant to anonymise some of the fields, and the columns with the data type ’DateTime’ were not supposed to be anonymised.

However, the anonymisation process, developed through the Talend tool [55], did not consider ’masking’ over the ’DateTime’ type of fields when obtaining the data from the source. And because of this, the ’DateTime’ type data came out in a text format, which was misinterpreted when placing the data on the target platform, thus promoting the dis-appearance of the midday events and the dis-appearance of these at midnight. Upon further research, there were also duplicated/tripled CDRs. Due to the rush to make the data available as soon as possible, an attempt at a method of having parallel processes to write the data in the database was created. Unfortunately, these threads were consuming the same data, resulting in duplicate/triplicate registers for the same CDR.

After fixing these mistakes, we carried on with the analysis, meaning that only fig.3.4 is plotted using the bugged data. In addition to the fact that there are days in which the plotted line is not complete, for example, the 2nd of March, in which the line ends at 13h, the 9th of March, where there is only data from 14h, the 28th of February, even though not visible, only has activity from 21h-23h. The 1st of March drops at 13h, which stays constant until the very end of the day. As these last two days correspond to a Monday and

30

EXPLORATION OF REPRESENTATION OF THE DIFFERENT WAYS DATA ANALYSIS IN REAL TIME AND IN ITS MATERIALIZATION IN USEFUL BUSINESS INFORMATION. a Tuesday, respectively, it feels very out of the pattern of a regular weekday. Furthermore, there is also a gap of several days between the first two dates mentioned, in which there is no data. The decision was to check the number of voice calls per day as the day of the week to see if these dates are still worth analysing when compared to the rest. The result is in fig.3.5.

FIGURE3.5: Weekly Activity of some days of March.

Observing fig.3.5, we can see that the 28th of February has a deficient activity, which is predictable, as it only had 2 hours of activity. Since all the remaining data corresponds to March, not many conclusions can be taken out of this date and this date only, and therefore we decided to exclude it. Moreover, the lowest points of this graph reflect the lack of activity on the days mentioned previously. The fact that the 29th of March has almost the same Amount of calls as these days, means that somewhere in fig.3.4, the line for this day also either breaks somewhere or starts in the middle of the day. The reason behind these days’ abnormal activity is that the data collection was done by accessing a field that represents a sequence number, which is managed by caching instance. Caching is used to provide quick access to values that are often read from databases, storing them in a temporary storage instance, the cache. The cache is used since a simultaneous request of value in the same sequence could return repeated values. For example, ’instance 1’

and ’instance 2’ both cache 5000 values. ’Instance 1’ caches values from 1 to 5000 and

’instance 2’ caches values from 5001 to 10000. When its range ends, the sequence asks for another numbering range. And this assigns from 10001 to 15000 to ’instance 1’ and so on. When obtaining the data, since there are millions of records to be extracted, we

3. ANALYSIS 31 asked for the minimum ID and the maximum ID of the database. And so far, the data for those particular days have not yet arrived because they may have an ID higher than the maximum obtained so far.

Therefore, the data we will be analysing corresponds to the time frame between the beginning of the 10th of March and the end of the 28th of March. So far, we know the num-ber of events each day has gone through, that the remaining weeks are in sync with each other (fig.3.5: lines corresponding to weeks 2, 3 and 4 overlap), and that weekdays are much busier than weekends. However, we don’t know whether the curves for weekends and weekdays follow the same pattern, regardless of the number of calls made/received that day. The idea of normalising the daily curves came up, and the result is in fig.3.6.

FIGURE3.6: Normalised hourly activity, in which the number between brackets repre-sents the day of the week. Zero for Monday and Six for Sunday.

From fig.3.6, it is clear that we have two standard curves, one for the weekdays and another one for the weekends. Besides these last ones being much slower, weekdays have a vigorous activity within the Portuguese eight daily working hours, including a drop in the activity within the lunch hours. Outside these working hours, which comprehend the period of 9h until 17h or even 18h, the activity drops to almost null. For weekends, however, the peaks observed are at around 11h-12h and 19h, with a more subtle drop between these hours, which is not as abrupt as we saw on weekdays. Not many clients have an intense activity on weekends or any activity considered relevant at all, meaning that the curves with the peaks and lows observed for the weekends are primarily due to a few clients that have this activity in specific. For example, food delivery services usually

32

EXPLORATION OF REPRESENTATION OF THE DIFFERENT WAYS DATA ANALYSIS IN REAL TIME AND IN ITS MATERIALIZATION IN USEFUL BUSINESS INFORMATION. peak a bit before eating hours, and hospitals, as usually, should not be closed on any day of the week.

What is more curious is the orange line that represents the 28th of March. Compared to other weekdays, it seems to have an offset of one hour to the left. This is due to the time change of one hour more that happened on the 27th of March at 01:00. So for the daylight saving time, 01:00 is 02:00, and so, there is no data available for 01:00 as that hour did not exist that day. The transition in the UTC (Coordinated Universal Time) from UTC+0 to UTC+1 shifts the calls from the following days one hour to the left.

Documentos relacionados