• Nenhum resultado encontrado

TweeProfiles: deteção de padrões espácio-temporais no Twitter

N/A
N/A
Protected

Academic year: 2021

Share "TweeProfiles: deteção de padrões espácio-temporais no Twitter"

Copied!
98
0
0

Texto

(1)

F

ACULDADE DE

E

NGENHARIA DA

U

NIVERSIDADE DO

P

ORTO

TweeProfiles: detection of

spatio-temporal patterns on Twitter

Tiago Daniel Sá Cunha

Mestrado Integrado em Engenharia Informática e Computação Supervisor: Carlos Soares (PhD)

Co-Supervisor: Eduarda Mendes Rodrigues (PhD)

(2)

c

(3)

TweeProfiles: detection of spatio-temporal patterns on

Twitter

Tiago Daniel Sá Cunha

Mestrado Integrado em Engenharia Informática e Computação

Approved in oral examination by the committee:

Chair: Luís Filipe Pinto de Almeida Teixeira (PhD)

External Examiner: Nuno Filipe Fonseca Vasconcelos Escudeiro (PhD) Supervisor: Carlos Manuel Milheiro de Oliveira Pinto Soares (PhD) Co-Supervisor: Maria Eduarda Silva Mendes Rodrigues (PhD)

(4)
(5)

This work is financed by the National Funds through the FCT - Fundação para a Ciência e a Tecnologia (Portuguese Foundation for Science and Technology) within the project REACTION - Retrieval, Extraction, and Aggregation Computing Technology for Integrating and Organizing News (UTA-Est/MAI/0006/2009).

(6)
(7)

Resumo

Redes sociais na internet apresentam-se como fontes de informação valiosas, no que diz respeito aos seus utilizadores e aos seus respectivos comportamentos e interesses. Tal informação tem sido sujeita a vários estudos, conduzidos por investigadores em Data Mining e outras áreas de todo o Mundo, de forma a descobrir padrões interessantes. Para além disso, tem existido também investimento em criar plataformas para extracção contínua e visualização dessa informação.

Esta dissertação aborda o problema de identificar perfis de tweets envolvendo múltiplos tipos de informação: espacial, temporal, social e de conteúdo. Os objectivos estabelecidos são: 1) o desenvolvimento de uma metodologia de Data Mining que identifique perfis de tweets que com-binem de forma flexível as várias dimensões consideradas, 2) a criação de uma ferramenta de visualização para representar os resultados obtidos e 3) a sua aplicação a um caso de estudo na twittosferaportuguesa.

O processo de Data Mining é composto pela computação, normalização e combinação de ma-trizes de dissemelhança. Cada matriz de dissemelhança é submetida a um algoritmo de clustering que realiza a extracção do conhecimento. Esta dissertação estuda várias funções de distância para diferentes tipos de dados, os métodos de combinação e normalização disponíveis e os algoritmos de clustering existentes.

A ferramenta de visualização está desenhada para uma utilização dinâmica e intuitiva, direc-cionada para a representação dos perfis de uma forma compreensível e interactiva. Para atingir estes objectivos, vários mecanismos de visualização foram estudados, assim como vários widgets capazes de representar os padrões obtidos.

O caso de estudo que esta dissertação aborda são os dados georeferenciados do TwitterEcho. No entanto, é importante sublinhar que é adequada para tratar quaisquer mensagens georeferenci-adas provenientes do Twitter.

(8)
(9)

Abstract

Online social networks present themselves as valuable information sources about their users and their respective behaviours and interests. Many researchers in data mining and other areas have analyzed these types of data, with the goal of finding interesting patterns. Additionally, a signifi-cant amount of investment has been applied to create platforms for the continuous collection and visualization of this information.

This dissertation addresses the problem of identifying tweet profiles by analysing multiple types of data: spatial, temporal, social and content. The goals are: 1) to develop an information extraction approach to combine the different types of information in a flexible way, 2) to develop a visualization tool for displaying the patterns found and 3) to apply this tool on a case study in the Portuguese twittosphere.

The data mining process is composed by the computation, normalization and combination of the dissimilarity matrices. Each dissimilarity matrix is then fed to a clustering algorithm that ob-tains the desired patterns. This dissertation studies appropriate distance functions for the different types of data, the normalization and combination methods available for different dimensions and the existing clustering algorithms.

The visualization platform is designed for a dynamic and intuitive usage, aimed at revealing the profiles in an understandable and interactive manner. In order to accomplish this, various visualization patterns were studied and widgets were chosen to better represent the information.

The case study uses geo-referenced data from TwitterEcho platform. However, the method is applicable to any geo-referenced messages extracted from Twitter.

(10)
(11)

Acknowledgements

This dissertation was possible with the contributions of various people, who deserve to be recog-nized.

Firstly, I would like to thank Professor Carlos Soares for his orientation on this project, both at the conceptual and practical levels. His ideas and his supervision enabled the final result of this dissertation.

In second place, a honour mention is due to Professor Eduarda Mendes Rodrigues who, in spite of the important changes in her professional and personal life, did her best for the project to finish in the best possible way, from resources allocation to project guidance.

Last, but not the least, I would like to thank my family and friends because they were there for the bad and good days. I also dedicate this work to all of you.

Porto, 15thJune, 2013

(12)
(13)

Contents

1 Introduction 1 1.1 Context . . . 1 1.2 Motivation . . . 1 1.3 Objectives . . . 2 1.4 Project description . . . 2 1.5 Document Structure . . . 3 2 Clustering 5 2.1 Clustering . . . 5 2.1.1 Partitioning . . . 6 2.1.2 Hierarchical . . . 7 2.1.3 Density based . . . 8 2.1.4 Grid based . . . 10

2.1.5 Data Input for Clustering . . . 11

2.1.6 Clustering Evaluation . . . 11

2.1.7 Discussion . . . 13

2.2 Distance Measures . . . 13

2.2.1 Spatial Distance Functions . . . 14

2.2.2 Temporal Distance Functions . . . 15

2.2.3 Social Distance Functions . . . 15

2.2.4 Content Distance Functions . . . 17

2.2.5 Mixed Distance Functions . . . 18

2.3 Related work . . . 18

2.4 Alternative Methodologies . . . 19

3 Spatio-Temporal Visualization 23 3.1 Spatio-Temporal Visualization . . . 23

3.1.1 Clustering Visualization . . . 23

3.1.2 Georeferenced Data Visualization . . . 25

3.1.3 Timestamped Data Visualization . . . 27

3.2 Twitter Overview . . . 28

3.2.1 General Description . . . 28

3.2.2 Twitter API . . . 28

3.2.3 TwitterEcho platform . . . 30

4 Clustering on content, time, space and social dimensions 33 4.1 Distance Measures . . . 33

(14)

CONTENTS 4.3 Clustering . . . 36 4.4 Illustrative example . . . 36 4.5 Visualization . . . 40 4.6 System Architecture . . . 43 5 TweeProfiles 45 5.1 Data collection and preparation . . . 45

5.2 Data understanding and exploratory data analysis . . . 46

5.3 Clustering . . . 50 5.4 Visualization . . . 53 5.5 Results Illustration . . . 54 6 Conclusions 65 6.1 Summary . . . 65 6.2 Discussion . . . 66 6.3 Future work . . . 67

A Clustering algorithms pseudo-codes 69

B Dimensional Combinations Table 73

(15)

List of Figures

2.1 Dissimilarity Matrix [HKP06]. . . 11

3.1 Clustering Visualization [LPPM07]. . . 24

3.2 Clustering Visualization [EO10]. . . 24

3.3 Map with event detection on Twitter [Lee12]. . . 25

3.4 Modified Google Earth rule visualization tool [CM07]. . . 26

3.5 Real-time heat maps of positive and negative sentiments expressed via Twitter [Fit12]. . . 26

3.6 Time Graphic for event detection on Twitter [Lee12]. . . 27

3.7 Timeline [RLW12]. . . 27

3.8 Tweet example. . . 28

3.9 REST API interaction process [Twi12]. . . 29

3.10 Streaming API interaction process [Twi12]. . . 30

3.11 TwitterEcho Physical Architecture. . . 31

4.1 Spatial coordinates of example tweets mapped in 2D plot. . . 37

4.2 Social graph of the users. . . 37

4.3 Clustering of matrix4.10. . . 39

4.4 Clustering of matrix4.11. . . 39

4.5 Clustering of matrix4.12. . . 40

4.6 Spatio-temporal basis representation. . . 41

4.7 Spatial representation. . . 41

4.8 Temporal representation. . . 42

4.9 Details representation. . . 42

4.10 System architecture diagram. . . 44

5.1 Tweet count distribution over time. . . 47

5.2 Tweet distribution in space. . . 47

5.3 Number of tweets on the most representative countries (more than 1000 tweets). . 48

5.4 Number of tweets per region. . . 48

5.5 Followers graph of all retrieved users. . . 49

5.6 Overall distances in the social graph. . . 50

5.7 Clusters calculated uniquely from the spatial dimension. . . 55

5.8 Clusters calculated uniquely from the temporal dimension. . . 55

5.9 Clusters calculated uniquely from the temporal dimension in a timeline represen-tation. . . 56

5.10 Clusters calculated uniquely from the content dimension. . . 56

5.11 Tooltips demonstrating the content similarity. . . 57

(16)

LIST OF FIGURES

5.13 Clusters calculated uniquely from the social dimension. . . 58

5.14 Clusters in Portugal: Content 100%. . . 60

5.15 Clusters in Portugal: Content 75% + Spatial 25%. . . 60

5.16 Clusters in Portugal: Content 75% + Temporal 25%. . . 60

5.17 Clusters in Portugal: Content 75% + Social 25%. . . 60

5.18 Clusters in Portugal: Content 100%. . . 61

5.19 Clusters in Portugal: Content 50% + Spatial 50%. . . 61

5.20 Clusters in Portugal: Content 50% + Temporal 50%. . . 61

5.21 Clusters in Portugal: Content 50% + Social 50%. . . 61

5.22 Clusters in Portugal: Content 25% + Spatial 25% + Temporal 25% + Social 25%. 61 5.23 Tweets in cluster 1 for dimension weights: Content 25% + Spatial 25% + Temporal 25% + Social 25%. . . 62

5.24 Tweets in cluster 3 for dimension weights: Content 25% + Spatial 25% + Temporal 25% + Social 25%. . . 62

5.25 Tweets in cluster 4 for dimension weights: Content 25% + Spatial 25% + Temporal 25% + Social 25%. . . 63

5.26 Tweets in cluster 5 for dimension weights: Content 25% + Spatial 25% + Temporal 25% + Social 25%. . . 63

5.27 Tweets in cluster 10 for dimensions: Content 25% + Spatial 25% + Temporal 25% + Social 25%. . . 63

(17)

List of Tables

2.1 Comparison of clustering algorithms (Adapted from [LMS11]). . . 13

4.1 Subset of 5 tweets for example. . . 37

4.2 Spatial dissimilarity matrix (values in Kilometres). . . 38

4.3 Temporal dissimilarity matrix (values in seconds). . . 38

4.4 Content dissimilarity matrix (values in 1 - cosine scale). . . 38

4.5 Social dissimilarity matrix (values as edge count in the followers graph). . . 38

4.6 Normalized spatial dissimilarity matrix. . . 38

4.7 Normalized temporal dissimilarity matrix. . . 38

4.8 Normalized content dissimilarity matrix. . . 38

4.9 Normalized social dissimilarity matrix. . . 38

4.10 Combined dissimilarity matrix (Spatial=25%, Temporal=25%, Content=25%, So-cial=25%). . . 39

4.11 Combined dissimilarity matrix (Spatial=25%, Content=75%). . . 39

4.12 Combined dissimilarity matrix (Temporal=25%, Social=75%). . . 40

5.1 Tweet example. . . 45

5.2 Table of clustering results in every combination calculated. . . 59

(18)
(19)

Abbreviations

API Application Programming Interface

BIRCH Balanced Iterative Reducing and Clustering using Hierarchies CLIQUE CLustering In QUEst)

DBSCAN Density-Based Spatial Clustering of Applications with Noise DENCLUE DENsity-Based CLUstEring

CLARA Clustering LARge Applications

CLARANS Clustering Large Applications based on RANdomized Search

COD-CLARANS Clustering with Obstructed Distance - Clustering Large Applications based on RANdomized Search

HDFS Hadoop Distributed File System HTML HyperText Markup Language HTTP Hypertext Transfer Protocol IDF Inverse document frequency JSON JavaScript Object Notation MDS Multi-Dimensional Scaling

OPTICS Ordering Points To Identify the Clustering Structure PAM Partitioning Around Medoids

REST Representational State Transfer STING STatistical INformation Grid

(20)
(21)

Chapter 1

Introduction

1.1

Context

A social network is defined in social sciences as a structure composed by a set of actors and the ties between them [PMMR13]. More recently, it acquired a new meaning in information science which is "a dedicated website or other application which enables users to communicate with each other" [Dic10]. More generally, online social networks present a variety of social media services. In recent years, social media services have achieved a huge importance in social life and also in business strategies for companies, as they "have been regarded as a timely and cost-effective source of spatio-temporal information" [LYCW11]. The massive adhesion and the number of platforms that provide social interaction lead to a growth in the data stored within these services. These data have been used by many investigators in order to extract information from [RLW12,

LMS11,CM07].

Twitter has proven to be a popular data source within social media due to the large number of active users and the easy access to their public API. As such, it has fuelled a number of stud-ies [BO12, Bru11,Cor12, Gol10, AGHT11]. TwitterEcho [BO12] is the result of one of those projects. It is a research platform that collects tweets and user data from the Portuguese Twitto-sphere and aims to support R&D and journalistic tools. These tools use Data Mining techniques to generate knowledge. Some of the main functionalities already implemented in this platform are text mining, opinion mining and social network analysis.

1.2

Motivation

Twitter contains a lot of data about people’s behaviors and interests. These data probably can be organized into subgroups that represent profiles of tweets and, thus, of the users. These profiles can be useful for many tasks (marketing, political science, government, product development, etc.). However, given the amount of data as well as its complex nature (space, time, content and social), these patterns cannot be obtained manually.

(22)

Introduction

Therefore, our motivation lies with the design and development of an automatized tool to retrieve and display the patterns mined. Our contribution with this project is the development of a Data Mining process that enables a weighted combination of multiple clusterings in various dimensions.

1.3

Objectives

This dissertation aims to identify tweet profiles by analysing multiple types of data: spatial, tempo-ral, social and content. The goals are: 1) to develop a information extraction approach to combine the different types of information in a flexible way, 2) to develop a visualization tool for displaying the patterns found and 3) to apply this tool on a case study in the Portuguese twittosphere.

The experimental objective that is suitable for this dissertation is to verify if the combination of dimensions for clustering yields interesting results and how can these be applied. Therefore the system should provide results with different combinations of these dimensions and display all of them, assuring that each dimension used is presented simultaneously.

While spatial and temporal dimensions are the basis of this project, social distances between users and tweet content similarities must also be taken into account in the profiles generated. The visualization platform must use spatio-temporal visualization techniques and tools to better represent the information and, at the same time, provide tools to interact with the information provided.

1.4

Project description

In order to extract the knowledge described in the previous objectives, a data mining process is required. This process is characterized by data retrieval and preparation, dissimilarity calculations, normalization, combination and clustering and, finally, visualization.

The primary data source is TwitterEcho from which a database with geo-referenced tweets was used. This database provided all data necessary to clustering according to the spatial, temporal and content dimensions, since for each tweet the latitude and longitude coordinates were present, with the tweet timestamp and the content written by the author.

With all data required for the data mining process available, dissimilarity matrices were com-puted for each dimension. In order to achieve that, distance measures that are suitable for each dimension were used. To compute the spatial distance or dissimilarity, the haversine function [MK10] was used to assess the linear distance between two points using latitude and longitude coordinates. For temporal distance a simple timestamp difference was chosen. For content dis-similarity the cosine disdis-similarity was used, based on a TF-IDF [LPPM07] representation of the posts. Lastly, social distance between two posts is defined as the geodesic between the users who make them in the social graph defined by the friends and followers relationships.

Data combination was planned in a way to be possible for the user in the final platform to choose different values for each dimension and to obtain the results for the given input. In order

(23)

Introduction

to reduce the number of combinations a step of 25% was chosen and all possible combinations calculated. These combination yielded a different combined dissimilarity matrix to be subjected to the clustering algorithm.

The clustering algorithm used was DBSCAN [EKSX96] and it was applied to every combined matrix. Each cluster presents the positions of its tweets in their respective locations, a circular representation that roughly represents the selected group, a time interval (time stamp of its first and last tweets), the most frequent words and a mini social graph for the most important users denoting the social distance.

The visualization tool provides the following components: a map, a timeline, an interactive graph and tooltips. The map contains all points represented with the color of its cluster and over-lapped circles to represent the clusters. The timeline presents a non-overlapping view all clusters in a temporal dimension. Further details about the profiles are presented in another section. This section is composed by a table with the most relevant data and an interactive social graph.

The user may zoom in and out in the map and in the timeline and a control bar is provided to show the clusters on a different subset or to adjust weight of the dimensions.

1.5

Document Structure

This document is organized as follows: Chapter2 contains the state of the art for the scientific fields related to this project, namely the clustering algorithms and distance measures studied for each dimension in spatio-temporal Data Mining.

In Chapter 3 we explain the TwitterEcho project in more detail alongside a state of the art study on spatio-temporal visualization techniques to use in the visualization tool.

In Chapter4 the concepts and decisions for the data mining process are explained in detail, namely the distance measures used, the combination process and visualization patterns.

Chapter5 is responsible for presenting the experimental data and fully describe the process used to extract knowledge. It details the data preparation methods, clustering algorithm and the parameters used in this experiment. A few examples of the results that were obtained in the case study are presented and analysed in order to validate the implementation and to illustrate the type of knowledge that can be extracted.

Finally, in Chapter6we discuss the results obtained and the influence of our decisions in all steps of the data mining process. All limitations are presented and future work is discussed.

(24)
(25)

Chapter 2

Clustering

2.1

Clustering

Data Mining is "the process of discovering interesting patterns from large amount of data" [HKP06]. [Cur06] claims that "Data Mining is a multi-disciplinary field at the confluence of Statistics, Com-puter Science, Machine Learning, Artificial Intelligence (AI), Database Technology, and Pattern Recognition". The main tasks of Data Mining are:

• Characterization and discrimination: summary of general characteristics or features. • Mining of frequent patterns, associations and correlations: finding patterns that occur

fre-quently in the data.

• Classification and regression: obtain a model that represents the data.

• Clustering analysis: groups objects in subgroups, where similar objects are in the same subgroup and different objects are in different subgroups.

• Outlier analysis: find objects that are very different from the majority of other objects. Clustering is formally defined as "the process of grouping a set of data objects into multiple groups or clusters so that objects within a cluster have high similarity, but are very dissimilar to objects in other clusters" [HKP06]. Similarity assessment is calculated through distance functions, which are described in section2.2.

Considering the proposed problem, clustering is the logical choice for extracting patterns from unlabelled data, such as the geo-reference and timestamp within each tweet. Furthermore, clus-tering provides grouping of similar objects, which answers directly to the main objectives for this dissertation.

Another consideration that must be made is the necessity of using a technique applicable to not only spatial and temporal dimensions, but also to social and content similarities. Therefore, clustering presents itself as the better suited technique.

(26)

Clustering

In this section we present the most representative clustering algorithms, organized by their type. There are 4 types of clustering methods from raw data: Partitioning, Hierarchical, Density-based and Grid-Density-based.

2.1.1 Partitioning

Partitioning algorithms are known for generating mutually exclusive clusters of spherical shape using distance-based techniques to group objects. They generally use mean or medoid to represent cluster centres and have proven effective up to medium size sets [HKP06]. A partition algorithm organizes the objects to create partitions accordingly to a particular criterion.

Within this set of partitioning algorithms, the most well known are k-Means and k-medoids [HKP06].

k-Means intends on solving the NP-hard problem of partitioning objects into clusters [HKP06]. It defines a centroid as the mean value of points in a cluster and assigns each object to the most similar cluster, comparing the distance of each object to each cluster centroid. It employs an iterative approach in order to improve the variation of distances in clusters, recalculating means and re-assigning objects to more similar clusters in each iteration. The within-cluster variation equation used to calculate the cluster quality in each iteration is presented in 2.1.

E= K

i=1p∈C

i

dist(p, ci)2 (2.1)

where p is an object of cluster ci. The algorithm stops when the clusters defined posses no difference, in two consecutive iterations, or by an imposed iteration limit. The pseudo-code of this algorithm is presented in algorithm1.

Algorithm 1 k-Means

1: procedureK-MEANS(k : #clusters, D : dataset)

2: arbitrarily choose k objects from D as the initial cluster centres;

3: repeat

4: (re)assign each object to the cluster to which the object is the most similar;

5: update the cluster means;

6: until no change in clusters

7: return set of k clusters;

8: end procedure

Although being a relatively scalable and efficient solution, results are sensible to the initial cluster centres and outliers. The necessity of previously indicating the number of clusters expected is another major disadvantage.

(27)

Clustering

k-medoids tries to solve one of the previous disadvantages: the sensibility to outliers. It changes the k-means processing by considering as cluster center an object (also known as rep-resentative object) instead of the mean value of all points allowing for outliers to have less in-fluence on the cluster shape. The partition is obtained in k-medoids by minimizing the sum of dissimilarities between each object p and the representative object oi:

E= K

i=1p∈C

i

dist(p, oi) (2.2)

This k-medoids concept was implemented by the PAM (Partitioning Around Medoids) algo-rithm [KR08]. Firstly, this algorithm selects random objects (or seeds) for representative objects and, in the same manner as k-Means, iterates by switching the center of cluster while the quality of the clustering is improvable [HKP06]. The pseudo-code of this algorithm is present in annex

A.

Although PAM indeed reduces the impact of outliers on the shape of the cluster, enabling better results, it presents a higher complexity and is therefore only indicated for small sized data sets.

In order to overcome the scalability problem, a new approach was created by the algorithm CLARA (Clustering LARrge Applications) [KR08]. It resorts to a sampling technique, in order to only cluster a small set instead of the all data and applies the PAM algorithm to the sample [HKP06]. It assumes that the sample distribution is the same as the set it was retrieved from. However, CLARA’s effectiveness depends on the sample chosen. Therefore, it is a simple solution to cluster large data sets but far from a perfect one.

CLARANS (Clustering LARge Applications based upon RANDdomized Search) [Ng,94] was created based on CLARA in order to improve its scalability and clustering quality [HKP06]. Not only it samples the data set, but it also includes a random search within the points in the data set to search for a better medoid. The pseudo-code for this algorithm can be found in annexA.

CLARANS guarantees a local optimum when applied to large data sets. COD-CLARANS (Clustering with Obstructed Distance) [THH01] is a variation of CLARANS which conserves its advantages but was designed for a specific purpose: clustering considering obstacles. The pseudo-code is available in annexA.

The approach in COD-CLARANS is to verify if the distance calculations between two points do not cross an obstacle area. This is achieved by line intersections with plans functions. If there is an intersection, the distance is set to infinite and, therefore, discarded in clustering. Other distances are computed directly and CLARANS used to clustering.

2.1.2 Hierarchical

"A Hierarchical clustering method works by grouping data objects intro a hierarchy or a "tree" of clusters" [HKP06]. This method can either be agglomerative (if it starts with small clusters and recursively merge them to find a single final cluster) or divisive (all objects are in a single cluster

(28)

Clustering

and iteratively are divided until it has only one object or the objects in each final cluster are very similar).

Usually, the results of hierarchical algorithms are represented by dendogram (i.e. tree dia-gram), which separates by levels the similarity of objects and represents the connections of clusters by creating lines from the root to the leafs.

We start by introducing the BIRCH algorithm [TL96]. BIRCH introduces the definition Clus-tering Feature (CF) used to summarize a cluster. It is a 3D-vector defined by:

CF= hn, LS, SSi (2.3)

where n is the number of points, LS is the linear sum of points and SS is the square sum of the data points. CF enables the computation of a cluster centroid, radius and diameter for future processing.

This data structure is then used in a CF-tree whose objective is to represent the cluster hi-erarchy and use the previous formulae to ensure tightness of each cluster. Since CF verifies the additive property, agglomerating two clusters is basically summing each component within each CF. This is the key for the space efficiency of BIRCH.

BIRCH builds an initial tree from the data set, where each CF is inserted into the closest leaf. These leafs are then provided to the clustering algorithm in order to group dense clusters into larger ones. The pseudo-code for this algorithm can be found in annexA.

BIRCH does not perform well for non-spherical clusters since it uses the previous formulae for radius and diameter to organize the clustering.

Chameleon [KHK99] is another agglomerative hierarchical algorithm which uses dynamic modelling to determine similarity between 2 clusters. This technique is based on two concepts for similarity: the relative interconnectivity (RI) and relative closeness (RC) between clusters.

Chameleon algorithm builds a k-nearest-neighbour graph where each edge is weighted for measuring similarity and vertex are connected if they are within the k-most similar objects. The graph is then subjected to a graph partition algorithm to generate smaller clusters minimizing edge cuts. Lastly, an agglomerative hierarchical algorithm merges sub-clusters to output the final resulting cluster. The pseudo-code for this algorithm is available in annexA.

Chameleon can adapt itself to the cluster characteristics and therefore discover arbitrarily shaped clusters. It is also applicable to all data types demanding only a suitable similarity function. This algorithm finishes the Hierarchical clustering algorithms, although there are many more, including variations of the presented before. We now look with more attention to the density based algorithms.

2.1.3 Density based

Density-based clustering algorithms follow the strategy of modelling clusters as "dense regions in that data space, separated by sparse regions" [HKP06]. Therefore, these algorithms are very suitable to finding non-spherical shaped clusters.

(29)

Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) [EKSX96] finds core objects (i.e., points with dense neighbourhood) and iteratively connects them to the neigh-bours if these are in the core object’s ε-neighbourhood.

The ε-neighbourhood is defined through a user defined parameter: the radius ε and states that a point is in the core object’s ε-neighbourhood if it is within the pre-defined radius. Therefore, for two points p and q, we can say that p is directly density-reachable from q is it is in the ε-neighbourhood of q.

Another user input is MinPts that determines if a point is a core object. If within the ε-neighbourhood there are at least MinPts points, then we are in the presence of a core object.

The algorithm2takes in account the two previous concepts and iteratively connects core ob-jects to its ε-neighbourhood until all obob-jects are processed.

Algorithm 2 DBSCAN

1: procedure DBSCAN(MinPts : neighborhood_threshold, D : dataset, ε : radius_parameter)

2: Mark all objects as unvisited;

3: do {

4: Randomly select an unvisited object p;

5: Mark p as visited;

6: if the ε-neighborhood of p has at least MinPts objects {

7: Create a new cluster C and add p to C;

8: Let N be the set of objects in the ε-neighborhood of p;

9: for each point p’ in N {

10: if p’ is unvisited {

11: Mark p’ as visited;

12: if the ε-neighborhood of p’ has at least MinPts points;

13: Add those points to N;}

14: if p’ is not yet a member of any cluster

15: add p’ to C;}

16: Output C;}

17: else mark p as noise;

18: } until no object is unvisited;

19: end procedure

DBSCAN possesses a complexity of O(n2), but effectively finds non-spherical shaped clusters. It also has de advantage of not needing as input the number of expected clusters.

OPTICS [ABKS99] is a variation of DBSCAN with the purpose of removing the necessity of user defined global parameters. Instead, these are updated automatically in each iteration in order to better adapt to the data characteristics. However, OPTICS does not output a cluster, but a clustering order, i.e. an ordered sequence of database items accordingly to the computed density.

OPTICS introduces two concepts: core-distance and reachability-distance. Core-distance is the smallest ε0 such that the ε0-neighbourhood has at least MinPts. Reachability-distance is the

(30)

Clustering

minimum radius that make two points density-reachable, i.e., the distances between two points is within the radius defined by the core-distance.

OPTICS starts by ordering all objects and computing core-distances and reachability-distances. With these values, it assigns objects to its neighbours, such as DBSCAN. However, at each iter-ation, these values are automatically updated and the radius parameter can be different for each core object. The pseudo-code is available in annexA.

DENCLUE (DENsity-based CLUstEring) [HK98] is a clustering algorithm based on a set of distribution functions. It makes use of a non-parametric density estimation approach called kernel density [HKP06]. The kernel density used in DENCLUE is generally a Gaussian kernel. This measure enables the definition of density attractors: points located in each local maximum of the density function. These density attractors are then filtered through a threshold to find the center of clusters.

DENCLUE defines a cluster as a set of density attractors with objects assigned to them, in-cluding other density attractors to generate the complete cluster. The pseudo-code is available in annexA.

This algorithm can find arbitrarily shaped clusters as well as it is invariant to noise in data, since it distributes the noise throughout the density distribution.

2.1.4 Grid based

Grid-based algorithms use a space-driven approach instead of a data-driven approach as in the previous algorithms [HKP06]. They partition the space into cells of a multi-resolution grid data structure. This ensures a fast processing time independent from the size of the data set, although it is affected by the resolution of the grid.

STING (STatistical INformation Grid) [WYM97] is a grid-based multi-resolution technique that splits the data into rectangular cells per each level. A multi-resolution grid structure is a hierarchical structure composed of levels and where each level is divided in cells. Each higher-level cell is decomposed into smaller cells in the lower higher-level. It is known that the lowest the higher-level is, the higher the resolution is. At each level, statistical measures are computed and saved for future query processing.

[HKP06] states that "the distribution of a higher-level cell can be computed based on the majority of distribution types of its corresponding lower-level cells in conjunction with a threshold filtering process". This allows for the clustering to be found recurring to querying the hierarchical structure using a top-down query, that goes through each level until it reaches the lowest level and returns the relevant cells for the specified query.

The clusters are found through a query approach. A level and grid must be chosen and each pair must be evaluated to verify different clusters. The pseudo-code is presented in annexA.

This algorithm presents many advantages: the grid structure is query independent and en-ables parallel processing and updates which leads to withstand more scalable problems. However, cluster quality directly depends on the number of levels in the structure and it can only produce

(31)

Clustering

isothetic clustering, i.e. cluster boundaries are either vertical or horizontal, but never diagonal, which affects directly the shapes of the cluster produced.

CLIQUE [AGGR98] is a simple grid-based method for retrieving density-based clusters in subspaces of data.

Initially, it partitions each dimension into non-overlapping intervals (i.e. cells in the grid) and recurring to a threshold, classifies each cell as dense or sparse. Then, dense cells are used to assemble clusters. In the second step, dense cells are connected and the clusters created.

CLIQUE is insensitive to the order in which objects are presented and does not presume a specific distribution of the data. And although it provides good scalability, clustering quality is dependant from the grid size for lower resolution will introduce error in the final clustering result.

2.1.5 Data Input for Clustering

Clustering algorithms usually perform dissimilarity calculations internally with specific metrics. The user is then limited on the data to be presented if it is non-numerical or defined by pairs of data, as is the example of latitude and longitude coordinates.

When the metric used for the clustering algorithm suits the input data, then only the data must be provided to obtain results. However, when the data requires another metric independent from the algorithm, a dissimilarity matrix must be provided.

A Dissimilarity Matrix can be represented as a n-by-n table, being n the number of elements in the data set.

       0 d(2, 1) 0 d(3, 1) d(3, 2) 0 .. . ... ... d(n, 1) d(n, 2) · · · 0       

Figure 2.1: Dissimilarity Matrix [HKP06].

This dissimilarity matrix calculates the distance pair-wise among all objects accordingly to a specific metric and this data structure is then provided to the clustering algorithm. Not all algorithms support this feature either for intern specifications or simply due to implementation faults.

2.1.6 Clustering Evaluation

Clustering evaluation is the assessment of three components: clustering tendency, number of clus-ters in a data set and clustering quality [HKP06].

(32)

Clustering

Clustering tendency checks that the data set is a non-random structure, i.e., it does not possess a uniform distribution. This is important because even if the clustering is completed, the results are meaningless. In order to compute cluster tendency, we can use the Hopkins Statistic:

H= ∑

n i=1yi ∑ni=1xi+ ∑ni=1yi

(2.4) where xi= minv∈D{dist(pi, v)} and yi= minv∈D,v6=qi{dist(qi, v)}. This function calculates

dis-tances between neighbours and assesses whether they are not similar. If the resulting coefficient is near zero, then the data set is not uniformly distributed.

The number of clusters is the second evaluation step. This takes into consideration that the number of clusters is not always known a-priori. Therefore, three methods are explained to help compute this value in a scientific manner.

The first is a simplistic approach. It always sets the number of clusters aspn/2, therefore producing√2n clusters, considering n as the number of objects in the data set.

The second method is called the elbow method. Firstly, the number of clusters is varied sys-tematically and the within-cluster variation is computed for all execution. Then, in the analysis of this function, the global maximum is chosen as the optimal number of clusters to provide to the final clustering algorithm. The within-cluster variation for a point p in a cluster Ciwith centroid ci is given by: E= k

i=1

p∈ Cidist(p, ci)2 (2.5) The final method is cross-validation. It divides the data set into m parts and clusters m - 1 parts. The remaining is used as a test set. With this test set, distances from each point to the respective centroid are computed and assembled as a unique coefficient. This coefficient can be optimized, by testing different numbers of clusters and therefore reach the optimal solution.

The final step in clustering evaluation is to assess the cluster quality. This is sub-divided in two different types of methods, considering whether the ground truth (i.e., perfect clustering result) is known. If there is no possibility of knowing the ground truth, Intrinsic methods should be used. Otherwise, there are Extrinsic methods available.

The extrinsic methods check 4 criteria: cluster homogeneity, cluster completeness, rag bag (when in presence of heterogeneous objects) and small cluster preservation. A famous extrinsic method is BCubed Precision that calculates how many objects fit into the same category as the object being tested. There is another metric able to be computed, named Recall BCubed. It calculates how many objects of the same category can be found in the same clusters.

In intrinsic methods, cluster compactness and separation are evaluated, mainly recurring to the silhouette coefficient:

s(o) = b(o) − a(o)

(33)

Clustering

where a(o) computes the compactness of a cluster and b(o) the cluster degree of separation. The coefficient is computed by calculating the average silhouette of all objects. The closer the silhouette coefficient to 1, the more it verifies both compactness and separation in the clusters.

[HBV01] defines other evaluation methods for clustering. Cluster validity is assessed by both compactness and separation (inter and intra-cluster distances). Compactness relates to the distance between elements in the same clusters, which must be minimized. Although many metrics can be used, usually the choice falls on the variance of the distances. On the other hand, separation refers to the distance between clusters, that must be maximized. Within this measure, there are three types of different distances that can be calculated. Single linkage measures the distance between the closest elements of a clusters. On the other hand, complete linkage measures the distance between the farthest members. Finally, comparison of centroids measures the distance between cluster centroids.

2.1.7 Discussion

[LMS11] took into consideration many of the previous explained algorithms and condensed the information about their characteristics in table2.1.

Algorithm Input number

of clusters Pair-wise distance computation Mandatory space-mapping No outliers detection k-Means Y N N Y CLARANS Y Y N Y BIRCH N N Y Y CURE N Y N N Chameleon N Y N N DBSCAN N Y N N OPTICS N Y N N STING N N Y Y CLIQUE N N Y Y

Table 2.1: Comparison of clustering algorithms (Adapted from [LMS11]).

This table, besides being a good summary of each clustering algorithm, also enables to filter our decision of algorithm by their functionalities. We will present now measures for clustering evaluation.

2.2

Distance Measures

Clustering algorithms, as we have seen in Section2.1, need distance functions in order to calculate dissimilarities between objects and to group these objects by similarity. [HKP06] states that "the objective function aims for high intra-cluster similarity and low inter-cluster similarity".

(34)

Clustering

We consider that each tweet is formally defined as ti, where i is the index identifier on the tweet data collection. The distance functions between two tweets ti and tj are hereby defined as distX(ti,tj), where X is the dimension on which the function maps the values. X can take the values Sp, T, C, So which are related respectively to the dimensions spatial, temporal, content and social.

2.2.1 Spatial Distance Functions

In this context, spatial dimension is defined by latitude and longitude numeric values extracted from tweets. Therefore, similarity functions between numeric values must be explored. Ac-cordingly to [HKP06] the 4 most important distances of this type in a euclidean space are the Euclidean Distance, the Manhattan Distance, the Minkowski Distance and the Mahalanobis Dis-tance. [AHMRS03] also defines the Chebychev Distance. Since weighted distances are also very useful for assigning different importances to different components, the Weighted Euclidean Dis-tance is defined as an example, although weighting can be applied to other disDis-tance functions. However, for the specific case of latitude and longitude points there is a better suited distance measure that considers the earth’s shape: the haversine distance.

The euclidean distance is defined by equation2.7. For each pair of tweets tiand tj, the distance function uses the latitudes φti and φtj and longitudes λti and λtj to determine the distance.

distSp(ti,tj) = q

(φti− φtj)2+ (λti− λtj)2 (2.7)

While the Euclidean Distance is known as the distance in a straight line, the Manhattan Dis-tance invokes the city block disDis-tance paradigm. This paradigm defines disDis-tance between two points as the sum of horizontal and vertical distances for each pair of points. The Manhattan Distance is visible in equation2.8.

distSp(ti,tj) = |φti− φtj| + |λti− λtj| (2.8)

The Minkowski distance is a generalization of both the Euclidean and the Manhattan Distances and it is presented in equation2.9.

distSp(ti,tj) = h q (φti− φtj) h+ (λ ti− λtj) h (2.9)

It introduces the real number h , where h ≥ 1. When h = 1 we have the Euclidean Distance and when h = 2 the Manhattan Distance is defined. However, for h → ∞, we obtain the Chebychev Distance (also known as the supremum distance). Equation 2.10demonstrates the Chebychev formula. distSp(ti,tj) = lim h→∞ |φti− φtj| h+ |λ ti− λtj| h !1h = max(|φti− φtj| h, |λ ti− λtj| h) (2.10)

(35)

Clustering

When each attribute has different importance, a weighting system can be applied. The Weighted Euclidean Distance is defined as in equation2.11.

distSp(ti,tj) = q

w1(φti− φtj)2+ w2(λti− λtj)2 (2.11)

The Mahalanobis Distance, although not as popular as the previous distances defined, also has long been used in clustering techniques. This distance includes the covariance matrix (V ) of distribution of the objects which determines whether each two objects vary together or not. It is, however, calculated on an algebraic manner and each pair of tweets ti and tj are now defined as the vectors xi= [φti, λti] and xj= [φtj, λtj].

distSp(ti,tj) = q

(xi− xj)V−1(xi− xj)T (2.12) The haversine distance [MK10], also known as great circle distance takes into consideration the earth shape to define the space on which the points are mapped. The haversine distance is presented in equation2.13. distSp(ti,tj) = 2Rsin−1 " sin2 φti− φtj 2 !

+ cosφticosφtjsin

2 λti− λtj

2

!#0.5!

(2.13)

where R is the earth radius and the distance calculated is returned in the same unit as the provided variable R. In our case, we used Kilometers.

2.2.2 Temporal Distance Functions

As far as the temporal dimension goes, contrary to the previous distances where the distances are mapped in R2, time is represented in R, which facilitates the difference calculation. For each pair of tweets ti and tj, the timestamp values ∆i and ∆j are used to compute the distance. The time interval can be defined by the following equation2.14.

distT(ti,tj) = |∆i− ∆j| (2.14) However, any of the previous distance functions for euclidean space is applicable to the tem-poral dimension.

2.2.3 Social Distance Functions

Considering connections between users stored in TwitterEcho, it is possible to assume the implicit existence of a social graph. Therefore, the social distance is simplified to a distance between nodes within a graph. In fact, almost all distance measures presented in this section have as basis a graph. [HKP06] defines two distance measures for graphs: Geodesic Distance and SimRank. Network Similarity [ACF11] and a pseudo-logirithmical distance measure [Dek05] are also presented.

(36)

Clustering

Geodesic distance is the shortest path between two vertices. This calculation relies simply on the minimum number of edges between two vertices. To achieve this distance one must apply a shortest path algorithm on graph, as for instance Dijkstra’s [Dij59].

SimRank stands for Similarity Based on Random Walk and Structural Context. In this distance measure, two vertices are similar if they are connected with common vertices. In order to calculate the similarity, there is the need to introduce the concept individual in-neighbourhood. Considering a directed graph G = (V, E), where V defines a set of vertices and the set of edges is E ⊆ V × V , the individual in-neighbourhood of a vertex v is defined as in equation2.15.

I(v) = {u|(u, v) ∈ E} (2.15)

SimRank Distance is defined, for any two vertices u and v related to two tweets tiand tj, within graph G as it is visible in equation2.16.

distSo(u, v) =    0, if I(u) = 0 ∨ I(v)=0 C

|I(u)||I(v)|∑x∈I(u)∑y∈I(v)s(x, y), if I(u) 6= 0 ∧ I(v) 6= 0

(2.16) where C is a constant between 0 and 1.

[ACF11] defines a social distance function denominated Network Similarity. It takes in ac-count 2 types of graphs: the mutual friends graph and the friendship graph.

In mutual friends graph (MFG), the functions takes into account the number of mutual friends and also the number of edges between these mutual friends. The larger the amount of edges and users in mutual friends, the closer the distance between two users.

On the other hand, the friendship graph (FG) is used to calculate the number of edges between the original users. The combined distance functions takes in account both concepts an is defined in equation2.17.

For each two tweets tiand tj we consider their users as u and v and E referring to the edges set in each graph MFG and FG.

distSo(u, v) =log(MFG(u, v).E)

log(2|FG(u).E) (2.17)

[Dek05] defines a pseudo-logarithmical social distance based on a discrete variables approach. They assign a value in a given scale depending on the number of communications during a prede-fined time span. For instance, a distance of 0 is assigned if no communications were realized in the last month and 1 if communications occur every day. The distances used by [Dek05] were:

• 1.0 Communications every day. • 0.8 Two or more times per week. • 0.6 Once per week.

(37)

Clustering

• 0.2 Once per month.

• 0.0 Less than once per month.

2.2.4 Content Distance Functions

Here, we present functions to calculate the similarity between two texts, which in our problem are the tweets message. [HKP06] defines Cosine similarity distance (as the most commonly used) and a variation denominated Tanimono distance. Lastly, [RLW12] proposes a variation of Jaccard similarity complemented with Dice’s coefficient.

Before exploring content similarity functions, document representations must be explained, since they enable the use of some of the distance functions presented. For a tweet tiwe define the tweet text as αiand a term within the tweet as γαi.

[MRS08] defines IDF (Inverse document frequency) as a type of representation to assert the frequency of terms in documents. Considering the size of a text collection N and a document frequency of the term as d f (γαi) we can calculate IDF as it is shown in equation2.18.

IDF(γαi) = log

N d f(γαi)

(2.18) TF-IDF (Term frequency - Inverse document frequency) [LPPM07] is an extension of this measure. It includes the concept of term frequency within a well defined document as opposed to a document collection. Its purpose is to eliminate the influence of common terms in the similarity computation. Equation2.19shows the formula.

T FIDF(αi, γαi, D) = T f (αi, D) ∗ IDF(γαi) (2.19)

In order to use Cosine similarity to check whether two texts are similar, they must be converted to a term-frequency vector (also known as Document vector). These numeric vectors are created representing the number of times a word appears in each text. After this transformation, it is possible to compare two term-frequency vectors βi, βj as seen on equation2.20.

distC(ti,tj) = βi.βj ||βi||βj||

(2.20) The output of this similarity function is a value between 0 and 1, that represent if the texts being compared are not related at all or if they are the same, respectively. A variation of this similarity measure is the Tanimono Distance defined specially for binary-valued attributes within the term-frequency vectors. Tanimoto Distance is represented in equation2.21.

distC(ti,tj) =

βi.βj βi.βi+ βj.βj− βi.βj

(38)

Clustering

Jaccard and Dice’s similarity measure also evaluates if two text representations are similar, as we can observe in equation

distC(ti,tj) =

|βi∩ βj| min(|βi|, |βj|)

(2.22) [RLW12] uses this measure and also a combination of cosine similarity and TFIDF for apply-ing weights to each term-frequency vectors in different data processapply-ing phases, but with the same purpose of verifying similar documents.

[RKT11] proposed two variations for both Cosine and Jaccard similarity when applied to short text clustering. This variants are proposed due to the sparsity in term-frequency vectors for short texts, such as tweets. The variation of Cosine similarity is presented in equation2.23.

distC(ti,tj) = 1 − ∑dk=1βikβ k j ||βi||βj||) (2.23) Jaccard’s variation is represented in equation2.24.

distC(ti,tj) = 1 −

|βi∩ βj| |βi∪ βj|)

(2.24)

2.2.5 Mixed Distance Functions

To the best of our knowledge, there is only one proposal of distance measure that combined dif-ferent types of data: the cosine similarity with temporal attenuation. [Lee12] states that content similarity is directly related to a temporal dimension. He expects to show that two texts are more similar if they occur in a short time windows, rather than in bigger ones. For every tweet tiand tj with respective timestamps ∆iand ∆jwe can calculate the distance as seen in equation2.25.

distC(ti,tj) = βi∗ βj |βi||βj|

∗ eζ |∆i−∆ j |W (2.25)

This temporal evaluation is made with a temporal penalty that assures that if the time interval between two texts is big, then the penalty suffered is also large. The parameter ζ enables to adjust the penalty ratio.

2.3

Related work

There are already many investigations done on Social Media (including Twitter) and spatio-temporal analysis. [RLW12] created a framework to find and explore memes in social media. The pat-terns are obtained through phrase ranking and clustering to group similar texts relative to different memes.

Considering Twitter as the social media platform, it is possible to observe that many studies have been conducted to its data in various fields of expertise and with different goals.

(39)

Clustering

[Gol10] provides an extensive study on the Twitter practices, mainly on retweets. Twitter conversational aspects are analysed in general in order to conclude why and what users retweet.

In the case of [Bru11], the study took a deeper practical intent, since hashtags and replies were extracted from the Twitter API in order to make a temporal analysis. Conclusions were extracted on how hashtags and replies are used. Social graphs structures for both hashtags and replies were created to enable analysis on the time distributions in the data.

[AGHT11] took advantage of Twitter data to create a recommendation system for Twitter users. It takes into consideration the user’s personal interests and the public trends (identified by hashtags) to enable personalization of each user’s relationships.

Another recommendation system was developed by [SOS12]. They focussed their study in verifying whether there is a relationship between content similarity and social distance. Hashtags and retweets were filtered to analyse each dimension to find patterns. These patterns are used to realize friend recommendation within Twitter.

Event detection on Twitter was approached by both [Cor12] and [Lee12], although with dif-ferent strategies. [Cor12] achieves event detection through hashtag text mining directly on the Streaming API. The results are computed and presented in a temporal perspective. On the other hand, [Lee12] made usage of both temporal and spatial dimensions to monitor events in Twitter. An incremental density-based clustering algorithm was developed to enable the stream clustering of various features (including location and timestamp). The results are represented recurring to spatial and clustering visual representations.

Lastly, [LMS11] created an approach to mining popularity of web items using a time-series approach to find similarities. Clustering is applied in sequence to retrieve the patterns and conclude on the popularity of each item depending on the cluster they fit into.

2.4

Alternative Methodologies

Clustering, as was explained in section2.1 is divided in many types. However, all of these al-gorithms have a simple methodology of either receiving a dissimilarity matrix or raw data and return the clusters. There are, however, many methodologies adapted from simple clustering to solve problems such as multiple data income sources or different type of income process. Some methodologies that may be adapted to this dissertation are Stream Clustering, Multi-view cluster-ing and Consensus Clustercluster-ing.

Stream Clustering [Mah09] is based on the maxima that an element can only be analysed once due to the high rate of data income. Therefore, it must have the lowest memory usage and complexity time possible. The basic idea is to adapt simple clustering algorithm methodologies to a stream input source. Some famous examples are Divide and Conquer [GMMO00], Doubling algorithm [CCFM97], STREAM [GMMM03], Grid-based [PL04] and CluStream [AHWY03].

This type of clustering is fit to be used in the Streaming API directly. However, to analyse multiple dimensions simultaneously would be a very difficult process to implement. Another

(40)

Clustering

problem, is that the data must be obtained from a continuous stream and the data used in this dissertation does not meet this requirements (as it is visible in section5.1).

A methodology for multiple data income analysis in clustering is Multi-View Clustering [BS04]. The main approach of this technique is two train at least 2 different hypothesis which bootstrap by providing each other with labels for unlabelled data. It takes a multi-view approach because two hypothesis based on 2 attributes are trained in at least 2 distinct views to produce a clustering.

Multi-view clustering, although approaching closely our desired goals of analysing different features simultaneously, uses classification of each attribute to obtain labels. This is not possible because classification is not our goal since we do not know how we can label the data nor it is our goal to classify the data. Although we could maybe classify elements by a delimited space and time, we cannot attach restriction to the social ties and the content categories.

Another multiple data income strategy for clustering is known as Consensus Clustering, Clus-tering ensembles or ClusClus-tering aggregation [NC07, Str02]. The objective of this methodology is to combine multiple clusterings without access to the underlying features of the data. There are many traditional approaches for this methodology: Pairwise Similarity, Graph-based, Mutual Information, Mixture Model and Cluster correspondence.

In the Pairwise Similarity approach, the final clustering is achieved by computing distances between the same pair of points in two different clusterings and provide these to a final clustering algorithm. In Graph-based approaches the distances are mapped onto a graph and a graph cluster-ing algorithm is used to obtain the final results. Mutual information bases it process in defincluster-ing a target function to approach each different clustering to an objective clustering, iteratively. For Mixture Model approach, a probabilistic model is constructed based on the various distributions and the solution is achieved through a maximum likelihood resolution. Cluster correspondence is the equivalent of optimizing a linear programming formulation, having also an objective function whose objective is to find maximums and assign these maximum values to the same clusters.

Consensus Clustering presents itself as an ideal approach to solve our problem, considering that each dimension is clustered separately and aggregated in the final stage. One must state, however, that not all approaches can be used in our problem. We cannot define an objective function to guide to the final clustering because we do not known a priori the desired clustering. However, both Pairwise Similarity and Graph-based approaches are directly applicable to our problem.

An alternative methodology for combining different data sources to group by similarities can be found in Multi-Dimensional Scaling (MDS)[You85]. MDS is a "set of data analysis techniques that display the structure of distance-like data as a geometrical picture" [You85]. The main ap-proach of MDS is to compute pair-wise distances between all points, store them in a matrix and create groups from the lowest distances between points possible. It is, therefore, very similar to clustering and a fitting hypothesis for our problem.

There are 3 types of MDS: classical, replicated and weighted. In classical MDS only one dissimilarity matrix is used and no weight scheme is available. In replicated MDS, the input of several matrices is enabled, although no weight scheme is applicable. The combination is

(41)

Clustering

achieved by a linear transformation and the minimization of least-square errors. Lastly, weighted MDS allows both several matrices and weights. The difference to the previous approach is that it is possible to assign a different percentage value to each dissimilarity matrix used.

A good example of a MDS and clustering approach for data mining was developed by [GAMA10]. A study was conducted to evaluate floral distribution based on spatial and species similarity and MDS was used for reduce the data dimensionality after dividing the area in grids. Lastly, clustering was used based on the MDS results to extract the final results.

(42)
(43)

Chapter 3

Spatio-Temporal Visualization

3.1

Spatio-Temporal Visualization

An important step in the Data Mining process is interpretation of data and patterns. As [CM07] claims, "Visual data mining refers to methods, approaches and tools for the exploration of large data sets by allowing users to directly interact with visual representations of data and dynamically modify parameters to see how they affect the visualized data". The main properties that must be followed by visualization tools in Data Mining are: the displaying of the data and temporal behaviour, showing properties of entire displayed scene and support interaction [Gah09].

[Gah09] states that the main visualization techniques are: map-based, chart-based, projection, space-filling or pixel based, iconographical or compositional and hierarchical or network.

These data visualization techniques will be explored next accordingly to practical implemen-tations found in related work.

3.1.1 Clustering Visualization

When referring to clustering, the most usual representation is a graph-like visualization. It presents the objects in each cluster, maintaining the clustering goal of assigning similar objects a shortest distance and verify sparsity between clusters with greater distance. [LPPM07] developed a clus-tering visualization tool visible in3.1.

(44)

Spatio-Temporal Visualization

Figure 3.1: Clustering Visualization [LPPM07].

Another clustering visualization for a large amount of data involves assigning different colors and objects. For objects in different clusters, overlapping ellipses over the most representative objects are displayed to represent similar objects. [EO10] applied it to study geographical lexical variation to better assist on mapping the results.

(45)

Spatio-Temporal Visualization

3.1.2 Georeferenced Data Visualization

Georeferenced data typical involves plotting the information on top of a geographic representation, being the most common the map.

The first visualization type discussed is the 2D map that is, currently, very popular due to Google Maps1. Using their API, a vast number of applications were implemented due to its sim-plicity and the visual appeal. [Lee12] used this tool to overlap the map with representative points for its solution, as well as geometric figures to highlight the obtained results. [LYCW11] used it to detect events on Twitter.

Figure 3.3: Map with event detection on Twitter [Lee12].

Another Google creation related to mapping is Google Earth2. This enables a 3D visualization of the Earth and, like Google Maps it possesses an easy interface and allows an intuitive informa-tion representainforma-tion. [CM07] used it to represent the association rules extracted from a data set of Hurricane Isabel.

[CM07] included also a timeline (which will be presented in the next section) since it uses a spatio-temporal approach to retrieve knowledge.

1https://developers.google.com/maps/?hl=pt-pt

(46)

Spatio-Temporal Visualization

Figure 3.4: Modified Google Earth rule visualization tool [CM07].

Although only Google’s visualization tools have been detailed, there are many competitors in this market niche that also provide a map API.

An approach created by Silicon Graphics International in partnership with the University of Illinois created a real-time visualization tool of sentiment mining on Twitter [Fit12]. The repre-sentation adopted a heat map approach, in which each color represented a different value for the majority of positive or negative comments.

(47)

Spatio-Temporal Visualization

3.1.3 Timestamped Data Visualization

Timestamped data invokes a linear organization of events and therefore the most intuitive repre-sentation is a graphic, possessing in one of the axis the temporal dimension and in the other axis the values for analysis. [Lee12] uses a graphic to plot probability of a keyword belonging to a location over time, as is visible in the figure3.6.

Figure 3.6: Time Graphic for event detection on Twitter [Lee12].

Although many graphic tools provide interaction, nowadays the concept of timeline has emerged as a very usable tool to navigate data through time. [RLW12] incorporated timeline as a filter for the information gathered, to simplify access to the most important data collected.

(48)

Spatio-Temporal Visualization

3.2

Twitter Overview

This section provides a description of the Twitter1 social media service and its API, followed by an introduction to the TwitterEcho platform.

3.2.1 General Description

Twitter is a microblogging service that enables users to publish short messages (also known as "tweets") with a maximum size of 140 characters. Figure3.8shows a tweet displayed through the Twitter web page.

Figure 3.8: Tweet example.

Within Twitter, a social relationship is defined by whether a user is following or being followed by other users. Each user is responsible by add a user to their following list. Security and privacy settings enable any user to prohibit users to add him/her as a following user.

Each tweet has very well defined items in its structure, although most are not mandatory. Each item has the purpose of enhancing the social interaction or complete the information related to the message in question. Below we present these functionalities:

• Retweet (RT) Share another user’s tweet [Twi13a].

• Mention (@ + user name) Identify a user in a tweet [Twi13e]. • Reply (@ + user name) Answer to a previous user tweet [Twi13e]. • Hashtag (# + topic name) Association of a keyword to a tweet [Twi13d]. • Localization User’s geo-coordinates when sending the a tweet [Twi13b].

3.2.2 Twitter API

Twitter provides two APIs to access its information, namely the Streaming API and the REST API [Twi13c]. The REST API requires an oAuth authentication and is request-based and Streaming,

(49)

Spatio-Temporal Visualization

on the other hand, requires oAuth or HTTP basic authentication and provides information through events.

The Streaming API provides real time data (where each tweet is flagged as a event), although the only data available is limited by the session’s beginning timestamp. In opposition, the REST API allows access to information in the pasta dn in the present. The only limits are the availability of Twitter data and the methods and applications rate limits.

The Twitter REST API enables access to the user’s information, timeline, friends & followers, direct messages and general search, streaming, Places & Geo and trends, although there are limits imposed to the number of requests allowed. For the REST API, a request window is declared with 15 minutes duration during each user is allowed either 15 or 180 requests per window and method invoked. However, each application invoking this API has a general 120 requests per hour limit. Figure3.9shows the REST API process for connection and data retrieval.

Figure 3.9: REST API interaction process [Twi12].

In the Streaming API since there is not a request policy but a connection policy instead, limits are imposed to the volume of data transmitted per client per second. Public access does not allow to receive more than 50 tweets per second or 4 320 000 tweets per day. The data is separated in 3 streams, namely: Public, User and Site streams. Public stream provides public data available in Twitter. In the case of User and Site streams, a filter is applied to a list of users and only tweets from these users are provided. The general process for connecting and obtain data from the Streaming API is presented in figure3.10.

(50)

Spatio-Temporal Visualization

Figure 3.10: Streaming API interaction process [Twi12].

3.2.3 TwitterEcho platform

The TwitterEcho project [BO12] is a research platform for extracting, storing and analysing the Portuguese Twittosphere for R&D and journalistic purposes. Its current architecture is presented in figure3.11.

TwitterEcho collects data using the Twitter API. This platform accesses the Twitter Streaming API to obtain real time tweets through the crawler clients. These tweets are sent to a message broker (i.e., data format translator program) and processed on two components: stream processing and pre-processing. The resulting data is stored in both Apache Solr1and MongoDB2.

In order to ease the access to the information in a simple and effective manner, message and users indexes were created using Apache Solr. This allows a parallel exploration of the tweets by text searching tweets or users in Solr and retrieving all their information in Hadoop3.

After the information is stored in Hadoop, it is subjected to batch processing in order to mine different kinds of knowledge. This knowledge is available through analysis modules which include text mining, opinion mining, sentiment analysis and social network analysis.

1http://lucene.apache.org/solr/

2http://www.mongodb.org/

(51)

Spatio-Temporal Visualization

Twitter

Tweets & Users (Searchable) Message Broker Stream Processing Trend topics Most mentioned users Most mentioned URLs Pre-processing URL Unshortener Language detection Tokenization Geo tagged tweets extraction Batch processing Computes users interactions Aggregation & statistics Spam detection Bot detection Network DB Crawling client Follow topics Follow users Follow sample stream Follow location File System TwitterEcho - Data collector

Geo DB

Kafka

Storm

Solr / MongoDB

Neo4j

Java Python / MapReduce Python

Figure 3.11: TwitterEcho Physical Architecture.

Among TwitterEcho’s databases, we would like to highlight the GeoDB, that will be the main data source available to solve this dissertation’s problem.

Although the complete system is presented in 3.11, only the most important steps were ex-plained.

(52)

Referências

Documentos relacionados

The same principle was followed when forecasting the future performance of the Emergents segment. In this case, ten years of explicit period are used with a different assumption.

The probability of attending school four our group of interest in this region increased by 6.5 percentage points after the expansion of the Bolsa Família program in 2007 and

This log must identify the roles of any sub-investigator and the person(s) who will be delegated other study- related tasks; such as CRF/EDC entry. Any changes to

Além disso, o Facebook também disponibiliza várias ferramentas exclusivas como a criação de eventos, de publici- dade, fornece aos seus utilizadores milhares de jogos que podem

Diante desse contexto, o presente trabalho apresenta a proposta de um modelo de reaproveitamento do óleo de cozinha, para a geração de biodiesel que

Table 4 – Exome inbreeding coefficient and number of heterozygotes for populations on the study area before imputation.. the first axis; ii) a close relationship between the

The main conclusions that we can take from this extent research concerning the determinants of the capital structure and stock returns are that, according to Yang et al

Nesse contexto, entende-se que a realidade dos problemas de saúde, direta ou indiretamente causados pela manipulação, pela utilização de produtos agrotóxicos e seus resídu- os,