• Nenhum resultado encontrado

Association between news and tweets

N/A
N/A
Protected

Academic year: 2021

Share "Association between news and tweets"

Copied!
80
0
0

Texto

(1)

Association and temporality between news

and tweets

by

Vânia Nogueira Moutinho

Master Dissertation in Data Analytics

Supervised by

Professor João Paulo Cordeiro

Professor Pavel Bernard Brazdil

(2)

Acknowledgements

Firstly, I would like to thank Professors Pavel Brazdil and João Cordeiro for their support, guidance and wisdom throughout this investigation.

I would also like to acknowledge Pedro Saleiro for his guidance and generosity at the start of this project, André Lima for his kind advice at critical moments, and Natália Silva and Filomena Anselmo for their companionship during the ups and downs of these past three years.

Lastly, I would like to thank my parents for making this journey possible, and Rafael Correia for his unending support and constant belief in my success.

(3)

Resumo

Com o advento dos social media, as fronteiras entre o jornalismo e as redes sociais estão a esbater-se. Existe um aumento dos conteúdos gerados pelos utilizadores (UGC), dedicando os jornalistas uma parte signicativa do seu dia a anunciar, di-fundir e monitorar notícias, assim como validar informações, em plataformas como o Facebook e o Twitter. Vários estudos tentaram perceber o papel das redes sociais enquanto fontes de notícias. Contudo, a relação e as interligações entre este tipo de plataforma e os meios de comunicação social ainda não foi detalhadamente estudada. Nesta investigação, estudámos uma série de notícias publicadas em artigos jor-nalísticos e a sua partilha e discussão numa rede social referentes a seis meses. Espe-cicamente, uma amostra de artigos de fontes portuguesas generalistas de notícias publicados no primeiro semestre de 2016 foi submetida a agrupamento, utilizando um algoritmo híbrido. Os grupos de notícias gerados foram posteriormente associ-ados a tweets de utilizadores portugueses, usando uma medida de similaridade.

Para um subconjunto dos clusters obtidos, realizámos uma análise temporal so-bre estes grupos de notícias, examinando a evolução dos dois tipos de documentos (artigos e tweets) e o momento da sua criação. Foi possível concluir que, para alguns grupos de notícias, nomeadamente o Brexit e o Campeonato Europeu de Futebol, a publicação de artigos jornalísticos ganha instensidade em datas chave (orientada para eventos), enquanto que o debate e a discussão nas redes sociais são mais equi-librados ao longo dos meses que antecedem esses eventos.

(4)

Abstract

With the advent of social media, the boundaries of mainstream journalism and social networks are becoming blurred. User generated content is increasing and journalists dedicate considerable time searching platforms such as Facebook and Twitter to announce, spread and monitor news and crowd check information. Many studies have looked at social networks as news sources, but the relationship and interconnections between this type of platform and news media is still not thoroughly investigated.

In this work, we have studied a series of news articles stories and their sharing and commenting on a social network during a period of six months. Specically, a sample of articles from generalist Portuguese news sources published on the rst semester of 2016 was subject to hybrid text clustering. The groups of stories obtained were then associated with tweets of Portuguese users with the use of a similarity measure. Focussing on a set of clusters, we performed a temporal analysis on these groups of stories by examining the evolution of the two types of documents (articles and tweets) and the timing of their generation. We concluded that for some stories, namely Brexit and the European Football Cup, the publishing of news articles in-tensies on key dates (event-oriented), while the discussion on social media is more balanced throughout the months leading up to those events.

(5)

Contents

Acknowledgements i

Resumo ii

Abstract iii

1 Introduction 1

1.1 Motivation and objectives . . . 1

1.2 Details and contribution . . . 3

1.3 Organization . . . 4

2 Related Work 5 2.1 Twitter . . . 5

2.1.1 Facts and conventions . . . 5

2.1.2 User intention . . . 6

2.1.3 Twitter in Portugal . . . 7

2.1.4 Twitter and news . . . 8

2.2 Text mining . . . 9

2.2.1 Denition and applications . . . 9

2.2.2 The case of unstructured text data . . . 10

2.2.3 Document, features and the representation model . . . 11

(6)

2.2.5 Mining tweets . . . 13 2.3 Clustering . . . 14 2.3.1 Introduction . . . 14 2.3.2 Text clustering . . . 17 3 Methodology 20 3.1 Motivation . . . 20

3.2 Four main stages of the method . . . 21

3.3 Data . . . 22

3.4 News clustering . . . 23

3.4.1 Sample . . . 23

3.4.2 Pre-Processing . . . 24

3.4.3 Clustering . . . 26

3.5 Assignment of tweets to clusters . . . 29

3.5.1 Sample . . . 29

3.5.2 Pre-processing . . . 29

3.5.3 Assignment to clusters . . . 30

3.5.4 Evaluation . . . 30

3.6 Analysis . . . 31

3.6.1 Timing of events on the news and on social media . . . 32

4 Temporal Analysis of News and Tweets 34 4.1 News clustering . . . 34

4.1.1 Selection of articles . . . 35

4.1.2 Preprocessing and representation model . . . 38

4.1.3 Parametrizing the method . . . 40

4.1.4 Clustering results . . . 41

4.2 Assignment of tweets to clusters . . . 45

(7)

4.2.2 Pre-processing of tweets . . . 46

4.2.3 Assignment to clusters . . . 46

4.3 Temporal analysis . . . 51

4.3.1 Evolution of articles and tweets . . . 51

4.3.2 Time-wise dierences . . . 53

5 Conclusion 56 5.1 Results . . . 56

5.2 Limitations and Future Work . . . 57

Bibliography 59

Appendices 65

A Example of input tweet 65

B Cluster keywords 67

C Assignment of tweets with at least two features in common with

cluster centroids - Evaluation results 69

D Temporality between news and tweets - considering the complete

(8)

List of Tables

4.1 Number of news articles per press source available . . . 36 4.2 Frequent expressions on news articles . . . 39 4.3 News articles pre-processing transformation examples . . . 39 4.4 Number of elements per cluster and class homogeneity and separation 42 4.5 Tweets pre-processing transformation examples . . . 47 4.6 Per-class evaluation of tweets assignment to clusters . . . 49 4.7 Global evaluation of tweets assignment to clusters . . . 50 C.1 Global evaluation of tweets assignment to clusters - considering tweets

with at least two terms in common with cluster centroids . . . 69 C.2 Per-class evaluation of tweets assignment to clusters - considering

(9)

List of Figures

3.1 Four main stages of the method . . . 21

3.2 Data setup . . . 22

4.1 Number of news articles in the dataset per month of 2016 . . . 35

4.2 News articles length  boxplot . . . 37

4.3 Cluster size and mean length of articles . . . 37

4.4 Explained inertia for k up to 500 . . . 41

4.5 Aggregation indices for k up to 500 . . . 41

4.6 Keywords per cluster  4 examples . . . 44

4.7 Number of available tweets per month of 2016 . . . 45

4.8 Number of tweets and articles per clusters . . . 48

4.9 Evolution of the number of elements . . . 52

4.10 Days dierence between articles and the median tweet . . . 54

A.1 Example of tweet in JSON  part 1 . . . 65

A.2 Example of tweet in JSON  part 2 . . . 66

B.1 Top 10 cluster keywords  part 1 . . . 67

B.2 Top 10 cluster keywords  part 2 . . . 68

D.1 Days dierence between articles and the median tweet - considering the complete dataset of tweets and articles . . . 71

(10)

Chapter 1

Introduction

This chapter presents the theme of the dissertation, describing the motivation be-hind it, the main goals to achieve and proposed methodology, as well as the main contributions.

1.1 Motivation and objectives

News media can have a powerful inuence in people's perception of reality, at least in some domains, like politics (McCombs and Shaw, 1972) and foreign aairs (Wanta et al., 2004). A story reported in mass media reaches thousands of people, that generally do not have any other direct contact with the subject of the story, and it is reasonable to say that news media is an important source of information for most people (McCombs and Shaw, 1972).

However, the advent of social media is gradually changing the way information is disseminated and, moreover, possibly shifting the roles of news makers and news recipients. According to Kaplan and Haenlein (2010), social media is the set of applications based on the internet where user generated content (UGC) is created, modied and exchanged in a collaborative and participatory way. Following their categorization, examples of social media include: Facebook and LinkedIn, that can

(11)

be classied as social networking sites; Wikipedia, a collaborative project; Youtube and Pinterest, as content communities; blogs; and virtual social or game worlds, like Second Life. In this setting, information is no longer the property of an elite set of sources; it is rather the result of a framework where every person with an internet connection can add, transform, update, diuse, lter and share pieces of content.

The impact on journalism, in particular, is already noticeable. Although we still look to mainstream news to discern truthful from unreliable information, we are also more and more interested in the content posted or shared by friends or other entities we follow on social networks (Newman, 2009). A study of the impact of social media on the activities of public relations professionals and journalists in 2015 reports that 1.7 hours of a journalist's day are spent using social media, Facebook and Twitter being the rst and second leader platforms for it. In addition to expected activities such as relationship management and comments response, these professionals use social media to announce, spread and monitor news and to check information (crowd checking) with 40% of them considering social media as a reliable source (DVJ Insights and ING The Netherlands, 2015).

The expectations are that the relationship between news and social media will continue to grow (DVJ Insights and ING The Netherlands, 2015).

Considering these facts, can we still tell where and when a news story begins and its authorship? As the borders between journalism in the traditional sense and social media become blurred, what is the impact of one over the other? Does this evolution mean that the popularity of stories in social media is reected on and/or reects the attention on the news press? In other words, to what extent are we still relying on news from the press to know what is happening and, conversely, how much is the social media focus impacting what the press reports?

This is a relationship worth analysing at this point in time. In particular, this empirical investigation explores news articles and social media messages of Portugal  a country where the use of social media is still increasing for both population and

(12)

enterprises (Lusa, 2016).

The main goal of this dissertation is to study stories that are reported in the news and commented / diused throughout social networks. The focus is to identify stories that were published on a six months period and social media posts about them. Then, for a selected set, the analysis of the dierences in timing on news and social media is studied. This analysis includes the examination of the evolution in the number of articles and tweets about the same stories and the identication of which groups of stories show signs of having been rst published on the news and then diused or commented on a social network and which have reached the height of social media discussion previously to the height of news article publishing.

1.2 Details and contribution

As described in the rst section, there is an important role of the press on the daily conversations and opinions of people (McCombs and Shaw (1972); Wanta et al. (2004)). Also, trends on social media have an increasing inuence on what journalists report (DVJ Insights and ING The Netherlands (2015); Newman (2009)). To explore this relationship, we looked at the news media coverage during a certain period of time and identied groups of stories with the use of clustering techniques. Then we searched those stories on a social network using a similarity measure to those groups of stories. We evaluated the results using `semi-labelled' social media posts, that is, posts that shared news articles. Focusing on the groups with the best performance, we studied the evolution of the number of articles and posts and timing dierences. The possible dierences found were meant to shed some light on the following questions. Stories that are brought to the public attention by mainstream journalism and then generate a buzz on social media may indicate a prevalent function of that profession. Stories begun on social media and then fetched by news can indicate either a reinforcement or a turnaround on this mechanism.

(13)

Moreover, are there stories that only occasionally come about or is there a continuous debate through time on either or both platforms?

It is not our objective to categorize every story on these terms. Indeed, we recognize these are subjective considerations and a full understanding of the inter-connections of news and social media and underlying aspects in society requires a more thorough research. Notwithstanding, we believe the approach and focus proposed for this investigation will provide initial grounds for it.

Furthermore, the outcome of this research should provide a good insight into what were the main themes discussed, formally and informally, chronologically and strength wise, thus allowing a country's narrative to take form.

1.3 Organization

The remainder of this report is as follows. Chapter 2 provides the literature review, Chapter 3 describes the methodology employed, Chapter 4 presents the results ob-tained from news and tweets grouping and discusses the temporal relationship be-tween news and tweets. Chapter 5 concludes.

(14)

Chapter 2

Related Work

This chapter provides an overview of the most relevant and related work. It begins by describing the social network Twitter and noting certain aspects researchers have found important when working with tweets. Then, attention is given to text mining and clustering techniques.

2.1 Twitter

2.1.1 Facts and conventions

Twitter is a social network launched in October, 2006. By the end of the rst semester of 2016, it held more than 313 million monthly active users, of which 82% used the mobile app (Twitter, 2016). By the end of the rst semester of 2018, the number of monthly active users was of 335 million (Statista, 2018). Users are people or organizations who create an account so they can send or receive messages and follow other users. Those messages are known as tweets. The relationship between users is bi-directional: a user chooses to follow another, but it does not mean that the other user will follow him/her back. Moreover, a user decides who he/she wants to follow, but has no power over who follows him/her.

(15)

Tweets are messages posted by a user that have a singular property of a maximum length of 140 characters1. This means that users must be concise in their writing,

and also not having to invest much thought makes adherence to this form of blogging higher. Indeed, as Java et al. (2007) state in one of the earliest works on Twitter, this feature classies it as a microblogging platform that makes communication faster and easier. On the other hand, it also follows that the use of abbreviations is fairly common (Sankaranarayanan et al., 2009), adding a layer of diculty to text mining tasks.

Following the classication of Kwak et al. (2010), tweets can be singletons, replies, mentions or retweets. A reply or a mention uses the convention `@user' to indicate we are addressing someone, whereas a retweet is a form of forwarding some other user's message and is usually preceded by `RT'. Kwak et al. (2010) consider retweets a very powerful feature for information ltering and diusion. A singleton is a tweet with no reply or mentions (Kwak et al., 2010).

There is another interesting feature associated with tweets which, according to Sankaranarayanan et al. (2009), has been successfully utilized in clustering tasks, namely the hashtag. A hashtag is a word or expression that begins with the hash symbol (#), and it is generally used to indicate the topic of the tweet. A query on a particular hashtag returns all the tweets containing it. Considering the hashtag is set at the user level, it is surprising to see how few hashtags are associated to a single news issue (Sankaranarayanan et al., 2009).

2.1.2 User intention

Studies have shown Twitter users are of three kinds: information sharers, infor-mation seekers or friends/acquaintances (Java et al. (2007); Krishnamurthy et al. (2008)).

(16)

41.7 million user proles collected in July 2009 revealed the average path length to be 4.12, which the authors consider to be relatively short. They conclude that Twitter is not only a social networking platform but also an information seeking facilitator. Furthermore, tweets creation follows somewhat the Pareto's law, as less than 10% of Twitter users tweet more than 90% of all tweets (Sankaranarayanan et al., 2009).

2.1.3 Twitter in Portugal

In 2015, 54.8% of Portuguese people used social networks, according to a study by Marktest, cited by Lusa (2016). By 2017 the penetration rate had increased to 59.1% (Marketeer, 2017).The main activities on social media are sending/receiving messages, video watching, chatting and reading and sharing news (Lusa, 2016). Twitter has a penetration rate of 23.6% among the population and of 41.9% among enterprises (Lusa, 2016).

To our best knowledge, in the literature there is one attempt focused on empir-ically characterizing Twitter in Portugal: Brogueira et al. (2016). This work uses geolocated tweets collected from the Twitter streaming API from mid-September of 2014 to mid-September of 2015. The main ndings are the following: the distri-bution of users per district is very similar to the population distridistri-bution; regional dierences in Twitter usage throughout the year reect the usual vacation destina-tions; tweets containing URL's and retweets represent a very small portion of the geolocated tweets (less than 3%), while mentions and replies add up to almost 35%, indicating these users mostly chat; the top hashtags were either football (soccer) or television entertainment shows related (Brogueira et al., 2016).

Although these results apply to geolocated tweets only, these are relevant as-pects of the Portuguese Twitter community that are taken into account during the empirical part of the dissertation.

(17)

2.1.4 Twitter and news

The literature concerning Twitter and news is often directed at using tweets as a single news source. Indeed, there are some studies whose approach is to regard Twitter as a substitute of (rather than complementary platform to) traditional news sources (e.g. Sankaranarayanan et al. (2009), Zhao et al. (2011), Phuvipadawat and Murata (2010)). The main reason for this is perhaps the realization that some news break rst on Twitter. For example, Hu et al. (2012) have shown that the capture and death of Osama Bin Laden was made public on Twitter at least 20 minutes sooner than on major U.S. television channels. The authors argue that this may happen due to the role of a particular set of inuential users, namely journalists and politicians, whose credibility instantly provokes an immediate reaction on social networks (Hu et al., 2012).

Sankaranarayanan et al. (2009) built a tool called TwitterStand with the goal of collecting and diusing breaking news quicker than conventional news media. This system performs online clustering on ltered tweets from a set of manually selected seeders  users that usually post news. In addition, it performs periodic checks to avoid fragmentation and ensure minimal duplication of clusters, i.e., topics. Also, it takes advantage of information in the content of the tweet and/or the user's prole to associate topics to geographic locations. The authors believe that if tweets belonging to a certain cluster mostly come from one location or a set of close locations, then the topic of that cluster is likely to pertain to that geographical area (Sankaranarayanan et al., 2009).

Zhao et al., 2011 used a corpus of news articles from the journal New York Times (NYT) and tweets from Edinburgh, gathered from November 11 2009 to February 1 2010, to investigate how similar the topics in Twitter and a traditional news source are. Their results showed some dierences regarding the most frequent categories and types of topics: Twitter users tweet the most about family and life,

(18)

Twitter and the NYT; world is much more frequent on the NYT; lastly, while long-standing topics have an equally strong presence, the same does not happen for entity-oriented and event-oriented topics, with Twitter favouring the former and the NYT the latter (Zhao et al., 2011). Regarding long-lasting topics, there is evidence that their prevalence is not due to an increasing number of users tweeting about them, but to a set of important users who discuss it over time (Kwak et al., 2010). The above ndings bring about relevant aspects of the similarities between Twit-ter and conventional news sources. Particularly, they emphasize the importance of a certain type of users in social networks that foster its role as a news medium. Still, while the reputation and popularity of users is signicant for the level of certainty in the network regarding new information (Hu et al., 2012), it is the communica-tion structure set upon follower/followee relacommunica-tionships that renders Twitter such a fast information diusion network. In fact, this propagation may in some cases not depend entirely on the rst user's network: Kwak et al. (2010) have found that if a message is retweeted, it quickly reaches an average of 1000 users, regardless of the rst user's number of followers. This is what the authors call `the emergence of collective intelligence', in the sense that individuals decide what information is good enough to spread and once that decision is made, it almost instantly reaches a massive audience (Kwak et al., 2010).

2.2 Text mining

2.2.1 Denition and applications

Text mining is the process of retrieving useful and meaningful information from text. As in data mining, it generally does so by identifying relevant patterns.

In a world where unstructured information is undoubtedly proliferous  a survey conducted on data management professionals in 2006 revealed an average of 31%

(19)

unstructured plus 22% semi-structured data over the entire organization (Russom, 2007) , text mining has grown considerably. It is applied in a variety of elds, such as analysis of patents, discovery of protein interactions, categorization of news stories, spam ltering, identifying industry trends for corporate nance, to name a few (Hotho et al., 2005; Feldman and Sanger, 2006).

2.2.2 The case of unstructured text data

Even though the goal of text mining is conceptually similar to that of data mining in general, the unstructured format of the data implies a bigger eort into the pre-processing stage. Unlike data mining, where data is often extracted from databases where the information is organized in records, the lack of structure in text data means that the data are not prepared to be used by common data mining algorithms. It should be noted that text data can have some form of structure. Some doc-uments are notoriously written following a specic format. For instance, a news article usually has the following elements: headline, byline (author and date), lead paragraph, body and conclusion. This type of text data can be classied as weakly structured data. Other documents that are constructed in a format to facilitate transmission (e.g. email, JSON) are considered semi-structured. However, the fact remains that even these types of structures are not adequate to feed into a data min-ing algorithm as is. The innite possibilities of news headlines or email recipients  the shortest components in the examples  per se cannot allow for comparison and pattern recognition. It is necessary to apply a representation model to transform text data into an input that can be processed by machine learning algorithms.

Additionally, handling textual information requires an understanding of natural language, which is why many text mining techniques borrow knowledge from other elds of study, particularly information retrieval (obtaining documents that answer a certain query), information extraction (extracting specic information from

(20)

docu-lies in the connection between techniques from these related research areas and the methods and algorithms from data mining (Hotho et al., 2005).

2.2.3 Document, features and the representation model

Feldman and Sanger (2006) dene document as a unit of discrete textual data within a collection that usually, but not necessarily, correlates with some real world document, such as a business report, legal memorandum, e-mail, research paper, manuscript, article, press release, or news story. In text mining, a document collec-tion is also known as a corpus.

Each document must be represented in a structured format. That transformation implies identifying the features that best characterise the document and ease the operations of text mining algorithms. According to Feldman and Sanger (2006), features are usually characters, words, terms or concepts. The dierence between words and terms is that the latter can include multi-words such as `Mother Theresa' and `European Union'. Concepts, in turn, represent a number of terms with familiar or related meaning and may include words not present in the original documents.

The bag-of-words representation of documents uses these features without taking into account the order in which they appear in each document. A document is represented by its set of features, and when each of these features has a value for that document, the document is represented in a feature vector. This is also known as the vector space model.

Therefore, once the features have been produced, each document can be repre-sented by a feature vector, and a corpus by a document-term matrix. The values of the vector or matrix depend on the attributed weight of the feature in the document and/or corpus. This weight may be binary, i.e., equal to one if the feature is present in the document and zero otherwise, or else dependent on the frequency of the term in the document and/or corpus. In this area, the most widely used representation format is the TF-IDF weighting, in which the weight of the term t in the document

(21)

d is given by

TF-IDF(t, d) = T ermF req(t, d) · log N DocF req(t)

!

where TermFreq(t, d) is the absolute frequency of t in d, N is the number of all documents in the collection and DocFreq(t) is the fraction of documents containing t. Note that this weighting scheme penalizes terms that are much too frequent in the corpus through the IDF component (the second factor).

The document-term matrix is hence the structured representation of the corpus of documents, allowing comparisons between documents and enabling pattern search.

2.2.4 Pre-processing

Textual data has two major issues: high feature dimensionality and feature sparse-ness.

For any given document collection, the number of possible features is normally very high. Consequently, the data becomes very sparse. This may be costly due to two main reasons: (1) the increased space hinders computational performance; and (2) as the number of variables becomes larger  possibly larger than the number of observations , the performance of some algorithms tends to degrade. Thus, it is convenient to keep only those features rendered informative and most representative of the document collection.

Some common pre-processing tasks are:

• Tokenization. A token is a meaningful constituent of a text stream (e.g. a chapter, section, paragraph, sentence, word) (Feldman and Sanger, 2006). After punctuation marks removal, the goal is usually to split each document in words.

(22)

preposition, number and proper noun, among others, enrich the textual data by marking their syntactic value.

• Filtering. Removing words that add no meaning to the document because they appear in any document (e.g. prepositions, conjunctions, articles  also known as stop words, i.e., the most common words in a language) is one way to reduce feature dimensionality; but there are other and more sophisticated feature selection techniques, based on, for example, the document frequency of the word/term or, in the case of classication problems, the information gain of the word/term.

• Stemming. A stem is the basic form of a word  its root or its root with some sux or prex. The most widely used stemmer for the English language is the Porter Stemmer (Porter, 1980).

• Lemmatization. A lemma is the dictionary form of a word (e.g. the lemma of `are' is `be').

2.2.5 Mining tweets

A very important aspect to consider is that short segments of text, such as tweets, enhance the problem of sparseness in the dataset, and strategies relying on exact word matching may be inadequate (Aggarwal and Zhai, 2012).

For that reason, some authors recommend the use of extra information (Genc et al. (2011); Sriram et al. (2010); Aggarwal and Zhai (2012)). The approach of Genc et al. (2011) was to map each tweet to the closest Wikipedia page and take the distance between these pages as the distance between two tweets. Their argu-ment is that both tweets and Wikipedia pages are constructed by humans and the categorization of Wikipedia pages in particular, echoes how our brains link semantic structures.

(23)

Sriram et al. (2010), on the other hand, used the standard bag-of-words con-struct plus eight additional features to categorize tweets and conrmed the valuable contribution of extra information. These added features were, namely, the author, the presence of abbreviations or slang, opinioned words, any currency or percentage symbols, emphasis on words, mentions at the start and mentions within the tweet.

2.3 Clustering

2.3.1 Introduction

Aggarwal and Zhai (2012) dene the clustering problem as that of nding groups of similar objects in the data, with the objective of obtaining high inner-group similarity and high inter-group dissimilarity. There are many categories of clustering algorithms (Gama et al., 2015), some of which are presented.

Connectivity-based clustering algorithms, also known as hierarchical, con-sider that neighbouring objects are more similar, while more distanced objects should appear in separate clusters. Thus, objects are linked based on their dis-tances, and a dendogram of the resulting hierarchy can be drawn.

The approach to link the objects can be either agglomerative, where all objects are initially separated in isolated clusters and successively grouped based on smallest distances; or divisive, where one big cluster containing all objects is initially formed and successively partitioned into smaller groups based on greatest distances.

The linkage method (or aggregation index) determines how to assess the dis-tances between objects and/or groups: the complete linkage considers the distance between the furthest elements of each group; the single linkage the distance be-tween the nearest elements of each group; the average linkage considers the average dissimilarity values between the elements of each group; the centroid linkage the distance of centroids; and Ward's linkage considers the increase in inertia (between

(24)

and within class dispersion) when two groups are merged.

Since the nal result is a hierarchy, to get the nal clusters of objects it is necessary to apply a cut-o criteria to the tree-like structure.

In centroid-based clustering algorithms, the distances are computed to a cluster centre  a specic data point (medoid) or a central vector , and clusters are improved in an iterative process. The number of clusters needs to be set at the start. Initial centres can be randomly picked or user-chosen, and the idea is to allocate each of the data points to the closest centre, recalculate cluster centres and repeat, until no further improvements are possible.

Albeit ecient, these types of algorithms can converge to local optimum, and so it is important to execute them more than once with dierent seeds and evaluate the results. The most widely known and still used algorithm of this category is the k-means (MacQueen, 1967; Jain, 2010).

Other categories of algorithms include density-based clustering algorithms, where clusters correspond to regions with high densities of objects, separated by regions with low densities of objects (e.g. DBSCAN (Ester et al., 1996)) and distribution-based clustering algorithms, where clusters are formed based on the probability of its members following the same distribution (e.g. clustering using the Expectation Maximization algorithm (Dempster et al., 1977)).

For distance based algorithms, a measure of (dis)similarity is needed, such as the Euclidean, Manhattan, Chebyshev or Mahalanobis distances. In the case of text data, the most widely used is the cosine similarity, for which, unlike the Euclidean distance, for example, the magnitude of the vector representing the document is not taken into account, but only its direction. This is particularly important when working with collections of documents with variable length, in which the weight of a particular term may be bigger not because it appears relatively more often in a document than any other, but simply because that document is longer.

(25)

evaluation stage, where an objective assessment is made regarding the signicance of the resulting clusters and whether the number of clusters is adequate. However, the authors recognize that it is still an open area, and make the following remarks: • There is no universal clustering algorithm capable of capturing every possible

underlying data structure;

• If the goals of two clustering algorithms are dierent, it may not make sense to compare their results;

• The knowledge of the data and its domain is paramount for determining not only what data transformations are necessary but also to choose the most adequate similarity measures and the properties inherent to each clustering algorithm.

Nevertheless, there are three types of criteria to evaluate the quality of a clus-tering (Gama et al., 2015):

1. Relative criteria, focusing on measuring which algorithm ts best to the data or on the most adequate number of clusters for a given algorithm. For example, the intra-cluster variance measures how compact the clusters are, while the connectivity assesses the degree to which neighbouring objects are positioned in the same cluster.

2. Internal criteria, focused on determining to what degree the partitioning rep-resents the inherent data structure. For example, the Gap statistic compares the total intra-cluster variation with the expected value under a null reference distribution (i.e., a distribution with no obvious clustering).

3. External criteria, with the goal of evaluating how well the clustering obtained conrms a given pre-specied hypothesis. For example, the Jaccard index de-termines the probability of two objects of the same cluster also being clustered together in a dierent clustering scheme.

(26)

By interpreting the resulting classes, one can also validate the clustering results, especially with the support of a domain expert. The interpretation may include some form of labelling, computation of mean values or visual representations of the clusters.

Regarding the challenges in data clustering, Jain (2010) denotes the importance of building benchmark data from dierent domains to test and evaluate clustering algorithms. Also, as there is not yet one clustering algorithm to clearly outperform any other in whatever domain, it is suggested that algorithms should be designed and used according to the application needs. Finally, the rise of semi-supervised methods is encouraged, as the user's domain expertise on pair-wise must-link and/or cannot-link constraints is still very important for good quality clusters.

2.3.2 Text clustering

The additional challenges of clustering text rather than quantitative or even cate-gorical data, are the following (Aggarwal and Zhai, 2012):

• High dimensionality and sparseness: on the one hand, the range of possibilities for words present in a document (i.e. the glossary of the document collection) is extremely high; on the other hand, each document often has relatively a low number of them.

• Word correlations: even though the lexicon is large, there are generally many words relating to a single concept.

• Variable document length: the representation of such items should take into account a normalization process.

To face these challenges, a series of pre-processing tasks are usually employed before the clustering process, some of which were detailed in section 2.2.4.

In addition to the feature selection techniques already discussed, as Aggarwal and Zhai (2012) emphasize, dimensionality reduction can also be achieved through

(27)

feature transformation methods, an example of which is Latent Semantic Index-ing (Deerwester et al., 1990). This technique represents the documents in a new (smaller) feature space where the nal features are a linear combination of the original features, thus eliminating noisy dimensions from the data (synonymy and polysemy) and enhancing its semantic value. This is particularly valuable in the context of text clustering.

Clustering techniques based on distances use similarity measures to evaluate how close or apart the objects are. Huang (2008) performed an experiment on seven dierent datasets, of which four comprised newsgroup posts or newspaper articles. Her results showed that the cosine similarity, Pearson's correlation coecient and the averaged Kullback-Leibler Divergence clearly outperformed the Euclidean dis-tance in all datasets in terms of entropy and purity. However, this experiment was conducted using the k-means algorithm, that belongs to the partitioning family of clustering algorithms, and therefore conclusions should be drawn only for this type of clustering.

A compromise between the robustness of hierarchical clustering methods and the eciency of partitioning clustering methods involves the use of a hybrid approach. In Cutting et al. (1992), such an approach is used in order to provide an ecient interactive experience in document browsing. Specically, the authors discuss two techniques, buckshot and fractionation, to nd the initial centres to feed to the partitioning clustering algorithm. The former consists of taking a sample of√k · N documents and performing hierarchical clustering to nd k centres that are more robust than if randomly chosen. The latter implies dividing the corpus into N/m buckets of size m>k, applying hierarchical clustering in each of them, and then using cluster centres to reapply the clustering routine iteratively until k clusters have been obtained. Both techniques provide better seeds for a more computationally ecient algorithm  like the k-means  to begin with when clustering the complete and larger document collection. In this study, the buckshot technique was applied.

(28)

There are other types of text clustering techniques, such as probabilistic cluster-ing, of which topic modelling is an example (Aggarwal and Zhai, 2012). In this case, each document and each term of the lexicon has a probability of belonging to one of k topics. This also makes topic modelling a clustering technique that determines document clusters and word clusters at the same time, following the notion that good clusters of words are indicative of good clusters of documents.

(29)

Chapter 3

Methodology

3.1 Motivation

Journalism and social media have become more intricately interconnected. Tradi-tionally, people resort to mainstream media to know what is happening in the world. However, this dynamic has been changing in recent years, at least for some news topics, due to the fact that a great proportion of the world population has access to platforms which broadcast real time events to an equally worldwide audience. Con-sequently, the beginning of the process of news generation and dissemination has in some cases changed from journalists to the general public. It has been recognised in various studies (Newman, 2009; DVJ Insights and ING The Netherlands, 2015) that journalists spend a lot of their time scouting social media for interesting topics to write about, relying on these platforms as reliable sources.

It is therefore relevant to study how events or news come about on these two types of platforms  news articles and social posts  particularly how and if they similarly arise, disseminate, gain strength and die. The goal of this dissertation is to characterize the news or events published by news sources and/or commented and shared on social media during a period of six months, focusing on the timing of their generation and the intensity with which they are mentioned on each and through

(30)

the use of text mining techniques. The next section describes the empirical steps taken to achieve this goal and the following provide more details regarding each of them.

3.2 Four main stages of the method

The empirical process used in this study can be observed in Figure 3.1 and is sum-marized below.

1. Data gathering: obtaining the data. On the news side, on-line news articles were used; on the social media side, tweets were used (see section 3.3).

2. News clustering: forming groups of similar news (see section 3.4).

3. Assignment of tweets to news clusters: allocating tweets about the same stories to the groups of news obtained (see section 3.5).

4. Analysing the resulting groups of news articles and tweets: tempo-rality between news and tweets (see section 3.6).

(31)

3.3 Data

Both tweets and on-line news articles were provided by the POPmine platform developed in SAPO Labs  a partnership between an internet services and products provider and an academic institution, namely the University of Porto. This platform gathers on a continuing basis tweets of approximately 100 thousand Portuguese users and news articles of over 40 Portuguese press sources (Saleiro et al., 2015).

The data were collected during the year of 2016. In total, there are more than 600 thousand news articles and almost 38 million tweets, distributed somewhat linearly across the year.

For computational reasons, the data time frame used in this study was reduced from one year to one semester. Both types of documents (news articles and tweets) were received in text les in the JSON format1. The rst task was to read and import

the relevant components into a database, in order to ease data cleaning, transfor-mation, access and analysis. The ETL2 process was particularly important in the

case of tweets, since each tweet can have multiple components recorded in a nested structure, which would be extremely dicult to read and use as is (see example in Appendix A). Figure 3.2 shows this setup process, as wells as the tools utilized for information extraction, transformation, loading and storage/management.

Figure 3.2: Data setup

1JSON stands for JavaScript Object Notation. It is a le format used to easily transmit data,

that uses attribute-value pairs and arrays data structures. See json.org for more information.

(32)

The following components were gathered and imported into the database: • News articles: article id (integer), title (string), body (string), date of

publi-cation (timestamp), source (string), url (string).

• Tweets: tweet id (integer), text (string), date of posting (timestamp), user (string), URL's shared (string).

The next stages were implemented on R (R Core Team, 2017) and rely on the information stored in this database, by use of the RODBC package (Ripley and Lapsley, 2017).

3.4 News clustering

The same story or event can be published in many dierent articles and shared or commented in many tweets. It is therefore rst necessary to identify such stories. The news articles were chosen as a base for that identication in preference to tweets as these are particularly prone to being about personal life rather than news topics. They are also more dicult to mine due to the shortness of the text, as seen in the previous chapter.

To obtain stories or groups of similar stories of the rst semester of 2016, clus-tering techniques were applied to a sample of on-line news articles.

3.4.1 Sample

The sample construction considered the importance of having, on the one hand, news articles that were shared on social media, and, on the other hand, news articles that were not shared on social media.

It is possible to know if a news article was shared on Twitter by looking up its URL in the tweets information collected. The former are necessary so that in the following stage the assignment of tweets to the clusters can be evaluated. We

(33)

assume that if a tweet contains a link to a news article then that tweet is about the same story as that news article. However, if only these news articles were included in the sample, it would be extremely likely that when analysing the nal clusters comprised of both news articles and tweets the following conclusion would be drawn: the story was rst brought up by the press, because if a link was shared, it means the tweet came afterwards.

By including also news articles that were not shared on Twitter, we allow news articles later published to enter the cluster. Additionally, if there are clusters of news articles with no similar tweets assigned to them, it is possible to identify stories that are not as relevant in the social media, which is useful to gain insights of the least talked about topics in the Portuguese society.

3.4.2 Pre-Processing

Standard pre-processing techniques were applied to the news articles, including the ones listed below. We do not further discuss these techniques, as they were described in Chapter 2.

• Tokenization

• Lower case conversion

• Punctuation and numbers removal • Portuguese stopwords removal • Repeated expressions removal • Stemming

These tasks were performed in R (R Core Team, 2017) using the following packages: tm (Feinerer and Hornik, 2017) and SnowballC (Bouchet-Valat, 2014). Examples of articles pre-processing are given in Table 4.3, in Chapter 4.

(34)

Tagging

Additionally, as a feature selection technique, terms were tagged according to their syntactic value (parts-of-speech) and any term not classied as a verb, noun or proper noun was discarded. The recognition of named entities in the text of news articles was also included. The reason behind this step is that the purpose of clus-tering is to nd groups of stories or events and it is expected that the inclusion of named entities, such as personalities, names of events and locations, as features will help the representation and consequent pattern recognition in that respect.

For parts-of-speech tagging, the resources used were the openNLP and the NLP packages (Hornik, 2016, 2017). For named entities recognition, we used the PAMPO package (Rocha, 2016). The PAMPO method (Rocha et al., 2016) for named entities extraction was built for the Portuguese language and is based on two algorithms: the rst generates candidates by gathering common named entities terms, such as capitalized words and personal titles; the second performs a candidate selection based on parts-of-speech tagging. Performance results on a Portuguese news corpora were a recall of 0.91, a precision of 0.959 and a 0.932 F1 score.

Representation

The news corpus was then represented in a document-term matrix (dtm), with normalized TF-IDF weights, according to the following formula:

tfi,j· idfi = tfi,j P knk,j · log2 |D| |{d|ti ∈ d}|

where tfi,j is the absolute frequency of term ti in document dj, Pknk,j is the sum

of absolute frequencies of each term k in document dj, |D| is total the number of

documents in the collection, and |{d|ti ∈ d}| is the number of documents in the

(35)

Dimensionality reduction

For dimensionality reduction purposes, the level of sparseness of the dtm was set to 0.98, which means that if a feature was not present in at least 2% of the documents, it was discarded. The 0.98 level guarantees no documents were left with only zero entries, while reducing signicantly the number of features.

The nal document-term matrix was built using the tm and RWeka packages (Feinerer and Hornik, 2017; Witten and Frank, 2005; Hornik et al., 2009).

3.4.3 Clustering

The clustering experiment conducted included hierarchical and k-means clustering. Hierarchical clustering

The hierarchical clustering algorithm used is of an agglomerative nature, which means each observation (news article) is isolated at the start in a cluster of its own; then, clusters are iteratively joined according to greatest similarities, until there is one single cluster. So, at each iteration, there is a need to compute a similarity (or dissimilarity) value between each of the current clusters. In this case, the dissimilarity measure used was the Euclidean distance, which is given by:

d(v1, v2) = v u u t p X f =1 (v1f − v2f)2

where v1 and v2 are the feature vectors representing each element to compare and

p is the total number of features. The values of the feature vectors are the TF-IDF values subject to a normalization at the document level, which means that the length of the document will not inuence its distance to a document of dierent (larger or smaller) length.

(36)

is determined, Ward's index was chosen (Ward Jr, 1963). This means that the dissimilarity is computed as the increase in inertia when the two clusters are joined. K-means clustering

The k-means algorithm (MacQueen, 1967) starts with selecting a set of k initial centres from the observations (in this case, news articles) and assigning each of the remaining observations to the its closest centre, immediately updating the centre of the chosen cluster to its mean point. Once every remaining observation is allocated to a cluster centre, the solution is optimized by repeating the assignment operation for each of the observations, until convergence is achieved, that is, until there is no change in the cluster centres. The selection of the initial centres can be random or user dened.

For this type of algorithm, the number of clusters (k) needs to be set at the start. To determine the number of clusters, two methods were used: the representation of the aggregation indices of the hierarchical clustering and the representation of the explained inertia. The latter is the between-class dispersion (B), measured as the sum of squared distances of the cluster centres (centroids) to the centre of gravity g, divided by the total dispersion of the data (T ), measured as the sum of squared distances of every observation to the centre of gravity:

B = 1 n k X h=1 nhd2(gh, g) T = 1 n n X i=1 d2(Ii, g)

The explained inertia (B

T) naturally increases with the number of clusters, and

the goal is to nd the value of k for which this increase starts to marginally decrease (the elbow method) (Bholowalia and Kumar, 2014). The same rationale applies for the aggregation indices, in reverse.

(37)

Hybrid clustering

The nal clustering method chosen, whose results are presented in the next chapter, was of a hybrid nature, following in the footsteps of Cutting et al. (1992), namely the buckshot method: choose the initial centres for k-means clustering by perform-ing hierarchical clusterperform-ing on a sample of √k · N news articles and compute the resulting k centroids. This way, the described methodology can still be applied to a larger sample of news articles without losing neither the robustness from hierarchical clustering nor the eciency of k-means clustering.

The tasks of hierarchical clustering and k-means clustering were implemented using the factoextra (Kassambara and Mundt, 2017) and ClustGeo (Chavent et al., 2017) packages.

Cluster labelling

Each cluster was labelled with keywords selected from the dictionary of features produced at the end of the pre-processing stage. For each cluster centroid, the terms with higher TF-IDF values were considered to be the most representative of that cluster. The number of keywords per cluster was dependent on the cluster size and set in the following manner: (i) determine the minimum number of keywords in a cluster of size one; (ii) increase the number of keywords using a logarithmic growth, in order to capture the lexicon variety in larger clusters but keep the characterization limited to a relatively small set of keywords.

Let |Ck| be the number of documents in cluster k and m the minimum number

of keywords in a cluster of size one. The number of keywords of cluster k is: Wk = log2(|Ck| + 1) · m

Cluster keywords were visually represented using the wordcloud package (Fel-lows, 2014).

(38)

3.5 Assignment of tweets to clusters

Once news articles clusters are formed, it is possible to add tweets about the same stories. To determine if a given tweet should be assigned to a cluster, we used a measure of similarity between that tweet and the cluster centroids. Next, we describe the methodology for this assignment.

3.5.1 Sample

All tweets with links to a clustered news article were included in the sample, thus allowing a later evaluation of the assignment method utilized. The sample size was then doubled by including tweets with no link to a clustered news article. Since there were over 19 million tweets to choose from, we demanded that for a tweet with no news article URL to be selected, it needed to have at least two terms from the set of keywords representing the clusters.

The reason to include this second set of tweets is straightforward: if only tweets with links to news articles were used, every story would be found to be rst talked about by the press, since those tweets can only exist after the shared URL exists and if the URL exists, an article has been published.

3.5.2 Pre-processing

Similarly to the news articles, every tweet was subject to the pre-processing tech-niques listed in section 3.4.2.

Representation

The corpus of tweets was transformed into a document-term matrix (dtm), using the dictionary of features from the news articles dtm. The dtm also included as documents the cluster centroids, so that the TF-IDF weights later used to compute

(39)

the similarity between each tweet and centroid were not based solely on the tweets corpus.

3.5.3 Assignment to clusters

Each tweet was assigned to the cluster whose centroid was closest based on the cosine similarity. This similarity measure is commonly used in text mining (Huang, 2008), and particularly because it is not inuenced by the size of the documents. Let v1 and v2 be two non-zero vectors, which in this case represent a given tweet

and a certain centroid. The cosine similarity between these vectors is: cos(θ) = v1· v2

||v1|| · ||v2||

If equal to zero, the vectors are directed at orthogonal directions, which means that the similarity between the two documents is non-existent. If equal to one, the angle between the two vectors is zero and hence the similarity between the two documents maximum.

3.5.4 Evaluation

Because all the data used in this study are unlabelled, results cannot be directly evaluated. The methodology described so far has addressed this issue by including tweets with links to clustered news articles. It can be assumed that if a tweet shares a specic news article, it should belong to the same cluster. So, we consider that class as the real class of the tweet and compare it with the results of the assignment based on similarity to cluster centroids.

Accuracy, precision, recall and F1 measures were used to evaluate these results. Accuracy is the percentage of the total number of true positives, i.e., correctly assigned tweets, on the total number of observations evaluated. Precision evaluates

(40)

predictions of a given class. On multi-class problems, the macro precision can be computed as the average of per-class precisions. Recall assesses the percentage of true positives among the actual positive class. The F1 score is the harmonic mean of precision and recall, and is used when neither false positives nor false negatives are more important.

In addition, we borrowed the concept of precision at n from the information retrieval eld. In this context, performance measures consider as true positives the relevant documents among those retrieved from a query. When considering the topmost query results only instead of all the query results, the measure is called precision at n (P@n), where n is the cut-o rank. This is particularly used for web search engines, where the performance of the rst results is more important than the overall performance (Schütze et al., 2008).

In the context of this study, we made the following adaptation: each observation has n predictions, based on the ranked closest news centroids. If the true class of a tweet is in the n-topmost predictions, it is considered as a true positive. The P@n is therefore the percentage of observations whose true class is present in the top n predictions.

3.6 Analysis

Once the groups of similar articles and tweets have been formed, it is possible to study how the underlying events or stories develop over time, and in particular if the role of the press as a story breaker is still indisputable or, on the contrary, if there are some stories that break rst on social media. This section discusses how this analysis was conducted.

(41)

3.6.1 Timing of events on the news and on social media

Graphical representation

Firstly, the date and time of publication of each document was retrieved from the database. This allows the representation of the cluster documents on a timeline, signalling the evolution of the number of articles and tweets in that cluster.

Temporality between news and tweets

In order to get a picture of the temporality between articles and tweets, a second analysis was made. For each news article in a given cluster, the time dierence was computed to that cluster median tweet. So, each cluster has a number of lag values characterizing it, and those values represent the timing dierence between articles and the moment the social network is fully engaged in the discussion of that story. If there is a signicant proportion of articles with positive lag values, then it is a strong indicator that for that particular group, the press had a more important role in the beginning of the discussion; if, on the contrary, there is a signicant proportion of articles with negative lag values towards the median tweet, then it is a sign that the discussion is likely to have started on the social media.

As in each cluster there is a set of linked articles and tweets, this analysis can be biased towards the rst hypothesis, as for these specic documents, every article comes sooner than its linked tweet (a news article URL can only be shared by a tweet if the article has been published). Hence, we further excluded these articles and tweets from this analysis.

Remarks on the analysis made

This analysis was conducted on a subset of clusters, the selection of which considered both the cluster size and the per-class precision obtained from the previous stage of tweets assignment.

(42)

Additionally, as this is an exploratory analysis, directed at gaining a general picture of the temporality of news and tweets in Portugal, we emphasize that con-clusions are meant to be interpreted carefully, as there is no attempt at evaluating them in this study.

(43)

Chapter 4

Temporal Analysis of News and

Tweets

The aim of this chapter is to present the empirical results obtained following the methodology described in the previous chapter. It begins by describing the process of clustering news articles. Then it presents the results of the assignment of tweets to those clusters. Finally, the nal clusters of news articles and tweets are analysed.

4.1 News clustering

Usually there are many news articles related to the same event or story, either due to the existence of many news sources that publish it or because it evolves through time. The goal of clustering the on-line news articles is to segment the published stories so that articles of the same story or similar stories are grouped together. This will allow the study of when a story or group of similar stories is brought up by the press and, later, compare this to what happens on the social media Twitter.

(44)

4.1.1 Selection of articles

The on-line news articles provided by the POPmine platform were collected during 2016 and amounted to over 600 thousand items. Figure 4.1 presents the frequency of the available news articles for this investigation over that year. The monthly average is 51.7 thousand news articles.

Figure 4.1: Number of news articles in the dataset per month of 2016

For computational reasons, only a sample of news articles was used to test the proposed methodology. Our sample included news articles from the rst semester of 2016, which is still a reasonably long period in which to study the timing of stories in the news and social media and reduces the size in approximately 50%.

First attempts at news clustering revealed the prevalence of sports related con-tent, which can be conrmed in Table 4.1: 58% of the articles are from generalist news sources, 28% from sports news sources, 8% from economics news sources, 3% from technology news sources and 3% from other types of news sources. In order to obtain a wider range of topics in the groups of articles formed, only those from the manually selected set of generalist news sources highlighted in bold in Table 4.1, were used. This further reduced the sample size in about 42%, to about 174 thousand news articles.

Furthermore, although the representation model used a normalized TF-IDF weighting scheme, rst attempts also revealed that longer news articles tended to be

(45)

Table 4.1: Number of news articles per press source available

grouped together. Indeed, a careful examination of the length of news articles re-vealed that it could vary from as little as one word to up to 4731 words. The boxplot in Figure 4.2 shows the presence of outliers in terms document length measured as the total number of characters (extreme outlier: 5011 characters; moderate outlier: 3349 characters). The scatterplot in Figure 4.3 shows clustering results (k=100) on a sample of three thousand articles  longer articles tend to be clustered together, whereas smaller ones easily tend to be separated.

This could be explained by the fact that longer articles have a wider range of vocabulary and therefore similarities between them are easier to nd, whereas for

(46)

Figure 4.2: News articles length  boxplot

Figure 4.3: Cluster size and mean length of articles

shorter articles it is the opposite and hence they appear in separate clusters. For these reasons, we have used another criterion for selecting news articles: to keep those with length between 100 and 3349 characters. The lower bound of 100 characters is used to exclude rather short and uninformative articles, such as the following examples: `Dados são relativos à zona euro e à União Europeia em geral.'; `Veja na íntegra o debate entre os três candidatos presidenciais, transmitido na SIC Notícias.'. It also prevents some articles that may not have been fully or correctly

(47)

collected (for example, only the subtitle was registered) from entering the sample. The nal criterion for selecting the articles was to include both articles that were shared on Twitter and also articles that were not shared. The reasons for this are explained further on. Hence, 50% of the nal sample is comprised of on-line news articles whose URL was shared in at least one tweet and 50% of on-line news articles whose URL was not found in any of the tweets.

With the above criteria, there were 3037 on-line news articles with a link to at least one tweet. The other 50% was randomly assembled. The nal sample includes therefore 6074 news articles.

4.1.2 Preprocessing and representation model

Standard preprocessing techniques were applied, as outlined earlier in Chapter 3 (section 3.4.2).

Each document was converted to lower case and stripped of punctuation, numeric characters and Portuguese stopwords. Words were stemmed using the Portuguese Snowball stemmer (Bouchet-Valat, 2014).

The rst analysis of the most relevant terms obtained from the document-term matrix after these preprocessing tasks revealed the need to remove some expressions that frequently appeared in the body of several news articles, as exemplied in Table 4.2. These types of expressions were removed from the articles directly at the database level, before carrying out the selection of articles for our study.

For feature selection purposes, parts-of-speech (POS) tagging was also used, so that only nouns, verbs and proper nouns were kept. By observing the most relevant terms with and without POS ltering, we concluded that this step also helped to improve the quality of labels produced per cluster.

Another improvement to the representation of news articles was the identication of named entities and their inclusion as features. The application of PAMPO (Rocha

(48)

Table 4.2: Frequent expressions on news articles

Expression Comments

Siga o CM no Facebook Equivalent to: Follow CM on

Face-book. Os nossos termos e condições de privacidade

foram alterados. Este website utiliza cookies que asseguram funcionalidades para uma mel-hor navegação. Ao continuar a navegar, está a concordar com a utilização de cookies e com os novos termos de utilização.

Warning about terms and conditions and cookies.

Partilhar o artigo [título] Imprimir o artigo [tí-tulo] Enviar por email o artigo [tí[tí-tulo] Aumen-tar a fonte do artigo [título] Diminuir a fonte do artigo [título] Ouvir o artigo [título]

Options for web users to share, print, send, listen to the news article and increase/decrease its font size (icons legends).

Completam-se agora 100 anos sobre o início da beligerância portuguesa. Uma data assinalada pela RTP com a publicação online dos seus mais signicativos materiais de arquivo sobre o tema.

Advertisement to another content of the news source (present in over 100 articles).

Table 4.3: News articles pre-processing transformation examples

Before After

Segundo site TMZ, Prince morreu na sua residência em Presley Park. A polícia está a investigar um óbito ocorrido na sua residência, mas não conrmou que se tratasse da morte do próprio artista. O cantor norte-americano, Prince Rogers Nelson de seu nome, terá su-cumbido a uma gripe que originara o seu in-ternamento de urgência na passada sexta feita.

sit tmz prince morr resident presley park políc investig óbit ocorr resident conrm trat mort artist cantor prince rogers nelson nom sucumb grip origin intern urgênc pass sext feit

A ministra da Administração Interna justicou esta sexta-feira que, "por uma questão de pro-porcionalidade", optou por aplicar ao militar da GNR, que matou um jovem numa perseguição após um assalto, uma sanção menos gravosa do que a proposta pela IGAI.

ministra da administração interna justic sext feir questã proporcional optou aplic milit gnr mat perseguiã assalt sanção propost decisã tom propost igai

Entre os detidos encontram-se familiares do ex-tremista malaio Mohamed Jedi, que combate na Síria nas leiras do Estado Islâmico.

det encontr malai mohamed jedi combat síria leir estado islâmico

(49)

sample.

Table 4.3 presents examples of the transformations. Underlined terms correspond to named entities. After these transformations, the corpus was structured in a document-term matrix. Since the number of terms generated, including named entities, surpassed 44 thousand, a feature reduction technique applied was to set the level of sparseness of the matrix to 98%. This resulted in 952 terms only.

4.1.3 Parametrizing the method

The clustering method chosen to group the on-line news articles was of a hybrid nature, combining an ecient algorithm, k-means, with a robust setup of hierarchical clustering. The number of clusters (parameter k) that k-means requires was decided on the basis of a subjective evaluation of the number of dierent stories that could occur in a six-months period. We have also considered the representation of the explained inertia for dierent k values.

Determining k

The explained inertia for hierarchical clustering, given by the between-class disper-sion over the total disperdisper-sion of the dataset, is shown in Figure 4.4. The elbow of the line uniting the dots is not clearly visible, but the largest growth in the explained inertia happens for k up to 50, above which any further partitioning does not gain marginal increments in quality (measured as class separation) at the same rate as up to that point. A similar conclusion can be drawn from the representation of the aggregation indices of the hierarchical clustering, presented in Figure 4.5. In this representation, the elbow is more clearly visible, for k values between 25 and 50.

As a benchmark, we retrieved a list of events occurred in the rst semester of 2016 from a well-known news source website  SIC Notícias. This information was published as part of the `Year in Review' at the end of 20161. The cited news source

(50)

Figure 4.4: Explained inertia for k up to 500

Figure 4.5: Aggregation indices for k up to 500

identied a total of 119 dierent events from January to June worth including in the year review. Hence, it is reasonable to assume that a large proportion of clusters of news articles characterizes dierent stories published in that time frame.

Consequently, we opted for the larger end of the spectrum and set k equal to 50.

4.1.4 Clustering results

As described in Chapter 3, hierarchical clustering was performed on a sample of √

(51)

centroids for each cluster were fed into the k-means algorithm, so that the starting points would have a higher quality than a random selection of 50 articles from the sample. The clustering results are summarized in Table 4.4.

Table 4.4: Number of elements per cluster and class homogeneity and separation

Cluster size and dispersion

The resulting clusters of news articles have variable size. Indeed, some clusters, like cluster 5 Miscellaneous, cluster 7 Elections, cluster 14 Politics and cluster 19 -International security, are rather large, accounting for over 50% of the documents. Their within-class dispersion values reect the diversity of the articles in them, and further partitioning would probably continue to divide these clusters into smaller

Referências

Documentos relacionados

didático e resolva as ​listas de exercícios (disponíveis no ​Classroom​) referentes às obras de Carlos Drummond de Andrade, João Guimarães Rosa, Machado de Assis,

Despercebido: não visto, não notado, não observado, ignorado.. Não me passou despercebido

Caso utilizado em neonato (recém-nascido), deverá ser utilizado para reconstituição do produto apenas água para injeção e o frasco do diluente não deve ser

Ao Dr Oliver Duenisch pelos contatos feitos e orientação de língua estrangeira Ao Dr Agenor Maccari pela ajuda na viabilização da área do experimento de campo Ao Dr Rudi Arno

Ousasse apontar algumas hipóteses para a solução desse problema público a partir do exposto dos autores usados como base para fundamentação teórica, da análise dos dados

Ainda assim, sempre que possível, faça você mesmo sua granola, mistu- rando aveia, linhaça, chia, amêndoas, castanhas, nozes e frutas secas.. Cuidado ao comprar

Na hepatite B, as enzimas hepáticas têm valores menores tanto para quem toma quanto para os que não tomam café comparados ao vírus C, porém os dados foram estatisticamente

i) A condutividade da matriz vítrea diminui com o aumento do tempo de tratamento térmico (Fig.. 241 pequena quantidade de cristais existentes na amostra já provoca um efeito