Identifying Topic Relevant Hashtags in Twitter Streams

(1)

Identifying Topic

Relevant Hashtags in

Twitter Streams

Filipe Daniel Marques Figueiredo

Master’s Degree in Computer Science

Computer Science Department 2018

Supervisor

Alípio Mário Guedes Jorge, Associate Professor, Faculty of Sciences, University of Porto

(2)

Todas as correções determinadas pelo júri, e só essas, foram efetuadas.

O Presidente do Júri,

(3)

Abstract

Hashtags have become a crucial social media tool. The categorization of posts in a simple and informal manner stimulates the dissemination of content through the web. At the same time, it enables users to find messages within a specific topic of their interest. However, the flexibility provided to the user to apply any hashtag carries some problems. Equivalent expressions, like synonyms, are handled like entirely different words, while the same hashtag may refer to distinct topics. Also, many hashtags are dynamic in the sense their meaning and connections with different subjects change through time and location. This factors may hinder content discovery, specially when discussing less popular subjects. One way to overcome this problem is to provide utilities to identify relevant hashtags. Some research in hashtag recommendation in Twitter has been conducted over recent years but with greater focus on proposing hashtags for new posts instead of for a topic in general. Additionally, most of the current approaches rely on databases which require time to be assembled and rigorous maintenance to keep updated.

The approach we propose for the identification of topic relevant hashtags is the development of a method to search Twitter, in real time, for hashtags relevant to a topic and represent them in a graph. For this task, we first retrieve tweets within some degrees of connection with the subject. Next, we employ Latent Dirichlet Allocation and Support Vector Machines, to classify tweets and collect their hashtags relevant to the subject. Finally, we use these hashtags to assemble a network of relations that can be used to deepen content retrieval on the original subject. This approach takes into account factors such as the popularity of the hashtags and current meaning.

Furthermore, we analyze the proposed algorithm, both qualitatively and quantitatively, and compare it with past approaches, in order to evaluate its performance. The outcomes of our method are usually equal or superior to the available alternatives, in relation to the number of returned hashtags, and current relevance to the topic. However, our process is by default significantly slower than the existing alternatives.

keywords: Text Mining, Topic Modeling, Latent Dirichlet Allocation, Support Vector

Machines, Twitter, Hashtags

(4)

(5)

Resumo

As hashtags tornaram-se numa ferramenta crucial nas redes sociais. A categorização de posts de forma simples e informal estimula a disseminação de conteúdo através da internet. Ao mesmo tempo, permite que os utilizadores encontrem mensagens acerca de um tópico específico do seu interesse. No entanto, a flexibilidade oferecida ao utilizador para aplicar qualquer hashtag acarreta alguns problemas. Expressões equivalentes, como sinónimos, são tratadas como palavras totalmente diferentes, enquanto que a mesma hashtag pode ser associada a tópicos distintos. Além disso, muitas hashtags são dinâmicas, no sentido em que o seu significado e relações com diferentes assuntos mudam com o tempo e a localização. Esses fatores podem dificultar a descoberta de conteúdo, especialmente quando se discutem assuntos menos populares. Uma forma de ultrapassar este problema é a disponibilização de ferramentas para identificar hashtags relevantes a um tópico. Alguma pesquisa em recomendação de hashtag no Twitter foi levada a cabo nos últimos anos, mas com maior foco em propor hashtags para novos posts em vez de para um tópico em geral. Além disso, a maioria das abordagens atuais depende de bases de dados que necessitam de tempo para ser recolhidas e de manutenção rigorosa para que se mantenham atualizadas.

A abordagem que propomos para identificar hashtags relevantes para um tópico passa pelo desenvolvimento de um algoritmo para pesquisar no Twitter, em tempo real, por hashtags relevantes para um tópico e representá-las num grafo. Para essa tarefa, primeiro recolhemos tweets dentro de alguns graus de conexão com o assunto. Em seguida, aplicamos Alocação Latente de Dirichlet e Máquinas de Vetores de Suporte, para classificar tweets e recolher as hashtags relevantes para o tema. Finalmente, usamos essas hashtags para montar uma rede de relações que pode ser utilizada para aprofundar a recuperação de conteúdo sobre o tópico original. Essa abordagem tem em conta fatores como a popularidade das hashtags e o significado atual.

Além disso, analisamos o algoritmo proposto, tanto qualitativamente como quantitativamente, comparando-o com abordagens passadas, de forma a avaliar o seu desempenho. Os resultados do nosso método são geralmente iguais ou superiores às alternativas disponíveis, tanto em relação ao número de hashtags retornadas como na sua relevância atual para o tópico. No entanto, o nosso processo é, por defeito, significativamente mais lento do que as alternativas existentes.

palavras chave: Mineração de Texto, Modelos de Tópicos, Alocação Latente de Dirichlet,

Máquinas de Vetores de Suporte, Twitter, Hashtags

(6)

(7)

Acknowledgments

I want to express my gratitude towards Prof. Alípio Jorge for accepting to be my mentor and thesis supervisor, for trusting in my capabilities to take part in his research and for all the guidance and support provided.

Furthermore, I am grateful for my family, who always supported me through my academic and personal life, specially for my parents Jorge and Anabela, my sister Filipa, my grandparents and for my girlfriend Cláudia for always standing by my side.

Finally, I want to acknowledge all my friends for the joyful moments shared over the years, in particular Pedro, Ricardo, Duarte, João and Gonçalo for their companionship and helpfulness through my university years.

I would never get this far without them.

(8)

(9)

The research leading to these results was conducted as part of the RECAP preterm Project which received funding from the European Union’s Horizon 2020

research and innovation program under grant agreement No 733280.

(10)

(11)

List of Tables

4.1 Example of Collected Tweets for the Seed #Preterm . . . 29

4.2 Preprocessed Tweets from Table 4.1 . . . 30

4.3 Example of a Document Term Matrix . . . 31

4.4 Classification attributed to Tweets from Table 4.1. . . 33

4.5 Example of Train Set for the Support Vector Machine . . . 33

4.6 Example of Hashtags Returned for the Seed #Preterm.. . . 35

4.7 Example of an Adjacency Matrix for the Hashtags from Table 4.6 . . . 36

4.8 Converted Upper/Right Adjacency Matrix of Table 4.7. . . 36

5.1 Summary of the Data Sets used for Evaluation . . . 40

5.2 Seeds Explored for the Qualitative Evaluation . . . 51

5.3 Comparison of the Results Collected by TORHID and Hashtagify.me . . . 53

5.4 Comparison of the Hashtags Collected for the Topic #Preterm . . . 54

5.5 Comparison of the Hashtags Collected for the Topic #CerebralPalsy . . . 54

5.6 Comparison of the Hashtags Collected for the Topic #GunControl . . . 55

(16)

(17)

List of Figures

2.1 Hyperplane, Margins and Support Vectors of a SVM. . . 7

2.2 Example of SVM classification through the use of kernels. . . 8

4.1 Example of a Topic Model . . . 32

4.2 Representation of a Network of Hashtags Related to #Preterm . . . 37

5.1 Amount of Tweets per number of Hashtags . . . 41

5.2 Percentage of Types of Error with Different SVM Variables for the Hashtag #Preterm. . . . 42

5.3 Percentage of Types of Error with Different SVM Variables for the Hashtag #CerebralPalsy. . . . 43

5.4 Percentage of Types of Error with Different SVM Variables for the Hashtag #GunControl. . . . 43

5.5 Results for Different Combinations of SVM Variables Without Seed for the Hashtag #Preterm. . . . 44

5.6 Results for Different Combinations of SVM Variables Without Seed for the Hashtag #CerebralPalsy. . . . 45

5.7 Results for Different Combinations of SVM Variables Without Seed for the Hashtag #GunControl. . . . 45

5.8 Average of the Percentages of the Types of Error for Different Sizes of Train Sets and Different Seeds. . . . 47

5.9 Average of the Percentages of the Types of Error for Different Numbers of Latent Topics and Different Seeds. . . . 47

5.10 Number of Hashtags Collected with Different Number of Iterations for Different Seeds. . . . 49

(18)

5.11 Number of Hashtags Collected with Different Number of Tweets per Iterations for Different Seeds. . . . 49

5.12 Number of Hashtags Collected with Different Frequency Thresholds for Different

Seeds. . . . 50

5.13 Examples from Hashtagify.me (left) and from TORHID (right) for the seed #Preterm 52

6.1 User Interface of the Shiny Application Running in Mozilla Firefox . . . 58

6.2 User Interface of the Shiny Application with Advanced Options and Results . . . 59

6.3 Comparison of the Runtimes of the Different Tools . . . 63

6.4 Comparison of the Hashtags Returned by the Different Tools . . . 64

7.1 Network of Relations of the Hashtags collected for the seed #XboxE3 . . . 69

7.2 Network of Relations of the Hashtags collected for the seed #PlaystationE3 . . . 69

7.3 Word cloud based on tweets collected for the seed #XboxE3 . . . 70

7.4 Word cloud based on tweets collected for the seed #XboxE3 . . . 70

7.5 Comparison of the Live and the Last Day Summaries of the Sentiments Regarding

#XboxE3 and #PlaystationE3 . . . 72

7.6 Comparison of the Sentiments from the Tweets Collected for #XboxE3 and

#PlaystationE3 . . . 72

7.7 Comparison of the Best and Worst Summaries from the Tweets Collected for

#XboxE3 and #PlaystationE3 with the Respective Seeds . . . 73

7.8 Comparison of the Sentiments of seeds with the entire Networks Collected for

#XboxE3 and #PlaystationE3 . . . 74

(19)

List of Algorithms

1 Collecting Tweets . . . 29

2 Classifier . . . 33

3 Collecting the Relevant Hashtags. . . 35

4 Weighting the Relations of the Hashtags. . . 36

5 Collecting Relevant Hashtags using only One Data Set . . . 61

(20)

(21)

Acronyms

CTM Correlated Topic Models

DF Data Frame

DTM Document Term Matrix

IDF Inverse Document Frequency

IR Information Retrieval

L2R Learning-to-Rank

LDA Latent Dirichlet Allocation

NLP Natural-Language Processing

pLSA Probabilistic Latent Semantic Analysis

POS Part-of-Speech-Tagging

SVM Support Vector Machine

TF-IDF Term Frequency - Inverse Document

Frequency

TM Topic Model

UTF-8 8-bit Unicode Transformation Format

(22)

(23)

Chapter 1 Introduction

Twitter has become one of the most popular micro-blogging services in the world [24] with over 300 million active users every month [25]. The social network allows its users to communicate by posting short messages called “tweets”, which consist in pieces of text up to 280 characters long (originally only 140 characters). Due to the massive amount of users, around 8,000 tweets are posted every second [74], which, despite the relatively small size of the messages, results in an astounding amount of information spread through the internet every day. However, accessing the relevant pieces of this information is not necessarily a straightforward task.

The original motivation for our work was to retrieve data from posts on social media in order to support a project (RECAP1) whose objective was to study physical conditions, on any stage of life, that could be related to a premature birth. Obstacles in determining the most advantageous manner of querying Twitter for collecting this information lead to the development of a tool for hashtag recommendation. This system was then improved and adapted to perform appropriately for virtually any other topic discussed in the platform.

Most of the functionality and tools of Twitter that make it so important in public commu-nication today were user-led innovations, later integrated into the service itself. One of these distinctive components is the hashtag, mostly associated with Twitter but also adopted by other social networks and communication platforms. Because of Twitter’s popularity, it became burdensome to search for relevant information about a specific topic on the service. This made it frustrating for users, specially ones with small follower bases, to openly engage in discussions in the platform. For that reason, users started applying dynamic, user-generated tagging signalized by a "#" to their tweets as a way to make them publicly available, categorized and, consequently, easier to encounter by other users interested in that same topic.

While the current hashtag implementation in Twitter brings many advantages for its users, because of its easiness of use, there are still some issues when searching for content of a specific topic. Since anyone can use any word or expression as a hashtag, limited only by the poster’s perspective and imagination and by Twitter’s imposed 280 characters cap, it may be difficult for

1

https://recap-preterm.eu/

(24)

2 Chapter 1. Introduction

a user to find every tweet relevant to his search. Hashtags are often ambiguous and some of the concepts expressed may be more popularly represented by a synonym, an acronym or even an analogue expression, which are handled by Twitter like entirely different topics. Furthermore, an hashtag may simply express a sub-topic or a super-topic of a more relevant subject. On the other hand, hashtags are dynamic in a sense that their popularity and meaning are constantly fluctuating. Since only one hashtag can be searched at a time, it may not be immediately clear which one is the best to search for a certain topic, specially when dealing with less popular subjects.

This situations are embedded in the principles of the hashtag categorization mechanism and, as a consequence, are analogous in every other social network and services making use of it. Nonetheless, over recent years, a few works attempted to minimize these shortcomings in Twitter through techniques of hashtag recommendation [16][26][43][47][84][85].

However, the previous research focuses primarily in proposing hashtags for new posts being made, taking into consideration an entire tweet, while mostly neglecting hashtag recommendation towards a topic in general. Additionally, most of the current approaches are fully dependent on large-scale databases and, while this brings advantages in response time, on social media everything is rapidly changing, making these data sets require an high degree of maintenance in order to keep providing relevant results. In this manner, there is interest in studying hashtag recommendation taking into consideration the evolution of communication patterns, that is, on a streaming environment instead of a static database.

1.1 Goals and Methodology

The objective of this thesis is to develop a new methodology and a tool to efficiently identify and collect hashtags related to a topic in Twitter streams, along with their relations, and represent them in an understandable visual manner, assisting people to reach more information and communicate more effectively.

In this document, we present TORHID (Topic Relevant Hashtag Identification), a technique to identify and retrieve relevant hashtags for a topic in Twitter streams. Firstly, we describe an algorithm to collect tweets and retrieve hashtags within some degree of connection with the subject. The relations among the hashtags are also determined and depicted through a graph. We subsequently improve on this concept through the application of topic modeling, to automatically disambiguate tweets and their respective hashtags, filtering the non relevant ones, thereby pruning the network. More specifically, we employ Latent Dirichlet Allocation and support vector machines for the classification tasks. In this manner, we present a method that achieves an assortment of hashtags relevant to the original topic and the relationships among them in a network represented through a weighted, directed graph.

We proceed to implement our method in R and compare it with available alternatives, both qualitatively and quantitatively, in order to validate its precision and usefulness. We also present

(25)

1.2. Thesis outline 3 a new evaluation metric when performing qualitative analysis that takes into consideration the temporal context of the results. Except for being slower in response time, our results are usually as good or better than previous techniques. We later improve the usability of our implementation, through the addition of a user interface, while attempting to minimize its run times, at a possible expense of accuracy. Finally, we demonstrate its versatility through a real world example.

1.2 Thesis outline

This chapter presented the motivation to this thesis, our objectives and contributions. The remainder of this document is structured as follows:

Chapter 2 - Basic Concepts presents the foundations of our work, defining the terminology

and introducing the essential background concepts and theories applied through the remaining chapters;

Chapter 3 - Related Work discusses the past literature and contributions on the topic of

this thesis, designating the respective limitations;

Chapter 4 - Design and Implementation describes the contributions of our work,

explain-ing our theory and implementation;

Chapter 5 - Evaluation defines the experiments and metrics used to validate our method,

in addition to a comparison with existing approaches;

Chapter 6 - Usability Improvement presents the development of a web application and

some experimentation with our implementation;

Chapter 7 - Case Study demonstrates a possible application of our method with a real life

example;

Chapter 8 - Conclusion summarizes the presented material and presents perspectives of

(26)

(27)

Chapter 2 Background

The purpose of this chapter is to cover the theories and concepts required to fully understand the rest of this document. First we explain some of the terminology employed and then we cover the tools and algorithms used to develop our method.

2.1 Terminology

In this section, we introduce the reader to some of the important terminologies used through this thesis. To address the components and features of Twitter, the following terms will be frequently employed:

Twitter Status/Post/Tweet - a short message posted to Twitter, limited to a length of 280

characters. The content of tweets may include images, videos, URLs, emojies, mentions of other users, polls or simply plain text, while their meta data incorporates information like the ID, the location and the time of the post, the handle of the user who posted the status and the number of likes and retweets. Unless the account is set as private, all posts are public by default.

User - everyone with a Twitter account, that can post tweets and interact with other people.

Every user is identified by a unique twitter handle which consists of the respective username preceded by an "at" sign (@), like in "@user ".

Retweet - the repost of an already existing tweet from a different user, with the intention of

sharing its content. In order to differentiate them and give credit to the primary posts, retweets are marked with a "RT" tag and a mention of the original poster as in "RT

@username".

Mentions and Replies - posts that refer directly to another Twitter user and are intended to

start or continue conversations. Both kinds of posts are public by default and marked with the handles of the recipients.

(28)

6 Chapter 2. Background

Follower - a user who subscribes to the feed of other users, thereby receiving updates on their

future tweets. In this manner, the followers of an account represent its main audience. Unlike most social networks, following is not necessarily mutual on Twitter.

Feed and Timeline - a chronologically ordered list of tweets that is automatically updated

with new content. Each account has a public timeline of tweets posted by itself and a private timeline with posts from users they follow.

Hashtag - a user-defined tag for categorizing messages in Twitter, signaled by a keyword

preceded by an octothorpe (#) as in #hashtag. Tweets with the same hashtags can be listed together. Hashtags are used to group and spread conversations about a specific topic or event in a feed.

Trending Topics - one of the most popular hashtags at a given moment in Twitter. This

hashtags are publicly listed to facilitate their discoverability and allow for the contributions of more users to the conversation. Trending topics highly variate through time and may be different according to the user’ location.

In addition, the following nomenclatures were employed to simplify the explanation of some concepts and procedures:

Seed - a keyword that can accurately represent a given topic. In our work, seeds are hashtags

characterizing the original topic for which we collect pertinent tweets and recommend related hashtags.

2.2 Theoretical Concepts

The purpose of this section is to cover the elemental theoretical concepts required to fully understand the methodology of our approach.

2.2.1 Support Vector Machines

In data science fields associated with machine learning, support vector machines are a set of methods for supervised learning, used for both classification and regression analysis. Being a supervised learning model, SVMs always require the training data to be labeled.

Provided with labeled training data, that is, a training set with every instance marked as related to one of two classes, a support vector machine builds a non-probabilistic binary classifier that assigns new cases to either one of those categories.

The support vector learning machine was developed by Vapnik et al. [78][66] to implement proposed principles from statistical learning [79], according to which learning is fundamentally

(29)

2.2. Theoretical Concepts 7 the same as estimating an accurate function from a set of examples (training set). In this manner, the task of a leaning machine is to determine the function from a given assortment of functions that minimizes the risk of being different from the actual unknown function.

In essence, SVMs operate through the representation of the instances as points in a n-dimensional space, attempting to maximize the margin between the closest points of both classes (support vectors), and the definition of an optimal separating hyper-plane between those points.

For better understanding, these concepts are illustrated in Figure 2.1. In this manner, a margin can be defined as the minimal distance of an example to a decision surface. In cases where some instances are placed in the incorrect side of the discriminant margin, their weights are reduced in order to decrease their influence in the final function. The new cases are then mapped into the described space and, according to the side of the gap they get placed into, are classified as belonging to a category or the other. The fundamental principle of support vector learning is the proven insight that the risk of miscalculations is minimized when the discriminant margin is maximized [46].

Figure 2.1: Hyperplane, Margins and Support Vectors of a SVM.

In addition to performing linear classification, in cases where it is impossible to determine a linear separator, support vector machines are able to perform non-linear classification through the application of kernel functions, where the data points are implicitly projected into higher-dimensional feature spaces with the purpose of making them linearly separable. Figure 2.2

illustrates an example of the use of kernels in support vector machines where a linear separation of the classes would usually be impossible.

2.2.1.1 Formalization of Hyper-plane Classifiers

The function of a classifier is to discover a rule that correctly assigns a new instance to one of several given categories, based on previous external knowledge.

(30)

Figure 2.2: Example of SVM classification through the use of kernels.

and a class label with only two possible values (y_i ∈ {−1, +1}), one possible formalization of the problem is to estimate the function f : RN _{→ {−1, +1}. In this case, all hyper-planes in R}d _are

parameterized by a vector w and a constant b and can be expressed in the equation:

w · x + b = 0 (2.1)

. Given a hyper-plane (w, b) that properly divides the training data, we get the corresponding decision function:

f (x) = sign(w · x + b) (2.2)

.

Considering that every hyper-plane represented by (w, b) is equally expressed by all pairs {λw, λb}, for λ ∈ R+_{, we define the acceptable hyper-plane as the one which separates the data}

by at least 1. That is, we only consider those that satisfy the conditions:

xi· w + b ≥ +1 when yi= +1 (2.3)

xi· w + b ≤ +1 when yi= −1 (2.4)

or, more compactly:

yi(xi· w + b ≥ 1), ∀i (2.5)

. Which means that every hyper-plane has a functional distance ≥ 1. To calculate the euclidean distance (margin) from the hyper-plane to a data point, we normalize the function by the magnitude of w, that is, the distance is:

d((w, b), xi) = yi(xi· w + b) kwk ≥ 1 kwk (2.6) .

Since we want the hyper-plane that maximizes the margin, that is, the distance to the closest data points, we need to minimize kwk (Quadratic Programming Problem). This task is usually

(31)

2.2. Theoretical Concepts 9 achieved through the application of Lagrange multipliers [10], transforming the problem into:

minimize: w(α) = − l X i=1 α1+ 1 2 l X i=1 l X j=1 yiyjαiαj(xi· xj) (2.7) subject to: l X i=1 yiα1 = 0 (2.8) 0 ≤ α_i≤ C, ∀i (2.9)

where the vector of the l non-negative Lagrange multipliers to be determined is represented by α, and C is a constant. Essentially, if C = ∞, the optimal hyper-plane divides the classes completely [9], but when C is finite, the classifier allows for a soft-margin [12], that is, allows for a margin error. The matrix (H)_ij = y_iyj(xi· xj) can be defined to introduce more compact

notation: minimize: w(α) = −αT1 +1 2α T_Hα _(2.10) subject to: αTy = 0 (2.11) 0 ≤ α ≤ C1 (2.12)

. By deriving these equations, the optimal hyper-plane can be written as:

w =X

i

αiyixi (2.13)

. Since only the support vectors are considered, that is, only the data points closest to the margin contribute to w, it can also be shown that:

αi(yi(w · xi+ b) − 1) = 0, ∀i (2.14)

. In order to specify the hyper-plane, after discovering the optimal α to construct w, it is still required to determine b. To perform this task, any positive and negative support vector is considered as x+ and x−:

(w · x++ b) = +1 (2.15)

(w · x−+ b) = −1 (2.16)

what is translated to:

b = −1

2(w · x

+_{+ w · x}−₎ _(2.17)

.

2.2.1.2 Kernels

The use of linear functions for the classifier gives the impression of being a limiting factor to support vector machines since many data sets cannot be linearly separable. However, it is possible to have linear models with a set of nonlinear decision functions through the application of the

(32)

kernel trick [12], that essentially consists in performing a type of preprocessing to the training data, transforming the problem into finding a normal hyper-plane.

Employing the kernel trick for support vector machines works through the creation of a non-linear mapping φ : RN → F , which fits the original input space into a new feature space

F , usually of higher dimensionality. The maximum margin hyper-plane is now just required to

separate this transformed data (φ(x₁), y₁). Usually, if φ(x) casts the input vector into a high enough space, the training data becomes separable [9].

Given a mapping z = φ(x) and using it to replace all occurrences of x with φ(x), the quadratic problem is still transformed into:

minimize: w(α) = −αT1 +1 2α

T_Hα _(2.18)

in resemblance with (2.10). However, the matrix (H)ij = yiyj(xi· xj) is converted to (H)ij =

yiyj(φ(xi) · φ(xj)). The equation (2.13) becomes:

w =X

i

αiyiφ(xi) (2.19)

. And equation (2.2) also becomes:

f (x)) = sign(w · φ(x) + b) (2.20) = sign " X i αiyiφ(xi) # · φ(x) + b ! (2.21) = sign X i αiyi(φ(xi) · φ(x)) + b ! (2.22) .

One important aspect of these transformations is that wherever there is a φ(xi) in a equation,

it is always in a dot product with another φ(xj). As a result, if there exists a kernel function for

the dot product in the higher dimensional feature space,

K(xi, xj) = φ(xi) · φ(xj) (2.23)

the use of that kernel would only be required in the training algorithm and it would never be necessary to even explicitly know the mapping function z = φ(x). This means that, in reality, the optimization matrix would simply be represented as (H)ij = yiyj(K(xi, xj)) and the classifier as

f (x) = sign (P

iαiyi(K(xi, x)) + b). Once the kernel is applied, the process to find the optimal

hyper-plane proceeds conventionally. Only in the original feature space, the delineation of the data will be curved and possibly non-continuous instead of linear.

The Mercer’s condition provides the mathematical properties to verify if a prospective kernel

K is a dot product in some feature space [78], but not to construct the mapping function itself. Finding the best kernel function for a data set has been subject of research [64] [75] and it was found that different kernels may give similar classification accuracy and use similar support vectors [65], suggesting that each problem may be characterized by a fixed support data.

(33)

2.2. Theoretical Concepts 11

2.2.2 Latent Dirichlet Allocation

In data science fields associated with natural-language processing and text-mining, Latent Dirichlet Allocation is a generative probabilistic model used to group unclassified sets of discrete data into a number of previously defined categories, most often used to analyze the topics in text corpora and organize the documents into the respective classes. When applied to a corpus, Latent Dirichlet Allocation considers each document as a combination of various topics and attributes to them the probability of the presence of each word. It is similar to probabilistic latent semantic analysis, except that Latent Dirichlet Allocation assumes that each document does not incorporate the entire set of topics and each topic does only employ a small set of words [23].

Latent Dirichlet allocation is a three-level hierarchical Bayesian model first introduced by Blei et al. [8] to analyze text corpora and became one of the most popular techniques in text-mining [81]. Some extensions to the method were later developed like dynamic topic models [7], hierarchical Dirichlet processes [76] and correlated topic models [6]. Other researchers have also successfully adapted Latent Dirichlet Allocation to images [33] and video analysis [80].

Essentially, Latent Dirichlet Allocation assumes that when each document from a corpus is being composed, the author firstly decides the total number of words and the topic composition, according to a Dirichlet distribution, and then populates the document with words from those topics. The model attempts to backtrack from the final set of documents to find the set of topics that are likely to have been used to generate the collection.

Given a corpus to find the topic representation of each document and a given number of topics to determine (either an informed estimate or trial and error), the algorithm begins to semi-randomly assign every word of every document to a temporary topic, according to a Dirichlet distribution (if a word appears twice, different topics may be assigned to each occurrence). Stopwords are removed and not assigned to any of the topics. This initial step provides a randomized topic representation of every document and an inaccurate word distribution for every topic. Next, the algorithm checks every word and updates the topic assignment based on the proportion of words in that document that are currently assigned to each topic (how prevalent is each topic in the document) and the proportion of assignments to each topic over all documents that come from the word (how frequent is the word across topics). In this step, it is assumed that all topic assignments except for the current word are correct and its assignment is updated using the proposed model of how documents are generated. Through the repetition of the last procedure, the topic assignment gets improved until becoming acceptable and can be used to estimate the topic composition of the documents and the word composition of the topics.

Latent Dirichlet allocation does not consider that the topics are defined either semantically or epistemologically, but simply characterized by the calculated probability of co-occurrence of each set of words. However, a lexical term may be included in more than one topic, with different probabilities. It is an unsupervised machine learning technique since the analysis is performed only according to word frequencies and, as a consequence, the data doesn’t need to be labeled.

(34)

Being a generative model, avoids making any initial assumptions about how the text relates to the categories.

2.2.2.1 Formalization of the Generative Process

To formalize Latent Dirichlet Allocation, we start by defining the generative process with more detail as follows:

1. For each document d in a corpus D:

(a) Select a total number of words N ∼ P oison(ξ)

(b) Select a topic distribution, θ_d ∼ Dir(α), where Dir(·) is drawn from a uniform Dirichlet distribution with a scaling parameter α

(c) For each of the N words wn:

i. Select a specific topic z_n∼ M ultinomial(θ)

ii. Select a word wn from p(wn|zn, β), a multinomial probability conditioned on the

topic zn

A selection of a k-dimensional Dirichlet distribution returns a random variable θ which can take values in the (k − 1)-simplex and has the following probability density:

p(θ|α) = Γ( Pk i=1αi) Qk i=1Γαi θα1−1 1 · · · θ αk−1 k (2.24)

where α is the parameter of the Dirichlet distribution representing a k-vector with components

ai > 0 and Γ(x) is the gamma function.

Given the parameter β, a k × V word probability matrix for each topic k and each term V , where β_ij = p(wj = 1|p(zi = 1, the joint distribution of a topic mixture θ, a set of N topics z and a set of N words w is given by:

p(θ, z, w|α, β) = p(θ|α)

N

Y

n=1

p(zn|θ)p(wn|zn, β) (2.25)

where p(zn|θ) is θi for the unique i such that zni = 1.

The marginal distribution of a document is obtained by integrating the Equation (2.25) over

θ and summing over z:

p(w|α, β) = Z p(θ|α) N Y n=1 X zn p(zn|θ)p(wn|zn, β) ! dθ (2.26) .

(35)

2.3. Packages and APIs 13 Finally, we obtain the probability of a corpus by calculating the product of the marginal probabilities of every single document:

p(D|α, β) = M Y d=1 Z p(θd|α)   Nd Y n=1 X zdn p(zdn|θd)p(wdn|zdn, β)  dθd (2.27)

Where the parameters α and β are only sampled once when the corpus is being generated, the variables θd are sampled once per document and the variables wdn and zdn are sampled once

for every word.

2.2.2.2 Inference

The central inferential problem that needs to be solved in order to use Latent Dirichlet Allocation is computing the posterior distribution of the latent variables given a document:

p(θ, z|w, α, β) = p(θ, z, w|α, β)

p(w|α, β) (2.28)

Since this distribution is intractable to compute, in order to normalize the distribution, the latent variables are marginalized and Equation (2.26) is rewritten in terms of the model parameters: p(w|α, β) = Γ( P iαi) Q iΓ(αi) Z k Y i=1 θαi−1 i !  N Y n=1 k X i=1 V Y j=1 (θ_iβij)w j n  dθ (2.29)

which is also intractable due to the coupling between θ and β [15]. However, while the posterior distribution is not tractable for the exact inference, numerous approximate inference techniques can be applied adjusted for Latent Dirichlet Allocation, like the Laplace approximation [8], variational approximation [8] and Markov chain Monte Carlo [34].

2.3 Packages and APIs

Each one of the algorithms and experiments described through the course of this thesis will be implemented in the R programming language. In this section we cover the most relevant packages we will be using and their respective applications.

2.3.1 TwitteR and the Twitter Search API

Part of the novelty in our methodology depends on our approach of data collection. For the task, we use the package twitteR [22], which provides a straightforward interface in R to the Twitter Search API [31].

The Twitter Search API is an interface provided by Twitter to facilitate searches in their platform. Using this interface, developers have official support for querying Twitter for a word,

(36)

expression, hashtag, or a username and receive a set of tweets containing that specified query. In addition, friends and favourites of a particular user can be searched as well. This API also supports the restriction of the results interval of time, a geographical area or a language.

However, despite its obvious advantages, the Twitter Search service and, by extension, the Search API are not originally meant to be an exhaustive source of Tweets. For this reason, various restrictions were imposed on the service: the API can only retrieve tweets posted in the interval between the moment of the query and seven days earlier. Twitter also limits the search rate to 180 calls of the Get Function for each 15 minute window intervals [32], meaning that, after fetching a few thousand tweets, the API is blocked for the next 15 minutes.

The package twitteR attempts to mitigate some of these limitations with the addition of some tools, particularly the built in options for pausing the search when the limit of tweets is reached, automatically resuming after the 15 minutes interval.

2.3.2 tm

The package tm [18] is a popular package for text-mining functionality. It offers options for facilitating the management of text, incorporating straightforward functions for document manipulation. An integrated database back-end support is also included to minimize memory demands.

We will use the package capabilities predominantly to preprocess data sets, through actions such as converting characters to different standards, removing white spaces, word stemming and extraction of stopwords. Furthermore, we will also employ the corresponding functions to create and manipulate corpus and document-term matrices.

2.3.3 topicmodels

The R package topicmodels [27] provides basic infrastructure for fitting topic models and correlated topic models. It is based on data structures from the text mining package tm, and complements its functionality. Two algorithms are provided for fitting topic models: the variational expectation-maximization algorithm and Gibbs sampling.

In our implementation, this package will be used to build topic models with Latent Dirichlet Allocation, as well as for calculating the posterior probabilities of tweets.

2.3.4 e1071

The R package e1071 [49] was originally developed as a mixture of useful functions of the Probability Theory Group, formerly E1071, from the Department of Statistics of TU Wien. It is mostly known from its functions for training classifiers based on support vector machines

(37)

2.3. Packages and APIs 15 (SVM), despite also providing other methods like shortest path computation, short time Fourier

transform, naive Bayes classifier and bagged clustering.

2.3.5 igraph

The R package igraph [13] provides an infrastructure in R for igraph, a library collection of open source network analysis tools for creating and manipulating graphs. Functions like graph generation, computation of network properties based on path length and calculation of clusters in graphs are included.

We will use the tools available in this package to represent the relations among the collected hashtags in a visual manner.

2.3.6 shiny

The R package shiny [62] was developed by RStudio [63] for facilitating the creation of interactive web applications directly from R code. Shiny applications can be further extended with CSS, html, and JavaScript. RStudio also provides the option for hosting Shiny apps locally or on their own hosting servers.

After describing and evaluating our method, we will use Shiny to build a user-friendly application through the addition of an interactive user interface to our implementation.

(38)

(39)

Chapter 3 Related Work

The purpose of this chapter is to introduce some of the existing literature and research that contributed as a basis to the development of this thesis.

The popularization and growth of social networks in recent years attracted a lot of attention from the research community in different areas. Twitter, being the most popular micro-blogging service in the world, has become appealing due to the public availability of millions of short texts shared every day by a multitude of people, covering a wide variety of subjects.

3.1 Short Text Classification

One of the first works in improving search and message filtering for small texts is presented by Esparza et al. [21] as an initial study in category recommendation applied to a micro-blogging environment. Their approach relies in a previous manual categorization of each tweet into five predetermined categories (books, movies, games, music and applications) to train a classifier to predict classes of new tweets, validating the possibility of indexing Twitter data for category recommendation and information retrieval.

Many pieces of literature have applied Latent Dirichlet allocation (LDA) to news articles, books, academic papers and other large text corpora. However, Ramage et al. [61] are the first to experiment with the utilization of topic modeling to perform classification tasks in short documents like tweets. The authors employ Labled-LDA [60], a partially supervised learning model based on Latent Dirichlet allocation that incorporates supervision in the form of tweet-level labels, to classify the contents of a Twitter feed into four previously determined dimensions:

substance topics, about events and ideas; social topics, comprised of language used toward

communication; status topics, describing personal updates; and style topics, that embody broader trends in language usage. This classifications are then used with the purpose of profiling Twitter users and their habits for better follower recommendations.

Considering that data sparseness is the main cause why many classification tasks are

(40)

18 Chapter 3. Related Work

unsuccessful in achieving a decent accuracy rate when applied to short segments of text like tweets, Phan et al. [57] propose a method to expand the coverage of classifiers through the addition of external knowledge. The authors introduce a framework whose underlying idea is that, for each new classification task, a universal data set, consisting of large-scale data, is collected from an external source like Wikipedia, and used to build a classifier on both the small set of labeled training data and the rich set of hidden topics derived from that external data collection. The proposed framework was considered to be general enough to be applied to a wide variety of data domains like Twitter feeds.

In order to facilitate the retrieval of information from Twitter, Antenucci et al. [3] proposes a method to learn the relationships between the content of a tweet and the hashtags that could accurately describe it. In their approach, each tweet is represented as a frequency list of its comprising words, which are neither stopwords or hashtags, and the corresponding categories are the hashtags employed in the tweet. They introduce a technique to cluster hashtags in meaningful topic groups using a combination of co-occurrence frequency [58], graph clustering and textual similarity. Then, a method is described to classify a tweet, based on the content of the topic groups previously defined, through the employment of a combination of principal component analysis (PCA), dimensionality reduction and a variety of multi-class categorization algorithms. In this manner, the authors were capable of performing supervised classification tasks on otherwise intractable problems.

To achieve the task of empirically compare the content of Twitter with a traditional news medium using unsupervised topic modeling, Zhao et al. [87] propose the unsupervised topic model Twitter-LDA, a variation of Latent Dirichlet Allocation, developed specifically for taking into consideration the size limitations of tweets. Assuming that each tweet has only one topic, the Twitter-LDA classifier is employed to automatically collect posts classified as discussion topics. The topics in the New York Times are then used to semi-automatically group the relevant topics of the collected tweets into distinct topic categories. Finally, all topics are manually classified. Further experimentation reveals that Twitter-LDA may outperform regular LDA when coping with the short text characteristics of tweets.

Not completely satisfied with some shortcomings of Twitter-LDA, Mehrotra et al. [48] investigate techniques for tweet aggregation, with the goal of improving the accuracy of topic modeling in micro-blogs without modifying the fundamental concepts of latent Dirichlet Allocation. The authors experiment with five different schemes to aggregate tweets in a preprocessing step for training the topic model: the Basic scheme, the baseline where each tweet is considered as a single document and the model is trained on all tweets; the Author-wise Pooling, a previously suggested process [82][29], where for each author in the data set, the corresponding tweets are combined in the construction of a single document; the Burst-score-wise scheme, which first discovers trending topics [52], through the detection of excessive frequency of a term into the data, and proceeds to aggregate the tweets into a single document for each burst-term; the

Temporal scheme, which attempts to capture the temporal consistency of the posts, through the

(41)

3.2. Hashtag Recommendation 19 which considers hashtags as indications to the context of the posts, through the combination of tweets sharing the same hashtag into a single document for every hashtag. The trained topic models were later compared in three different data sets. The proposed hashtag based pooling scheme was found to significantly outperform all the other alternative methods, revealing that hashtags may be valuable indicators of the substances of a tweet.

3.2 Hashtag Recommendation

The majority of previous research in hashtag recommendation applied to Twitter focused in suggesting a set of hashtags to complement a tweet in the moment it is being posted. For this reason, most of the methods described in this section can be used to recommend hashtags for an entire tweet but not for a single word.

In contrast to the earliest techniques of keyword extraction [84][44], Mazzia and Juett [47] introduce the use of probability distributions applied to hashtag suggestion for micro-blogging websites. In particular, the proposed method considers every hashtag as a category and, correspondingly, the tagged tweets as labeled data, thus allowing the application of a naive Bayes model to calculate the maximum a posteriori probability of each hashtag class, given the words of the tweet. The authors also propose some preprocessing procedures to adapt Twitter data to common machine learning techniques. However, one of the major drawbacks of their approach is the assumption of the mutual independence of the words inside tweets.

Other method that leverages on the similarity between tweets collected in a data set is proposed by Zangerle, et al. [85] whose goal is to discourage the usage of synonymous hashtags through the recommendation of homogeneous hashtags to a tweet. Their approach calculates the resemblance of tweets and attributes them a score based on a Term Frequency-Inverse Document Frequency (TF-IDF) scheme. Then, the hashtags are extracted and restricted to a final set of those with a score above a threshold. They experiment ranking this set based on three different metrics: the OverallPopularityRank score, based on the overall popularity of the tweets; the

RecommendationPopularityRank score, based on the popularity within the most similar tweets;

and the SimilarityRank score, based on the most similar tweets. The latest metric is reported to provide the best results, particularly when recommending five hashtags.

Taking into account that Twitter users adopt distinct writing styles when posting messages in the platform, an attempt to improve on the method proposed above was later suggested by Kywe, et al. [37], who develop a personalized hashtag recommendation system considering not only the tweet content, but also the user preferences. Their approach applies an explicit user profile based collaborative filtering method, which represents each user as a vector weighted by a Term Frequency-Inverse Document Frequency (TF-IDF) scheme, analyzing the frequency of some key-words in user profiles. After calculating the weights of every user, they proceed to extract the hashtags from their most similar tweets, which are subsequently ranked and recommended to the user. This method recommends hashtags collected from one month of posts and was found

(42)

to outperform the previous approach by up to 20 percent. The authors also discovered that the inclusion of other hashtags and tweets previously posted by the user improved performance slightly.

A different method based on language modelling is proposed by Efron [17] which attempts to improve Twitter search through the retrieval of hashtags related to a keyword query, from a previously collected data set of tweets. He estimates the probability of every word in the data set to co-occur with every hashtag in a tweet and smooths the models through Baysian updating with Dirichlet priors. The amount of information contained in every hashtag is later assessed through the employment of an Inverse Document Frequency (IDF) scheme. Finally, KL-divergence [36] is used to compare the models of every hashtag with the one from the original query, measuring their respective relevance, according to which the hashtags are ranked. In this manner, hashtags are predicted through query expansion techniques.

Due to the volatility in the appropriate number of clusters to be found, Li and Wu [43] consider that the common procedures for classification of regular text are inadequate for hashtag prediction and employ network relatedness methods to recommend a hashtag for a new tweet. In their work, they argue that if the correlation between tweets can be mathematically measured, then it can also be represented as a network of points in a high dimensional space through a latent space model. The similarity of the tweets is then calculated by the Euclidean distance between those points, considering the words in a dictionary as the orthonormal bases. The semantic correlation among words is determined through the construction of a matrix based on WordNet similarity [51] and its subsequently weighting through their co-occurrence in tweets. Finally, the tweets that are closest to the original are collected and new ones are added to the set until a certain hashtag becomes dominant, that is, appears in the majority of the tweets. One of the advantages of this method is the easiness of updating the dictionary, since the addition of a new word only requires the computation of an extra column in the matrix.

Topic modeling is first employed for hashtag recommendation in Twitter by Godin et al. [26], attempting to facilitate indexing and search of new tweets. In their work, they begin by introducing a binary language classifier for tweets based on the Naive Bays method and Expectation-Maximization to filter out posts written in languages other than English from their database. A Latent Dirichlet allocation model is then trained to cluster the remaining tweets in a set of general topics. Given a new tweet, its underlying topic distribution is also generated and top keywords from the dominant subjects are recommended as hashtags. The authors argue that the suggestion of general words as hashtags instead of already existing ones is advantageous when compared to previous work, since it enables categorization and search of tweets. A qualitative evaluation of this approach concluded that in 80% of the cases, a suitable hashtag is recommended from a selection of five possibilities.

She and Chen [69] develop a supervised topic model-based solution for hahstag recommenda-tion on Twitter, abbreviated TOMOHA. They employ an adaptarecommenda-tion of Twitter-LDA [87] which considers hashtags as the labels of the local topics to generate a model capable of analyzing relationships among words, hashtags and topics of different posts. Their work follows some

(43)

3.2. Hashtag Recommendation 21 assumptions like that each tweet is solely about one of the topics of the corpus, and that every word is either a local topic word or a background word, common in most tweets. Considering that users can be influenced by their network of contacts, the study also takes into account user following relationships into the model. Finally, the probability of every hashtag to be contained in a new tweet is calculated in order to recommend the most probable ones, using asymmetric Dirichlet distribution. The authors also propose the use of parallel computing to accelerate the model training process, through the random assignment of tweets to different processors.

Pursuing to encourage more widespread adoption and usage of hashtags in Twitter, Dovgopol et al. [16] experiment with different methods of hashtag recommendation for new posts. The authors argue that hashtag recommendation carries two major challenges that set it apart from traditional document tag recommendations: the fact that data consists of huge volumes of small and noisy content, for which they propose some preprocessing methods; and the size limitations of tweets that discourage the repetition of words [77], making traditional classification techniques like TF-IDF inefficient, for which they present an original alternative. They start by using Inverse Document Frequency (IDF) to discover the three most important words in the target tweet and filter out all tweets not containing those words, with the purpose of improving the speed of the system. Next, they present an hybrid classifier that employs the Bayes’ Theorem, used to rank hashtags by their probability of co-occurring with each term in a tweet, and k-Nearest Neighbor, used to calculate the number of hashtags that occurred in similar tweets, and use it to classify and recommend hashtags for the given tweet.

Attempting to improve user interactions with reports, articles and publications, Xiao et al. [83] propose a method to recommend Twitter hashtags related to a keyword that represents a news related topic. They begin by collecting news articles from major agencies and employ an original Probabilistic Inside-Outside Log (P-IOLog) method to cluster them into vectors representing different topics, according to the presence of topic-specific informative words that co-occur with the given target word in the editorials. Tweets related to news which were published concurrently to the articles are also collected from a set of manually selected accounts and concatenated according to their hashtags, with the purpose of building a similar hashtag vector. The similarity between each news-topic vector and each hashtag vector is subsequently calculated, and the hashtags with the highest similarity scores are recommended for the news topic. Experiments performed by the authors revealed that the proposed process outperforms more common methods like Term Frequency-Inverse Document Frequency (TF-IDF), recommending more relevant hashtags to the selected news topic. However, they acknowledge that their approach is specifically tailored to news related keywords, being incapable of recommending hashtags to topics not reported by news agencies.

Shi et al. [70] also propose a news related hashtag recommendation solution denominated

Hashtagger, with the purpose of finding relevant hashtags for stimulating the dissemination of

news articles in Twitter. Opposing to the most modern alternatives, instead of focusing on topic models, they suggest a learning-to-rank (L2R) approach for modelling hashtag relevance, in order to avoid the need of continuously retrain the model to keep it updated. They adapt standard

(44)

Information Retrieval (IR) methods to accept a news article as the initial query and a set of tweets representing their embraced hashtags as the documents. For each new article, a ranking is performed based on automatic query formulation, to retrieve candidate hashtags, and a previously trained pointwise learning-to-rank (L2R) method is subsequently used to score the relevance of each hashtag for each article. In the process of generating keyword queries from the news articles, the proposed method employs part-of-speech-tagging (POS) to select pairs of nouns/phrases and calculates a score based on Term Frequency-Inverse Document Frequency (TF-IDF) to select the top 5 pairs. The authors argue that the features that make a hashtag relevant to an article remain constant over time, allowing the L2R model to be trained on manually labeled article-hashtag pairs only once and used several times. However, the procedure requires to wait until enough tweets are collected for being able to provide recommendations, resulting in up to 12 hours run-times. For this reason, the same authors later propose Hashtagger+ [71], an improved version of Hashtagger that employs cold-start algorithms to accelerate the recommendation process. This particular method was found to perform exceptionally well with less popular articles.

Focusing in the volatility of hashtag popularity and data sparsity, Otsuka et al. [54] suggest a method employing an adapted ranking system to recommend more current trending hashtags to new Twitter posts. Inspired by classic information retrieval methods, their process requires an earlier compilation of millions of tweets into two nested maps data structures: a Term to Hashtag Frequency Map (THFM), that maps each word with the frequency they co-occur with each hashtag; and a Hashtag Frequency Map (HFM), an analogous data structure that maps each hashtag with the respective term frequencies. Next, these mappings are manipulated by a Hashtag Frequency-Inverse Hashtag Ubiquity scheme (HF-IHU), a variation of the popular Term Frequency-Inverse Document Frequency (TF-IDF) weighting scheme adapted to score hashtag relevancy while taking into account the data sparseness of data sets collected from Twitter, to rank the hashtags in order of relevance to the tweet. The algorithm also makes use of the Hadoop distributed computing platform [55] with Map-Reduce [14] to accelerate the mapping process. The authors conducted experiments on a large Twitter data set and concluded that their proposed method successfully yielded relevant hashtags, outperforming other popular schemes like k-nearest neighbors, k-popularity and Naive Bayes.

One of the very few approaches that take into consideration the chronological evolution of hashtags in Twitter is proposed by Harvey and Crestani [28], who investigate the exploitation of temporal patterns for hashtag recommendation for new posts, attempting to tackle the heavily fluctuation of hashtag popularity over time [40][45]. The authors consider that hashtags can be splitted in two different types: organizational hashtags, which remain popular over long periods of time; and conversational hashtags, which refer to particular events and are only relevant during a short interval of time. The proposed approach first applies a collaborative filtering personalization technique, extending a popular method based on Term Frequency-Inverse Document Frequency (TF-IDF) [85], to determine a set of candidate hashtags from the users’ previous posts. Subsequently, the tweets containing those hashtags are reviewed, taking into account the time when they were posted, in order to re-weight the scores of possible hashtags by their temporal relevance. Experimental results were positive, showing that even when the

(45)

3.2. Hashtag Recommendation 23 temporal weighting is not completely successful, the damages imposed to the original ranking are usually insignificant.

A specially distinct approach is taken by Hamidreza Alvari [2] who proposes a simple method to recommend relevant hashtags for a new tweet, based solely on the hashtag usage history of the users. The author argues that the content of the tweets is not necessarily required to perform good hashtag recommendation and that most alternatives in the literature carry an excessively computational weight due to the lack of strong natural language processing techniques to analyze the contents of tweets. The proposed process considers hashtag recommendation as a collaborative filtering based problem, formulates it into an optimization problem and proceeds to solve it through the employment of low-rank matrix factorization to characterize users and hashtags by inferring vectors of latent factors from the user’s hashtag usage history. According to the author, when users compose a new post, they are very likely to select hashtags similar to the ones they have already adopted in previous tweets. Empirical experiments demonstrate that the proposed method is indeed capable of proper hashtag recommendation.

In contrast with the majority of the previously described methods, Li et al. [42] propose a method to suggest relevant Twitter hashtags for a given concept represented by a keyword, instead of for an entire tweet. The authors were the first to exploit distributed language representations of words in hashtag recommendation for a subject, through the construction of word-embeddings [50] from a database containing billions of words extracted from tweets, thus considering both the semantic and syntactic of the terms. When querying for a keyword, every hashtag extracted from the tweets is ranked according to the cosine similarity score between their corresponding word-embedding vector and the embedding vector of the keyword itself. The highest ranked hashtags are then recommended. The authors concluded that their approach outperformed the conventional term co-occurrence based methods and, despite being tested with health-related concepts, the procedure could be applied in any content domain.

3.2.1 Proprietary Methods

In recent years, due to the improvement and popularization of social media platforms, as well as the reduced broadcasting cost, marketers and communicators have increased their focus on online advertisement and content propagation on social networks. Being the worlds’ most popular micro-blogging service, Twitter became an obvious target of product research, attempting to direct its multiple conveniences and instruments towards marketing and content propagation to the masses.

For companies and influencers, the employment of the appropriate hashtags is capable of significantly improving brand awareness and connection with their target audience, through the encouragement of conversations related to their products or ideas. Hashtags are also an important tool for community building and management of user-generate content, considering that posts containing hashtags receive twice the engagement of those without them [67].

(46)

The need to better measure the success of online campaigns and collect more knowledge about the target audiences to optimize the reach of social media posts drove the establishment of a few companies and websites whose services focus on hashtag management, analytics and recommendation.

Companies like Sprout Social 1 have specialized in hashtag analytics directed to enhance social media strategies of a specific brand, building reports about which hashtags are commonly employed when referring to a particular company in Twitter. Other websites like Hashtagify.me

2 _{attempt to amplify their clients’ reach, through hashtag marketing, as well as allowing them to}

track their competitors, discovering common variations of hashtags frequently used alongside a queried keyword, as well as its major influencers. Ritekit develops services like Ritetag3, a website that focuses in hashtag suggestions for pictures and Tweets and not only recommends relevant hashtags, along with some extra information, but also discourages the use of some other hashtags.

Despite their usefulness for professionals, companies for hashtag analystics and recommend-ation like the ones described above carry two significant drawbacks. Due to their commercial nature, these services require a costly premium subscription to unlock most of their useful features and tweaks, or even to be used at all. Furthermore, it is of the best interest of the companies maintaining these services to keep their technology secret, therefore there is little to no information on their methodology of work, diminishing their contribution from a scientific perspective.

3.2.2 Shortcomings of Current Hashtag Recommendation

While advancements in research related to hashtag recommendation in Twitter has reached some significantly breakthroughs, current approaches, still carry some shortcomings.

First of all, the majority of the represented literature limits their focus on hashtag recom-mendation to the improvement of new posts, thereby converging their attention in suggesting hashtags to short texts and not to a topic or a keyword. In this manner, while these methods are advantageous for increasing the reach and consistency of new content, they are less convenient when searching for new information and connections.

Furthermore, the described approaches are universally dependant on the previous extraction of extensive databases populated with millions of random tweets for the process of training their classifiers and recommender systems. Due to the static character of these data sets, very few studies take into consideration the evolution of the suggested hashtags, whose connotation and popularity frequently change over time. For example, in 2010 the hashtag #Johannesburg would presumably be considered relevant to the topic #WorldCup, considering that the FIFA World Cup was hosted in South Africa in that same year. However, that would probably not be

1

https://sproutsocial.com/about 2_{http://hashtagify.me/explorer/about} 3

(47)

3.2. Hashtag Recommendation 25 perpetuated until 2014, when the same event was held in Brazil. Surprisingly, even professional hashtag recommendation services make this kind of mistakes, specially when searching for less popular topics, revealing that their databases may not be renovated frequently enough.

Therefore, in this thesis, we propose and develop techniques to properly identify, collect and recommend hashtags currently relevant to a given topic. Our approach employs methods for querying Twitter to assemble smaller but more consistent data sets, which are easier to maintain updated, to train a classifier. In addition, our metrics for relevance are less influenced by co-occurrence than most of the alternatives.

Identifying Topic Relevant Hashtags in Twitter Streams