Temporal Research Interests Discovery Using Co-Occurrence Keywords Networks

(1)

Temporal Research

Interests Discovery

Using Co-Occurrence

Keywords Networks

Pedro Cardoso Belém

Master’s degree in Computer Science

Computer Science Department 2018

Supervisor

Fernando Manuel Augusto da Silva, Full Professor,

Faculty of Sciences, University of Porto

Co-supervisor

Pedro Manuel Pinto Ribeiro, Assistant Professor,

Faculty of Sciences, University of Porto

(2)

pelo júri, e só essas, foram efetuadas.

O Presidente do Júri,

(3)

Abstract

Doing research has in itself a social component, not only because most publications are authored by more than one person but also because every scientific study is built upon previous work of different researchers. Having a profile that summarizes a researcher’s career, which can be decades long, is very important to facilitate this social dynamic.

In this thesis we focus on the creation of temporal expertise profiles that showcase not only the research interests of a researcher but also how those interests evolve over time. State of the art approaches on this problem resort to the use of LDA-based topic models [19] which has the drawback that the number of topics must be user defined. In our approach we use keyword co-occurrence networks and hierarchical community detection algorithms to defer topics, organized in a hierarchy, from the data. We used publication’s metadata from the Authenticus system, a national repository of metadata of scientific publications authored by researchers associated with Portuguese institutions, with over 400,000 records.

We structured our methodology in three different steps. In the first step, we create a keyword co-occurrence network and defer a topic hierarchy from it. We tried different variations of networks, by including keywords extracted from the title and abstract of publications, and also semantic similarity connections between keywords. Having a topical hierarchy, we link publications to those topics and then create a temporal profile of any entity with publications associated, whether it is a researcher or an institution. After creating the profile, in the last step, we present the profile in an user-friendly way. In our results we present examples of expert profiles which look promising, however, we lack ground truth data to evaluate the methodology.

(4)

Resumo

Fazer investigação tem uma componente social inerente, não só porque a maioria das publicações têm mais que um autor mas também porque todos os estudos científicos são feitos sobre o trabalho feito por outros investigadores. Ter um perfil que sumariza a carreira de um investigador é muito importante para facilitar estas dinâmicas sociais.

Nesta tese nõs focamos na criação de perfis temporais de expecialização que demonstram não só os interesses de investigação de autores mas também como esses interesses evoluem ao longo do tempo. Abordagens estado-da-arte recorrem a topic models baseados em LDA [19] que tem o problema que o numero de tópics é definido pelo utilizador. Na nossa abordagem nós usamos redes de co ocurrência de palavras-chave e algoritmos de deteção de comunidades hierárquicas para deferir tópicos, organizados numa hierárquia, dos dados. Nós usamos meta-dados de publicações no sistema Authenticus, um respositório nacional de meta-dados de publicações cientificas autoradas por investigadores associados a instituições portuguesas, com mais de 400,000 registos.

Nós estruturamos a nossa metodologia em três passos diferentes. No primeiro, criamos uma rede de co-ocorrência de palavras-chave e deferimos uma hierarquia de tópicos dela. Experimentamos diferentes variações de redes, ao incluir palavras-chave extraidas do titulo e abstracto de publicações, e também conexões the similaridade semântica entre as palavras-chave.

Tendo uma hierarquia de tópicos nós ligamos publicações a esses tópicos e depois criamos um perfil temporal de qualquer entidade com publicaçãos associadas, seja um investigador ou uma instituição. Depois de criar o perfil, no último passo, nos apresentamos-lo numa forma amigável para o utilizador. Nos resultados mostramos exemplos de perfis promissores, no entanto, não temos dados reais para validar a metodologia.

(5)

2.1.1 Temporal Expertise Profiling . . . 7 2.1.2 Topical Hierarchies . . . 8 2.2 Topic Modelling. . . 9 2.2.1 LDA . . . 9 2.2.2 Author-Topic Model . . . 10 2.3 Network Analysis . . . 11 2.3.1 Centralities . . . 11 2.3.2 Community Detection . . . 13 2.4 Bibliometric Networks . . . 17 2.5 Keyword Extraction . . . 18 3 Methodology 20 3.1 Introduction. . . 20 3.2 Topic Modelling. . . 21

3.2.1 Keyword Co-occurrence Network . . . 21

3.2.2 Co-Occurrence with Semantic Similarity . . . 25

3.2.3 Community Detection . . . 27

3.3 Entity - Topic Association . . . 29

3.3.1 Publication - Topic Association . . . 30

3.3.2 Entity - Publication Association . . . 30

3.3.3 Temporal Decay . . . 31

3.4 Visualization . . . 35

3.4.1 Interests Selection . . . 36

(7)

3.4.2 Temporal Hierarchical Visualization . . . 37

3.4.3 Topic Representation. . . 38

3.4.4 Temporal Evolution . . . 39

4 Application on the Authenticus data 41 4.1 Data Pre-Processing . . . 41

4.2 Keyword extraction . . . 42

4.3 Keyword Co-Occurrence Network . . . 44

4.4 Co-occurrence with semantic similarity . . . 47

4.5 Temporal Expertise Profile . . . 48

4.5.1 WOS Classification . . . 48 5 Conclusion 51 5.1 Main Contributions. . . 52 5.2 Future work . . . 52 Bibliography 53 7

(8)

List of Tables

2.1 22 WOS topics from first hierarchy level. . . 9

3.1 Top-5 most important keywords in a topic according to different centralities . . . 39

4.1 Size of author-validated Authenticus data . . . 41

4.2 Examples of the keywords pre-processing. . . 42

4.3 Evaluation results of different keyword extraction algorithms. . . 42

4.4 Summary of the keyword extraction procedure . . . 43

4.5 Details of both networks. . . 44

4.6 Size of topical hierarchies . . . 45

4.7 Information about hybrid networks for various values of w . . . 47

(9)

List of Figures

1.1 Authenticus Researcher Profile . . . 3

2.1 Main actors in Expertise Retrieval: persons (p), topics (t) and documents (d). . . 5

2.2 Author - Expertise matrix . . . 6

2.3 Example of the hierarchy of WOS classification . . . 8

2.4 Example of graph. Image taken from [5]. . . 12

2.5 Simple graph with three communities. Image taken from [13] . . . 14

2.6 Example of a network and its dendrogram. Image adapted from [13] . . . 16

2.7 Example of a network with overlapping communities. Image taken from [13] . . . 17

2.8 Example of MultipartiteRank graph. Image take from [8] . . . 19

3.1 Keyword co-occurrence network of a single researcher. . . 24

3.2 Keyword co-occurrence network with semantic similarity of a researcher. w = 0.5 27 3.3 Representation of topic hierarchy T . . . . 28

3.4 Keyword co-occurrence networks colored according to the topic it belongs in for different granularity levels.. . . 29

3.5 Example of the thematic structure of a publication . . . 30

3.6 Example of a profile of an entity without time component with 2 hierarchical levels. 31 3.7 Temporal profile . . . 32

(10)

3.10 Temporal decay with λ = 0.5 . . . 34

3.11 Normalization of an hierarchical tree . . . 34

3.12 Temporal decay normalized . . . 35

3.13 Expert profile of entity Peter . . . 35

3.14 Process of selecting top-3 topics. . . 36

3.15 Filtered grid with top-3 topics per year. . . 37

3.16 Evolution of interests of Peter on two different granularity levels. . . 38

3.18 Evolution of interests from granularity level 1 of Peter . . . 40

3.19 Line Chart Example . . . 40

4.1 Top keywords chosen by DCC authors vs Top keywords extracted from DCC publications. . . 43

4.2 Variation of number of publications and keywords after filtering low occurring keywords . . . 45

4.3 Three most used communities and their hierarchical structure . . . 46

4.4 Evolution of interests of Departamento de Ciência de Computadores (DCC) publications. . . 49

4.5 Variation of WOS classification of DCC publications . . . 50

(11)

List of Algorithms

1 Function to extract keywords from a publication . . . 22

2 Function to pre-process a keyword k . . . 23

3 Procedure to create the keyword co-occurrence network from list of publications P 24

4 Procedure to detect hierarchical communities . . . 28

5 Procedure to generate topic hierarchy T . . . . 28

(12)

Acronyms

ACT Author-Conference-Topic Model

ATF Author Topic Flow

ATM Author-Topic Model

CRACS Center for Research in Advanced Computing Systems

DCC Departamento de Ciência de Computadores

FCUP Faculdade de Ciências da

Universidade do Porto

ISI International Scientific Indexing

LDA Latent Dirichlet Allocation

R&D Research and Development

TATM Temporal Author Topic Model TOT Topics Over Time

WOS Web-Of-Science

(13)

Chapter 1 Introduction

Throughout their career, a researcher can work in various scientific studies. This career can be decades long and in many cases, because of the evolution of science and the rise of new discoveries, or due to collaborations with different colleagues, researchers tend to change the subjects they write about. Due to the social component in Research and Development (R&D), either through direct collaboration between different researchers or indirectly, by building upon previous works, having a profile that describes what the researcher’s work focus on and how it evolved through time can be very important for the author because it helps those social dynamics by providing a method for people to be familiar with them.

To fulfill this need of describing their research fields, many researchers maintain personal pages with sections dedicated to that, however, that is a manual process and it is not always trivial for a researcher to characterize themselves. Personal pages also tend to simply list the research areas they have worked on, discarding the temporal aspect and the fact that the interests might have changed over the years. This evolution of interests is important because it gives a better idea of how the career evolved and thus gives a better understanding of the researcher.

A temporal profile, however, does not function solely as a "business card": it can also be used to study and model the dynamics of personal expertise [12] and, with that, find characteristics of researchers such as their tendency to change research areas and even predict how the path of their career might change. Other use cases are, for example, to use this profile to improve the search results of a search engine based on the user’s interests [40] and also to compare researcher’s careers and, with that, find researchers with similar career paths. Having this profile can also assist in the problem of finding experts of a certain research field.

(14)

1.1 Bibliometrics

Every year millions of scientific studies are published [50] and with that it it came the need to measure and analyze the scientific output so academicians and policy makers can have an informed idea of how the science is evolving [15].

One big component of this analysis of science uses publications and their metadata attributes such as the title, abstract, authors, journal of publication, etc. as the object of study. These approaches are categorized as bibliometrics.

1.1.1 Authenticus

In order to make it possible that bibliometric studies are made, many countries and institutions have developed bibliographic databases that agglomerate publications data. Authenticus1 [11], a national repository of scientific publications developed by CRACS / INESC TEC, uses some of these databases to aggregate metadata of scientific publications produced by researchers associated with Portuguese institutions. To build the repository, Authenticus has 4 different sources2 _{from which they automatically find publications and then join metadata available from} each source.

This thesis was done in collaboration with the Authenticus team which provided their collected data. At the time of the publishing of this thesis, the data had records of a total of 460353 publications from 85,410 authors.

Authenticus also has a social network component with researcher profile pages which provide bibliometric information of the researchers such the list of publications and also other type of information derived from the publications metadata such as the number of citations or how the number of publications evolved over time. In regard to what are the interests of the researcher, this page currently simply contains a wordcloud of the keywords chosen by the researcher in all the publications, as it can be seen in Fig. 1.1. Although one can get an idea of the research interests, this approach does not take the temporal component into account, and it can also be very confusing due to the large number of different keywords that a researcher might use during their career. With this thesis our main goal precisely is to develop a method that describes the

1

www.authenticus.pt

2

(15)

1.2. Expertise Profiling 3

researcher’s interests and how they changed over time.

Figure 1.1: Current Authenticus researcher profile

1.2 Expertise Profiling

In literature, the task of linking authors with their areas of expertise is named Expertise Profiling, which with the addition of the time component becomes Temporal Expertise Profiling. In most cases the areas of expertise are not predefined and instead must be discovered from the data. For that reason, previous approaches to solve this problem, use topic models based on Latent Dirichlet Allocation (LDA) model, a state-of-the art algorithm that detects n topics from a set of documents, where n is chosen by the user. Even though there are multiple variations of LDA adapted to Expertise Profiling [4, 20, 36, 42, 43, 48] and even Temporal Expertise Profiling [9, 19, 22, 49], they all have the same problem that the number of topics must be predefined by the user.

1.3 Objectives

The main objective of this thesis is to find a new method that uses bibliometric data to detect the research interests of a researcher and how those interests evolved throughout their career. We intend to contribute towards solving some of the LDA limitations, such as having the number of topics being defined by the user, and our approach relies on the usage of networks and community detection algorithms to form the topics. Having the researches interests over time we also want to create a graphical profile that displays, in an user-friendly way, that evolution.

(16)

1.4 Organization

This dissertation is organized as follows. In Chapter2 we give an overview of previous work on the area and explains all the concepts needed for a better understanding of the remainder of the dissertation. Chapter3 details our methodology used in the development of the thesis. In Chapter 4 we apply the methodology to the Authenticus Data. Finally, in Chapter5 we give a little summary of the thesis and what can be done in the future to further this thesis’s theme research.

(17)

Chapter 2 Basic Concepts and Related Work

In this chapter we explain all the concepts required for the understanding the remainder of the thesis. We also describe previous approaches to problems similar to the one tackled in this thesis. We begin by describing what is expertise retrieval and more specifically expertise profiling. Then we explain the need for topic modelling in expertise profiling and state-of-the-art approaches to it. Next, we give the core concepts of network analysis that will be used throughout our work.

2.1 Expertise Retrieval

In literature, the problem of linking persons with topics they are expert on it is named Expertise Retrieval [3]. In most approaches to this problem, there are three actors involved: persons p, topics t and documents d where the documents are used to establish the relation between persons and topics, as shown in Figure2.1.

Expertise Retrieval is a broad term that can be divided in two different problems:

Figure 2.1: Main actors in Expertise Retrieval: persons (p), topics (t) and documents (d).

(18)

• Expertise Finding: Given a topic find persons expert on that topic.

• Expertise Profiling: Given a person find topics that the person is a expert on.

In the context of our thesis, we have an expertise profiling problem where we want to, given a researcher, find the topics that the researcher is expert on, based on its publications. Because we want those topics to convey the researcher’s interests, we don’t take into account the quality of the work when measuring expertise but only the quantity.

Even though Expertise Retrieval is highly associated with academia [12,14,19,22,42,51] and the association between researchers and their areas of investigation, it also has applications in other fields such as social networks [16] or enterprises [39].

Figure 2.2: Author - Expertise matrix taken from [3]

As mentioned in [3], "Expertise Finding and Expertise Profiling are two sides of the same coin". The reason to this is the fact that both problems can be reduced to the same one with the help of the matrix in fig. 2.2. This matrix contains two dimensions, one for the persons (candidates) and one for the topics (areas). The value of each cell, Score(a, e), represents the proximity between a candidate a and an area e which in the case of our thesis we want this proximity to represent interest. Although in this example, the matrix is binary, and each cell is either filled or not, we believe that a candidate can have higher association with some expertise areas than with others, so we want this proximity to be numerical. With this matrix we can reduce expertise finding to the task of, given an expertise area, fill the correspondent column and expertise profiling to the task of, given a candidate, fill the correspondent row, however both have the final goal of filling the entire matrix.

(19)

2.1. Expertise Retrieval 7

Despite this conclusion, Balog and de Rijke [2] concluded that simply ’inverting’ an expertise finding algorithm is not an effective method for expertise profiling. The reason for this is the fact that in expertise finding, the scoring procedure only considers documents of that topic, while expertise profiling only considers the documents of the author, meaning that for both approaches the score isn’t calculated with the same data. This problem reduction, also assumes that there are n predefined areas and m candidates and that the matrix is completed filled which is not true. Instead, in expertise finding, it is given a topic query ’What researchers are most experts in area x’ and then researchers are ranked based on their association to x. On the other hand, in expertise profiling, for areas to be ranked they must first be discovered from the data.

Expertise Profiling can also be divided in two types in regards to how the profile is constructed. If the profile is generated from implicit user feedback, as, for example, personal pages, it is considered implicit expertise profiling, while explicit expertise profiling derives the profile from the user’s work [38].

2.1.1 Temporal Expertise Profiling

Expertise profiling takes only two dimensions into account, persons and topics. Because persons and their expertises change over time, as we have mentioned before, there is a need to add a temporal component to Expertise Retrieval, especially to expertise profiling, even though there are also examples of temporal expertise finding [10]. For temporal expertise profiling, the association between persons and topics must be calculated for each point in time. Instead of a 2-D matrix, temporal expertise profiling is extended to a 3-D matrix where each cell value,

Score(a, e, t), represents the proximity between author a and expertise area e for timestamp t.

While Temporal Expertise Profiling consists mostly on associating persons with topics and time, some studies have also focused on the dynamics of this associations. Rybak et al. [37] after creating expert profiles also proposes a strategy to find and characterize the expertise changes in the profile. Other example is Fang and Godavarthy [12] which uses people’s expertise profiles to create a model that characterizes their tendency to change or maintain expertise areas and also to predict future research changes. With this thesis our goal is not only to create explicit temporal expert profiles but is also focused on their presentation to an user.

(20)

2.1.2 Topical Hierarchies

Until this point, research areas have been presented in a list where every topic is independent of the others. Rybak et al. [37], however, proposed expertise areas to be organized in a hierarchy with super and sub-topic relations, meaning that research areas have different levels of granularity and can be divided in more specific areas. Their study, however does not focus on the creation of a topical hierarchy but rather use the ACM classification system of publications1and their topics organized in a taxonomy to defer the interests, which were manually generated and mapped to publications of their system.

Another example of topical hierarchy is provided by Web-Of-Science (WOS)2, a web service that indexes scientific citations and manually classifies their publications using their topical hierarchy also manually build. This topical hierarchy has only two granularity levels, with 22 different topics on granularity level 0 (most broad) (presented in Table 2.1) which then are divided in 266 more specific ones. A small example of this topical hierarchy is presented in Figure 2.3, where both Artificial Intelligence and Information Systems are sub-categories of a broader area

Computer Science. WOS is also one of the sources of the Authenticus data and therefore part of Authenticus publications are classified with these categories, however the classification is solely based on the journal of publication and not on the contents of the publication.

Figure 2.3: Example of the hierarchy of WOS classification

1

https://dl.acm.org/ccs/ccs.cfm

2

(21)

2.2. Topic Modelling 9

ISI22

Agricultural Sciences Environment/ecology Multidisciplinary Biology & Biochemistry Geosciences Neuroscience & Behavior

Chemistry Immunology Pharmacology & Toxicology

Clinical Medicine Materials Science Physics

Computer Science Mathematics Plant & Animal Science Economics & Business Microbiology Psychatry/psychology

Engineering Molecular Biology & Genetics Social Sciences, general Space Science

Table 2.1: 22 WOS topics from first hierarchy level.

2.2 Topic Modelling

As mentioned before, Expertise Profiling has in itself the problem that in order for topics to be associated with a person, those topics must first be discovered, and, for that reason, the general approach for expertise profiling [12,14,19,22,42,51] consists in the usage of topic models in the documents text. In the academic context, the documents that are modeled are publication’s titles and abstract, since they explain the contents of the publication and are publicly available.

2.2.1 LDA

One of the first topic models to emerge, and state-of-the-art, is Latent Dirichlet Allocation (LDA) [6]. This topic model detects n topics from a set of documents, where n is chosen by the user, and models each document as a distribution of those topics. Each topic is represented as a multinomial distribution of all the words in the vocabulary. LDAhypothesizes that each word in a document is generated by first sampling a topic from a topic distribution and then sampling a word from the word distribution of that topic, so the algorithm has an iterative process to learn those distributions. LDA, however, has a major flaw in that the number of topics must be predefined by the user.

While LDA is the simplest version of a topic model only linking documents with topics there have been various models adapted from it. For example, Allahyari and Kochut [1] added a new layer between documents and topics, by instead modeling documents as a mixture of concepts

(22)

and the concepts as a distribution of words.

2.2.2 Author-Topic Model

In regards with expertise profiling, the first LDA-based topic model, was Author-Topic Model (ATM)[36]. ATM instead of linking documents with topics, links authors with topics, by hypothesizing that each word in a document is generated by first sampling an author a from the document’s authors, then sampling a topic t from the topic distribution of a and then sampling a word from the word distribution of t. Because of this instead of each document being represented as distribution of topics, each author is represented as a distribution of topics.

Following ATM more author-topic models emerged. Author-Persona-Topic model [4] admits that an author to have multiply personas, where each persona is a distribution of topics, by dividing the documents of an author into clusters.

Tang et al. [42], even integrated academic conferences, with Author-Conference-Topic Model (ACT) that not only models the relation between authors and topics but also model their relation with conferences. In the same vein, Wang et al. [48] extends Tang et al. [42] approach by adding a subject layer to conferences and the mapping between subjects and topics. Others examples focused on bibliometric data have also included citations [20,43] to the topic model.

2.2.2.1 Temporal Topic Models

So far the models mentioned have been used for expertise profiling, however they don’t take time in consideration. Wang and McCallum [49] proposed Topics Over Time (TOT) with the addition of time dimension to LDA but their approach only modeled documents and not authors.

Kong et al. [22] has used LDA for temporal expertise profiling by grouping publications with the same author and year of publication, join their abstracts, and run LDA in the joined abstracts. With this approach each document in the corpus is actually all the documents of a researcher for an year, so the topic distribution represents that researcher/year combination.

Daud et al. [9] proposed Temporal Author Topic Model (TATM) that models the topic distribution of authors over time. It considers each author has a distribution of topics and each topic has a word distribution and a timestamp distribution, meaning that each topic has a temporal pattern.

(23)

2.3. Network Analysis 11

Jeong et al. [19] consideredTATMlimited because each topic has a temporal pattern independently of the author and proposed Author Topic Flow (ATF). ATF differs from TATM in the fact that, ATF has a time distribution for each combination of author and topic, meaning that, unlike TATM, ATF has a temporal pattern for each author and topic combination, which better represents the evolution of author’s interests. Currently,ATF is the state-of-the-art approach to Temporal Expertise Profiling.

Topic models based onLDA, however, maintain it’s limitation, the fact that the number of topics must be user-defined, which is subjective and doesn’t always translate to the true distribution of topics. Because of this reasons we tried a different approach to topic modelling with the analysis of networks and community detection algorithms.

2.3 Network Analysis

A graph, or network, G is a data structure composed by a set of vertices, or nodes, V (G) and a set of edges E(G). While the vertices are the components of the network, the edges represent the relationships between those components. Fig. 2.4is an example of a graph where each node represents an actor and each edge represents if the actors co-starred in a movie. In this example, the graph is unweighted, because the relations between nodes are binary meaning that two nodes either are connected or not. In case of weighted networks, edges have an attribute of weight associated that represents how strong the connection is. An example of this, would be if the edges of the graph had associated the number of movies the actors co-starred.

If the edges have a direction from a node to another the graph is directed, meaning that between two nodes u and v it is possible to have an edge that connects u to v and a different one that connects v to u. In cases where relations between the nodes don’t have direction, as the example in Fig. 2.4, the graph is undirected. Any subset of nodes v ∈ V (G) and subset of edges e ∈ E(G) is a subgraph G0 of G.

2.3.1 Centralities

One frequent task in a network is determining the most important node in a network. In graph theory, this problem is translated to find the most central node in the network, and there are various equations to calculate the centrality of the nodes, and depending on the context some

(24)

Figure 2.4: Example of graph. Image taken from [5].

centralities are more accurate than the others. In this thesis centralities will be important when representing research areas.

Degree centrality The simplest idea of centrality is the degree centrality. The degree of a node consists on the number of links of that node. In the case of directed networks, there are two degree centralities. The outgoing degree, that is the number of links from that node, and the ingoing degree, which is the number of nodes to that node.

Betweeness centrality Other example of a centrality is betweeness centrality. This centrality is calculated by the equation:

BC(v) = X

s,t∈N

σst(v)

σst

(2.1)

where N is the set of nodes of the graph, σst the number of shortest-paths between s and t,

and σ_st(v) the number of shortest-paths between s and t that pass by node v. This centrality

is useful in communication networks, because more information passes by nodes with higher betweeness centrality.

Closeness centrality How close a node is to all the other nodes can also be a measure of its importance. That is called closeness centrality and it is measured with the formula:

CC(v) = P|N |

u∈N

(25)

where |N | is the number of nodes in the network and d(u, v) is the shortest distance between nodes u and v.

Pagerank Ranking web pages based on their importance is a problem that a search engine like Google has to address. PageRank [30] is the algorithm used by Google and uses the following simplified formula to calculate the rank of a node:

P ageRank(v) = X

u∈Bu

P ageRank(v) Nv

(2.3)

where Bu is the set of nodes connected to u. The idea of this metric is that a node is more

important if it is linked with many nodes and if is linked with important nodes.

2.3.2 Community Detection

Other frequent usage of graphs is to study the structure on data, such as the community structure. In graph theory, a community can be loosely described as locally dense subgraph, or in other words, a subgraph with a large number of edges connecting its nodes. Various definitions satisfy this property, and depending on the context the rules that define a community vary.

One simplistic approach is considering each clique of a graph as a community. A clique is a subgraph where each node is connected to all the other nodes in the subgraph and this community definition is not well regarded in most cases because it is very strict, in the sense that a subgraph with every possible connection except one wouldn’t comply this rule, although it can be considered locally dense. [13].

Another fact to take into account when detecting communities is the fact that as locally dense a subgraph is it should hardly be considered a community if it is dense with the rest of the graph, so when testing a subgraph C for a community it should be taken into account not only the intra-connections of C but also the inter-connections with the rest of the graph. Considering a subgraph C of a graph G, internal degree of node x, kint_x , is the number of edges between node x and any other node in C and external degree of node x, kext_x , is the number of edges between node x and any node in G outside of C. With this metrics, two definitions for communities are [13]:

(26)

Figure 2.5: Simple graph with three communities. Image taken from [13]

degree.

k_xint> kext_x , ∀x ∈ C (2.4) • Weak Community: The sum of internal degree of C is higher than it’s sum of external

degree.

X

x ∈ Ckint_x >Xx ∈ Ckext_x (2.5) These definitions of community are examples of local definitions, in the sense that the communities are viewed as separate entities and are evaluated independently of the whole graph. On the other hand, global definitions are community definitions that satisfy a global property of the graph. In those cases identical subgraphs may lead to different results depending on the graph the subgraph is part of.

As far as global definitions, a popular approach is comparing the whole graph with its null model, i.e a random version of the original graph but that maintains some structural features. The basic idea is that a subgraph is considered a community if it’s internal degree is higher than the expected internal degree of the same graph on the null model of the graph. This concept is the

(27)

core concept of modularity, a metric that quantifies the quality of a community in a network comparing it with its null model.

So far we only mentioned examples of what constitutes a community. The goal of a community detection algorithm is dividing a network into those communities. In the simplest case that each node belongs only to one community that division is called a partition. There are multiple possible partitions on a graph and the goal of a community detection is finding whic/ possible partition has the highest quality and because of that modularity is popular as a quality function of a partition. The straightforward approach to find that solution would be to test every partition, however, that’s not a feasible approach due to the number of partitions increasing exponentially with the size of the network [13]. Instead, community algorithms use heuristic strategies such as greedy techniques to reduce the search space.

Regarding the strategy of the searching method, community detection algorithms can be divided in two different types: divisive algorithms and agglomerative algorithms. While divisive algorithms start by considering the entire graph as a community and iterates through the search space by removing low similarity links and thus isolating communities, agglomerative algorithms start with each node as a community and then iterates by merging communities.

Girvan-Newman algorithm [29] is one example of a divisive algorithm. The algorithm starts with the whole graph as a community and iteratively calculates the edge betweeness of all edges and removes the edge with highest betweeness until there are no edges. The modularity is only used in the end to choose the best partition. This algorithm has a complexity of O(m2n), in the worst

case, for a graph with n nodes and m edges.

Later, Newman [28] proposed a greedy approach with an agglomerative algorithm. The algorithm begins with each node as a community and iteratively merges communities if the merge leads to an increase of the partition’s modularity. This algorithm has instead a complexity of O((m + n)n). Other agglomerative example is Louvain algorithm [7] which starts equally but in each iteration each node is removed from its community and added to each community of its neighbors and the move that most increases modularity is selected.

2.3.2.1 Hierarchical Communities

So far we have looked at community detection as a problem of finding the partition that better divides the graph into communities according to some metric, such as modularity. However,

(28)

Figure 2.6: Example of a network and its dendrogram. Image adapted from [13]

some networks have an hierarchical structure, in the sense, that communities are composed by even smaller communities which are composed by even smaller communities, etc... A real-word example is, for example a social network, where a person can belong to a community that represents their school which itself belongs to a community that represents all the schools in the city which belongs to a community that represents the schools in the country.

The end result of hierarchical community detection algorithms, instead of a partition, is a dendrogram as the one represented in Figure 2.6, and depending on where the dendrogram is cut, different partitions can be yielded.

Both previously mentioned agglomerative algorithms, Girvan-Newman[28] and Louvain[7] can yield dendrograms, because when merging communities iteratively, while searching for the optimal solution, they consider both communities a sub-community of the one they merge to.

2.3.2.2 Overlapping Communities

Before this section we considered each node in the network to belong solely to a community. However, many examples on the real world don’t follow these rules. For example, in the context of social networks, a person can belong to various communities, such as school community and a family community. Some algorithms have this into account and detect communities on the network but with the possibility of each node belonging to more than one community. Figure

(29)

2.4. Bibliometric Networks 17

Figure 2.7: Example of a network with overlapping communities. Image taken from [13]

2.4 Bibliometric Networks

Various studies have been using networks to find structure in bibliometric data. The most common bibliometric networks study three types of relations: co-authorship relations, citations relations and keyword co-occurrence relations [44].

Co-authorship networks [23], consists in networks that capture collaboration between different entities that produce publications. These entities are represented as nodes in this type of network and can be anything from a single researcher, center of researchers or even countries. Each edge represents the number of publications co-authored by the entities. With this network it is possible to study, for example, how it is structured the collaboration between researchers and even can be used can be used, for example, to find communities of researchers that work together, with the application of community detection algorithms [34].

Citation networks, on the other hand, are more focused on the relations between publications. These networks normally consider publications as nodes and the edges measure how strong the publications are related based on citations. Two examples are co-citation networks that consider that the larger the number of publications that co-cite two publications, the stronger is their relation, and bibliographic coupling networks that, on the contrary, considers a strong relation between publications when both have overlapping citations. These networks have been used

(30)

for topic detection [18,45,47] by clustering documents and considering the clusters as different topics, however, with citations there is the potential issue of social inconsistencies on citation patterns because when citing a publication there are other factors other than the topic similarity, e.g publications with an higher number of citations have an higher tendency to be more cited or possible rivalries between authors that avoid citing each others works [46].

Lastly, there are Keyword Co-Occurrence Networks. This type of network uses textual information to structure semantic relations. The idea is that words that tend to be used together have thematic similarities. Theses networks are composed by terms, of one or more words,as nodes and edges represent the number of publications that those words were used together. Keyword co-occurrence networks have been mainly used to map the intellectual structure of different journals [21,24,26, 33, 41]. In this thesis we will use this type of network for the topic modelling by detecting communities of keywords that tend to be used together.

2.5 Keyword Extraction

In our methodology we not only experiment using author-given keywords for the topic detection but also keywords extracted from text. For that we use algorithms to extract keywords from text, that best describe it. That task is named keyword extraction.

Keyword extraction algorithms are typically composed by two steps [17]. The first step, Candidate Selection, consists in the selection of phrases that can be chosen as keywords. In this step, stop words (i.e words so common that don’t carry thematic value) are removed and, depending on the algorithm, words that don’t correspond to specific word classes, such as nouns and adjectives, are also removed. The second step, Candidate Weighting, assigns to each candidate a weight which represents the importance of the candidate in the document. This step is different for every algorithm and there are both supervised and unsupervised approaches.

TextRank [27] represents the document as a graph where each node is a word in the document and each edge means that both words co-occurred within a window of N words. For the weight calculation it is used the Google’s pagerank algorithm and the pagerank centrality of each node corresponds to the weight of the respective candidate.

RAKE [35], on the other hand, uses a more statistical approach to weight the candidates using

(31)

2.5. Keyword Extraction 19

Figure 2.8: Example of MultipartiteRank graph. Image take from [8] .

different words that occurred next to the word) and f req it’s frequency on the text.

In a more recent approach, Boudin [8], proposed MultipartiteRank algorithm, which ensures topical coverage and diversity of keywords detected. This algorithm initially clusters words into topics based on the stem roots and then the document is represented as a graph but words are only connected if belonging to different topics, and the edges weighted according to the distance between the candidates. The candidate weighting is then done with pagerank algorithm.Fig. 2.8

is an example of this graph.

Although keyword extraction algorithms consists solely on the candidate weighting and candidate selection, there is also an additional step to extract keywords which is the actual picking of candidates, or candidate filtering. This step is dependent on the context but usually the top-n keywords are chosen where n is defined by the user according to the problem.

(32)

Methodology

This chapter details every step of our methodology. We begin by giving the general idea of our approach in the first section. In the second section, we focused on the topic modelling aspect. In the third section, we focus on the association between entities, time and topics. In fourth section we tackle the visualization of the research interests over time.

3.1 Introduction

Due to the limitations of LDA, we tried a different approach to topic modelling based on keyword co-occurrence networks and the concept of community detection. With this step we defer a set of topics organized in an hierarchy with topic - subtopic relations From this topic hierarchy, we calculate Score(e, t, y) between any entity e with publications, each topic t and each timestamp

y and thus get a hierarchical representation of the interests in different points in time. While

expertise profiling has been focused on the creation of expert profiles for authors, we don’t limit our methodology to the profiling of authors but rather any entity with publications associated. Examples of this entities are, for example, countries and institutions. In the end, we handle the visualization of the interests evolution over time to an user.

Our methodology is then divided in the following sections:

1. Topic Modelling: Represent data in a network and defer topics by applying community detection algorithms to the network.

2. Entity-Topic association: Calculate Score(e, t, y) between each entity e, topic t and

(33)

year y.

3. Visualization Display the variations of Score(e, t, y) graphically.

3.2 Topic Modelling

For the topic modelling step, we use keyword co-occurrence networks and community detection algorithms to cluster keywords that tend to be used in the same publications and therefore have high thematic similarities. This concept has been used in literature to map and analyze the intellectual structure of various journals [21,24,26,33, 41]. We take this idea and apply it to the expertise profiling task, more specifically on the detection of topics represented in the data.

3.2.1 Keyword Co-occurrence Network

Firstly, we begin by creating a keyword co-occurrence network that represents every publication available in the dataset. Keywords are terms, with one or more words, that give thematic information about a publication. Usually, the authors of a publication pick different keywords not only to give an idea of the themes of the publication for the reader but also so the publications can be easily indexed, and those keywords are part of the publication metadata and publicly available. While these keywords are expected to be the most reliable source of thematic information due to the fact that they are chosen by the author with that intention, we also noticed that not every publication has keywords associated. In fact only 60% of the publications in Authenticus have keywords assigned. For that reason, we also looked at other sources of thematic information of a publication to extract keywords.

Keyword Extraction

Keywords are a good attribute to study because they are a semi-structured data in the sense that while they are still text tokens, and text data has always inconsistencies, they are organized in a list. Other types of text data, such as the title and abstract of a publication, however, are written excerpts of text and therefore highly unstructured meaning that while humans can read and interpret the text, a machine only interprets it as chunk of unstructured data. For that reason we used a keyword extraction algorithm to only select terms of the text that better summarizes it. We have mentioned different examples of keyword extraction algorithms (TextRank [27], RAKE [35] and MultipartiteRank [8]), however these algorithms only select and

(34)

weight possible keyword candidates from a document. To actually pick the keywords from the candidates, independently of the algorithm, we used a dynamic threshold function proposed in [31], which only selects candidates with the weight higher than the average. This procedure can be seen in Algorithm 1.

Algorithm 1 Function to extract keywords from a publication

Input: publication p and a keyword extraction algorithm extractkeywordsalgorithm Output: list of keywords

1: _{function ExtractKeywords(p, extractkeywordsalgorithm)} 2: doc ← p.title ∪ p.abstract

3: _{candidates ← extractkeywordsalgorithm(doc)} 4: _{weightT hreshold ← Mean(candidates.weights)}

5: _{return Filter(candidates, weight > weightT hreshold)} . Keep only candidates with

weight > weightThreshold 6: end function

Keyword Cleaning

Considering every publication has a set of keywords, either author-given or extracted, we apply a pre-processing procedure, on every keyword before creating the network, to remove inconsistencies on the wording of the same concept from different authors such as words in the plural form or with capitalized letters. This procedure consists in first splitting the keyword into word tokens considering each word is separated either by a space or a punctuation mark. For each word token, the word is first converted to its lower case, and then deleted if it is a stopword1, otherwise the word is converted to its root form using the PorterStemmer [32] algorithm. Lastly we join the different words with the symbol ’_’ and originate the processed version of the keyword. This procedure is detailed in Algorithm 2. We use the processed version of the keyword in every step of our methodology with the exception of the visualization, which, when presenting the keywords to the user, we use the original version.

1_{Word that is so frequent in language that have no thematic value, e.g ´}

the´. We used the stopwords list from library gensim.

(35)

Algorithm 2 Function to pre-process a keyword k Input: Keyword k

Output: Processed version k0 1: _{function PreProcessKeyword(k)} 2: _{wordT okens ← Split(k)}

3: k0< −[]

4: for word ∈ wordT okens do 5: _{word ← Lowercase(word)} 6: if word 6∈ stopwords then

7: k0 ← k0_{∪ PorterStemmer(word)} 8: end if 9: end for 10: k0← Join(k0, delimiter = _) 11: return k0 12: end function

Assuming we have a list of publications with keywords we create the co-occurrence network by iterating over each publication and update the number of occurrences between each combination of keywords of the publication. In the end the network has every keyword in the data as a node and each edge is weighted with the number of publications the keywords were both used on. Algorithm3 details the network creation process.

A simple example of a co-occurrence network is Fig.3.1. The network was generated using solely the publications of one researcher. Looking at this example we can mentally divide the graph and get a good understanding of the topics that the researcher writes about, especially because the graph is composed of two separated components. While looking solely at the publications of a researcher to detect what their interests might be possible considering this example, that would mean that topics would not be shared among different entities and expert profiles could not be compared. Instead, we detect topics from the entirety of the data, by creating the keyword co-occurrence network using every publication in the dataset and then associate each entity to those topics.

(36)

Algorithm 3 Procedure to create the keyword co-occurrence network from list of publications P 1: _{procedure CreateCoOccurrenceNetwork(P )}

2: _{G ← InitializeNetwork} 3: for p ∈ P do

4: for key ∈ p.keywords do 5: if key 6∈ G.nodes then

6: G.AddNode(key)

7: end if

8: end for

9: for (key1, key2) ∈ CombinationsSize2(p.keywords) do

10: if (key1, key2) 6∈ G.edges then

11: _{G.AddEdge((key}₁, key2), weight = 0)

12: end if

13: G.edges[(key1, key2)].weight+ = 1

14: end for 15: end for 16: end procedure

(37)

3.2.2 Co-Occurrence with Semantic Similarity

While co-word analysis is proven to work, relying solely on co-occurrences to measure the thematic similarity of two keywords has some flaws. Namely, the semantics of the keywords are not taken into account. Keywords, both author-given and extracted, are picked from an uncontrolled vocabulary and therefore different researchers might use different words to describe the same concept which don’t necessarily have an high number of co-occurrences. This is not only the case for word synonyms such as network and subgraph but also terms with very close meaning such as parallel computing and parallel algorithms. For that reason we experimented to mix the frequency two keywords co-occur with their semantic similarity to measure how thematic close those keywords are.

Semantic Similarity Network

Our first step was to create a semantic similarity network with the same keywords as the co-occurrence network but each edge weight representing the semantic similarity between two keywords. To calculate the semantic similarity of two terms we used Wikipedia and it’s link structure due to its vast quantity of knowledge. Wikipedia is an encyclopedia composed of articles of multiple concepts and each article has links for other articles mentioned. To calculate the semantic similarity between two keywords we first assign an article to each keyword and then calculate the number of links in common in both articles. To assign an article to a keyword we query the keyword in Wikipedia’s search function and pick the first search result. Although this method can lead to wrong assignments, and has room for improvement, we found it to have consistent good results.

The equation of semantic similarity between keywords x and y is, then, given by:

SemanticSimilarity(x, y) = |n(x) ∩ n(y)|

max(|n(y)|, |n(x)|) (3.1)

where n(x) is the set of links that the article page for keyword x has.

Co-occurrence network normalization

The first step in combining two different types of data is reducing both to the same scale. That process is named normalization. In a co-occurrence network the edge weights, correspond to the number of co-occurrences, which can be any positive number integer in [0, +∞], so they must be normalized to a weight between [0, 1], where 1 indicates that both keywords always co-occur

(38)

together and 0 indicates that they never occur together. There is no consensus in literature in how co-occurrence data should be normalized. Leydesdorff and Vaughan [25] argues that co-occurrence data should not be normalized but rather the underlying occurrence matrix2. Zhou and Leydesdorff [52], proves that normalizing co-occurrence networks with ochiai coefficient is the equivalent of normalizing the underlining occurrence matrix with cosine normalization and for that reason we normalized the co-occurrence weights with ochiai coefficient given by:

Ochiai(x, y) =√ cxy

cx× cy

(3.2)

where cxy represents the number of co-occurrences of x and y and cx represents the number of

occurrences of x.

Hybrid Network

Having both the co-occurrence and semantic similarity we created an hybrid network with every keyword for the nodes. For the edges weight we used the following equation:

hybridij = w ∗ Ochiai(i, j) + (1 − w) ∗ W ikiSimil(i, j) (3.3)

where w is a value between 0 and 1, which corresponds to the weight assigned to the semantic similarity network.

Fig.3.2 is an example of a small hybrid network created from the same publications as Fig.3.1, in this case with w = 0.5. Comparing both networks we see that, with the addition of semantic relationships, it adds an edge between Parallel Algorithms and Parallel Computing, or Logic

Programming and Parallel Logic Programming, which we can see have thematic similarities but

would not be considered with co-occurrence alone.

2

(39)

Figure 3.2: Keyword co-occurrence network with semantic similarity of a researcher. w = 0.5

3.2.3 Community Detection

Once we have a network with keywords as nodes and edges representing their thematic similarity we use a community detection algorithm to find communities of keywords with high thematic similarity. We chose Louvain algorithm because it is compatible with weighted networks and because of it’s expected runtime time complexity of O(nlog(n)) which makes it very popular in large networks.

3.2.3.1 Hierarchical Communities

In order to create a topic hierarchy, with topics from different granularities, we apply the Louvain algorithm to the network, getting a list of topics which we consider as the most specific ones. Then, we break this topics in smaller, more specific topics, by recursively applying the Louvain algorithm to the subgraph of topics previously detected, for the number of hierarchical levels desired. This procedure is detailed in Algorithm 5. Although, Louvain can inherently generate a dendrogram, we opted to do this procedure because we have a better control on the size of the topical hierarchy. In the end we get a topical hierarchy T , organized in a tree, where Ti is the set

(40)

Algorithm 5 Procedure to generate topic hierarchy T .

1: _{function DetectHierarchicalCommunities(G, HierLevels)} 2: T ← [] 3: T0 ← Louvain(G) 4: for k ← 1 to HierLevels do ‘ 5: Tk← [] 6: for Ti ∈ Tk−1 do 7: G0← G.Subgraph(Ti) 8: Tk ← Tk∪ Louvain(G0) 9: end for 10: end for 11: end function

Figure 3.3: Representation of topic hierarchy T .

In Figure 3.4we can see the keyword co-occurrence network of a researcher we showed before in

3.1but now with keywords colored according to the topic they belong on. Note that the topics were detected on the network with every publication in the data, while Figure3.4 only represents the publications of the researcher. In the top figure(3.4awe consider topics of granularity level 0

T0 and in the bottom we consider topics of granularity level 1 T1. In this example we see the

importance of hierarchical levels because in the most broad level this researcher has interest in one topic, however in a more specific level we see a more varied set of topics.

(41)

3.3. Entity - Topic Association 29

(a) Topics in T0.

(b) Topics in T1.

Figure 3.4: Keyword co-occurrence networks colored according to the topic it belongs in for different granularity levels.

3.3 Entity - Topic Association

Having a topical hierarchy T , and an entity e, we represent Score(e, T ) as a tree similar to T , where each topic is weighted with Score(e, t) which quantifies the interest of entity e to topic t. In order to calculate Score(e, T ) we first calculate Score(p, T ) for every publication p of e and then based on the number of publications and the topics they approach create an expert profile.

(42)

Figure 3.5: Example of the thematic structure of a publication

3.3.1 Publication - Topic Association

To associate a publication p with T we assume that a publication can have various themes and that the more keywords the publication has in common with a topic t the bigger is Score(p, t). With this in mind we consider Score(p, t) as the fraction of keywords of p in common with topic

t, as shown in the following equation:

Score(p, t) = |Keywordsp| ∩ |Keywordst|

|Keywordsp|

(3.4)

where Keywordsp are the keywords of publication p and Keywordst are the keywords of topic t.

With this equation we calculate Score(p, t) for each topic t and then represent Score(p, T ) as a tree as the one in Figure 3.5. In this example, we have a publication with Score 1.0 to Topic 3 in hierarchy level 0 because all keywords belong to that topic in T0 and it a Score of 0.50 with

topic 3_35 and 0.25 with topic 3_39 and 3_41 in hierarchy level 1.

3.3.2 Entity - Publication Association

Considering we have Score(p, T ) for every publication p of entity e we calculate Score(e, t), considering that the bigger the Score(p, t) and the more publications about a topic t, more interest is that topic to the entity. With that in mind, without considering time component, we

(43)

3.3. Entity - Topic Association 31

calculate Score(E, t) as the average Score(p, t) of every publication of entity E, as it is presented in equation 3.5.

Score(e, t) =

PPe

p Score(p, t)

|P_e| (3.5)

where P_e is the publications of E.

Calculating Score(e, t) for each sub-topic t in T we get Score(e, T ) which is represented in a tree. An example of this tree can be seen in Fig3.6 of an entity John.

Figure 3.6: Example of a profile of an entity without time component with 2 hierarchical levels.

3.3.3 Temporal Decay

To add the temporal component we calculate Score(e, y, T ) for each timestamp y. Because publications are dated only with the year of publication we consider each time-stamp y as a different year. The straightforward approach to calculate Score(e, y, t) consists in grouping the publications of e by year and calculate Score(e, T ) for each year. With this approach

Score(e, y, T ) is given by equation 3.6and the temporal profile of John is can be seen in Fig 3.7

as well as his interests solely for year 2010.

Score(e, y, T ) =

PPey

p Score(p, t)

|Pey|

(44)

(a) Temporal Expertise profile of John (b) Expertise profile for of John in year 2010

Figure 3.7: Temporal profile

While this approach might lead to good representations of how the evolution works, Score(e, y, T ) only takes into account, publications of year y, meaning that when there are no publications in a year, such as the year 2018 in Figure 3.7, the entity has 0 interests.

A better representation of the interests of an entity would be that the interest in a topic increases when the topic is explored and decreases over time otherwise. With this in mind we propose that for each year y every publication previously published should have an impact on the interests of that year and the older the publication the less impact it has. We propose, then that to calculate

Score(E, y, t) every publication has a different weight associated depending on the number of

years the publication has been published, as described in equation 4.5.

Score(P, y, t) = P X p yp<y Score(p, t) × w(y − yp) (3.7)

where w is a weight function that receives the number of years between the year p was published (yp) and year y.

For the weight function we considered an exponential decay function given by the formula:

w(t) = e−λt

where λ is the decay rate, which controls how fast publications lose importance. This function is represented in Figure 3.8.

(45)

3.3. Entity - Topic Association 33 0 1 2 3 4 5 6 7 0 0.2 0.4 0.6 0.8 1 y w (y ) λ = 0.5 λ = 1 λ = 2

Figure 3.8: Exponential Decay Function for different values of λ

In Figure3.9 we can see how the interests of John on topic ’3_47’ evolve for different values of

λ. In the left example, the score is calculated without temporal decay, and for that reason the

interests are very volatile eventually reaching zero interest in 2018 due to the lack of publications. Both the middle and right example, with the addition of time decay we see a more clean variation of the interests, which not only is more readable, but also a better depiction on how the interests change over time. The bigger the value of lambda, the faster a publication loses it’s importance. There is not a correct value for λ as different researchers have different interpretations of their interests, so we propose that this decay rate should be chosen by the user. For the remainder of the thesis we will consider λ = 0.5.

(a) No temporal decay (b) λ = 0.5 (c) λ = 0.1

(46)

Figure 3.10: Temporal decay with λ = 0.5

With this new approach the temporal profile of John with time decay can be seen in Figure 3.10. In this case the interests are not normalized and while it has the advantages that we can, for example, see that in year 2012 John had the highest interest in topic 3 and could even be compared to different entities, normalizing the interests by year gives a better perspective on how the focus of an entity change each year.

Figure 3.11: Normalization of an hierarchical tree

To normalize the interests we normalize the hierarchical tree of each year by dividing Score(e, y, t) of every topic t by the Score(e, y, P_t) where P_t is the super topic of t. This process is shown in Figure 3.12and Figure 3.11is the normalized temporal expert profile of John.

(47)

3.4. Visualization 35

Figure 3.12: Temporal decay normalized

In this case, Score(e, y, t) represents the proportion of the focus of John in topic t in year y. Because of this we see that although the absolute value of Score(e, y, t) in topic ’3’ decreases after year 2012, we see that it was always it’s main interest in every year. We can also see that at the beginning between years 2007 and 2009, John focused in various areas and since 2010 narrowed his interests to topic 3_47.

3.4 Visualization

In this section we tackle the presentation of the research interests to the users. To simplify we will use the expert profile of an entity Peter which can be seen in Figure3.13in this section. We start by selecting the most important topics of the profile. Then we tackle how we visualize the evolution of the interest trees. Next we demonstrate how to represent a topic and finally propose two different charts that show the interests evolution.

(48)

3.4.1 Interests Selection

To display the evolution of the interests we can’t show the interest of every topic because some topics are so irrelevant that will only add noise and confuse the user. For that reason we have to select the most important ones, that better represent the interests of the entity. Because of the fact that the interests are always decaying when not published about, the interest on a topic never disappears but rather decays to very low values. We can see in Fig. 3.13that in year 2018 the author has interest in every topic, even though mostly are irrelevant because they were explored in the past or barely explored at all.

For that reason we select the top-n topics for each year and discard the rest. The n is chosen by the user making the visualization interactive which allows the user to explore better the temporal expertise profile. To choose the top-n topics from the we start from the root and apply a depth-first search selecting the topic with highest value at each level, until reaching a leaf node. That topic is then selected and removed from the tree so it is not chosen again. We repeat this step until selecting n different topics and then reconstruct the interests only with the selected topics and then we normalize the tree. Figure 3.14details the process of the selection of the top-3 topics in the hierarchical tree of 2003.

Figure 3.14: Process of selecting top-3 topics.

After filtering the tree we also remove the topics with weight less than 0.1, and normalize the tree again. After applying these filters we get the interests evolution in Figure 3.15which gives a summarized version of the expert profile that is more readable.

(49)

Figure 3.15: Filtered grid with top-3 topics per year

3.4.2 Temporal Hierarchical Visualization

As we have mentioned, interests are represented in a hierarchical structure. There are various various possible representations for hierarchical data such as trees, however that’s not the case for temporal-hierarchical data. For that reason, we decided to not show the full representation of the interests but instead give the chance to interact with the visualization and dynamically choose the granularity level. This way we only need to show the evolution of the interests in topics of a granularity level at once. Figure 3.16shows the evolution of the interests of Peter on granularity level 0 and on granularity level 1.

(50)

(a) Interests evolution - Level 0

(b) Interests evolution - Level 1

Figure 3.16: Evolution of interests of Peter on two different granularity levels.

3.4.3 Topic Representation

In the context of this thesis, each topic is composed by multiple keywords. We can have an understanding of what a topic represents, for example, by representing the keywords in a wordcloud, as the one exemplified in Fig.3.17b, however, that representation is not practical to present to the user due to it’s size.

(51)

(a) Wordcloud of a topic (b) Topic subgraph

While we consider a topic mostly as a set of keywords, a topic is a community of keywords and therefore a graph. To represent the topic in a simplistic way we focused on representing the topic with the five most important keywords in the topic subgraph. For that reason, we ranked the subgraph nodes of each topic according to different centralities: degree centrality, closeness centrality, pagerank and betweeness centrality. In table 3.1it is shown the top 5 nodes for each centrality. Empirically we noticed that any of the 4 centralities lead to satisfactory results, and even though it could also be chosen by the user, we decided on the degree centrality for the ranking function because it is calculated directly and therefore faster than the rest.

Degree Closeness PageRank Betweeness

1. Neural Networks Neural Networks Neural Networks Neural Networks

2. Classification Classification Classification Classification

3. Data Mining Machine Learning Data Mining Machine Learning

4. Machine Learning Image Processing Support Vector Machine Pattern Recognition

5. Support Vector Machine Data Mining Machine Learning Data Mining

Table 3.1: Top-5 most important keywords in a topic according to different centralities

3.4.4 Temporal Evolution

Finally, to display the evolution of the interests we propose two different charts: Stacked Area Chart and Line Chart. Stacked Area Chart, represented in Figure3.18, has the advantages that it gives a good notion of how the interests of the entity were divided along the years by stacking each topic above each other. However, it is harder to track each interest individually and in cases with a lot of topics, such as profiles of institutions, it gets easily unreadable.

Temporal Research Interests Discovery Using Co-Occurrence Keywords Networks