Interval-Weighted Networks: Community Detection and Centrality Measures

(1)

Interval-Weighted

Networks:

Community

Detection and

Centrality

Measures

Hélder Fernando Cerqueira Alves

Tese de Doutoramento apresentada à

Faculdade de Ciências da Universidade do Porto, Universidade de Aveiro e Universidade do Minho

Matemática Aplicada

2020

(2)

D

Networks:

Community Detection

and Centrality

Measures

Hélder Fernando Cerqueira Alves

Programa Doutoral em Matemática Aplicada

Matemática Aplicada 2020

Orientador

Maria Paula de Pinho de Brito Duarte Silva Professora Associada com Agregação

Faculdade de Economia da Universidade do Porto

Coorientador

Pedro José Ramos Moreira de Campos Professor Auxiliar

(3)

(4)

(5)

This thesis is not just an isolated act of mine, but something that could only have happened with the support of a multi-person network. It is with great pleasure that I see this opportunity to express my gratitude to many of them in this space.

First and foremost, I would like to express my profoundest gratitude to my advisor Prof. Paula Brito and cooadvisor Prof. Pedro Campos who guided me during these years and gave me valuable advice and provided constructive comments and remarks, that greatly improved the quality of my thesis. I would like to acknowledge the constant support, encouragement, availability, feedback, precious advices and ideas on my work that Prof. Paula Brito gave me. She has played a crucial role in this whole process due to her rigor, demandingness, helpful comments, countless corrections and leadership over the past few years. I am truly grateful for the opportunity to grow and learn from her. To my friend, Prof. Pedro Campos for his optimistic attitude, knowledge and specially for his friendship. They have always inspired me by showing excitement for any result I have presented during our meetings and cheering me up anytime I was disappointed. I am also very much in their intellectual debt, hoping in the future to continue the collaboration started with this work.

I would like to express my thanks to all the teachers who taught in the curriculum year of this Doctoral Program in Applied Mathematics (PDMA) for the transmission of the valuable knowledge. I would especially like to thank the director of the PDMA, Prof. S´ılvio Gama, for the help and permanent availability shown throughout this doctorate.

During these years, I had the privilege and opportunity to work in LIAAD/INESC TEC, which provided me with working conditions, access to people and knowledge and also benefit from support for participation in international conferences.

I would also like to thank the support I received from Prof. Teresa S´a Marques (FLUP/CE-GOT), who provided valuable help both in providing access to data and in supporting the

(6)

time at LIAAD/INESC TEC, who proved to be a decisive person in terms of supporting both the R programming and Latex writing of this dissertation. His input greatly improved the research presented here.

My gratitude also goes to Instituto Superior de Servic¸o Social do Porto (ISSSP), the institution where I have been teaching since 2007, for all the conditions offered that made it possible to successfully complete this journey.

I would also like to express my gratitude for the valuable feedback I received at the various conferences I attended, which helped to improve the work presented in this disser-tation. These conferences were: XXXV Sunbelt Conference of the International Network for Social Network Analysis (INSNA), 2015 (Brighton, UK); 2nd_{European Conference on Social} Networks (EUSN), 2016 (Paris, France); 3rd _{European Conference on Social Networks} (EUSN), 2017 (Mainz, Germany) and 8th _{International Conference on Complex Networks} and their Applications, 2019 (Lisbon, Portugal).

Finally, my deepest appreciation goes to my family and closest friends. I am especially grateful to my parents for their unwavering love, selfless support, and encouragement over the years. A special thank you to my brother Paulo who I admire and love, for listening to me and supporting me, not only during these years, but throughout my life. Thank you for taking the time to help in the translation of several sections of this thesis and for the insightful advice to improve them. I would also like to express my thank you and love to my sister Ana. I also owe a word of gratitude to my sister-in-law Mariana for her support in making the maps used in this thesis.

Last but not least, I would also like to thank my wonderful life partner, Marta Luz, who has supported me at each step of the way with her love and patience on the roller coaster of feelings that was this PhD. You are the kindest and most sensitive person I have ever met and I am very grateful for everything you have done for me and I am proud of what we achieved together. The best is yet to come!

(7)

Atualmente, vivemos cada vez mais num mundo complexo e interconectado. Como cosequência, redes sociais como o Facebook, o Twitter, o WhatsApp, o Instagram, en-tre outras, registaram um crescimento impressionante, atingindo centenas de milhões de utilizadores. A quantidade de dados dispon´ıveis bem como a tecnologia necessária para ter acesso e analisar/explorar esses dados (capacidade computacional) é cada vez mais acess´ıvel. Independentemente do contexto e tamanho dessas redes, elas geralmente são representadas na forma de redes binárias ou ponderadas. No entanto, em aplicações do mundo real, esses pesos podem variar sob a forma de intervalos, em vez de serem stantes. Para melhor modelar essa variabilidade de pesos numa rede, em vez de usar con-stantes (números reais) e métodos associados para representar as informações presentes nas arestas das redes, representamos os pesos como intervalos. Uma representação desses valores sob a forma de intervalos fechados compostos pela informação precisa pode ser mais significativa e útil num ambiente dinâmico do que o resultado através de um valor pontual, pois esses intervalos contêm mais informação para expressar a variabilidade dos dados originais.

Uma das caracter´ısticas mais importantes nessas redes é a existência de uma es-trutura em comunidades. Ao analisar essas redes, são reveladas algumas de suas pro-priedades estruturais como a organização modular em comunidades (ou clusters), que estão fortemente (densamente) interligadas internamente e menos com o restante da rede. A identificação dessas comunidades é importante, pois está intimamente relacionada com a organização hierárquica de muitos sistemas complexos no mundo real. Além disso, essas comunidades são úteis para uma melhor compreensão e visualização de toda a rede. Outro conceito importante na análise de redes está relacionado com a noção de ser o vértice mais conectado, ou estar posicionado no centro da rede, e desse modo ter

(8)

para determinar sua importˆancia relativa na rede.

Esta tese faz contribuições sobre detecção de comunidade e medidas de centralidade, que são dois dos assuntos mais importantes da ciência de redes. O nosso objetivo de pesquisa é duplo: (i) generalizar a detecção de comunidades com base no algoritmo de Louvain para redes ponderadas-por-intervalos; e (ii) generalizar as três medidas de centralidade mais conhecidas (clássicas), o grau (degree), a proximidade (closeness) e a intermediação (betweenness) , para redes ponderadas-por-intervalos (IWN). Em ambos os casos, consideramos IWN em que os pesos são representados por intervalos fechados, levando em consideração a variabilidade dos pesos nas arestas da rede. Assim, tanto a modularidade de Newman como o ganho de modularidade para redes ponderadas, bem como um dos algoritmos de última geração para maximizar a modularidade, o algoritmo de Louvain (LA), são extendidos a redes ponderadas-por-intervalos, considerando uma representação tabular das redes (por tabelas de contingência).

Propomos uma generalização da centralidade de grau (degree), seguindo a abordagem de Opsahl de levar em consideração os pesos das arestas e o número de vértices inter-mediários, introduzindo um parâmetro de ajusteα, chamando-o de Grau ponderado-por-intervalos (IWD). Usando o conceito de Freeman para redes de fluxo baseadas no algo-ritmo de Ford e Fulkerson, as centralidades de proximidade (closeness) e intermediação (betweenness) são estendidas como Proximidade de Fluxo ponderada-por-intervalos (IWFC) e Intermediação de Fluxo ponderada-por-intervalos (IWFB).

Finalmente, para avaliar e ilustrar a metodologia de detecção de comunidades proposta e medidas de centralidade, realizamos várias pesquisas aplicadas usando duas redes ponderadas-por-intervalos (IWN) do mundo real. A primeira é uma rede de movimentos pendulares em Portugal continental, entre as vinte e três regiões das NUTS 3 (IWCN). A segunda é relativa o comércio anual de mercadorias entre 28 pa´ıses europeus, entre 2003 e 2015.

Palavras-chave: An´alise de redes, Dados com variabilidade, An´alise Intervalar, Redes

ponderadas por intervalos, Detecc¸˜ao de comunidades, Redes ponderadas, Medidas de centralidade, Redes de fluxo, Algoritmo de Louvain, Modularidade.

(9)

Nowadays, we are increasingly living in a complex and interconnected world. As a consequence, social networks like Facebook, Twitter, WhatsApp, Instagram, among others, registered an astounding growth reaching hundreds of millions of users. The amount of data available as well as the technology required to have access to and mine/explore this data (computational capacity) has become increasingly affordable. Regardless of the context and size of these networks, they are usually represented in the form of binary or weighted networks. However, in real-world applications these weights may vary within ranges rather than being constants. To better model such variability of weights in a network, instead of using constants (real numbers) and associated methods to represent the information present in the edges of the networks, we represent weights as intervals. A representation of these values in the form of closed intervals composed with the precise information, can be more meaningful and useful in a dynamic environment than point-valued output, as these intervals contain more information in expressing raw data variability.

One of the most important features in these networks is the existence of a commu-nity structure. When analysing these networks some of their structural properties as the modular organization in communities (or clusters), which are tightly (densely) connected internally, and less with the rest of the network, are revealed. Identifying these communities is important since it is closely related to the hierarchical organization of many complex systems in the real-world. Furthermore, these communities are helpful to a better under-standing and visualisation of the whole network. Another important concept of network analysis is related to the notion of being the most connected vertex or being positioned in the center of the network and having this way advantages over other vertices. This centrality of vertices, or the place of a given entity (or actor) in the network, can be described using measures to determine their relative importance within the network.

(10)

twofold: (i) to extend the community detection based on the Louvain algorithm to interval-weighted networks and (ii) to extend the three well-known (classical) centrality measures, degree, closeness and betweenness centrality, to interval-weighted networks (IWN). In both cases, we consider IWN where the weights are represented by closed intervals, thus taking into account the variability of network edge weights. Accordingly, both Newman’s modularity, and modularity gain for weighted networks, as well as one of the state-of-the-art algorithms to maximize modularity, the Louvain algorithm (LA), are extended to IWN considering a tabular representation of networks (by contingency tables).

We propose a generalization of degree centrality, following Opsahl’s approach of taking into account both edge weights and the number of intermediate vertices, by introducing a tuning parameterα, calling it Interval-Weighted Degree (IWD). Using Freeman’s concept of flow networks based on Ford and Fulkerson’s algorithm, closeness and betweenness cen-tralities are extended as Interval-Weighted Flow Closeness (IWFC) and Interval-Weighted Flow Betweenness (IWFB).

Finally, to evaluate and illustrate the proposed community detection methodology and centrality measures, we conduct several applied research using two real-world IWN. The first is a commuter network in mainland Portugal, between the twenty three NUTS 3 Re-gions (IWCN). The second focuses on annual merchandise trade between 28 European countries, from 2003 to 2015.

Key-words: Network analysis, Data with variability, Interval analysis, Interval-Weighted

Networks, Community Detection, Weighted Networks, Centrality Measures, Flow Networks, Louvain Algorithm, Modularity.

(11)

Resumo v

Abstract vii

List of Tables xvi

List of Figures xx

List of Listings xxii

Symbols and Notations xxiii

1 Introduction 1

1.1 Thesis Contributions . . . 9

1.2 Thesis Outline . . . 10

2 Elements of Network Analysis 13 2.1 Notations and basic concepts . . . 15

2.2 Random graphs . . . 38

2.3 Summary . . . 45

3 Interval Analysis 47 3.1 Interval arithmetic: algebraic operations . . . 49

3.1.1 Drawbacks of interval arithmetics . . . 62

3.2 Interval order relation . . . 63

3.3 Summary . . . 68 ix

(12)

4.1.1 Modularity for (undirected) binary networks . . . 78

4.1.2 Modularity for (undirected) weighted networks . . . 79

4.1.3 Normalized Modularity (comparing modularities) . . . 93

4.2 Methods for community detection . . . 95

4.2.1 Hierarchical clustering (divisive and agglomerative methods) . . . 98

4.2.2 Modularity optimization . . . 102

4.2.2.1 Greedy agglomeration . . . 108

4.2.2.2 Louvain algorithm . . . 113

4.3 Summary . . . 119

5 Community Detection in Interval–Weighted Networks 121 5.1 Modularity based on the Contingency Table . . . 122

5.2 Interval–Weighted Networks . . . 132

5.3 Modularity for Interval–Weighted Networks . . . 132

5.3.1 Interval differences . . . 133

5.3.2 Interval Modularity–1 . . . 136

5.3.3 Interval Modularity–2 . . . 137

5.3.4 Methodology . . . 138

5.3.5 Classic Louvain – Method 1.1: Intervals Sum . . . 140

5.3.5.1 Adjustments of the Expected interval–weights [Method 1.1] . 142 5.3.6 Classic Louvain – Method 1.2: Intervals Min–Max . . . 146

5.3.6.1 Adjustments of the Expected interval–weights [Method 1.2] . 149 5.3.7 Hybrid Louvain – Method 2.1: Intervals midpoint . . . 153

5.3.8 Hybrid Louvain – Method 2.2: Intervals Sum . . . 154

5.4 Summary . . . 156

6 Centrality Measures 159 6.1 Centrality measures for weighted networks . . . 161

6.2 Flow Centrality Measures . . . 166

6.3 Centrality Measures for Interval–Weighted Networks . . . 170

6.3.1 Interval–Weighted Degree (IWD) . . . 171 x

(13)

7 Applications 177

7.1 Community Detection . . . 177

7.1.1 Communities in a Commuters network . . . 179

7.1.1.1 Results – Communities in a Commuters Network . . . 184

7.1.2 Communities in a Trade network . . . 194

7.1.2.1 Results – Communities in a Trade network . . . 195

7.1.3 Summary for the community detection . . . 202

7.2 Centrality Measures . . . 204

7.2.1 Results – Commuters Network . . . 204

7.2.2 Results – Trade Network . . . 207

8 Concluding Remarks and Future Research 211 8.1 Main Conclusions . . . 211

8.2 Future Research . . . 216

References 217

A Modularity Gain calculations 245

B Interval–Weighted Centrality Measures 309

C Applications 311

(14)

(15)

2.1 Examples of vertices and edges in particular networks. . . 14 3.1 Interval ordering for the order relation “6”, to choose the “greater interval”. . 67 5.1 Modularity gain by moving vertexv1to neighbouring communities of vertices

v2andv3. . . 128 5.2 Modularity gain calculations, ∆QN

1 and ∆Q N

2 for the weighted network of Figure 5.2a. . . 129 5.3 Modularity gain results for the 1st _{iteration of the 1}st _{pass of the Louvain}

algorithm for interval–weighted networks (Method 1.1. Intervals sum: Phase 1 = SUM / Phase 2 = Sum. . . 146 5.4 Modularity gain results for the 1st _{iteration of the 1}st _{pass of the Louvain}

algorithm for interval–weighted networks (Method 1.2. Intervals Min–Max: Phase 1 = Min–Max / Phase 2 = Min–Max). . . 152 5.5 Modularity gain results for the 1st _{iteration of the 1}st _{pass of the Louvain}

algorithm for interval–weighted networks (Method 2.1. Intervals Midpoint: Phase 1 = Sum / Phase 2 = Min–Max). . . 154 5.6 Modularity gain results for the 1st _{iteration of the 1}st _{pass of the Louvain}

algorithm for interval–weighted networks (Method 2.2. Intervals: Phase 1 = Sum / Phase 2 = Min–Max). . . 155 6.1 Interval–Weighted Degree values for the benchmark valuesα = 0, 0.5, 1, 1.5. 172 6.2 Interval–Weighted Flow Centrality measures. . . 173 7.1 Interval–Weighted adjacency matrix for the commuting movements(> 50)

between NUTS 3 in the Portuguese mainland (2011). . . 183 xiii

(16)

7.3 Interval–weighted adjacency matrix of NUTS 3 communities for the Hybrid Louvain — Method 2.1. Intervals Midpoint: phase 1= SUM, phase 2= Min– Max. . . 185 7.4 Summary table for the differenced1according to “Classic Louvain — Method

1.1. Intervals Sum: phase 1= Sum, phase 2= Sum”. . . 187 7.5 Summary table, difference d2, “Classic Louvain — Method 1.1. Intervals

Sum: phase 1= Sum, phase 2= Sum”. . . 188 7.6 Summary table for the differenced1 according to “Hybrid Louvain – Method

2.2. Intervals: phase 1= Sum, phase 2= Min–Max”. . . 192 7.7 Summary table for the differenced2according to “Hybrid Louvain — Method

2.2. Intervals: phase 1= SUM, phase 2= Min–Max”. . . 193 7.8 Summary of main results for the method “Hybrid Louvain — Method 2.1.

Intervals Midpoint: phase 1= SUM, phase 2= Min–Max” – (Trade network). . 196 7.9 Summary table for the differenced1 according to “Method 1.1. Classic

Lou-vain – Intervals Sum (Phase 1=Phase 2=Sum)” – trade network. . . 197 7.10 Summary table for the differenced2 according to “Method 1.1. Classic

Lou-vain – Intervals Sum (Phase 1=Phase 2=Sum)” – trade network. . . 198 7.11 Summary table for the differenced1 according to “Method 2.2. Hybrid

Lou-vain – Intervals Min–Max (Phase 1= Sum and Phase 2= Min–Max)” – Trade network. . . 201 7.12 Summary table for the differenced2 according to “Method 2.2. Hybrid

Lou-vain – Intervals Min–Max (Phase 1= Sum and Phase 2= Min–Max)” – Trade network. . . 201 7.13 Summary of all outcomes obtained by the community detection methods for

the Interval–Weighted Commuters Network (IWCN), according to the modu-larity gain used. . . 202 7.14 Summary of all outcomes obtained by the community detection methods for

the Interval–Weighted Trade Network (IWTN), according to the modularity gain used. . . 202

(17)

7.16 Flow centrality measures for the Interval–Weighted Commuters network (NUTS

3 ranked in descending order of interval rank for IWFB). . . 206

7.17 Degree centrality for the Interval–Weighted Trade network (Countries ranked in descending order of interval rank forα = 1). . . 208

7.18 Flow centrality measures for the Interval–Weighted Trade network (Countries ranked in descending order of interval rank for IWFB). . . 210

B.1 Lexicographic order of three intervals each with five values. . . 310

C.1 Louvain algorithm: phase 1= SUM, phase 2= SUM for the degenerate interval– weighted network (midpoints)”. . . 311

C.2 Classic Louvain — Method 1.1. Intervals Sum: phase 1= Sum, phase 2= Sum_{| ∆Q}I 11 and∆QI21. . . 313

C.3 Adjacency matrix of NUTS 3 communities – Classic Louvain — Method 1.1. Intervals Sum: phase 1= Sum, phase 2= Sum . . . 314

C.4 Classic Louvain — Method 1.1. Intervals Sum: phase 1= Sum, phase 2= Sum_{| ∆Q}I 12 and∆Q I 22. . . 315

C.5 Classic Louvain — Method 1.2. Intervals Min–Max: phase 1= Min–Max, phase 2= Min–Max_{| ∆Q}I 11 and∆Q I 21. . . 316

C.6 Classic Louvain — Method 1.2. Intervals Min–Max: phase 1= Min–Max, phase 2= Min–Max_{| ∆Q}I 12 and∆Q I 22. . . 317

C.7 Hybrid Louvain — Method 2.1. Intervals Midpoint: phase 1= SUM, phase 2= Min–Max . . . 318

C.8 Hybrid Louvain — Method 2.2. Intervals: phase 1= SUM, phase 2= Min–Max | ∆QI 11and∆Q I 21. . . 319

C.9 Hybrid Louvain — Method 2.2. Intervals: phase 1= SUM, phase 2= Min–Max | ∆QI 12and∆Q I 22. . . 320

C.10 Adjacency matrix – Interval–Weighted “Trade Network” (Part I) . . . 322

C.11 Adjacency matrix – Interval–Weighted “Trade Network” (Part II) . . . 323

C.12 Adjacency matrix – Interval–Weighted “Trade Network” (Part III) . . . 324

C.13 Adjacency matrix – Interval–Weighted “Trade Network” (Part IV) . . . 325 xv

(18)

C.15 Classic Louvain — Method 1.1. Intervals Sum: phase 1= Sum, phase 2= Sum_{| ∆Q}I

11and∆Q I

21 – (Trade network). . . 327 C.16 Classic Louvain — Method 1.1. Intervals Sum: phase 1= Sum, phase 2=

Sum_{| ∆Q}I

12and∆Q I

22 – (Trade network). . . 328 C.17 Classic Louvain — Method 1.2. Intervals Min–Max: phase 1= Min–Max,

phase 2= Min–Max_{| ∆Q}I

11and∆Q I

21 – (Trade network). . . 329 C.18 Classic Louvain — Method 1.2. Intervals Min–Max: phase 1= Min–Max,

phase 2= Min–Max_{| ∆Q}I

12and∆Q I

22 – (Trade network). . . 330 C.19 Hybrid Louvain — Method 2.1. Intervals Midpoint: phase 1= SUM, phase 2=

Min–Max – (Trade network). . . 331 C.20 Hybrid Louvain — Method 2.2. Intervals: phase 1= SUM, phase 2= Min–Max

| ∆QI

11and∆Q I

21– (Trade network). . . 332 C.21 Hybrid Louvain — Method 2.2. Intervals: phase 1= SUM, phase 2= Min–Max

| ∆QI

12and∆Q I

22– (Trade network). . . 333

(19)

2.1 Example of different networks for the same graph . . . 14

2.2 Illustration of the most common network types . . . 19

2.3 Example of a network and some of its subgraphs . . . 22

2.4 Bipartite network . . . 23

2.5 A weighted undirected network . . . 26

2.6 Degree distribution of an binary undirected network . . . 27

2.7 Examples of degree and strength distributions . . . 28

2.8 Illustration of examples for: Walk, Trail, Path, Circuit, Cycle, Shotest Path (geodesic) and the Diamter of a network . . . 33

2.9 Examples of connected and disconnected networks . . . 35

2.10 Example of a triadic closure or a triad between three vertices . . . 36

2.11 Four different realizations of ER uniform random graphs with n = 20 and m = 15 . . . 39

2.12 Four different realizations of ER binomial random graphs with n = 20and p = 0.11 . . . 41

2.13 Watts and Strogatz’s schematic illustrating the small–worlds model . . . 45

3.1 Example of an Interval–Weighted Network where the values of the weights in the edges are represented by intervals. . . 49

3.2 Intersection of two intervals A and B. . . 51

3.3 Union and interval hull of two intervals A and B. . . 51

3.4 Different types of interval relations . . . 65

4.1 Example of communities in a network . . . 70

4.2 Example of an undirected weighted network –GW _{. . . 81} xvii

(20)

4.5 (a) Undirected weighted networkGW _{with two communities, (b) Tabular} rep-resentation of the weightsW, of the perfect partitionWC_{. . . 95} 4.6 Sketch of a merging sequence in a greedy algorithm . . . 112 4.7 Sketch of the optimization and aggregation steps of the Louvain algorithm . . 118 5.1 Example of a weighted network and its tabular representations of the

ob-served and expected weights . . . 124 5.2 Illustration of an application of the 1st _{iteration/phase/pass of the Louvain}

algorithm to a weighted network . . . 128 5.8 Sketch of the Louvain methods extended to interval–weighted networks

de-veloped in this thesis . . . 139 5.9 Example of an weighted network and its tabular representations of the

ob-served and expected weights (Method 1.1) . . . 142 5.10 Example of the adjustments for the expected interval-weights (Method 1.1) . 144 5.11 Example of an weighted network and its tabular representations of the

ob-served and expected weights (Method 1.2) . . . 148 5.12 Example of the adjustments for the expected interval-weights (Method 1.2) . 150 5.13 Hybrid Louvain – Method 2.1: Intervals midpoint . . . 153 5.14 Hybrid Louvain – Method 2.2: Intervals Sum . . . 155 6.1 Scheme of the generalizations made for the interval-weighted networks of

the three (classical) measures of vertex centrality, degree, closeness and betweenness . . . 160 6.2 Example of the transformation of an undirected interval–weighted network

into a directed interval–weighted network . . . 167 6.3 Interval–Weighted Network . . . 172 6.4 Illustration on how to obtain the max–flow and the interval weighted flow

betweenness (IWFB) values for the intervals lower and upper bounds of an interval-weighted network . . . 174

(21)

7.2 Geographic representation of Portuguese NUTS 3 and the correspondent weighted network . . . 182 7.3 Community structure – Hybrid Louvain – Method 2.1: Intervals midpoint

(Commuters Network) . . . 186 7.4 Community structure – Method 1.1. Intervals Sum – difference d1

(Com-muters Network) . . . 187 7.5 Community structure – Method 1.1. Intervals Sum – difference d2

(Com-muters Network) . . . 189 7.6 Community structure – Method 1.1. Intervals Sum – difference d2

(Com-muters Network) . . . 189 7.7 Community structure for the CL (degenerate intervals by midpoints) and HL

(Method 2.1 – intervals midpoint) (Commuters Network) . . . 191 7.8 Geographical representation of the communities for difference –d1and

mod-ularity gains∆QI

12and∆Q I

22 (Commuters Network) . . . 193 7.9 Geographical representation of the communities for difference –d2and

mod-ularity gains∆QI

12and∆QI22 (Commuters Network) . . . 194 7.10 Geographic representation of the 28 European countries (Trade network and

the correspondent weighted network . . . 195 7.11 Geographical representation of communities for Method 1.1. Classic Louvain

– Intervals Sum (phase 1=phase 2= Sum), difference –d1 (Trade Network) . 198 7.12 Geographical representation of communities for Method 1.1. Classic Louvain

– Intervals Sum (phase 1=phase 2= Sum), difference –d2 (Trade Network) . 199 7.13 Geographical representation of the communities for Method 2.1. Hybrid

Lou-vain: Intervals midpoint (phase 1= Sum and phase 2= Min–Max), and Clas-sical Louvain algorithm for the degenerate interval–weighted network (Trade Network) . . . 200 A.1 Interval–Weighted network and tabular representations for the Classic

Lou-vain Method 1.1: Intervals Sum . . . 245 xix

(22)

A.3 Interval–Weighted network and tabular representations for the Hybrid Lou-vain Method 2.1: Intervals Midpoint . . . 295 A.4 Interval–Weighted network and tabular representations for the Hybrid

Lou-vain Method 2.2: Intervals . . . 297 B.1 Example of a lexicographic order of an interval–weighted network. . . 309 C.1 Community structure for the degenerate interval–weighted network

(Com-muters Network) . . . 312

(23)

5.1 R output for the Louvain algorithm for weighted networks. . . 131 A.1 R output for the LA: Method 1.1. Intervals Sum /QI

1/d1 /∆QI11 . . . 250 A.2 R output for the LA: Method 1.1. Intervals Sum /QI

2/d2 /∆QI22 . . . 278 A.9 R output for the LA: Method 1.2. Intervals Min–Max /QI

1 /d1/∆QI11 . . . 284 A.10 R output for the LA: Method 1.2. Intervals Min–Max /QI

2 /d2/∆QI22 . . . 293 A.17 R output for the LA: Method 1.2. Intervals Midpoint QI

1 / Q I

2 / d1 / d2 / ∆QI

11/ ∆QI12/ ∆QI21/ ∆QI22 . . . 296 A.18 R output for the LA: Method 2.2. Intervals /QI

1 /d1/∆QI11 . . . 298 A.19 R output for the LA: Method 2.2: Intervals /QI

1 /d2/∆QI12 . . . 301 A.22 R output for the LA: Method 2.2. Intervals /QI

2 /d1/∆QI21 . . . 302 xxi

(24)

A.25 R output for the LA: Method 2.2. Intervals /QI

2/d2 /∆QI22 . . . 306

(25)

Acronyms

SNA Social Network Analysis ER Erd˝os-R´enyi

W-S Duncan Watts and Steve Strogatz LFR Lancichinetti-Fortunato-Radicchi IWN IWN Interval-Weighted Networks

LA Louvain Algorithm CL Classic Louvain HL Hybrid Louvain GN Girvan and Newman IWD Interval–Weighted Degree

IWFB Interval–Weighted Flow Betweenness IWFC Interval–Weighted Flow Closeness

NUTS Nomenclature of Territorial Units for Statistics IWCN Interval–Weighted Commuters Network

IWTN Interval–Weighted Trade Network

(26)

Notation

Networks

G(V, E) Undirected network, whereV is a set of vertices, andE a set of edges A(i, j) Undirected Adjacency matrix: 1if(i, j)_{∈ E}and0, otherwise

GW_{(V, E, W )} _{Weighted network, where}_V _{is a set of vertices,}_E_{a set of edges and}_W a set of weights

W (i, j) Weighted Adjacency matrix: wij weight in the link betweeniand j (0 if no link exists)

ki Degree of vertexi

hki Average degree of a network si Vertex strength

Intervals

[x, x] closed interval: x and x are respectively the lower and upper bounds (x 6 x)

R The set of real numbers [R] The set of intervals ℘(R) The powerset of a setR

◦ ∈ {+, ×} A binary algebraic operator

∈ {−,−1_} _{A unary algebraic operator} x, y, z Real variable symbols X, Y, Z Interval variable symbols

Weighted Networks: Community Detection

Q Unweighted Modularity

QW _{Weighted Modularity} C A partition

Cq CommunityC (or set)q(of a partitionC) δ(., .) Kronecker delta

(27)

sC

u Total weight linked to communityuor community strength

Qmax The maximum value that modularity can take for a given partition Qnorm Normalized modularity

∆QW _{Modularity gain}

∆QW

u→C Modularity gain by moving an isolated vertex u to a neighbouring

communityC

O Contingency table for the observed weights oij Observed weight between vertices(i, j) sO

i The total weight or strength attached (or linked) to vertex i (marginal sum)

E Contingency table for the expected weights eij Expected weight between vertices(i, j)

QN _{Modularity for the difference between the observed and the expected} weights (based on a contingency table)

QN

new Modularity after merging two communities (based on a contingency table) QN

last Modularity before merging two communities (based on a contingency table)

∆QN

1 Modularity gain for the difference between Q N new and Q N last (based on a contingency table) ∆QN

2 Modularity gain for the reduced formulation (based on a contingency table)

Interval–Weighted Networks: Community Detection

d1 Difference between two intervals based on the Hausdorff distance but does not take into account the modulus of the difference between the two intervals

d2 Difference between two intervals based on the Hausdorff distance but it does take into account the sign of the highest value

#»

d Vector that stores the difference between two intervals D Generalization of the different differences (d1andd2) QI

(28)

QI

2 Modularity-2 for interval-weighted networks ∆QI

11 Modularity gain for interval–weighted networks (based on Interval Modularity-1)

∆QI

12 Modularity gain for interval–weighted networks (reduced formula – based on Interval Modularity-1

∆QI

21 Modularity gain for interval–weighted networks (based on Interval Modularity-2)

∆QI

22 Modularity gain for interval–weighted networks (reduced formula – based on Interval Modularity-2)

QI

1norm Normalized modularity-1 for interval–weighted networks

QI

2norm Normalized modularity-2 for interval–weighted networks

#»

Qlast Vectorial modularity before merging two communities #»

Qnew Vectorial modularity after merging two communities

OI _{Contingency table for the observed interval-weights [Method 1.1]} oI

ij Observed interval–weights: [oij, oij] (oij > oij > 0; oIij ⊆ R

+₎_[Method 1.1]

sIO

i Interval marginal sum attached to vertexi: sIO i , s IO i [Method 1.1] EI _{Contingency table for the expected interval-weights [Method 1.1]}

eI

ij Expected interval–weights assuming independence between the vertices [Method 1.1]

E0I _{Adjustment of the expected interval–weights [Method 1.1]}

OU _{Contingency table for the observed interval-weights [Method 1.2]} oU

ij Observed interval–weights: [oij, oij][Method 1.2] sOU

i The maximum variability attached (or linked) to each vertexi: sIO i , s IO i [Method 1.2]

EU _{Contingency table for the expected interval-weights [Method 1.2]} E0U _{Adjustment of the expected interval–weights [Method 1.2]}

Interval–Weighted Networks: Centrality Measures

(29)

c ({i, j}) Forward capacity of(i, j) Gf = (V, Ef) Residual network induced byf

cf(j, i) Residual capacity fromj toi, in the backward direction of the edge(i, j) cf(p) Residual capacity of a pathp(pis called an augmented path)

CIW

D (i) Interval–Weighted Degree (IWD) CIW α

D (i) Interval–Weighted Degree with a tuning parameter α (Opsahl’s

ap-proach) CIW

F B(i) Interval–Weighted Flow Betweenness (IWFB) CIW

(30)

(31)

“I think the next century will be the century of complexity”. Stephen Hawking, in 2000 There are a wide variety of complex systems in nature and society that can be depicted as networks representing the interactions between the systems’ components (Albert and Barabási, 2002). This has proven to be a convenient abstraction to represent many of the real–world problems, thus making it an important tool to better understand reality that sur-rounds us. This different abstract representation of networks raises non trivial mathematical problems according to different areas of knowledge, such as sociology (Granovetter, 1973; Freeman, 2004), physics (Newman, 2003a) or mathematics/computer science (Erd˝os and Rényi, 1959; Bollobás, 1998; Leskovec et al., 2010).

When reducing a full system to a network representation we are creating an abstract structure and some information is lost along this process; nevertheless, this disadvantage has also advantages since it captures only the essentials of the connection patterns (New-man, 2010).

To understand how these complex systems behave and function, we need to study the pattern of interactions among components (Strogatz, 2001). Networks are ubiquitous, covering many areas of research, and may be used to represent any type of relation (edges) between interacting entities (vertices), such as information and communication networks (World Wide Web, phone and e-mail networks) (Albert et al., 1999; Barabási and Albert, 1999; Faloutsos et al., 1999; Anteneodo et al., 2010), social networks (friendship, collab-oration) (Davis et al., 1941; Zachary, 1977; Wasserman and Faust, 1994; Newman and Park, 2003; Freeman, 2004), biological and ecological networks (food networks, metabolic networks, protein–protein interactions) (Lowry et al., 1951; Martinez, 1992; Jeong et al., 2001; Guimerà et al., 2010; Guimerà and Amaral, 2005; Kim et al., 2002), transportation

(32)

networks (road, rail and air networks) (Guimer`a et al., 2005; Gastner and Newman, 2004), among others.

More specifically, the following examples can be conceived as networks: the flows of commuters within and between the regions of a country (De Montis et al., 2013a, 2011, 2013b; De Leo et al., 2013); the annual trade between countries (Barigozzi et al., 2011; Traag, 2014); the structure of interurban traffic (De Montis et al., 2007); the transportation network flow of people between different parts of a city (Cheng et al., 2015); the train operation schedule and passenger flow data for a city urban rail transit (Zeng et al., 2018); or the network structure of subway passenger flows (Xu et al., 2016). As it can easily be realized from the previous examples, the vertices (entities) and edges (links) of these networks can mean very different things according to what they represent.

Networks are formally represented as graphs and graph theory provides an appropriate set of concepts, models and tools for the mathematical treatment of networks (Bollob´as, 1998). Several types of graphs can be used to model different kinds of networks. When the classification of the networks is based on the direction of the edges they are called undirected networks, when the edges connect unordered pairs of vertices, e.g., Facebook or LinkedIn or directed networks /digraphs, if the edges have an orientation assign to them, e.g., Twitter. On the other hand, if we regard the values assigned to edges, the networks are classified as unweighted or weighted. Unweighted networks are binary since edges are either present or absent. In weighted networks each edge is associated with a weight that represents the strength or intensity (usually a constant positive real number) of the interaction between vertices (Wasserman and Faust, 1994).

Although this is a recent scientific area, the study of networks is not new. In fact, it all began in 1735 when Leonhard Euler was trying to find a path through the city of Königsberg that would cross exactly once each of its seven bridges. The seminal paper of Euler (Euler, 1741), representing land as vertices and bridges as edges, analysed the problem in abstract terms (though this problem had no solution), laying the foundation of what we know today as network science (Newman, 2010; Barabasi, 2016). Another milestone in the study of networks occurred many years later, with two seminal papers, the first by Paul Erd˝os and Alfréd Rényi (ER) on random networks (Erd˝os and Rényi, 1959), and the second by Mark Granovetter (Granovetter, 1973) – the most mentioned social networking article, which definitely prompted the interest in the study of networks. Initially, both papers

(33)

were highly regarded within their discipline, but had only limited impact outside their field. However, as a consequence of the emergence of a social and interdisciplinary awareness of network science, as well as of more and larger real–world networks (Watts and Strogatz, 1998), empirical studies were developed, and these classic publications, especially in the 21st _{century, had an explosive growth in the number of citations, which definitely prompted} network science (Albert and Barab´asi, 2002; Newman, 2003a; Barrat et al., 2008).

When analysing the structure of networks, one usually starts by making a picture of it. However, for large networks, visualisation is not of much help, and other properties are required to unveil their features. In this sense, some of the structural properties of networks that the empirical studies repeatedly put in evidence, are: many real–world networks have a non Poissonian degree distribution (with heavy tails) (Barab´asi and Albert, 1999); the mean geodesic distance between vertex pairs is very short (“small–world effect”) (Milgram, 1967), which later Watts and Strogatz based on Milgram’s discovery, developed as the first model of a network that combines the high clustering characteristic of networks and the short average path lengths known from ER random graphs (Watts and Strogatz, 1998); and finally, the modular organization of networks in communities (or clusters), which are tightly (densely) connected internally, and less with the rest of the network (Girvan and Newman, 2002; Fortunato, 2010).

Nowadays, the study of network community structure is considered a research field by itself, and is an area of network science that is rapidly expanding in different directions. Also known as groups, clusters or cohesive subgroups, communities have been extensively studied in many fields. Although there is a problem of ill–definition, a general consensus suggests that communities are formed due to the structural or functional similarities among the vertices in a network. These communities can represent a large number of subjects in various areas of knowledge, such as Sociology (Zachary, 1977; Padgett and Ansell, 1993), Biology (Jeong et al., 2001; Guimerà et al., 2010; Guimerà and Amaral, 2005); many other examples of communities that can be found in networks literature (Guimerà et al., 2005; Meunier et al., 2009; Porter et al., 2005; Traag and Bruggeman, 2009; Barigozzi et al., 2011; De Montis et al., 2013a). In this thesis, our focus are the communities that arise from the flows of commuters within a country regions (Portuguese, NUTS 3) (De Montis et al., 2013a; De Leo et al., 2013), well as analysing how a number of European countries group together considering the annual trade between them (Traag and Bruggeman, 2009;

(34)

Barigozzi et al., 2011).

Community detection consists in the decomposition of vertices of a network into sets (or groups), such that vertices within a set are densely connected internally, and sparsely connected externally. A vast number of community detection methods and algorithms have been defined, especially in the last two decades (Girvan and Newman, 2002; Newman, 2004c; Porter et al., 2009; Fortunato, 2010; Fortunato and Hric, 2016; Chakraborty et al., 2017). The main breakthrough occurred with the work by Newman and Girvan, who intro-duced a quality function known as modularity, a quantitative criterion to evaluate the quality of a certain partition, without a priori knowledge of the true division of the network into communities (Newman and Girvan, 2004). In this context, modularity is a measure that defines how likely is the community structure found (Clauset et al., 2004).

Nevertheless, modularity was not an “immaculate” solution because it suffered from some limitations, such as the “resolution limit” (Fortunato and Barthelemy, 2007); a “fluctua-tion of randomness” (Guimer`a et al., 2004); and “extreme degeneracies” (Good et al., 2010). However, in terms of accuracy and computational costs, the most successful solutions to the community detection problem are those based in the optimization of modularity (Newman and Girvan, 2004; Reichardt and Bornholdt, 2006a; Brandes et al., 2006; Clauset et al., 2004; Blondel et al., 2008).

Another broader method of identifying groups in (social) networks is known in the so-ciology field as blockmodeling (White et al., 1976; Batagelj et al., 1992; Doreian et al., 1994; Wasserman and Faust, 1994; Batagelj, 1997), and later extended to generalized blockmodeling (Doreian et al., 2004, 2005). Briefly, it consists in partitioning the vertices of the original network into clusters (groups, communities) and, at the same time, partitioning the set of ties into blocks. Blockmodeling was first developed for binary networks, and more or less a decade later extended to weighted networks by Aleˇs ˇZiberna ( ˇZiberna, 2007). The groups created by this method take into account all types of patterns in the network, which can be the “group pattern” (vertices are densely connected within the group, but sparsely connect with vertices outside the group), or other patterns as “core periphery structure”, or “bipartite structure”, among others.

In this thesis, we will focus on networks with ranges of values in the edges, thus taking into account variability of edge weights. Usually, in classical graph theory, weights on the

(35)

edges of weighted networks are assumed to be constants (Newman, 2004b). However, in real–world applications, these weights may vary within ranges rather than being con-stants (Hu and Hu, 2008).

A representation of these values in the form of closed intervals composed with precise information, can be more meaningful and useful in a dynamic environment than point– valued output, as these intervals contain more information in expressing raw data variability, thereby minimizing the loss of information (Noirhomme-Fraiture and Brito, 2011; Couso and Dubois, 2014; Grzegorzewski and ´Spiewak, 2017).

Taking into account the variability of edge weights in the form of closed intervals, we call our networks interval–weighted networks (IWN). Obviously, all the arithmetic operations / computations on these closed intervals are performed using a proper interval arithmetic, defined on sets of real intervals, rather than sets of real numbers[x, x] = {x ∈ R: x 6 x 6 x}(Moore, 1959, 1962; Moore et al., 2009).

A relevant aspect when using intervals is their comparison. In real–world problems, sets of intervals may appear as coefficients in an inequality (or an equality) for the selection of the best alternative in a decision making problem. Thus, having a comparison and ranking definition of any two intervals is paramount (Ishibuchi and Tanaka, 1990; Sengupta and Pal, 2000; Karmakar and Bhunia, 2014, 2012). Till now one main dilemma in using interval data for decision problems is, perhaps, the choice of an appropriate interval order relation. Unlike real numbers, in many situations, the definitions cannot order two intervals in general, even though they can be applied efficiently to solve models (Hossain, 2009; Karmakar and Bhunia, 2014). As a consequence, theoretically, intervals can only be partially ordered (Moore et al., 2009), and hence one can only compare certain elements. However, when a choice has to be made among alternatives, the comparison is indeed needed.

This problem arises when we want to accomplish the objectives of this work, the ex-tension to networks with intervals in the edges of community detection using Louvain’s algorithm to maximize the modularity, and the three classical centrality measures (degree, closeness and betweenness). In both generalizations we rank the intervals based on Hossain’s interval ordering methodology (Hossain, 2009). We define a new approach to capture the maximum variability between two intervals by considering that one interval has

(36)

a higher order than another when it has a higher midpoint value. If the midpoint value of the two intervals coincide, the highest order interval is the one with the largest radius.

The idea of using intervals in network links is not something new. This approach can be found in works related to decision making based on Dijkstra’s (Dijkstra, 1959) shortest path problem (Okada and Gen, 1993; Nayeem and Pal, 2005; Sengupta and Pal, 2006; Gatev and Hossain, 2007; Nayeem et al., 2008; Sengupta and Pal, 2009; Hossain et al., 2009; Hossain and Gatev, 2010; Hossain, 2009; Zhang et al., 2012), or even in works based on flows in networks (Hu et al., 2007; Hu and Hu, 2008; Bozhenyuk et al., 2017). Nevertheless, the representation and manipulation of the inexact, vague, fuzzy, ambiguous or imprecise information in the form of intervals, uses fuzzy logic (Zadeh, 1965), while in this thesis we use closed intervals composed with the precise information involved (minimum and maximum) in order to capture the maximum variability present following an ontic rather than an epistemic approach (Couso and Dubois, 2014).

A convenient way to mathematically represent networks is by considering its tabular representation in the form of contingency tables (Traag, 2014), since Newman’s modularity (ignoring the fixed factor 1_/_2w_{) basically consists in the sum over all communities of the}

difference between the observed and the expected weights (Newman and Girvan, 2004). To extend modularity, we may define two types of contingency table, one for the observed weights and another for the expected weights assuming independence between the ver-tices of the table (our null model). Then, based on the concept of the chi-square test of independence, which tests independence between the row and column variables in a contingency table, we evaluate the discrepancy between the observed counts in the table and the expected values of those counts under the null distribution. If the discrepancy is larger than the expected (from a random chance model), then there is evidence against the null hypothesis of independence (Everitt, 1992).

Based on this reasoning, we extend modularity to weighted networks, assuming mod-ularity as a sum over all communities of the difference between the observed and the expected weights (Newman and Girvan, 2004). Then, the modularity gain obtained by merging two communities into a single community is also extended. First as a difference between the modularity after and before the merge. Then through a more efficient way that results from the derivation of the previous (reduced formula) (Clauset et al., 2004). Lastly, these procedures are extended to one of the state–of–the–art algorithms to maximize

(37)

modularity, the Louvain algorithm – LA (Blondel et al., 2008).

However, our goal with this work is to extend these measures to the case of an interval– weighted network. In doing so, several major setbacks occur. For example, one way of evaluating the difference between two intervals is to use a measure of distance. However, by using a distance, the value of both modularity and modularity gain are always non– negative, which makes it impossible to determine if a vertex stays in its own community or moves to a neighbourhood community, when implementing the 1st _{phase (optimization) of} the Louvain algorithm. Other setbacks are directly related to the interval arithmetic itself, such as “the interval dependency problem” (the subtraction of two identical intervals is not equal to the null interval), “the distributive law does not hold” (only a subdistributive law), among others (Moore et al., 2009).

To contour these setbacks, we propose two differences to evaluate the discrepancy between the observed and expected interval–weights. Both differences are based on the Hausdorff distance (Bryant, 1985; Rote, 1991; Chavent et al., 2006; Billard and Diday, 2007). The difference between them is that, the first one d1 does not take into account the modulus of the difference between the two intervals, while the second oned2does take into account the sign of the highest value. With the definition of these two measures we can evaluate the effect that different approaches have on the final clustering when applying the Louvain algorithm.

Taking into account all of the above, we have developed a strategy divided into two approaches, here called “Classic Louvain” and “Hybrid Louvain”, which in turn split into two different methods. The different approaches considered allow obtaining solutions according to different criteria, such as capturing more or less variability from data, or favouring a faster execution of the algorithm. All these procedures are then repeated using a vector difference between intervals(#»d ).

Our second goal with this work is the extension of the three well–known (classical) centrality measures degree, closeness and betweenness centralities, to interval–weighted networks (IWN). The study of the centrality measures is one of the most important topics in network science (Newman, 2010; Borgatti et al., 2013).

The importance of a vertex is usually related to the concept of being the most connected vertex or being positioned in the center of the network (Wasserman and Faust, 1994; Scott,

(38)

2000). Essentially, a vertex positioned in the center of a network has advantages over other vertices, as it is directly linked to many other vertices or acts as an intermediary in communicating with other vertices, or in the flow control with which it reaches the other vertices (Freeman, 1979; Bonacich, 1987; Borgatti, 2005; Borgatti et al., 2006). This centrality of vertices, or the place of a given entity (or actor) in the network, can be described using measures to determine their relative importance within the network.

The Degree centrality is a measure that represents the number of edges that termi-nate or origitermi-nate at a focal vertex (vertex strength in weighted networks). Our extension of degree, which we call Interval–Weighted Degree (IWD), follows Opsahl et al. (2010) approach of taking into account both edge weights and the number of intermediate vertices, by introducing a tuning parameterα, calling it Interval–Weighted Degree (IWD).

The closeness and betweenness centrality measures rely on the identification of the shortest paths, and measure the number of them that passes through a vertex (Opsahl et al., 2010). Commonly, to identify the shortest paths/geodesic paths in weighted networks using Dijkstra’s algorithm (Dijkstra, 1959), the edge weights are reversed (or inverted), thus representing the edge cost and not the edge strength (Newman, 2001; Brandes, 2001; Opsahl et al., 2010).

Since interval arithmetic has no additive or multiplicative inverses, except for degenerate intervals (Moore et al., 2009), we adopt Freeman et al. (1991) concept of flow networks based on Ford and Fulkerson’s (1956) algorithm (which uses all the independent paths between all pairs of vertices in the network instead of using the geodesic paths). To do so, we first prove that the maximum flow is obtained with the maximum flow values at each edge, and the minimum flow with the minimum flow values at each edge. Thus, we propose a generalization of closeness and betweenness centralities using Freeman et al. (1991) concept of flow networks based on Ford and Fulkerson’s (1956) algorithm. The use of interval–weighted flows to represent flow capacities allows taking into account the variability observed in the original network, thereby minimizing the loss of information.

Throughout this research work, different software was used to handle network data, such as Pajek (Batagelj and Mrvar, 2014), UCINET (Borgatti et al., 2002), Gephi (Bastian et al., 2009) and some R packages (R Core Team, 2016), such as igraph (Csardi and Nepusz, 2006), sna (Butts, 2016), statnet (Handcock et al., 2003) and tnet (Opsahl, 2009).

(39)

Since none of these software and packages were designed to deal with interval varia-tions on the edges of the networks, it was necessary to implement several funcvaria-tions in R to compute the different community detection approaches, as well as centrality measures, proposed in this thesis.

1.1. Thesis Contributions

Network and Interval Analysis are the unifying themes that link the two research lines addressed in this dissertation: community detection and centrality measures. Both of utmost importance in the field of network science. This thesis contributes to the Network Analysis field by introducing interval variations on the edges of networks, allowing taking into account the variability observed in the original network, thereby minimizing the loss of information present in the raw data.

The contributions of this thesis may be summarised as follows:

• The definition of a new approach to order two intervals taking into account the maxi-mum variability between them;

• A new method to compute the modularity for weighted networks based on the concept of the chi-square test of independence and the representation of a weighted network in the form of two contingency tables, one for the observed counts and the other for the expected values of those counts;

• A new methodology to adjust the values in the expected interval–weighted contin-gency table;

• The development of two measures to evaluate the discrepancy between two intervals; • New modularity and modularity gain definitions for interval–weighted networks; • A new community detection framework based on classic Louvain’s Algorithm for

net-works considering ranges (intervals) on the edges (interval–weighted netnet-works); • A new community detection framework based on a hybrid adaptation of Louvain’s

(40)

• The development of three new centrality measures for interval–weighted networks: the interval–weighted degree (IWD), interval-weighted flow betweenness (IWFB) and interval-weighted flow closeness (IWFC);

• Application of the developed methodologies for both community detection and cen-trality measures in two real–world networks.

1.2. Thesis Outline

Following the previous introduction which provides the general context and summarises the original contribution of this thesis, we now provide an overview of its structure, that comprises seven more chapters and three appendices.

Chapter 2 provides an overview of the fundamental notations, terms, concepts and def-initions of network (graph) theory, as well as of the corresponding mathematical language (Bollob´as, 1998), in order to familiarize the reader with the topics covered in the following chapters.

In Chapter 3, we introduce the basic terms and concepts of interval analysis, followed by the presentation of the essential definitions in interval arithmetic and its properties. Then, after having presented a brief state–of–the–art of interval order relations, and based on our purpose to capture the maximum variability of an interval, a new approach for ranking intervals is proposed.

The next two chapters, Chapters 4 and 5, are dedicated to community detection. In Chapter 4, after presenting the background of the development in community detection, we firstly explain the best known measure to evaluate the quality of a network partition, modularity, and then we devote our attention to the optimization of modularity for weighted networks, using a greedy agglomeration perspective. Finally, we study in detail one of the state–of–the–art community detection methods, in terms of accuracy and efficiency, known as the “Louvain algorithm” (Blondel et al., 2008).

Chapter 5 begins with the extension of modularity and the gain of modularity considering a tabular representation of networks (by contingency tables) and based on the concept of the chi-square test of independence which tests independence between the row and column variables in a contingency table (Everitt, 1992). To generalize these notions to the case of Interval Weighted Networks (IWN), due to interval arithmetic pitfalls (e.g., interval

(41)

dependency, among others), new measures are defined to evaluate the difference between two intervals. Furthermore we propose a new approach for computing modularity and modularity gain using a vector that stores the difference between two intervals (this vector contains two values instead of a single real value). Based on these findings, we then extend the modularity to the case of an interval–weighted network (IWN), as a “sum over all the different communities in the community structure” of the difference between the observed and the expected intervals, and then the modularity gain obtained by merging the two communities into a single community.

These approaches were then extended to the Louvain algorithm (LA), developing a methodology based on two major methods: Interval Modularity–1, which extends the “clas-sical” modularity and modularity gains of weighted networks to the case of interval-weighted networks, and Interval Modularity–2, that uses the vector difference of the intervals in the calculations performed. For each of them, according to Louvain’s algorithm phases (phase 1: optimization; and phase 2: aggregation), two different approaches were developed to deal with IWN.

In Chapter 6, we first discuss the three well–known (classical) centrality measures that aim at finding the most central vertex within a weighted network: degree centrality, closeness centrality and betweenness centrality. Then, we consider the extensions to IWN of these measurements. Firstly, we define the Interval–Weighted Degree (IWD) taking into account both edge weights and number of intermediate vertices introducing a tuning pa-rameter(α). Secondly, using the concept of flow networks based on Ford and Fulkerson’s algorithm, closeness and betweenness centralities are extended as Interval–Weighted Flow Closeness (IWFC) and Interval–Weighted Flow Betweenness (IWFB).

Finally, in Chapter 7, to evaluate and illustrate the outcome of the proposed community detection methodology and centrality measures to IWN, two real–world interval–weighted networks are analysed. The first is a commuter network in mainland Portugal, between the twenty three NUTS 3 Regions (IWCN), to put in evidence the community structure that emerges from the movements of daily commuters. The second focuses on annual merchandise trade between 28 European countries, from 2003 to 2015, analysing the commercial communities that emerge between these countries during the thirteen year period considered.

(42)

The final chapter (Chapter 8) concludes the thesis by reviewing the most relevant re-search outcomes and outlining possible future developments.

(43)

The central objective of this chapter is to introduce the notations and terms that will be used throughout this thesis, as well as some basic concepts of a branch of discrete mathematics known as Graph Theory. Graph theory provides both an appropriate mathe-matical representation of a network and a set of concepts that can be used to study formal properties of networks (Harary, 1969; Wasserman and Faust, 1994; Bollob´as, 1998; West, 2008; Newman, 2010; Barabasi, 2016). Actually, graphs or networks provide a structural model that makes it possible to analyse and understand how many separate systems act together, thereby allowing us to understand patterns and regularities of interactions, and so, become a first step in understanding complex systems (Albert and Barab´asi, 2002; Newman, 2003a; Barabasi, 2002; Watts, 2004; Reichardt, 2009). The visual representation of data that a graph or network (frequently, social scientists call it sociogram) provides, often allows to uncover patterns that might otherwise go undetected (Wasserman and Faust, 1994).

Basically, a network is a collection of vertices joined by edges (see Figure 2.1). In the scientific literature vertices and edges may have different terminologies. In computer science are also called nodes and links, in physics sites and bounds, and in sociology actors and ties. Table 2.1 represents some examples of vertices and edges in ten networks frequently used by researchers to illustrate key networks properties (adapted from: New-man, 2010 and Barabasi, 2016).

Note 2.1. Hereafter, throughout this thesis we will use the denomination network whenever

we refer to graphs.

(44)

Table 2.1: Examples of vertices and edges in particular networks.

Network Vertex Edge

Internet Computer or router Cable or wireless data connection

World Wide Web Web page Hyperlink

Citation network Article, patent, or legal case Citation

Power grid Generating station or substation Transmission line

Friendship network Person Friendship

Metabolic network Metabolite Metabolic reaction

Neural network Neuron Synapse

Food web Species Predation

Mobile phone calls Subscribers Calls

Email Email addresses Emails

Below, in Figure 2.1 different networks are depicted for the same graph. It is observed that, while the nature of the vertices and the edges differs, networks represented in Fig-ures 2.1a, 2.1b, 2.1c have the same graph representation, consisting ofV = 4vertices and E = 4edges, as represented in Figure 2.1d.

(a) The Internet (b) The Hollywood actor network

(c) The protein-protein interaction network

v1

v3

v2

v4

(d) Undirected network

Figure 2.1: (a) The Internet, where routers (specialized computers) are connected to each other, (b) The Hollywood actor network, where two actors are connected if they played in the same movie, (c) A protein-protein interaction network, where two proteins are connected if there is experimental evidence that they can bind to each other in the cell, (d) Undirected network with four vertices and four edges (Figures 2.1a, 2.1b, and 2.1c, were extracted from Barabasi, 2016)

(45)

2.1. Notations and basic concepts

A network G is an entity used to represent the existence or absence of links among various objects. In other words, a network is an ordered pair of disjoint sets(V, E) such thatEis a subset of the setV of unordered pairs ofV.

According to the absence or not of direction (orientation) on the edges, a network can be defined as undirected or directed (or digraph). In undirected networks, the edges have no orientation, i.e., each pair (i, j) _{∈ E} is considered unordered: (i, j) = (j, i) and the adjacency matrix is symmetric A = AT _(AT

ji = Aij)(e.g., in friendship networks, such as Facebook, when two persons become friends, the relationship is reciprocal – both ways –, that is, they can see the posts of each other). On the other hand, there are situations where it is important to consider the order of the two vertices connected by an edge. Therefore, in order to better describe this situation we need to extend the previous mathematical concept to a direct network. Lastly, we may be interested in capturing the intensity of the interaction (or strength) between vertices. The best way to express this is by using a weighted network in which edges are associated with weights (or values).

Undirected Networks

Mathematically, an undirected network can be defined in the following way:

Definition 2.1 (Undirected Network). A network G(V, E), more specifically an undirected network, consists of two sets,V _{6= ∅} (finite) andE. It represents the pairwise interactions betweenn ∈ Rindividual objects (or agents) and is defined by a set of vertices (or nodes or points)V =_{v1, . . . , vn}and a set of edges (or links or lines)E = {e1, . . . , em}: E ⊆

(i, j) : i, j ∈ V .

A pair (i, j)belongs toE if there is an interaction between the objects iandj and the number of edges in the network, i.e., the cardinality of the set E, is denoted by _{|E| = m}. An edge(i, j)is called incident to both the nodes iandj which, in this case, are termed neighbours. Likewise, the number of vertices in the network, i.e., the cardinality of the set V, is denoted by|V | = n.

(46)

Note 2.2. Throughout this thesis we will consider only finite networks, that is,V andE are always finite, and will denote the number of vertices in a network by n and the number of edges bym.

Matrices are an alternative way to represent and summarize network data. A matrix contains exactly the same information as a network, but is more useful for computation and computer analysis (Wasserman and Faust, 1994)1_{. A network (unweighted) can be} represented by its adjacency matrix A_{∈ {0, 1}}n×n_{such that}

A(i, j) =      1, if(i, j)∈ E, 0, otherwise. (2.1)

Note 2.3. In this thesis, we will only consider networks with strictly positive adjacency

matricesA(i, j) > 0, _{∀i, j}.

Directed Networks

The mathematical definition of a directed network is:

Definition 2.2 (Directed Network). A directed networkG(V, E), consists of two sets,V _{6= ∅} (finite) and E. The elements ofV = {v1, . . . , vn} are the vertices (or nodes or points) of the network G. The elements of E = _{e1, . . . , em} are distinct ordered pairs of distinct elements ofV, and are called arcs or directed links.

In directed networks an edge (or arc) (i, j)has a source i and a destination j. In this kind of networks, a neighbour j of a vertex i is called a child when (i, j)_{∈ E} and a parent when(j, i)∈ E. When an edge connects a vertex to itself,(i, i)∈ E, it is called a self-loop (or just loop), i.e., a neighbour can be both a child and a parent at the same time. A good example of directed networks are Twitter or Instagram, where follower relationships are not bidirectional. One direction is called followers, and the other is denoted following.

1_{The study of a network based on this matrix representation is known as algebraic graph theory or spectral} graph theory (Bollob´as, 1998).

(47)

Weighted Networks

In undirected and directed networks as defined above, for each pair of vertices we can either have an edge or not. However, purely topological models are often inadequate to explain the rich and complex properties observed in real systems, and there is also a need for models that go beyond pure topology. In other words, there is a need to deal with networks displaying a large heterogeneity in the relevance of the connections.

This kind of networks are better described in terms of weighted networks, i.e., networks in which edges are associated with weights (or values) representing the intensity of the interaction between the incident vertices (Wasserman and Faust, 1994; Newman, 2004b). This is, networks where the strength of ties is taken into account. Generally, for social net-works, as Granovetter (1973) argued, there may be stronger or weaker social ties between individuals (the edge values might represent the strength of social connections, which could represent duration, emotional intensity, intimacy, among others). On the other hand, for non-social networks, such as infrastructure and information networks, the strength of ties often reflects the flow of information, energy, people and goods along the tie.

Classic examples of real–world weighted networks are: the “World–wide Airport Net-work” which contains the world list of airports pairs connected by direct flights and the number of available seats on any given connection (Barrat et al., 2004, 2008; Guimer`a et al., 2005; Pastor-Satorras and Vespignani, 2007), the “Scientific Collaboration Network” of scientists who have authored manuscripts submitted to the e–print archive relative to condensed matter physics where the weights mean the intensity wij of the interaction (Barrat et al., 2004).

A special case of weighted networks with binary weights or negative links (0/1 or +/-) are called signed networks (Traag et al., 2013, 2018).

Definition 2.3 (Weighted Network). A weighted or valued networkGW_{(V, E, W ), consists} of a set, V = _{v1, . . . , vn} 6= ∅of vertices, a set E = {e1, . . . , em}of edges and a set of weights or valuesW ={w1, . . . , wm}, positive real numbers associated with the edges.