Condensed Graphs: Towards a General Approach for Faster Subgraph Census

(1)

Condensed Graphs:

Towards a General

Approach for Faster

Subgraph Census

Miguel Lopes Martins

Master’s degree in Computer Science

Computer Science Department 2019

Supervisor

Pedro Manuel Pinto Ribeiro, Assistant Professor, Faculty of Sciences, University of Porto

(2)

Todas as correções determinadas pelo júri, e só essas, foram efetuadas.

O Presidente do Júri,

(3)

Acknowledgments

There is a special place in my heart for my parents and godmother that, in the face of several and severe adverseties I faced during the year I wrote this thesis, served as my guiding light and safe harbour. Without their love and constant care, this work would be left unfinished.

To my advisor Pedro Ribeiro, for challenging me and being dutiful with his replies to my needs, even with his tight and stressfull schedule. I have to admit my favorite part were the deviations we allowed ourselves to take in each meeting, talking about music and movies, sharing silly jokes and other geeky things.

Finaly, to all my friends in Coimbra and Anadia for the jests, meat and mead we shared. All of them have a special place in my heart, in their own peculiar ways. Specially F´abio Ribeiro for being my main backbone here in this journey I took in Porto.

(4)

(5)

Abstract

Determining subgraph frequencies is at the core of several graph mining methodologies such as discovering network motifs or computing graphlet degree distributions. Current state-of-the-art algorithms for this task either take advantage of common patterns emerging on the networks or target a set of specific subgraphs for which analytical calculations are feasible. In this thesis, we propose a novel network generic framework revolving around a new data-structure, a Condensed Graph, that combines both the aforementioned approaches, but generalized to support any subgraph topology and size. Furthermore, our methodology can use as a baseline any enumeration based census algorithm, speeding up its computation. We target simple topologies that allow us to skip several redundant and heavy computational steps using combinatorics. We were are able to achieve substantial improvements, with evidence of exponential speedup for our best cases, where this patterns represent up to 97% of the network, from a broad set of real and synthetic networks.

Keywords— Subgraph Census, Motif Discovery, Graphlets, Isomorphism, Condensed Graphs, Compression, Data-Structures, Algorithms, Network Science, Complex Networks

(6)

(7)

Resumo

Determinar as frequências dos subgrafos é uma das tarefas cruciais de várias procedimentos de minera¸cão de grafos, como a descoberta de motifs em redes ou no cálculo de distribui¸cões do grau de graphlets. Os algoritmos do actual estado da arte usam técnicas que tiram partido de padrões recorrentes nas redes ou focam-se em conjuntos restringidos de subgrafos espec´ıficos para os quais cálculos matemáticos são exequ´ıveis.

Nesta tese, propomos um inovador esquema de genérico para o qual desenvolvemos uma nova estrutura de dados, Grafos Condensados, que nos permitem tirar partido de uma forma de ambas as estratégias supracitadas na literatura para subgrafos de quaisquer topologias ou tomamanhos. Em suma, usamos as propriedades de topologias simples e, através de análise combinatória, evitamos computa¸cões pesadas e redundantes. Para além disso, a nossa metodologia permite-nos usar qualquer algoritmo baseado em enumera¸cão e acelerar o seu tempo de computa¸cão. Conseguimos alcan¸car melhorias significativas, com evidências de speedup exponencial para os nossos melhores casos, onde os padrões em que nos focamos representam 97% do tamanho da rede, num conjunto diverso de redes reais e sintéticas. Keywords— Subgraph Census, Motif Discovery, Graphlets, Isomorphism, Condensed Graphs, Compression, Data-Structures, Algorithms, Network Science, Complex Networks

(8)

(9)

List of Tables

6.1 The set of real networks in decreasing order of compression ratio (CR). . . 51

6.2 Speedup of our adaptations vs baseline algorithms on real networks. . . 51

6.3 The set of used synthetic networks generated using NetworkX package. . . 54

(12)

(13)

List of Figures

1.1 Diagram of the Seven Bridges of K¨onigsberg . . . 2

1.2 Graph overlaid over the diagram of the city of K¨onigsberg . . . 3

2.1 An undirected graph G (left) and a directed graph G0 (right). Note how the direction of the edges affects the set of edges of G0 . . . 7

2.2 Difference between a regular subgraph and an induced subgraph . . . 8

2.3 A Clique and an Independent Set contained in the same Graph . . . 8

2.4 Two isomorphic graphs and the bijective function f between them. . . 9

2.5 A graph with two orbits, depicted in yellow and gray, followed by its automorphism set Aut(G). . . 9

2.6 All induced occurrences of a subgraph of size 3 on graph G. . . 12

2.7 Visual representation of Network Motifs on a real network. . . 13

2.8 Automorphims orbits for all possible graphlets from sizes 3 to 5. . . 15

3.1 Two induced occurrences on G that map to the same canonical class. . . 17

3.2 Example of an ESU-tree . . . 19

3.3 Resulting trie for an input set of four words. . . 22

3.4 A subgraph of size 3 shared among four subgraphs of size 6. . . 22

3.5 A G-trie generated from an input set of 6 subgraphs. . . 23

3.6 The effect of different labeling schemes on the same graph. . . 24

3.7 Different LS-Labeling schemes using the adjacency list and matrix . . . 29

3.8 Example of a G-Trie using the LS-Labeling scheme. . . 31

4.1 Visual representation of a graph containing 4 peripheral stars. . . 35

4.2 Graph containing a peripheral star and all its possible enumerations . . . 36

4.3 Graphical representations of the binomial function’s properties. . . 36

4.4 Difference between lossless versus Isomorphical Reconstruction. . . 37

4.5 A graph G and the output condensation algorithm, C(G). . . 39

4.6 Adjacency matrix of D(C(G)). . . 40

5.1 The result of condensation on a graph G and all the occurrences of its extensions. 44 5.2 Example of a Co-ESU-tree. . . 46

(14)

(15)

List of Algorithms

3.1 ESU: enumerates all induced k-subgraphs of G. . . 18

3.2 Insertion: inserting a subgraph G on a G-Trie T . . . 24

3.3 Census: Census of subgraphs of T in a graph G. . . 25

3.4 FindConditions: Symmetry breaking conditions for graph G. . . 26

3.5 Insertion [Symmetry breaking]: inserting a subgraph G on T . . . 27

3.6 Census [Symmetry breaking]: census of subgraphs of T on G. . . 27

3.7 FaSE: enumerates and counts all induced k-subgraphs of G. . . 31

4.1 CondenseGraph: Condenses an input graph G. . . 38

4.2 DecondenseGraph: Decondenses C(G) and returns D(C(G)). . . 39

4.3 CreateCondensedGraph: Creates a C from an graph G. . . 41

5.1 CGF: takes any algorithm A on a graph G and performs subgraph census using Condensed Graphs. . . 45

(16)

(17)

Introduction

1

Network Science traces back strongly to classical Graph Theory. In fact, one can make a strong case that the term graph and network can be used interchangeably, since the term Network usually refers to some type of large scale graph. This means that when we use modern network analysis techniques, we are standing on the shoulders of giants of the likes of Leonhard Euler, whose solution to the Seven Bridges of K¨onigsberg problem in 1735 [Wil86] is considered by many the genesis of Graph Theory, and Paul Erd¨os, whose paper on Random Networks [ER60] is arguably one of the main causes that made Network Science so relevant today.

Many complex real world problems become more tractable when represented as networks. In fact, the underlying properties of a graph convey important information about the original observed phenomena. Since these mathematical structures are completely abstract and independent of context, they have high applicability on a multitude of scientific fields, ranging from Biology [MV07, CSH+14], Sociology [TMP12], Neuroscience [ALB+13], Cheminformatics [LREH+11] to Time-Series Analysis [LLB+08].

In this thesis we focus on one of these properties, more specifically, subgraphs. Although a more formal definition is given in Chapter 2, subgraphs are essentially small patterns contained in a network that can convey crucial information about its properties.

The explanatory potential of computing frequencies of subgraphs plays a crucial role in several research branches of network science, as we shall see throughout this thesis. At the core of the different approaches regarding subgraph occurrences lies the subgraph census problem, that is computing the frequencies of a set of subgraphs. However, this is a fundamentally hard computational task that is related to the subgraph isomorphism problem, which is NP-complete [Coo71].

We propose a novel framework that draws inspiration from compression techniques coupled with simple combinatorics targeted at simple patterns in order to accelerate the frequency counting process. The key advantage is that it can take any enumeration based algorithm and improve its performance, should these patterns occur.

(18)

CHAPTER 1. INTRODUCTION

1.1 An Informal Introduction to Graph Theory

We will introduce the applicability of networks by drawing inspiration from the introduction of [Bar14, Section 2.1].

Evidence suggests that the problem of the Seven Bridges of Königsberg is the first known instance of a mathematical solution that uses a similar concept to our modern definition of graph [Wil86], and it was solved in 1735 by Leonard Euler. The city of Königsberg was the capital of Eastern Prussia that had a prolific merchant economy. This prosperous time led to the building of seven bridges around the prussian metropolis. Five of those bridges connected the isle of Kneiphof to the mainland, and the remaining two crossed the branches of river Pregel. With this new commodity, the need to know the best path to visit the whole city arose. Specifically, was it possible to perform a complete tour of Königsberg crossing each bridge exactly once?

The solution proposed by Euler can be informally described as follows:

A tour must have a starting and an ending point. Remember that we can only use each bridge to either enter or leave a patch of land, meaning that for each intermediary patch we visit, we need to enter and leave exactly the same number of times. This implies it has to contain an even number of bridges. There can be none or exactly two patches with an odd number of bridges and, should they exist, need to be the start and ending points of the tour, since we will always end up on leaving the starting point in order to arrive at the finishing point, concluding the path. It is also obvious that we must be able to reach the city on its entirety, meaning that there can be no isolated patches of land. A simple map of the city can be seen in figure 1.1.

Figure 1.1: Representation of the diagram of the Seven Bridges of K¨onigsberg presented by Euler in 1735 (Adapted from [GHSL07]).

We just informally used graph theory in order to solve this problem. Each patch of land can be represented as a vertex, and each bridge connecting two patches as an edge, and this two

(19)

1.2. MOTIVATION

sets form a graph. The number of bridges of each patch of land, or number of edges incident in each vertex, is defined as the degree of the vertex, and the fact that there must be no isolated patches of land is equivelent to say that the resulting graph must be connex. Finally, the tour on K¨onigsberg we are looking for is known as an Euler tour on a graph. With this in mind, figure 1.2 shows the graph of K¨onigsberg modelled as the graph.

Figure 1.2: The city of K¨onigsberg represented as a graph. See how different regions of the city correspond to vertices, and bridges to edges connecting them on the graph.

Theorem 1.1.1. A connex graph G is said to have an Euler tour if and only if it has exactly 0 or 2 vertices of odd degree.

1.2 Motivation

The answer to the Seven Bridges of K¨onigsberg problem points to an important direction: not only are we now capable to unmistakably prove that there is no possible Euler tour on the city, but we are also able to adapt our solution for every city or location given that we know its layout. Moreover, the topology of the resulting graph is the essence to the solution, and the only way to make possible such a tour across the city is by adding or removing bridges in a way such that the resulting graph meets the criterion necessary to fulfill Euler tour criterion of theorem 1.1.1.

Modern Network Analysis takes advantage of the intrinsic properties of a graph to describe a wide plethora of phenomena. Although vertex degree is one of the most fundamental measures to infer the topology of a graph, there exists a wide set of properties that we can extract that gives us further intuition relating to the type of network we are studying. Thanks to graph theory, we can extract a considerable amount of information regardless of context, since the main object of study becomes the topology of the network.

(20)

CHAPTER 1. INTRODUCTION

Some of these properties, like the degree distribution, are trivial to extract. Others, such as those related to the frequencies of subgraphs, are computationally demanding, and there is not even proof that a truly efficient solution exists

The remainder of this section delves into details that will be adequately introduced in Chapter 2. We advise, should the reader be inexperienced in this field, to consult the preliminaries first.

We already established that, since a graph is an abstract mathematical construct, it can be applied to a broad range of scientific fields. Regarding the process of extracting induced subgraph frequencies, it is related to the Subgraph Census Problem, that is, given a number k and a graph G, output the exact occurrences of each of the induced subgraphs of size k on G. Typically, in order to characterize a network through its subgraphs, we are concerned with the concept of frequent subgraphs, that is, the set of prominent subgraphs on G according to some criteria over their frequencies, like those seen in Motif discovery [MSOI+02] and Graphlet degree distributions [PCJ04]. The former defines frequent subgraphs as those that are over-represented when comparing to the frequencies extracted from an ensemble of random networks, and has been shown to have high applicability on biological [DBBO04, SK04, MBV05, PIL05, Alo07, Kon08], social [HEL04, SC11] and World-Wide-Web [KSS97] networks. Relating to Graphlet degree distributions, the frequencies of a set of subgraphs is used as a similarity measure between networks, and has been extensively used in the comparison of biological [PCJ06, Prˇz07] and social [LB11, JHK12] networks.

1.3 Goals and Contributions

The purpose of this thesis is to improve efficiency of solutions for the subgraph census problem, specifically building an abstract framework that takes any enumeration census algorithm and accelerates it. Typically, subgraph counting algorithms only add a single occurrence of each subgraph per recursion branch. Through a new perspective on graph compression, we compress the input graph in a way that enables the possibility of analytically calculating several combinations of topologically equivalent vertices, and use these results to add several occurrences in a single branch. We show that this framework is highly adaptable by using it to extend and accelerate state of the art algorithms such as ESU [WR06] and FaSE [PR13].

1.4 Thesis Outline

This thesis is structured in six major chapters. A brief description for each one is provided below.

(21)

1.5. BIBLIOGRAPHIC NOTE

thesis as well as the motivation behind it. Additionally it presents the thesis’ structure. Chapter 2 - Preliminaries Sets the formal base graph theory for graphs and Network Science. It defines the thesis’ probelem definition, the Subgraph Census Problem. Additionally introduces the fields of Frequent Subgraph Mining, Graphlet Degree Distributions and Network.

Chapter 3 - Algorithms for subgraph enumeration and counting Briefly expands on the problem definition and explains thoroughly, as close as to the original author’s literature, specific subgraph census algorithms (ESU, G-Tries and FaSE).

Chapter 4 - Condensed Graphs Introduces our novel data-structure, Condensed Graph, preceeded by a solid theoretical foundation. Furthermore, it defines a new concept, Isomor-phical Reconstruction required to understand the concepts in Chapter 5.

Chapter 5 - Condense Graph Framework Expands the concepts introduced in Chapter 4 in the context of subgraph census and thoroughly explains our strategy in order to build an expandable, adaptable and generic framework. Afterwards, we apply this newfound procedure to known subgraph census algorithms, ESU and FaSE.

Chapter 6 - Performance evaluation Speedup measurements of ESU and FaSE versus their counterparts adapted with our compression scheme, on a set of real and synthetic networks. It finishes with a discussion on the conclusions drawn from the experiments. Chapter 7 - Conclusions and Future Work States our main contributions as well as possible future improvements we have planned for our future work.

The thesis is purposefully structured in an unconventional way. If you are already familiar with the problems addressed by this thesis we advise you to, after Chapter 2, delve into the state-of-the-art described in Chapter 4 section 4.1.

Notwithstanding, our goal was to ease the introduction into the more formal part of the thesis to any STEM field reader. Should you fall within this category, we advise you to follow the thesis’ structure as it eases you into the applicability and the types of approaches addressed and are impacted by our contributions.

1.5 Bibliographic Note

The work developed in this thesis originated a paper called “Condensed Graphs: a generic framework for accelerating subgraph census computation” accepted, with oral presentation, to CompleNet 2020 on the 20 of December of 2019. This work was financed by National Funds through the Portuguese funding agency, FCT - “Funda¸c˜ao para as Ciˆencias e Tecnologias” with project: UID/EEA/50014/2019.

(22)

(23)

Preliminaries

2

In this chapter we introduce the notation and precise terminology used throughout this thesis. We start by presenting general concepts of graph theory, proceeded by an explanation of the importance of subgraphs in networks.

We conclude by performing a small survey on different techniques used in Network Science related to subgraph enumeration and counting, specifically, Frequent Subgraph Mining, Network Motifs and Graphlet Degree Distributions.

2.1 Basic Graph Theory

Although the term Network tends to be more generic than a Graph, for example, in the case of complex Multi-layer Networks, as shown in [KAB+14], on the context of this thesis, these terms will be use interchangeably.

A graph G = (V, E), where V (G) is its set of vertices that are connected by a set of edges, E(G). More specifically, two vertices u, v ∈ V (G) are connected if (u, v) ∈ E(G). Note that in Network Science, vertices and edges are usually called nodes and links respectfully. A graph is said to be directed if its set of edges is comprised of ordered pairs. More specifically, for two vertices u, v ∈ V (G), (u, v) ∈ E(G) 6=⇒ (v, u) ∈ E(G).

Figure 2.1: An undirected graph G (left) and a directed graph G0 (right). Note how the direction of the edges affects the set of edges of G0 .

Two vertices are said to be adjacent if they share an edge between them. This notion is adjacency is tightly knit to the concept of degree. On an undirected graph, the degree of u, d(u), measures the number of its adjacent vertices.

(24)

CHAPTER 2. PRELIMINARIES

Note that, by definition, this the concept of degree does not extend to directed graphs. Specifically, we can need to measure din(u) and dout(u), where the former measures incoming

edges and the latter outgoing edges.

A graph is said to be simple if it is undirected and it has no self-loops, that is, (u, u) 6∈ E(G), ∀ u ∈ V (G), nor multi-edges, meaning that any two vertices can be connected by at most one edge.

The size of a graph is measured by the the size of its vertex set. Consider a graph G such that |V (G)| = k, then G is a k-graph.

A graph S is said to be a subgraph of G if V (S) ⊆ V (G) and E(S) ⊆ E(G). If all the edges of G that have have both end-points on V (S) are present in E(S), then the S is said to be an induced subgraph. The difference between these two concepts is illustrated in figure 2.2.

Figure 2.2: Two subgraph examples (outlined in black) on the same graph. (Left) Note how on the non-induced case, the subgraph does not use all edges between the vertices.

The open neighbourhood of u ∈ V (G), N (u), is comprised by the induced subgraph formed by all its adjacent vertices.

The closed neighbourhood of u ∈ V (G), NG(u), excludes u for the vertex set that defines the

induced subgraph.

A clique or complete graph is a graph where all vertices are adjacent among each other. An independent set or anti-clique, is a subset of vertices of a graph G such that none of them are adjacent. Figure 2.3 gives a clear visual insight between the disparity between these two types of graphs.

Figure 2.3: A clique (outlined in black) and an independent set (outlined in white), both induced on the same graph.

(25)

2.1. BASIC GRAPH THEORY

For two graphs, G and G0, are said to be isomorphic if and only if there is a bijection f : V (G) −→ V (G0), such that (u, v) ∈ E(G) ⇐⇒ (f (u), f (v)) ∈ E(G0) (see figure 2.4).

Figure 2.4: Two isomorphic graphs and the bijective function f between them.

An automorphism is an isomorphism from a graph onto itself, and the set of automorphisms of a graph G is denoted as Aut(G). Consider a vertex u ∈ V (G), then the automorphism orbit of u is:

Orb(u) = {v ∈ V (G)|v = g(u), g ∈ Aut(G)} (2.1) Simply put, if u and v are in the same orbit, they are topologically equivalent, which means one could swap their labels without altering the graph topology. Figure 2.5 gives a concrete example of a set of automorphisms and the respective orbits.

Figure 2.5: A graph with two orbits, depicted in yellow and gray, followed by its automorphism set Aut(G).

Curiously, you can refer back to figure 2.3, the graph as a whole has two orbits: one related to the vertices of the induced clique, and another for the vertices of the independent set. The degree distribution of an undirected graph, P (k), measures the fraction of nodes in the graph with degree k. This is one of the crucial properties used to characterize a network. For example, a lot of real world networks are said to be scale-free, that is, their degree distribution follows a power-law [BA99], P (k) ∼ k−γ.

Usually a graph is represented through an adjacency matrix, Aij ∈ R|V |×|V |. On a plain

graph, A[i][j] = 1 if (i, j) ∈ E and 0 otherwise. This representation has a spatial complexity of Θ(|V |2), with a benefit of having a lookup complexity of Θ(1). Since real world networks are often large and sparse, an alternative representation was developed, the adjacency list.

(26)

In an adjacency list, each row contains a list of the connections for each node, meaning that connections are only represented when (i, j) ∈ E. This has the benefit of saving space for sparse networks, the worst case being when the network is a complete graph, O(|V |2). Notwithstanding, it requires a space versus lookup speed trade-off, where queries now may take O(log(n)) using binary search.

2.2 Thesis’ Scope and Problem Definition

The term pattern in the context of Network Science usually refers to some kind of subgraph contained in G. Throughout this section we will discuss different techniques designed to measure the importance of these patterns. We will start by giving some context about Frequent Subgraph Mining (FSM), the broader scientific field that studies that attributes importance to subgraph through their frequency of occurrence in the Network. We then proceed to state our problem definition that is closely related to typical FSM, the main difference being we are only concerned with induced subgraphs. Afterwards we will give two examples of techniques that fall closer within the scope of applicability of this thesis. Firstly, we tackle Network Motifs, that consist of defining an induced subgraph as being representative of the Network if they are over-represented when compared to the occurrences computed on Random Networks. Finally, we discuss Graphlet Degree Distributions, a methodology that allows measurements of similarity between two networks.

Frequent Subgraph Mining

Frequent Subgraph Mining is the name of the Data Mining field that focused on extracting insights on complex networks through. The core idea is that we first perform a count of the frequencies of each subgraph, and classify an occurring subgraph as being frequent (i.e. insights with respect to the Network) through additional constraints. In FSM, we usually call the number of occurrences of a subgraph its support. Moreover the definition of support varies according to the type of FSM performed.

The two standard ways of performing FSM are: single graph based FSM and graph transaction based FSM. They are formally defined as follows:

Definition 2.1 (Single Graph FSM). Given a graph G, find all its frequent subgraphs. For a given δ such that δ ≥ 0, a subgraph, Gk, is frequent if its support supGk ≥ δ.

Definition 2.2 (Graph Transaction Based FSM). Given an input set of N transactions G = {G1, . . . , GN}, find all its frequent subgraph. The support of a subgraph, G0_k, is given by

supG_G0 k

= |{Gi|G0k⊆Gi}|

N . Given a δ s.t. δ ≥ 0, G 0

k is considered frequent if sup G G0

k

(27)

2.2. THESIS’ SCOPE AND PROBLEM DEFINITION

Note that the definition of support varies in the two definitions. Furthermore, in Graph Transaction Based FSM we compute the transaction-based support, which can be seen as a generalization of the occurence-based support performed in Single Graph FSM for a set of multiple graphs. Thus, occurrence-based support is as a special case of transaction-based support for N = 1 and G = {G}.

We first compute the occurrences on a set of pre-computed candidate subgraphs. The type of search and matching is also application-dependant [AHHJ16].

Search can either be complete, if we fully traverse the entirety of the network, or incomplete. Matching relates to the isomorphism test performed when counting the occurrences, and can be exact if we perform a strict isomorphism test, resulting on the exact number of occurrences for complete search. In contrast, if we allow some relaxations on the constraints in the isomorphism test, such as not considering the vertices’ labels as descriptive characteristics of a graph, we are performing approximate matching.

Frequent Subgraph Mining also draws some inspiration from Frequent Itemset Mining using the downward closure property to accelerate counting.

Definition 2.3 (Supergraph). Given two graphs G and S, if V (S) ⊂ (G), then G is a supergraph of S.

Definition 2.4 (Downward closure for subgraphs). If a subgraph S of G is frequent, then all of its supergraphs are also frequent.

Examples of classic FSM algorithms are MoFa [BB02], gSpan [YH02], FFSM [HWP03] and Gaston [NK04], and a study comparing them directly was performed in [WMFP05]. There is no general consensus on what the best algorithm is, since performance highly depends on application specific constraints.

Problem Definition

As we have seen in the previous section, there are no a priori constraints regarding the topology of the graph. In this thesis we focus on a more specific problem, defined below. Definition 2.5 (Subgraph Census Problem). Given some positive integer k and a graph G, count the exact number of distinct occurrences of each of all possible connected induced k-subgraphs of G. Two occurrences are distinct if there is at least one vertex that they do not share.

If the constraint on topology enforced by definition 2.5 is still not clear, you can refer back to the preliminaries section 2.1 of this chapter, specifically figure 2.2. Below, in figure 2.6 we

(28)

demonstrate visually all occurrences under the constraints stated by the Subgraph Census Problem.

Figure 2.6: (Bottom) All induced occurrences of a subgraph of size 3 (Right) on graph G (Left).

We will proceed with two examples of fields that measure subgraph importance in Networks under the aforementioned constraints.

Network Motifs

We have seen that in FSM we usually define some threshold δ for the support, above which subgraphs are considered frequent.

Milo et. al. [MSOI+02] proposed Network Motifs, which determines which subgraphs are representative of the network. As we stated in the previous section, we will only be concerned with induced subgraphs from now on, so this term will be used interchangeably with subgraph, and will always refer to the induced definition. In Motif discovery, subgraphs are considered frequent only if they have statistical representativeness given a null model. In the case of the Milo et. al.’s methodology, the null model is the comparison of between a Real Network’s subgraph occurrences to Random Networks.

The Erd˝os-R´enyi model [ER60] is the most commonly known model for Random Networks. Gilbert’s approach tends to simplify the calculation of key network characteristics, and allows the number of links on the network to evolve over time (as stated on section 3.2 of [Bar14]). Following Milo’s definition, given a Real World Network, a Randomized version is generated such that each vertex has the same number of incoming and outgoing edges or, in other words, the degree sequence is kept intact.

(29)

Definition 2.6 (Network Motif ). Given a network G and an ensemble of randomized networks, R = {R1, ..., Rl}, where there is a bijection f : V (G) −→ V (Ri) for every Ri∈ R,

satisfying din(u) = din(f (u)) and dout(u) = dout(f (u)), a k-induced subgraph is a motif of G

if its probability P of appearing in a randomized network from R an equal or greater number of times than in the real network is lower than a given cutoff value, τ .

A visual representation of this process is given figure 2.7.

Figure 2.7: A real network with an emerging motif (A) and an ensemble of random networks (B). Note how the frequency of the motif (occurrences outlined in red) is considerably higher on the real network than on the randomized ones (taken from [MSOI+_02]).

The degree constraint enforces that only patterns related to single-vertex characteristics emerge. Milo et. al. also presented an additional constraint that ensures that high significance is not assigned to a pattern only because it contains a highly significant subgraph: for k-motif discovery, the occurrences of the (k − 1)-subgraphs in R are preserved. Note that there are alternative ways of generating randomized networks for Motif Discovery and determine which subgraphs are statistically significant [MKI+03, KA05].

Motif discovery has been a crucial tool for research in Biology [BO04, BLC+05], Neurology [RS10b], Medicine [BGL11] and many other fields.

Graphlet Degree Distributions

Milo et. al. suggested that motifs could be seen as the building blocks of networks. Graphlet Degree Distributions (GDDs) [PCJ04] use a similar intuition, that is, the frequencies of subgraphs can be used as an indicator to distinguish between different networks. In a way, they can be seen as the fingerprint of a Network, and from them, we can build a similarity measure to compare different networks that emerge from distinct phenomena.

(30)

Note that in this paper, the author calls induced subgraphs graphlets. We will use this terminology only in this section, to avoid unnecessary verbosity throughout the remainder of the thesis.

Consider that we have a set of n candidate subgraphs S. Let s be a subgraph in S, and FN(s) be the frequency of subgraph s on network N , then the relative frequency of s in N is

given by: RN(s) = −log FN(s) P i∈SFN(i)

Note that the frequencies of different graphlets can differ by several orders of magnitude. If we didn’t apply the negative logarithm function, then RN would be entirely dominated by

the most frequent subgraphs.

In order to calculate the distance between two networks N1 and N2 we simply do:

D(N1, N2) =

X

s∈S

|RN1(s) − RN2(s)|

Let j be some orbit, and dj_G(k) be the number of vertices in G that touch orbit j a k number of times, meaning that dj_G(k) is the j-th GDD. The scaled GDD is given by:

S_Gj(k) = d

j G(k)

k

We normalize by k in order to attenuate the impact of larger degrees in the GDD. Let T_Gj be the sum of all scaled GDDs for a fixed k, then the final normalized distribution is given by:

N_Gj(k) = S

j G(k)

T_Gj

We can think of N_Gj(k) as the fraction of the total area of the of the complete GDD that pertains to S_Gj(k). Finally, we can define the notion of distance between two networks related to a given orbit j: Dj(G1, G2) = ∞ X k=1 [N_Gj 1(k) − N j G2(k)] 2 !1 2

(31)

Finally, the GDD agreement between two networks relative to orbit j is given by: Aj(G1, G2) = 1 − Dj(G1, G2)

The agreement will be maximal when the networks have identical j-th GDDs, that is, when Dj(G1, G2) = 0.

Figure 2.8: Automorphims orbits for all possible graphlets from sizes 3 to 5 (taken from [Prˇz07]).

While network motifs are ideal to outline the set of subgraphs that capture the essence of the Network, GDDs allow us to compare real networks with each other. Furthermore, through the notion of agreement, we can capture the local topological discrepancies for each graphlet. GDDs have shown promising results in the comparison of biological [PCJ04, PCJ06, Prˇz07] and social [LB11, JHK12] networks.

The relative frequency RN can be seen as the probability of A similar concept to network

motifs is graphlets [PCJ04]. They are also small graphs that can be seen as building blocks of a network, the main difference being that random networks are not used to verify their over-representation. Figure 2.8 shows the set of the 29 graphlets of sizes 3 to 5, taken from [PCJ04].

(32)

(33)

Algorithms for subgraph

enumeration and counting

3

In the previous chapter, throughout section 2.2 we pinpointed many different constraints regarding what should be considered an important subgraph. Remember that in this thesis we are only concerned with the Subgraph Census Problem. The process of computing subgraph frequencies is at the core of several Network Analysis techniques (like the examples of Motif Discovery and GDDs we gave previously). This fact motivates research on accelerating subgraph frequency counting and, throughout this chapter, we will do a small survey concerned with the literature related to some of the most modern and powerful subgraph census algorithms.

The process of determining the k-subgraph frequencies of a graph is divided into two distinct steps:

1. Find or enumerate explicit the distinct instances of the k-subgraphs in the network. 2. Determine which of this subgraphs are isomorphic, and group them in subgraph classes.

Typically step 2 usually relies on the concept of a canonical class. That is, given a set of isomorphic graphs, map them to the same abstract topological class. Nauty [McK03] is the most commonly used tool to perform this step, and allows to produce a canonically-labelled representation given an input subgraph, as shown below in figure 3.1.

Figure 3.1: (Two induced occurrences on G (left) that map to the same canonical class (right).

(34)

CHAPTER 3. ALGORITHMS FOR SUBGRAPH ENUMERATION AND COUNTING

We will give two three examples of subgraph census algorithms and data-structures. Note that we will purposefully be close as close as possible to the original literature, specially in in the ESU algorithm, which can be seen as the first major breakthrough (we will explain this better when addressing the state of the art in the following chapter) in accelerating the subgraph census problem.

3.1 Enumerate Subgraphs (ESU)

Sebastian Wernicke introduced a novel and highly efficient methodology in 2005 to perform the enumerations of subgraphs on networks, as well as a sampling procedure that avoided previous issues related to bias when determining networks motifs [Wer05]. The former subject falls within the thesis’ scope, the latter does not, so we will leave it unaddressed.

Firstly, we introduce the concept of exclusive neighbourhood, N (v, V0) = Nexcl({v})\N (V0)

such that V0⊆ V (G) for some graph G. The properties of this neighbourhood will allow us to enumerate each occurrence exactly once.

Algorithm 3.1 ESU: enumerates all induced k-subgraphs of G.

1: _{procedure ESU(G, k)} 2: for all vertex v of G do

3: VExtension ← {u ∈ N (v)|u > v}

4: _{ExtendSubgraph({v}, V}Extension, v)

5: _{procedure ExtendSubgraph(V}_Subgraph, VExtension, v)

6: if |VSubgraph| = k then

7: output G[VSubgraph] and return

8: while VExtension6= ∅ do

9: Remove arbitrary w ∈ VExtension

10: V_Extension0 ← VExtension∪ {u ∈ Nexcl(w, VSubgraph)|u > v}

11: _{ExtendSubgraph(V}_Subgraph∪ {w}, V_Extension0 , v)

The intuition behind ESU is the following: start at some vertex v, and add vertices to VExtension that have a labels larger than v, and are not neighbours to any vertex in VSubgraph,

effectively enumerating each possible subgraph exactly once. Another way of looking at the extension procedure is thinking of Vextension as a a “string” composed by vertices’ labels. The

aforementioned constraints assure that any string with “prefix” u will be enumerated exactly once on their respective recursive branch.

We now introduce definitions and claims, as wells as the appropriate proofs as seen on [Wer05], to show the correctness of this algorithm.

Definition 3.1 (ESU-tree). Calling EnumerateSubgraph(G, k) associates a tree of recursive function calls called ESU-tree. The root at depth zero represents the function ESU(G, k). Each

(35)

3.1. ENUMERATE SUBGRAPHS (ESU)

call of ExtendSubgraphs(VSubgrah∪ {w}, VExtension0 , v) leads to an edge from the caller

func-tion node to a node that represents the callee. The callee node is labeled (VSubgraph, VExtension)

and its depth is |VSubgraph|.

Figure 3.2: Resulting ESU-tree (bottom) from a call to EnumerateSubgraphs(G, 3) on graph G (top). Each leaf corresponds to a size 3 subgraph (adapted from [Wer05]).

For a node w, SUB(w) and EXT(w) represent VSubgraph and VExtension respectively. We

assume that the nodes of the ESU-tree have the same order as the calls to the procedures they represent. If a node w1 precedes a node w2, we designate this by using the notation

w1 ≺ w2. Given a set of nodes of the ESU-tree, a node is called minimal if it precedes all

other nodes in the set.

In order to prove the correctness of ESU, the author starts by claiming the following properties:

1. Let w1 be a distinct node from the root. For every u ∈ EXT(w1), w1 has a child node

w2 such that u ∈ SUB(w2).

2. For each node w in the ESU-tree distinct from the root and for each vertex u ∈ EXT(w), we have u > v, where v is the smallest label vertex in SUB(w).

3. Let w1 and w2 be two nodes in the tree with a common parent node and w1≺ w2. Then,

SUB(w1) contains exactly one vertex u1 which is not contained in SUB(w2) and vice

versa. For every node w0 whose path to the root contains w2, we have u16∈ SUB(w0).

Property 1 is consequence of the lines 8 to 9 in 3.1 being carried out for every vertex u in the original VExtension set.

(36)

Regarding property 2, it follows from lines 3 and 10, since one of the conditions for a vertex to be added to VSubgraph is that its label is larger than that of v. Thus, since v is the first

vertex added to VSubgraph, it has the smallest label in SUB(w).

Property 3 is not as trivial, and we need to consider two distinct situations. Firstly, let the common parents of w1 and w2 be the root of the ESU-tree. In this case, property

3 holds since we compute the code in line 4 of the algorithm only once for every vertex v ∈ V (G), and property 2 ensures that every node w0 that is a descendant of w2 that satisfies

SUB(w0) ∩ w1 = ∅.

Consider now that the common parent of w1 and w2 is not the root of the ESU-tree. Then,

the claim u 6∈ SUB(w0) comes from the fact that once u is a candidate for addition of VSubgraph, it is removed from VExtension by line 9 of ESU. The fact that once the call to

ExtendSubgraph(VSubgraph∪ {u}, VExtension) in line 11 terminates, there can be no output

subgraph containing u1 until VExtension = ∅. Note that u must be a neighbour to some vertex

in VSubgraph. Therefore, once u is removed from VExtension, it is not added to this set by any

call to ExtendSubgraph, until the aforementioned condition is satisfied again.

Now that we have shown that these 3 properties hold, we are to proceed to the proof of the correctness of the ESU algorithm.

Theorem 3.1.1. Given a graph G and k ≥ 2, ESU enumerates all induced k-subgraphs exactly once.

Proof. We’ll show that each subgraph is output at least once and at most once (therefore, only once).

For a call to ESU(G, k), let T be the associated ESU-tree. Assume that exists a k-subgraph G0 with V (G0) = {v1, ..., vk} in G that is not output by ESU. Without loss of generality, consider

that v1 is the smallest labeled vertex in G0. Line 4 of the algorithm ensures that for every

vertex v ∈ V (G), including v1, the root of T exactly one child w1 with SUB(w1) = {v1}. All

neighbours of v1 in G0 are in EXT(w1), by line 3 if we consider v1 as the smallest labeled

vertex. By property 1, w1 has a child node w2 with SUB(w2) = {v1, v0} for each neighbour

v0 of v1 in G0. Let w20 be the minimal of these child nodes, and assume that the respective

neighbour of v1 added to SUB(w1) is v2, that is, SUB(w20) = {v1, v2}. We claim that EXT(w20)

contains all neighbours that v1 and v2 have in G. For the purpose of contradiction, assume

that there exists a neighbour which is not contained in EXT(w₂0). There can be only three such cases:

1. Its label is smaller than that of v1 - this can be easily debunked, since v1 has the

(37)

3.2. G-TRIES

2. It could be neither a neighbour of v1 nor in Nexcl(v2) - we can rule this case out, since

that in G0 it is a neighbour of either v1, v2 or both.

3. It could have been taken from EXT(w0₂) - also false, since w0₂ is minimal by definition. By induction for the remaining vertices v3, ..., vk leads to a node wk in the ESU-tree for which

SUB(wk) = V (G0), contradicting our claim that G0 was not output from the algorithm (each

subgraph is thus output at least once).

Finally, to prove the at most once case, we assume by contradiction that G0 is enumerated twice. This means that are two leaves w1 and w2 in the ESU-tree where SUB(w1) = SUB(w2)

holds. The path p1 from w1 to the root must differ (at least for a node) from path p2 from

w2 to the root. Define the split node as the node at the greatest depth in the tree shared

by p1 and p2. Due to property 3, the existence of this split node implies that SUB(w1) and

SUB(w2) differ by at least one element, and we thus arrive at a contradiction.

We arrive at the conclusion that Theorem 2 is true, and ESU enumerates all k-subgraphs of G (k ≥ 2) exactly once.

It turns out that ESU-tree also has a lot of properties that make it ideal for fast motif detection (see [Wer05]), and this led to the development of a tool based on this algorithm called FANMOD [WR06]. The main advantage at the time, was that its main competitors were mfinder [KIMA02] and MAVisto [SS05], and it was faster and scaled considerably better as the subgraphs increased in size. FANMOD also had other conveniences, such as the possibility to handle networks with colored vertices and edges, and the ability to accurately predict the overall running time of the algorithm, since RAND-ESU (the version of ESU used when complete traversal is too time-expensive), that explores only parts of the ESU-tree such that each leaf is reached with equal probability) allows for an accurate estimation of the total frequencies of each k-subgraph.

3.2 G-Tries

G-Tries are an efficient data structure designed to improve motif discovery [RS10a]. Typically, algorithms for subgraph census either have to enumerate all k-size subgraphs and perform a set of isomorphism tests in order to determine the frequency of each class (like ESU in the previous section), or require a set of input subgraphs where a specialized search is performed in order to compute the frequencies [GK07].

The main idea is to perform subgraph census in a way that stands in the middle of these opposing strategies: having a set of input subgraphs, that allows us to save computational

(38)

time on counting subgraphs that do not interest us, while taking advantage of topological information from previous subgraph searches instead of performing each census completely independently.

Prefix Trees as an Intuition for G-Tries

Prefix trees or tries are a well known data-structure in Computer Science used to solve problems related to strings (as depicted in figure 3.3). In a prefix tree all descendant nodes share a common prefix, represented by a traversal from the root to the parent node.

Figure 3.3: Resulting trie for an input set of four words.

The intuition behind G-Tries is to apply a similar idea in the context of networks, since multiple graphs can share a common smaller subgraph. A visual example is given in figure 3.4.

Figure 3.4: A subgraph of size 3 shared among four subgraphs of size 6.

In a trie each node represents a letter, while in a G-Trie each node represents a graph vertex. Each vertex can be distinguished by its connections to ancestor nodes so that common substructures are precisely captured between two or more sibling nodes on the G-Trie tree. Furthermore, each distinct path in the tree leads to a unique graph.

Given that we now know the set of intuitions required for this section, we will provide the definition as stated by the authors:

Definition 3.2 (G-Trie). A G-Trie is a multiway tree, where each node contains information about a single graph vertex and its corresponding edges to ancestor nodes. A path from the root to a leaf corresponts to one distinct graph. Descendants of a G-Trie node share a common subgraph.

(39)

3.2. G-TRIES

Figure 3.5: A G-trie generated from an input set of 6 subgraphs. For each level of the tree a new vertex is added (black) connecting to the vertices on the parent node (taken from [RSL10]).

Remember that each node needs to store information about a new vertex and its connections to ancestor vertices. Given its simplicity and ease of use, we will represent the graph by an adjacency matrix with ’1’ representing a connection and ’0’ its nonexistence. A vertex and its edges is therefore represented by the respective row of the matrix. Since every tree node represents a single new vertex, we can therefore just store in it the corresponding row. To get a clearer visual idea of the inner workings of a G-Trie, consult figure 3.6.

Subgraph Labeling and Insertion

For clarity’s purpose, suppose that you are using a boolean adjacency matrix to represent a graph. Note that each row contains the desired information related to each vertex and its respective edges, thus we only need to store the row values up to the position pertaining to the target vertex itself.

Consider also that you want to insert a set of subgraphs on an empty G-Trie. To construct the tree we simply insert each subgraph one at a time, and in each traversal we check if any G-Trie node has the same partial row information as the graph we are inserting.

Note that depending on the order of labels of the vertices, the adjacency matrix representation changes drastically.

Using arbitrary labeling systems is inappropriate, since different adjacency matrices will generate different G-tries, and may even lead to the possibility isomorphic graphs having different G-Trie representations, representing the same graph on different branches of the

(40)

Figure 3.6: The effect of different labeling schemes on the same graph (adapted from [RSL10]).

tree. Thus, there is a need to define a canonical labeling in order to guarantee that identical sets of subgraphs will always lead to the same G-Trie.

The authors chose a labelling that results in the lexicographically larger matrix, with the goal of ensuring the the highest degree vertices are stored in the first rows of the matrix. Using this representation in a graph without self-loops will guarantee that the first vertex inserted will always have the same partial row representation of ’0’, the same goes for the second vertex to be inserted that will always result on a label of ’10’.

This representation tends to enable the emergence of common patterns, and the fact that this effect is more prevalent in vertices with a lower index, leads to having a less dense G-Trie in the lower depths and a small tree overall.

The amount of compression performed by this data structure can be measured by the quotient between the size of the G-Trie and the sum of sizes of the subgraphs in the input set:

compress ratio = 1 − nodes in tree

P nodes of stored subgraphs (3.1) We will now thoroughly describe the complete insertion procedures presented in algorithm 3.2.

Algorithm 3.2 Insertion: inserting a subgraph G on a G-Trie T .

1: _{procedure Insert(G, T )} 2: M ← canonicalAdjMatrix(G)

3: _{insertRecursive(M, T, 0)}

4: _{procedure InsertRecursive(M , T , k)} 5: if k < numberRows(M ) then

6: for all children c of T do

7: if c.value = first k + 1 values of M [k] then

8: insertRecursive(M, c, k + 1)

9: return

10: else

11: nc ← new G-Trie node

12: nc.value ← first k + 1 values of M [k]

13: T.insertChild(nc)

(41)

3.3. SUBGRAPH CENSUS

Firstly we start by obtaining the canonical representation that, in this case, results in the lexicographically larger matrix (line 2). Afterwards the procedure InsertRecursive is called, where the tree is traversed recursively and an insertion is made when required. Inside this function we first iterate through all descendants of the current G-Trie node (line 6) and check if there is a match with their partial row of the adjacency matrix (line 7). If that is the case, we continue the recursion with the next vertex (line 8). Otherwise, a new child is created (lines 11 to 13), and only after we proceed with the recursion for the next vertex (line 14),

resulting in an overall complexity of O(|V (G)2|) for procedure InsertRecursive.

3.3 Subgraph Census

After creating a G-Trie one can perform subgraph census using the graphs stored in the G-Trie nodes as induced subgraphs of a larger graph.

Algorithm 3.3 Census: Census of subgraphs of T in a graph G.

1: _{procedure Census(G, t)}

2: for all children s of T.root do

3: match(c, G, T, ∅)

4: _{procedure Match(T , G, k, V}_used)

5: if Vused= ∅ then

6: Vcand ← V (G)

7: else

8: Vconn ← {Vused[i] : T.value[i] =010}

9: m ← m ∈ Vconn: ∀v ∈ Vconn, |N (m)| ≤ |N (v)|

10: Vcand ← {v ∈ N (m) : v 6∈ Vused}

11: for all v ∈ Vcand do

12: add v to end of Vused

13: if ∀i ∈ [0..k] : T.value = Gadj[v][Vused[i]] then

14: if T.isleaf() then

15: reportGraph()

16: else

17: for all children c of T do

18: match(c, G, k + 1, Vused)

19: remove v from Vused

We start by calling the Match procedure for all descendants of the root (line 2). Inside this function, we first construct the set of candidate of nodes Vcand to expand the partial subgraph

set Vused. If there are no vertices in Vused all vertices of the graph are suitable candidates

for the first vertex of the partial subgraph (lines 5 and 6). If this Vused is not empty, we

compute the vertices that our current vertex should be connected to, Vconn (line 8). The

(42)

candidates (line 9). Afterwards we add the vertices belonging to N (m) that do not already belong to Vcand.

Following this operations we iterate through all candidates in Vcand (line 11), adding each of

them to the partial subgraph Vused (line 12).

In order to check if a candidate is a suitable match to our current state, we compare the G-Trie node value to check if the candidate has a perfect match to the desired connections of its ancestor (line 13). If the candidate fulfills this condition, there are two possible outcomes: either we are on a leaf of G-Trie T and we can increase the frequency (lines 14 and 15) or we continue the recursion of the Match procedure with all the children of the current G-Trie node (line 18).

We finish be removing the vertex from Vused in order to repeat the process with the next

candidate.

3.4 Symmetry Breaking Conditions

If we do not enforce constraints that prevent automorphims to be accounted for in the frequency calculations, the resulting subgraphs census will not be correct. Note that each induced occurrence needs to be accounted for exactly once.

A possible solution is to compute the number of automorphims for each subgraph, and divide the total frequency by this number. However, this results on a considerable amount of computational time wasted on redundant calculations.

With the goal of solving this problem, the authors present a set a symmetry breaking generator algorithm (described below) based on previous work of Grochow et. al. [GK07]. A symmetry breaking condition is a constraint of the type a < b, meaning that vertex a should have a smaller label than b.

Algorithm 3.4 FindConditions: Symmetry breaking conditions for graph G.

1: _{procedure FindConditions(G)} 2: Conditions ← ∅

3: Aut ← setAutomorphims(G)

4: while |Aut| > 1 do

5: m ← minimum v : ∃map ∈ Aut, map[v] 6= v

6: for all v 6= m : ∃map ∈ Aut, map[m] = v do

7: add m < v to Conditions

8: Aut ← {map ∈ Aut : map[m] = m}

9: return Conditions

The algorithm starts with an empty set of conditions (line 2). Firstly, the set of all possible automorphisms for G, Aut, is calculated (line 3). The goal is to iteratively expand the

(43)

3.4. SYMMETRY BREAKING CONDITIONS

conditions set with constraints that, when fulfilled, reduce the set to the idendity mapping. Note that although calculating the number of automorphims is a hard problem, the authors report that when using nauty it can be solved usually in less than a second for hundreds of vertices.

For each iteration of the while loop (line 4), we find the minimum vertex m ∈ Aut that still has a topologically equivalent vertex for position m (line 5).

Afterwards, the constraints indicating that the vertex in position m should have the lowest index of every equivalent position are added to the condition set (lines 6 and 7), fixing m in its position.

Finally we filter the Aut set by removing the mappings that do not follow the newly enforced constrains, that is, the mappings that do not fix m (line 8). The loop (line 4) concludes when the Aut is equivalent to the identity.

We are now ready to complete the insertion and census algorithms with symmetry breaking conditions (algorithms 3.5 and 3.6).

Algorithm 3.5 Insertion [Symmetry breaking]: inserting a subgraph G on T .

1: _{procedure InsertCond(G, T )} 2: M ← canonicalAdjMatrix(G, T ) 3: InsertCondRecursive(M, T, 0, C) 4: _{procedure InsertCondRecursive(M , T , k, C)} 5: lines 5 to 12 of algorithm 3.2 6: nc.addSetConditions(C)

7: lines 13 and 14 of algorithm 3.2

Algorithm 3.6 Census [Symmetry breaking]: census of subgraphs of T on G.

1: _{procedure Match(T , G, k, V}_used)

2: lines 5 to 13 of algorithm 3.3

3: if ∃C ∈ T.conditions : Vused respects C then

4: lines 14 to 18 of algorithm 3.3

5: line 19 of algorithm 3.3

The intuition for this adaptations is that each G-Trie node should be aware of which set of graphs, as well as the appropriate conditions, are in their descendant leafs. The census algorithm has to check if the partial subgraph fulfills all the conditions of at least one of its leaves.

Note that since algorithm 3.4 creates conditions of the minimal indexes that are yet to be fixed, we can check if a condition is being broken as early as possible, which results on a strong pruning effect on the recursion tree.

(44)

3.5 Fast Subgraph Enumeration (FaSE)

The original version of G-Tries is one of the fastest algorithms available for exact subgraph census. However, it requires a set of subgraphs in order to build the G-Trie tree, and the number of distinct k-subgraphs gets exponentially larger with k. Precomputing the complete set of subgraphs a priori quickly becomes intractable, and thus the need to iteratively build the G-Trie during the process of subgraph census emerged. FaSE [PR15] provides an answer to this problem, trading off the necessity of having simmetry breaking conditions for a G-Trie where topologically redundant nodes are allowed (i.e. two or more G-Trie leaves can represent the same subgraph).

The reader might be reluctant to believe the benefits of this algorithm, since in the worst case scenario all the subgraphs will emerge on a network, and not enforcing symmetry breaking conditions will only make the resulting G-Trie even larger. Notwithstanding, when applied to real networks, FaSE shows promissing results, and empirically it appears that because the set of all subgraphs is exponential relative to k, the probability that all of the members of this set are represented on the network is statistically insignificant.

The FaSE algorithm is comprised of two steps: enumeration and encapsulation. Since we already introduced the former we will focus on the latter. Encapsulation is the process of obtaining the isomorphism classes by storing the topological features of the subgraph. This is done by two steps, an intermediary labeling the authors named LS-Labeling, that is generated whenever a vertex is added to the temporary set of subgraph vertices, and the final canonical label generated by nauty. One of the main advantages of using a G-Trie is that nauty is only called once per G-Trie leaf, and the intermediary LS-labels are computed in polynomial time.

Another strength of FaSE is that it only requires the enumeration process to transition from state to state adding only a single vertex at the time. This allows the resulting subgraph to be labeled according to the transitions generated to get to a final state. This means that FaSE is compatible with any modern enumeration algorithm, such as ESU [Wer05] and Kavosh [KAE+09].

LS-Labeling

LS-Labeling is crucial to the effectiveness of the G-Trie data-structure, as it directly impacts the branching factor of the tree since it defines its edges, having an influence on both the running time and the amount of memory required. It uses information only regarding the newly explored vertex of a partial subgraph (that is, a subgraph that is yet to be fully enumerated and as not reached a final state on the enumeration algorithm) and its connections to vertices in that subgraph.

(45)

3.5. FAST SUBGRAPH ENUMERATION (FASE)

A crucial point is this is but a intermediary representation, and it is of utmost importance that it takes only polynomial time to compute so that we take advantage of the fact that we do not need to call a canonical class label generator (such as nauty) until we arrive at a final state, since that problem is computationally hard.

The authors present two simple labeling alternatives: adjacency list labeling and adjacency matrix labeling. Consider that we are dealing with an undirected network, if we are using the adjacency list labeling, the label is comprised of k − 1 integers (where k is the size of the partially enumerated subgraph), and i (0 < i < k) will only be present if the new vertex has a connection to the i-th added vertex. The process for the adjacency matrix labeling is similar, and the label is comprised of a list of k − 1 boolean values, where each of them indicates if the newly added vertex as a connection and previously added vertices in order of addition. If we are considering directed networks, we only need to keep two lists instead of one for the adjacency list labeling, one for incoming edges and the other for outgoing edges.

Figure 3.7: Different LS-Labeling schemes using the adjacency list and matrix. Grey nodes outline the vertices being inserted.

The two label techniques are methodically equivalent and only change in the type of strategy used to display information. To prove the correctness we only need to show that two graphs labeled equally will be isomorphic. In other words, we need to find a bijection between the two subgraphs. Trivially this can be done by following the order in which each vertex was enumerated, which is implicitly stored in the actual label, and map the vertex in each position of the order to one another. Therefore, any two subgraphs that share a label belong to the same canonical class, and we arrive at our correctness proof.

Customized G-Tries

The authors propose an alternate version to the original G-Tries [RS10a], which can be seen as a simplified version of the original data-structure. Each node stores the frequencies of each

(46)

class, and its LS-label. The G-Trie is initialized having information pertaining only to an empty graph. When a vertex is added, its LS-Label is calculated to determine its location on the G-Trie. If this node is nonexistent, both the node and the edge are created. If two different subgraphs lead to the same G-trie node, it is assured that they are isomorphic, thanks to the labeling properties described in the previous subsection.

We keep a “current node” that represents the partial enumerated subgraph, which is initially the root node and represents an empty graph. There are two operations used to navigate the data-structure as we explore the input network. The first one is Deepen (short for “deepening”), and it inserts a new vertex into the current graph by searching the G-Trie for the appropriate node , updating the “current node” accordingly. It also creates a new node (with a frequency of one) and edge if necessary. The generated label for the new vertex is used to assign an edge, to decide where to go on the tree. The backbone of the G-Trie is a prefix tree (or “trie”) that ensures linear search time of the new node related to the length of its label.

The Jump operation is simpler, and it sets the “current node” to its parent.

When inserting graphs on the G-Trie we can take advantage of the common topologies inherent of the subgraph enumeration algorithm. When a new vertex is selected by enumeration, the labeling algorithm is called and the new label is calculated using information of the already stored vertices relative to the new vertex. A Deepen operation is then performed, and after the recursive call that extends the current subgraph as ended, a Jump operation is called to get back to the previous node in the G-Trie. This allows the G-Trie pointer (i.e. our “current node”) to coincide with the enumeration algorithm, since the partial enumerated set

is common to every member of its extension.

FaSE algorithm

The procedure EnumerateAll (line 5) iterates through all subgraphs of all sizes to k, updating the appropriate frequency when the counter equals to k. The frequencies are stored by the G-Trie, but since LS-Labeling does not provide a canonical labeling, it is necessary to accumulate the frequencies of topologically equivalent G-Trie nodes (lines 3 and 4) and perform isomorphism test that lead to canonical labeling (using nauty). Note that EnumerateAll and EnumerateNext (line 9) are left purposely ambiguous, since FaSE is adaptable to any enumeration process where the transitions from state to state add only a single vertex at the time.

(47)

3.5. FAST SUBGRAPH ENUMERATION (FASE)

Figure 3.8: A G-Trie using the LS-Labeling scheme. Note how isomorphic subgraphs may lead to different nodes (e.g. third level of the tree) since we do not enforce symmetry breaking conditions.

Algorithm 3.7 FaSE: enumerates and counts all induced k-subgraphs of G.

1: _{procedure FaSE(G, k)}

2: _{EnumerateAll(G, k, ∅, 0)} 3: for all n in GT rie.leaves() do

4: f requency[CanonicalLabel(n.Graph)] += n.count 5: _{procedure EnumerateAll(G, k, S, d)} 6: if d = k then 7: GT rie.current.node.count += 1 8: else 9: while nS ← EnumerateNext(S) do 10: w ← nS.NextNode() 11: nL ← LSLabel(S, w) 12: GT rie.Deepen(nL) 13: nS.Subgraph ← nS.Subgraph ∪ w 14: EnumerateAll(G, k, nS, d + 1) 15: GTrie.Jump()

(48)

(49)

Condensed Graphs

4

In the previous chapter we focused on three specific algorithms. Now that we properly introduced the problem and scope of the thesis, we will start this chapter by focusing on other state-of-the-art solutions to regarding subgraph counting and enumeration. Afterwards, we will lay baseline theory regarding our proposed innovations.

4.1 Related Work

Subgraph census computation has been studied for more then 15 years. In 2002 Milo et al. [MSOI+_{02] coined the term network motifs as frequent over-represented induced subgraph}

patterns and offered the mfinder subgraph enumeration algorithm as a first practical approach for computing subgraph frequencies. The first major breakthrough was introduced by Wernicke [Wer06] with the ESU algorithm, which avoided graph symmetries and enumerated each subgraph only once. Isomorphism tests for each discovered subgraph occurrence are made trough the third party package nauty [McK03], a highly efficient isomorphism algorithm. In order to reduce the number of needed isomorphism tests, approaches such as QuateXelero [KSD+13] or FaSE [PR13] encapsulate the topology of the current subgraph match, grouping several occurrences as belonging to the same isomorphic class. If we know beforehand the set of subgraphs that we are interested on (which can possibly be smaller than the entire set of all possible k-subgraphs), the original G-Tries data structure and algorithms [RS14] could be used, allowing for further improvements.

All the aforementioned approaches are general (i.e, are applicable to any subgraph size and also allow direction) and rely on doing a full subgraph enumeration. However, for more specific sets of subgraphs there has been an increasing number of more analytical algorithms that take into account the subgraphs topology and its combinatorial effects. For example, ORCA [HD14], which counts orbits and not directly subgraph occurrences, can tackle up to size 5 undirected subgraphs and relies on a derived set of linear equations that relate the orbit counts. This was also generalized for other small undirected orbits [HD17]. PGD [ANRD15] (up to size 4) and Escape [PSV17] (size 5) are other examples of state-of-the-art analytical algorithms specialized on counting undirected subgraphs. Other analytical approaches only work for directed graphs, such as Acc-Motif [MMFDC14], which can count non-induced

Condensed Graphs: Towards a General Approach for Faster Subgraph Census

Condensed Graphs:

Towards a General

Approach for Faster

Subgraph Census

Miguel Lopes Martins

Master’s degree in Computer Science

Supervisor

Acknowledgments

Abstract

Resumo

Contents

List of Tables

List of Figures

List of Algorithms

Introduction

1

1.1

An Informal Introduction to Graph Theory

1.2

Motivation

1.3

Goals and Contributions

1.4

Thesis Outline

1.5

Bibliographic Note

Preliminaries

2

2.1

Basic Graph Theory

2.2

Thesis’ Scope and Problem Definition

Frequent Subgraph Mining

Problem Definition

Network Motifs

Graphlet Degree Distributions

Algorithms for subgraph

enumeration and counting

3

3.1

Enumerate Subgraphs (ESU)

3.2

G-Tries

Prefix Trees as an Intuition for G-Tries

Subgraph Labeling and Insertion

3.3

Subgraph Census

3.4

Symmetry Breaking Conditions

3.5

Fast Subgraph Enumeration (FaSE)

LS-Labeling

Customized G-Tries

FaSE algorithm

Condensed Graphs

4

4.1

Related Work