• Nenhum resultado encontrado

Chapter 1 Introduction

2.4 Graph pattern mining paradigms

2.4. Graph pattern mining paradigms 39 Definition 13. (Pattern-aware subgraph enumeration) Let C be a predicate rep- resenting a set of conditions (or properties) a subgraph in G must exhibit in order to be of interest and let G be a graph. A pattern-aware subgraph enumeration (PASE) finds subgraphs in G by: (i) obtaining a set of patterns of interest P that does not violate predicate C; and (ii) enumerating subgraphs in G isomorphic to each p∈P.

LetP ={P1, P2,· · · , Pm}be the set of all canonical patterns inG, a pattern-aware subgraph enumeration (Definition 13) explores subgraphs isomorphic to a given set of patterns, and uses such given information to guide the extension of connected subgraphs.

Algorithm1 describes the process in high-level details. The method starts by selecting a set of patterns that may generate subgraphs of interest in G, i.e., patterns that contain k vertices and that satisfy the predicate C (line 1). Next, for each selected pattern, the input graph is matched against it producing isomorphic subgraphs for processing (lines 2-5). The matching of each selected pattern works vertex-by-vertex by recursive calls of:

(i) selecting the next vertex to match (to add); (ii) determining pattern edges that connect the current subgraph to the next vertex to match; (iii) retrieving fromGthe set of vertices that satisfy the connectivity pattern represented by those edges; (iv) removing from those vertices any candidate not satisfying the predicate C or that produces a non-canonical subgraph code (details on the canonical ordering algorithm for a pattern-aware subgraph exploration are presented in Section 4.1.2); (vi) and finally including valid candidates to the current subgraph for matching subsequent vertices.

Algorithm 1 pattern-aware-senum(G, k, C)

1: P ← {P ∈ P | |V(P)|=k and C(P)}

2: for P in P do

3: for u∈V(G)do

4: for S in match({u}, P) do

5: aggregate(S)

6: procedure match(S, P)

7: if |V(S)|=k then

8: emit S

9: else

10: vP ←next-vertex-to-match(S, P)

11: EP ← {(s, ls, d, ld, l)∈P |d=vP}

12: V ← {v ∈S

uV(S)N(v)| µ(S∪ {v})is canonical, C(S∪ {v}), and Ä(S∪ {v}) = Ä(S)∪EP}

13: for v ∈V do

14: match(S∪ {v}, P)

Definition 14. (Pattern-oblivious subgraph enumeration) Let C be a predicate representing a set of conditions (or properties) a subgraph in G must exhibit in order to be of interest and letG be a graph. A pattern-oblivious subgraph enumeration (POSE) finds subgraphs inG directly using Algorithm 2.

2.4. Graph pattern mining paradigms 40 On the other hand, a pattern-oblivious subgraph enumeration (Definition14) rep- resents the class of enumeration strategies that are not pattern-aware. In such exploration category, canonical subgraph codes are visited but without the help of any reference pat- tern to guide the subgraph growth. Algorithm2describes the overall method in high-level details. This paradigm works by adding valid words (W(G): vertices or edges) connected to the current subgraph until it reaches size k. Subgraph extensions are valid if they generate canonical subgraph codes (details on the canonical ordering algorithm forPOSE are presented in Section 4.1.2) and satisfy the given subgraph predicate (line 8). Note that in this case, no information about which patterns are being extracted from the input graph is given whatsoever.

Algorithm 2 pattern-oblivious-senum(G, k, C)

1: for w inW(G) do

2: for S inexplore({w}) do

3: aggregate(S)

4: procedure explore(S)

5: if size(S) =k then

6: emit S

7: else

8: W ← {w∈N(S)|µ(S∪ {w}) is canonical andC(S∪ {w})}

9: for w in W do

10: explore(S∪ {w})

Figure 2.5 illustrates the difference between both subgraph enumeration frame- works. While P OSE enumerates subgraph candidates directly by having some partial order among subgraphs of different patterns, P ASE includes a pattern generation step where patterns of interest are first obtained and each are matched against the input graph. We highlight that the output is equivalent since the same subgraph candidates are generated regardless of the paradigm adopted.

We make an informal argument that both methods may be used to solve graph pattern mining applications. In POSE, all subgraphs are visited once, and only once, because the algorithm only generate larger subgraph codes that are canonical. Therefore, all subgraphs are visited in no particular order, meaning that there is no way of determining the pattern of the next subgraph being produced. Naturally, this is sufficient for visiting all the search space required for the conditional subgraph enumeration problem. On the other hand, in PASE, patterns are selected before subgraph exploration. Indeed, the algorithm can always be exhaustive in generating the set of patterns that are going to be matched in the input graph, in case the predicate is unable to select specific patterns.

Also each subgraph is visited only once for the same reason as in POSE. Therefore, all subgraphs are visited in a particular order: all subgraphs isomorphic to pattern P1, then all subgraphs isomorphic to pattern P2, and so on. Naturally, this is also sufficient for

2.4. Graph pattern mining paradigms 41 Figure 2.5: GPM paradigms: pattern-oblivious (P OSE)vs. pattern-aware (P ASE).

enumeration orderingby

pattern generation

enumeration matchingby

enumeration matchingby

Source: Made by the author.

visiting all the search space required for the conditional subgraph enumeration problem.

An interesting analogy exists for understanding these two alternative subgraph exploration paradigms for graph pattern mining: querying tuples in a relational database.

In this scenario, we can always determine queries (patterns) sufficient to cover all the tuples – this is analog to the PASE approach. Alternatively, we may instead retrieve the same set of tuples by fully scanning the database – this is analog to the POSE approach.

In this work, as we may see, we leverage these alternatives to propose a general modeling for graph pattern mining applications (Chapter 4) and abstractions (Chapter6).

42

Chapter 3

Related Work

This chapter is devoted review the GPM literature and related topics. In Section 3.1 we introduce the class of algorithms known as graph analytics, and we argue why these algorithms may not exhibit the same challenges as GPM processing. In Section 3.2 we present some specialized ad-hoc solutions for specific GPM problems to motivate why this strategy may be too strict for real-world data analytics pipelines. In Section 3.3 we introduce existing general-purpose GPM systems, which aim at filling the gaps of too specialized GPM solutions. Finally, in Section3.4we review a few existing works in which goal is to characterize and understand GPM workloads.