Graph pattern mining problems - Graph pattern mining: consolidating models, systems, and abstra

Chapter 1 Introduction

2.3 Graph pattern mining problems

2.3. Graph pattern mining problems 34

• Induced subgraphs with size k;

• Predicate C(S) is satisfied for all S ∈ S;

• Mapping function (g :S → P) =S →Ä(S);

• Mapping function (h:S →N) =S →1;

• Reduction function (r : (N×N)→N) = (a, b)→a+b.

The motifs kernel has only one constraint that limits the size of induced subgraphs to k. The mapping function g extracts the key for aggregation from the subgraphs, representing the canonical patterns obtained. The mapping functionh extracts the value function from a subgraph and in this case it is a constant 1 for counting. Finally, the reduction function r defines the associative operation of summing the partial counts of the same pattern. Figure2.3 illustrates the aggregation part in the motifs kernel.

2.3.2 Cliques kernel (k - CL)

A k-clique is a complete subgraph having k nodes in an input graph. In this case, only the topology of the subgraphs is considered. The cliques kernel is used in [28, 25].

Formally, letG(V, E)be the input graph,k be the number of vertices in a clique,S be the set of induced subgraphs ofG,S be an arbitrary induced subgraph (S ∈ S), the following items define thek-cliques kernel:

• Subgraphs (also induced subgraphs) with size k;

• Predicate C(S) ⇐⇒ ∀u, v ∈V(S),(u, v)∈E(G);

• Mapping function (g :S → {cliques}) =S →cliques;

• Mapping function (h:S →N) =S →1;

• Reduction function (r : (N×N)→N) = (a, b)→a+b.

In this kernel, we have two constraints: valid subgraphs have k vertices and each one of them is connected tok−1 other vertices, yielding a total of ^k(k₂⁻¹⁾ edges. Because we are interested in counting a single pattern (i.e. the k-clique pattern), mapping key function g returns the cliques always. The other mapping function h and the reduction functionrrepresent the sum over all valid subgraphs. The result of this kernel is a mapping with a single entry representing the number ofk-cliques, denoted bym: {(cliques, m)}. A particular instance of the cliques kernel is thetriangle kernel, i.e., cliques with 3 vertices.

2.3. Graph pattern mining problems 35

2.3.3 Pattern querying kernel (Ä- PQ )

Querying a subgraph pattern is maybe the naivest GPM application known. The task is to list and count all the subgraphs in an input graph G that are isomorphic to a user-defined pattern p. The following items describe the subgraph querying kernel in terms of the subgraph aggregation problem:

• Subgraphs with size k=|E(p)|

• Predicate C(S) ⇐⇒ Ä(S) =p

• Mapping function (g :S → P) =S →P.

• Mapping function (h:S →N) =S →1.

• Reduction function (r : (N×N)→N) = (a, b)→a+b.

Valid subgraphs have exactly the same number of vertices as p, the same number of edges as p, and have the same template as p. The output of this kernel is a mapping with only one entry, representing the number of subgraphsm isomorphic to pattern pin G: {(p, m)}.

2.3.4 Frequent subgraph mining kernel (k- FSM -³)

A Frequent Subgraph Mining (FSM) kernel seeks to obtain all frequent subgraph patterns from a labeled input graph G. A pattern P is frequent if it has a support s(P)∈Sabove a threshold ³,i.e., ifs(P)g³. In particular,s(P)is calculated based on the set of isomorphisms of the extracted subgraphs. We adopt theminimum image-based support [16] as the support function s(·) to leverage the anti-monotonic property: larger frequent patterns can only be obtained from smaller also frequent patterns. Because of this hierarchical behavior, the FSM kernel is defined as a chain of conditional subgraph enumeration problems, where the output of a computation level (the frequent patterns) serves as input to the next computation level.

Specifically, to compute the frequent patterns withk edges, we split this execution in three steps: (1) computation of frequent patterns with 1 edge; (2) computation of frequent patterns with 2 edges from frequent patterns of size 1; and (3) computation of frequent patterns with 3 edges from frequent patterns of size 2. Nevertheless, the following items describe each step i= 1,2,· · ·, k of the FSM kernel. Let G(V, E)be the

2.3. Graph pattern mining problems 36 input graph,S be the set of subgraphs withi edges ofG, ibe the current FSM step, and Fi−1 be the set of frequent patterns withi−1 edges⁴:

• Subgraphs with size k (i.e., number of edges);

• Predicate C(S) ⇐⇒ Äc(S)∈ Fi−1;

• Mapping function (g :S → P) =S →Äc(S);

• Mapping function (h:S →S) =S →s(S);

• Reduction function (r : (S×S)→S) = (a, b)→a+b⁵.

In this case, step i generates valid subgraphs with i edges that can be obtained from frequent patterns (S^′ ∈ F_i−1). The mapping key function extracts the pattern of the subgraph, the mapping value function extracts the support of the subgraph. Finally, the reduction function combines the supports of the same pattern using an associative

“+” operator. The output of step i is A(S, g, h, r), in which we derive the input for step i+ 1 representing only the mappings spanning frequent patterns: Fi−1 = {(Ä, s) ∈ A(S, g, h, r) such as sg³}.

2.3.5 Quasi-cliques kernel (k- QC -³)

Dense subgraph extraction can assist in fraud detection for social networks [55], in unveiling structural correlations for attributed graphs [115], among others. A ³-quasi- clique of size k is a subgraph that has k nodes and every vertex of the subgraph has a degree of at least +³∗(|V(S)| −1),, i.e., each vertex is connected to a fraction of the vertices in the subgraph. The problem seeks to list and count³-quasi-cliques in a graph:

• Subgraphs with k vertices;

• Predicate C(S) ⇐⇒ ∀u∈V(S), dS(u)g +³∗(|V(S)| −1),

• Mapping function (g :S → {qc}) =S →qc;

• Mapping function (h:S →N) =S →1;

• Reduction function (r : (N×N)→N) = (a, b)→a+b.

4ifi= 1, then Fi−1=∅

5operator “+” must be defined over set of supportsS

2.3. Graph pattern mining problems 37

2.3.6 Query specialization kernel (Ä- QS )

Given a pattern queryÄ the goal of query specialization [94,137] is to unveil new queries that are specializations ofÄ, i.e., larger queries containingÄ. The major procedure in unveiling query specializations is to list and to count subgraphs that are isomorphic to specializations of query Ä. This kernel is used in the context of query recommendation and for single graph mining instead of graph databases. Indeed, the latter setting can be viewed as a particular case of the former and consequently, more challenging [62]. For simplicity, we consider specializations containing pattern Ä and an additional edge:

• Subgraph with k =|E(Ä)|+ 1 edges;

• Predicate C(S) ⇐⇒ p¢Ä(S);

• Mapping function (g :S → {qs}) = S →qs;

• Mapping function (h:S →N) =S →1;

• Reduction function (r : (N×N)→N) = (a, b)→a+b.

2.3.7 Label search kernel (k- LS - L)

Graph databases are often represented by entities (vertices) that are related among themselves (edges). In this context, vertices may carry labeled semantics representing roles or types in the database schema. The goal of label-based subgraph search is to extract relevant induced subgraphs from a larger input graph according to labels of interest L.

The following items define this kernel in terms of the subgraph aggregation problem:

• Induced subgraph with size k;

• Predicate C(S) ⇐⇒ ∀u∈V(S)[L(u)¦ L];

• Mapping function (g :S → {ls}) = S →ls;

• Mapping function (h:S →N) =S →1;

• Reduction function (r : (N×N)→N) = (a, b)→a+b.

Valid subgraphs for this kernel are induced, must have a given numberkof vertices, and each vertex must have only labels within a label setL. While this kernel may seem simple it is widely used in NoSQL graph databases (e.g. Neo4j) for exploratory analysis in machine learning pipelines.

No documento Graph pattern mining: consolidating models, systems, and abstractions (páginas 34-39)