2 Conjunctive Inductive Queries

In this section, we deﬁne conjunctive inductive queries as well as the pattern domain of strings. Our presentation closely follows that of [6].

A pattern language L is a formal language for specifying patterns. Each patternφ∈ Lmatches (or covers) a set of examplesφe, which is a subset of the universeU of possible examples.

Example 1. LetΣ be a ﬁnite alphabet andUΣ =Σ^∗ the universe of all strings overΣ. We will denote the empty string with. The traditional pattern language in this domain isLΣ =UΣ. A pattern φ∈ LΣ covers the setφe ={ψ ∈Σ^∗ | φψ}, whereφψ denotes thatφis a substring ofψ.

One patternφismore generalthan a patternψ, writtenφψ, if and only ifφe⊇ψe.

A patternpredicatedeﬁnes a primitive property of a pattern, usually relative to some data setD (a set of examples), and sometimes other parameters. For any given pattern, it evaluates to eithertrue orfalse.

We now introduce a number of pattern predicates that will be used for il- lustrative purposes throughout this paper. Our ﬁrst pattern predicates are very general in that they can be used for arbitrary pattern languages:

– minfreq(φ,n,D) evaluates to true iﬀφ is a pattern that occurs in database D with frequency at least n ∈ N. The frequency f(φ,D) of a pattern φ in a database D is the (absolute) number of examples inD covered by φ. Analogously, the predicate maxfreq(φ, n,D) is deﬁned.

– ismoregeneral(φ,ψ) is a predicate that evaluates to true iﬀ patternφis more general than pattern ψ. Dual to theismoregeneralpredicate one deﬁnes the ismorespecificpredicate.

The following predicate is an example predicate tailored towards the speciﬁc domain of string-patterns overLΣ.

– length atmost(φ,n) evaluates to true forφ∈ LΣ iﬀφhas length at mostn. Analogously thelength atleast(φ,n) predicate is deﬁned.

In all the preceding examples the pattern predicates have the formpred(φ,pa- rams) orpred(φ,D,params), whereparamsis a tuple of parameter values,Dis a data set andφis a pattern variable.

We also speak a bit loosely of pred alone as a pattern predicate, and mean by that the collection of all pattern predicates obtained for diﬀerent parameter values params. We say that m is a monotonic predicate, if for all possible parameter valuesparams and all data setsD:

∀φ, ψ∈ L such thatφψ:m(ψ,D,params)→m(φ,D,params) The class ofanti-monotonicpredicates is deﬁned dually. Thus,minfreq,ismoregeneral, andlength atmostare monotonic, their duals are anti-monotonic.

A pattern predicatepred(φ,D,params) deﬁnes thesolution set Th(pred(φ,D, params),L) ={ψ∈ L |pred(ψ,D,params) =true}. Furthermore, for monotonic predicatesmthese sets will be monotone, i.e. for allφψ∈ L:ψ∈Th(m,L)→ φ∈Th(m,L).

Example 2. Let L = LΣ with Σ = {a,c,g,t} and D = {acca,at,gg,cga, accag,at,g}. Then the following predicates evaluate to true: minfreq(acca; 2;

D),minfreq(g; 4;D),maxfreq(ca; 2;D),maxfreq(t; 1;D).

The pattern predicateQm:=minfreq(φ,2,D) definesTh(Qm,LΣ) ={,a,c, g, ac,ca,cc,acc,cca,acca}, and the pattern predicate Qa := maxfreq(φ,3,D) defines the infinite setTh(Qa,LΣ) =LΣ\ {,g}.

The deﬁnition ofTh(pred(φ,D,params),L) is extended in the natural way to a deﬁnition of the solution setTh(Q,L) for boolean combinationsQof pattern predicates over a unique pattern variable:Th(¬Q,L) :=L \Th(Q,L),Th(Q1∨ Q2,L) :=Th(Q1,L)∪Th(Q2,L). The predicates that appear inQmay reference one or more data setsD1, . . . ,Dn.

We are interested in computing solution setsTh(Q,D,L) for boolean queries Qthat are constructed from monotonic and anti-monotonic pattern predicates.

De Raedt et al. [6] presented a technique to rewrite an arbitrary boolean queryQ into an equivalent query of the formQ1∨...∨Qksuch thatkis minimal and each of the subqueriesQiis the conjunctionQai∧Qmi of a monotonic and an anti- monotonic query. This type of subquery is calledconjunctive. De Raedt et al.

argued that this is useful because 1) there exist eﬀective algorithms for computing the solution space to such queriesQi(cf. [5, 4, 6, 17] and below), and 2) that minimizingkin this context would also minimize the number of calls to such algorithms and thus corresponds to a kind of inductive query optimization. In the present paper, we focus on the subproblem of optimizing conjunctive inductive queries, which is an essential step in this process. Observe that in a conjunctive queryQa∧Qm,Qa andQm need not be atomic expressions. Indeed, it is well- known that both the disjunction and conjunction of two monotonic (resp. anti- monotonic) predicates are monotonic (resp. anti-monotonic). Furthermore, the negation of a monotonic predicate is anti-monotonic and vice versa.

We will assume that there are cost-functions ca and cm associated to the anti-monotonic and monotonic subqueriesQaandQm. The idea is that the cost functions reﬂect the (expected) costs of evaluating the query on a pattern. E.g., ca(φ) denotes the expected cost needed to evaluate the anti-monotonic queryQa

on the patternφ. The present paper will neither propose realistic cost functions nor address the nature of these cost functions. Even though it is clear that some predicates are more expensive than other ones, more work seems needed in or- der to obtain cost estimates that are as reliable as in traditional databases. The present use of cost-functions is only a ﬁrst step in this direction. One point to mention is also that several of the traditional pattern mining algorithms, such as Agrawal et al.’s Apriori [2] and the levelwise algorithm [18], try to minimize the number of passes through the data. Even though this could also be cast within the present framework, the cost functions introduced above better ﬁt the situation where the data can be stored in main memory. One direct application would concern molecular feature mining [17, 16], where one aims at discovering fragments (i.e. subgraphs) within molecules (represented as graph structures).

In this case, a natural cost function is to minimize the number of covers tests (i.e. matching a pattern with a molecule or graph) because each covers test corresponds to a subgraph isomorphism problem, a known NP-complete problem.

By now, we are able to formulate the conjunctive inductive query optimiza- tion problemthat is addressed in this paper:

Given

– a languageLof patterns,

– a conjunctive queryQ=Qa∧Qm

– two cost functionscaand cm fromLtoR

Findthe set of patternsTh(Q,D,L), i.e. the solution set of the queryQin the languageL with respect to the databaseD, in such a way that that the total cost needed to evaluate patterns is as small as possible.

One useful property of conjunctive inductive queries is that their solution spaceTh(Q,D,L) is a version space (sometimes also called a convex space).

Definition 3. LetL be a pattern language, andI ⊆ L. I is a version space, if

∀φ, φ, ψ∈ L:φψφ and φ, φ∈I =⇒ ψ∈I.

Version spaces are particularly useful when they can be represented by boundary sets, i.e. by the sets G(Q,D,L) of their maximally general elements, and S(Q,D,L) of their minimally general elements. Finite version spaces are always boundary set representable, cf. [14], and this is what we will assume from now on.

Example 4. Continuing from Ex. 2, letQ=Qm ∧Qa. We haveTh(Q,LΣ,D) = {a,c,ac,ca,cc,acc,cca,acca}. This set of solutions is completely characterized byS(Q,LΣ,D) ={acca}and G(Q,LΣ,D) ={a,c}.

No documento PDF www-ai.ijs.si (páginas 53-56)