• Nenhum resultado encontrado

2 Conjunctive Inductive Queries

No documento PDF www-ai.ijs.si (páginas 53-56)

In this section, we define conjunctive inductive queries as well as the pattern domain of strings. Our presentation closely follows that of [6].

A pattern language L is a formal language for specifying patterns. Each patternφ∈ Lmatches (or covers) a set of examplesφe, which is a subset of the universeU of possible examples.

Example 1. LetΣ be a finite alphabet andUΣ =Σ the universe of all strings overΣ. We will denote the empty string with. The traditional pattern language in this domain isLΣ =UΣ. A pattern φ∈ LΣ covers the setφe = ∈Σ | φψ}, whereφψ denotes thatφis a substring ofψ.

One patternφismore generalthan a patternψ, writtenφψ, if and only ifφe⊇ψe.

A patternpredicatedefines a primitive property of a pattern, usually relative to some data setD (a set of examples), and sometimes other parameters. For any given pattern, it evaluates to eithertrue orfalse.

We now introduce a number of pattern predicates that will be used for il- lustrative purposes throughout this paper. Our first pattern predicates are very general in that they can be used for arbitrary pattern languages:

minfreq(φ,n,D) evaluates to true iffφ is a pattern that occurs in database D with frequency at least n N. The frequency f(φ,D) of a pattern φ in a database D is the (absolute) number of examples inD covered by φ. Analogously, the predicate maxfreq(φ, n,D) is defined.

ismoregeneral(φ,ψ) is a predicate that evaluates to true iff patternφis more general than pattern ψ. Dual to theismoregeneralpredicate one defines the ismorespecificpredicate.

The following predicate is an example predicate tailored towards the specific domain of string-patterns overLΣ.

length atmost(φ,n) evaluates to true forφ∈ LΣ iffφhas length at mostn. Analogously thelength atleast(φ,n) predicate is defined.

In all the preceding examples the pattern predicates have the formpred(φ,pa- rams) orpred(φ,D,params), whereparamsis a tuple of parameter values,Dis a data set andφis a pattern variable.

We also speak a bit loosely of pred alone as a pattern predicate, and mean by that the collection of all pattern predicates obtained for different parame- ter values params. We say that m is a monotonic predicate, if for all possible parameter valuesparams and all data setsD:

∀φ, ψ∈ L such thatφψ:m(ψ,D,params)m(φ,D,params) The class ofanti-monotonicpredicates is defined dually. Thus,minfreq,ismore- general, andlength atmostare monotonic, their duals are anti-monotonic.

A pattern predicatepred(φ,D,params) defines thesolution set Th(pred(φ,D, params),L) ={ψ∈ L |pred(ψ,D,params) =true}. Furthermore, for monotonic predicatesmthese sets will be monotone, i.e. for allφψ∈ L:ψ∈Th(m,L) φ∈Th(m,L).

Example 2. Let L = LΣ with Σ = {a,c,g,t} and D = {acca,at,gg,cga, accag,at,g}. Then the following predicates evaluate to true: minfreq(acca; 2;

D),minfreq(g; 4;D),maxfreq(ca; 2;D),maxfreq(t; 1;D).

The pattern predicateQm:=minfreq(φ,2,D) definesTh(Qm,LΣ) ={,a,c, g, ac,ca,cc,acc,cca,acca}, and the pattern predicate Qa := maxfreq(φ,3,D) defines the infinite setTh(Qa,LΣ) =LΣ\ {,g}.

The definition ofTh(pred(φ,D,params),L) is extended in the natural way to a definition of the solution setTh(Q,L) for boolean combinationsQof pattern predicates over a unique pattern variable:Th(¬Q,L) :=L \Th(Q,L),Th(Q1 Q2,L) :=Th(Q1,L)∪Th(Q2,L). The predicates that appear inQmay reference one or more data setsD1, . . . ,Dn.

We are interested in computing solution setsTh(Q,D,L) for boolean queries Qthat are constructed from monotonic and anti-monotonic pattern predicates.

De Raedt et al. [6] presented a technique to rewrite an arbitrary boolean queryQ into an equivalent query of the formQ1∨...∨Qksuch thatkis minimal and each of the subqueriesQiis the conjunctionQai∧Qmi of a monotonic and an anti- monotonic query. This type of subquery is calledconjunctive. De Raedt et al.

argued that this is useful because 1) there exist effective algorithms for comput- ing the solution space to such queriesQi(cf. [5, 4, 6, 17] and below), and 2) that minimizingkin this context would also minimize the number of calls to such al- gorithms and thus corresponds to a kind of inductive query optimization. In the present paper, we focus on the subproblem of optimizing conjunctive inductive queries, which is an essential step in this process. Observe that in a conjunctive queryQa∧Qm,Qa andQm need not be atomic expressions. Indeed, it is well- known that both the disjunction and conjunction of two monotonic (resp. anti- monotonic) predicates are monotonic (resp. anti-monotonic). Furthermore, the negation of a monotonic predicate is anti-monotonic and vice versa.

We will assume that there are cost-functions ca and cm associated to the anti-monotonic and monotonic subqueriesQaandQm. The idea is that the cost functions reflect the (expected) costs of evaluating the query on a pattern. E.g., ca(φ) denotes the expected cost needed to evaluate the anti-monotonic queryQa

on the patternφ. The present paper will neither propose realistic cost functions nor address the nature of these cost functions. Even though it is clear that some predicates are more expensive than other ones, more work seems needed in or- der to obtain cost estimates that are as reliable as in traditional databases. The present use of cost-functions is only a first step in this direction. One point to mention is also that several of the traditional pattern mining algorithms, such as Agrawal et al.’s Apriori [2] and the levelwise algorithm [18], try to minimize the number of passes through the data. Even though this could also be cast within the present framework, the cost functions introduced above better fit the situation where the data can be stored in main memory. One direct application would concern molecular feature mining [17, 16], where one aims at discovering fragments (i.e. subgraphs) within molecules (represented as graph structures).

In this case, a natural cost function is to minimize the number of covers tests (i.e. matching a pattern with a molecule or graph) because each covers test cor- responds to a subgraph isomorphism problem, a known NP-complete problem.

By now, we are able to formulate the conjunctive inductive query optimiza- tion problemthat is addressed in this paper:

Given

a languageLof patterns,

a conjunctive queryQ=Qa∧Qm

two cost functionscaand cm fromLtoR

Findthe set of patternsTh(Q,D,L), i.e. the solution set of the queryQin the languageL with respect to the databaseD, in such a way that that the total cost needed to evaluate patterns is as small as possible.

One useful property of conjunctive inductive queries is that their solution spaceTh(Q,D,L) is a version space (sometimes also called a convex space).

Definition 3. LetL be a pattern language, andI ⊆ L. I is a version space, if

∀φ, φ, ψ∈ L:φψφ and φ, φ∈I = ψ∈I.

Version spaces are particularly useful when they can be represented by bound- ary sets, i.e. by the sets G(Q,D,L) of their maximally general elements, and S(Q,D,L) of their minimally general elements. Finite version spaces are always boundary set representable, cf. [14], and this is what we will assume from now on.

Example 4. Continuing from Ex. 2, letQ=Qm ∧Qa. We haveTh(Q,LΣ,D) = {a,c,ac,ca,cc,acc,cca,acca}. This set of solutions is completely characterized byS(Q,LΣ,D) ={acca}and G(Q,LΣ,D) ={a,c}.

No documento PDF www-ai.ijs.si (páginas 53-56)