• Nenhum resultado encontrado

5 Generalized Version Space Trees

No documento PDF www-ai.ijs.si (páginas 97-100)

We have extended our previous data structure “Version Space Trees” [3] for mining VS1 of strings to handle the general case of VSZ. The extended data structure is called Generalized Version Space Trees (GVS Tree).

The GVS Tree maintains a set of strings which are patterns we are discovering from the database. Each such string is represented by a node in the tree. For any given noden, we denote the string pattern it represents bys(n). The organization of the tree is based on suffix tries. A suffix trie is a trie with the following properties:

For each nodenin the trie, and for each suffixtofs(n), there is also a node n in the trie representingt, i.e.t=s(n).

Each nodenhas as asuffix link,suffix(n) =n, such thats(n) is obtained froms(n) by dropping the first character. The root node is special because it represents, which has no suffixes. We definesuffix(root) =Ω, whereΩ denotes a unique fake node.

2 For brevity, we writeVS(Q) forTh(Q,D,L).

Unlike the common approach in the literature on suffix trees [16, 17], we use suffix tries in two very different ways from the main stream. The first one is that instead of building a suffix tree on all the suffixes of asingle string, we are indexing all the suffixes of aset of strings patterns for a string databaseD. This means multiple strings are stored in the tree. Moreover, in parts of our algorithms, we even keep a count of occurrences of each such substring in the corresponding node.

A labelled trie Tf is a suffix trie where each node is labelled with either a

“⊕” or a “”. We will use the⊕label to indicate nodes representing elements in T h(Q,D, Σ)∈ VSZ) andfor those that are not. In our previous pub- lication [3], the VS Tree had a restriction that there can be at most one sign change the root to any leave. This is because VS Tree was designed to model sets inVS1only. As a generalization in this current work, we have removed this restriction, allowing complete freedom on the assignment of the labels “⊕” or a

“” to any node. As a result, a GVS Tree can represent sets of string patterns in VSZ. The usual set operations can be performed by manipulating the labels on the nodes of GVS Trees, which we will come to in Sect. 5.1.

GVS TreeT is a labelled trie that represents a generalized version spaceV ∈ VSZ). In other words,V ={ϕ| ∃noden∈T :ϕ=s(n)∧nis labelled⊕}. We have observed that if dim(V) =n, then there exists a path (exploiting both child and suffix links, and ignoring the link directions) in T with alternating signs so that the number of sign changes fromto⊕isn(Lemma 8).

Fig. 1 shows a labelled GVS Tree for the set T h(Q,D, Σ) where Σ = {a,b,c,d}andQ(ϕ,D) = (bcϕabcd)∨(aϕacd). The dashed, curved arrows show the suffix links. The suffix links of the nodes immediate below the root node all points back to the root node, and are omitted for clarity. The ⊕ nodes represent the seven members of the GVS, namely a, abc, abcd, ac, acd, bcandbcd. Note thatQabove is already in a minimal disjunctive normal form.

So, the GVS has a dimension of 2, and the path through the nodes representing ,a,ab,abcwith exactly 2 sign changes.

c b

a b

d

d c c

c d

d d

+ +

− −

+

+ +

+ + −

Fig. 1.An example of GVS Tree

An important property of the GVS Tree is that checking for membership is very efficient. Given any string ϕ∈Σ, we just need to follow the symbols on ϕ and descend through the tree accordingly. We will then end up at a node n so thats(n) =ϕ. This node has an⊕mark if and only ifϕis in the GVS. The time complexity is O(|ϕ|). The space complexity of a GVS Tree and the time required to build it is the same of that of suffix tries—quadratic in the size of the input strings.

5.1 Algorithm TreeMerge

The TreeMerge algorithm is basically an algorithm for merging two ordinary trees. However, we have to combine the flags ( or ⊕) from both trees, too.

The combining algorithm is presented in the pseudocode in Algorithm 1 as an abstract Boolean operation . When the operation is “and”, the flags are com- bined using conjunction during the merging (is interpreted as “false” while⊕ is treated as “true”), and hence the TreeMerge algorithm will compute the in- tersection of the represented GVSes. When=∨, we get the union operation.

Whenxy≡ ¬(x→y), we get the set difference operation.

In the algorithm, the functionroot or negative(T) returns the root node of a treeT ifT is non-empty, or a node with labeland no children ifT is empty.

Functionchild(n, σ) returns the child node ofnon the child link labelledσ, where σ∈Σ. If there is no such child,NULLis returned. The functiontree with root(n) returns a GVS Tree whose root node isn, orTemptyifnisNULL.Temptydenotes an empty GVS Tree.

Algorithm 1TreeMerge Require:

/* Input:T1, T2: two GVS Trees : a binary, boolean operator */

/* Output:T= The resulting of mergingT1 andT2, so that the flag on each node inT is the result of applyingto the corresponding nodes onT1 andT2. */

if T1=Tempty∧T2=Tempty then returnTempty

r1←root or negative(T1);r2←root or negative(T2) Create new treeTwith root noder.

label(r)←label(r1)label(r2) for allσ∈Σdo

c1←child(r1, σ);c2←child(r2, σ) if c1=NULL∨c2=NULLthen

Tc1←tree with root(c1);Tc2 ←tree with root(c2)

c←root or negative(TreeMerge(Tc1, Tc2,)) /* Recursively */

addctoras a child node along linkσ returnT

It should be emphasized that Algorithm 1 is just a pseudo code. In practice, much optimization can be introduced by specializing the code for each particular

Boolean operation. For instance, when = ∧ and a certain branch in T1 is empty, there is no need to look at a corresponding branch in T2.

No documento PDF www-ai.ijs.si (páginas 97-100)