5 Related Work and Conclusions

Algorithms that try to minimize the total number of predicate evaluations have been around for several years, most notably Gunopulos et al.’sDualize & Ad- vance-algorithm [10, 9] that computes S(minfreq(φ,·,D),L) in the domain of itemsets. This works roughly as follows: first, a setMSof maximal specific sentences is computed by a randomized depth-first search. Then the negative border ofMS is constructed by calculating aminimum hypergraph transversal of the complements of all itemsets inMS. This process is repeated with the elements from the hypergraph transversal until no more maximal specific sentences can be found. The result is then set ofall maximal interesting itemsets.

Although Gunopulos et al. work with itemsets and only consider monotonic predicates, there is a clear relation to our approach. Whereas the former method needs to compute the minimum hypergraph transversals to ﬁnd the candidates for new maximal interesting sentences, these can be directly read oﬀ the partial version space tree. In fact, all nodes whose monotonic part is still undetermined are the only possible patterns ofS(minfreq(φ,·,D),L) that have not been found so far. These are exactly the nodes that are still waiting in the priority queue.

So by performing a depth-first expansion until all children are negative, our algorithm’s behaviour is close to that ofDualize & Advance. It should be noted that the two strategies are not entirely equal: if a nodeφhas negative children only, it is not necessarily a member ofSbecause there could still be more specific patterns that haveφas theirsuffix and satisfyQm.

One of the fastest algorithms for mining maximal frequent sets is Bayardo’s Max-Miner [3]. This one uses a special set-enumeration technique to find large frequent itemsets before it considers any of their subsets. Although this is com- pletely different from what we do at first sight,CounterSum also has a tendency to test long strings first because they will have higherzm-values. By assigning higher values toPm for long patterns this behaviour can even be enforced.

As mentioned in the introduction, the work is related — at least in spirit

— to several proposals for minimizing the number of membership queries when learning concepts, cf. [20].

Finally, let us mention that the present work is a signiﬁcant extension of that by [6] in that we have adapted and extended their version space tree and also shown how it can be used for optimizing the evaluation of conjunctive queries. The presented technique has also shown to be more cost eﬀective than theAscendandDescendalgorithms proposed by [6].

Nevertheless, there are several remaining questions for further research. They include an investigation of cost functions for inductive queries, methods for es- timating or approximating the probability that a query will succeed on a given pattern, determining the complexity class of the query optimization problem, ad- dressing query optimization for general boolean queries, and considering other types of pattern domains. The authors hope that the presented formalization and framework will turn out to be useful in answering these questions.

Acknowledgements

This work was partly supported by the European IST FET project cInQ. The authors are grateful to Manfred Jaeger, Sau Dan Lee and Heikki Mannila for contributing the theoretical framework on which the present work is based, to Amanda Clare and Ross King for the yeast database and to Sau Dan Lee and Kristian Kersting for feedback on this work.

References

1. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. InProc.

VLDB, 1994.

2. R. Agrawal, T. Imielinski, A. Swami. Mining association rules between sets of items in large databases. InProc. SIGMOD, pp. 207-216, 1993.

3. R. Bayardo. Eﬃciently mining long patterns from databases. In Proc. SIGMOD, 1998.

4. C. Bucila, J. Gehrke, D. Kifer, W. White. DualMiner: A dual pruning algorithm for itemsets with constraints. InProc. of SIGKDD, 2002.

5. L. De Raedt, S. Kramer. The levelwise version space algorithm and its application to molecular fragment ﬁnding. InProc. IJCAI, 2001.

6. L. De Raedt, M. J¨ager, S. D. Lee, H. Mannila: A Theory of Inductive Query Answering. In Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM’02), 123–130, Maebashi (Japan), 2002.

7. L. De Raedt, A perspective on inductive databases.SIGKDD Explorations, Vol. 4 (2), 2002.

8. B. Goethals, J. Van den Bussche. On supporting interactive association rule mining. InProc. DAWAK, LNCS Vol. 1874, Springer Verlag, 2000.

9. D. Gunopulos, H. Mannila, S. Saluja. Discovering All Most Speciﬁc Sentences by Randomized Algorithms. InProc. ICDT, LNCS Vol. 1186, Springer Verlag, 1997.

10. D. Gunopulos, R. Khardon, H. Mannila, H. Toivonen: Data mining, Hypergraph Transversals, and Machine Learning. InProceedings of the 16th ACM Symposion on Principles of Database Systems (PODS), Tuscon (Arizona), 1997.

11. J. Han, Y. Fu, K. Koperski, W. Wang, and O. Zaiane. DMQL: A Data Mining Query Language for Relational Databases.InProc. SIGMOD’96 Workshop on Re- search Issues on Data Mining and Knowledge Discovery, Montreal, Canada, June 1996.

12. J. Han, L. V. S. Lakshmanan, and R. T. Ng. Constraint-Based, Multidimensional Data Mining,Computer, Vol. 32(8), pp. 46-50, 1999.

13. J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation.

InProc. SIGMOD, 2000.

14. H. Hirsh. Generalizing Version Spaces.Machine Learning, Vol. 17(1): 5-46 (1994).

15. H. Hirsh. Theoretical underpinnings of versionspaces. InProc. IJCAI, 1991.

16. Akihiro Inokuchi, Takashi Washio, Hiroshi Motoda: Complete Mining of Frequent Patterns from Graphs: Mining Graph Data. Machine Learning 50(3): 321-354 (2003)

17. S. Kramer, L. De Raedt, C. Helma. Molecular Feature Mining in HIV Data. In Proc. SIGKDD, 2001.

18. H. Mannila and H. Toivonen. Levelwise search and borders of theories in knowledge discovery,Data Mining and Knowledge Discovery, Vol. 1, 1997.

19. T. Mitchell. Generalization as Search,Artiﬁcial Intelligence, Vol. 18 (2), pp. 203- 226, 1980.

20. T. Mitchell. Machine Learning. McGraw Hill. 1997.

21. R. T. Ng, L. V.S. Lakshmanan, J. Han, and A. Pang. Exploratory mining and pruning optimizations of constrained associations rules. InProc. SIGMOD, 1998.

22. NIC Genetic Sequence Database. Available at www.ncbi.nlm.nih.gov/

Genbank/GenbankOverview.html

23. E. Ukkonen. On-line construction of suﬃx trees. Algorithmica, 14(3):249–260, 1995.

24. P. Weiner. Linear pattern matching algorithm. InProc. 14th IEEE Symposium on Switching and Automata Theory, pages 1–11, 1973.

What You Store Is What You Get

No documento PDF www-ai.ijs.si (páginas 65-68)