Finding Interesting Relationships among Documents in DDB

(1)

Finding Interesting Relationships among

Documents in DDB

K.REDDY MADHAVI 1

Assistant Professor, Dept.of CSE SSITS, Rayachoty, Kadapa. pin:516269

DR. A. VINAYA BABU

Professor, Dept. of CSE, Director, Admissions

JNTU College of Engineering, Hyderabad,

DR.T.V. RAJINI KANTH

Professor and HOD of IT Dept,GRIET, Hyderabad M.V. RATHNAMMA

Assistant Professor, Dept.of CSE SSITS, Rayachoty, Kadapa. pin:516269

Abstract :

Content-based document management tasks have gained a prominent status in the information system field, due to the increased availability of documents. It is very challenging to retrieve Relevant and Related Documents from Document DataBases. The interestingness measures such as Support and Confidence are used to know the relationship and which documents are frequently retrieved, but this leads misleading and not related in some real applications. Hence Cosine is the correlation measure used in conjunction with support and confidence to get the frequent Document Sets from Document DataBases which generates correlation rules. In this paper, we present Cosine measure applied to Document DataBases. With this correlation measure and pattern interestingness measures, more concise, related, useful and relevant documents are retrieved which even helps to provide ranking for documents.

Keywords: Correlation measure: Cosine; Document DataBases; Interesting Relationships.

1. Introduction

Today, information retrieval plays a large part of our everyday lives. Information retrieval (IR) is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching relational databases and the WWW. There is overlap in the usage of the terms data retrieval, document retrieval, information retrieval, and text retrieval, but each also has its own body of literature, theory, praxis, and technologies. IR is interdisciplinary based on computer science, mathematics, library science, information science, information architecture, cognitive psychology, linguistics, and statistics. Many universities and public libraries use IR systems to provide access to books, journals and other documents. Web search engines are the most visible IR applications.

The idea of using computers to search for relevant pieces of information was popularized in the article As We May Think by Vannevar Bush in 1945. . The first automated information retrieval systems were introduced in the 1950s and 1960s. By 1970 several different techniques had been shown to perform well on small text corpora such as the Cranfield collection (several thousand documents). All these thousands of documents in a databases may not be useful i.e, When user poses query, it results in number of documents which matches that query. To know whether these documents matched are relevant or not, the relationships among those documents, some measures are needed.

(2)

which uses these measures like Apriori algorithm (Agrawal et al. 1993), FP-growth which uses Apriori property (Agrawal and Srikant, 1994). The Apriori property states: if a pattern with k items is not frequent, any of its super patterns with (k + 1) or more items can never be frequent (Agrawal and Srikant, 1994).etc. But Apriori algorithm requires number of database scans. This problem can be overcomed in FP-growth , but FP-growth requires so much amount of time to get frequent docsets. But, sometimes frequent documents generated through this process may not be related and also misleading (represented in section 2). Hence, it is challenging task to retrieve Relevant and Related Documents(RRD) from Document DataBases (DDB) and provide ranking for documents.

The rest of the paper is organized as follows. Section 2 provides motivating example for this paper. Section 3 provides Cosine measure with working on a database. Section 4 presents experimental results by Cosine measure in conjunction with support and confidence. Section 5 concludes with its future work.

2. Interesting Relationships

To discover interesting relationships between documents in DDBs and to know which documents are frequently used by users or readers and also to know which documents are closely related, the basic pattern interestingness measures such as support and confidence are used. They are defined as-

Support = Probability of X and Y, i.e. P(X ∩Y) (1) Where X, Y are documents in a database

Confidence= The conditional probability, P(X|Y) (2)

(Han and Kamber, 2006).

Rules with high confidence and strong (reasonably large or high) support are referred to as strong rules (Agrawal et al., 1993; Han and Kamber, 2006; Park et al., 1995; Tan et al., 2006). According to Han and Kamber, the relation between the documents X and Y is said to be strong, if it satisfies minimum support and min confidence (these are threshold values set by an analyst). But all the rules satisfying min sup and min conf are not strong at all times in real time. Example for this is considered below.

Suppose if we want to know the relationship between documents in a database, then the following procedure is followed.

Let X and Y are documents, say data mining and information retrieval respectively. Assume that the total numbers of documents in a DB are 30000, and documents retrieved related to X are 12000 and Y are 27000. Documents containing both are 7500. Let minimum support is 20% and minimum confidence is 60%. To know the relation between X and Y, its support and confidence are need to be found

XY [sup=25%, conf=62.5%] (3)

Sup= P(X ∩Y)=7500/30000=25% (from eq.(1)

Conf== P(X|Y) =P(X ∩ Y)/P(X) =7500/12000=62.5% from eq.(2)

This rule is satisfying min support and min confidence, hence can be considered as strong. But this rule is slightly confusing since the probability of Y (i.e documents of information retrieval retrieved is 90%, ) , which is a lot higher than 62.5%. Also, though this is a strong association rule, we don’t know how correlated these are. To address this scenario, the correlation measure named cosine has been used to know the correlation between documents.

3. Cosine Measure

In this, in addition to support and confidence, correlation is also used to know the relationship between documents.

XY [support, confidence, cosine] (4)

(3)

generated satisfying this criteria are less and relevant, than the documents generated using support and confidence. Cosine measure is defined as

Cosine (X, Y) = P(X∩Y) / √P(X) P(Y)

or (5)

Sup(X∩Y)/ √Sup(X) Sup(Y)

P= probability, X and Y are documents, Sup= Support

Cosine is defined as the probability of X and Y occurring together divided by the square root of the probability of the X multiplied by the probability of Y. By taking the square root, the cosine value is only influenced by the supports of X, Y, and (X∩Y), and not by the total number of transactions. This cosine-type measure is a null-invariant because its value is free from the influence of null transactions (Han and Kamber, 2006). Typically in large datasets, there can be lots of null transactions, so the cosine-type measure provides better results in large datasets. If the resulting value of the above equation is between 0 and 1, then X and Y are correlated to each other.

If Cosine (X,Y) <1 , then X and Y are positively correlated If Cosine (X,Y) =1 , then no correlation between X and Y If Cosine (X,Y) >1 , then X and Y are negatively correlated

For more than two documents like A, B, C, D, … N,, the cosine measure can be extended in the following manner:

Cosine(A, B…N) = P(A∩B∩C∩D∩…N / √P(A) P(B) P(C)P(D)…P(N) (6) To work on this measure, let us consider the following database, let the min sup=3

Sample DDB : Table: 1

DocID Documents

1 A B C D E F G H I

2 B C D E K J

3 C E F G H

4 B C D E M

5 F C B A H I

6 F C D I

7 B C D H

8 B M L

9 A M

10 C M B

11 A B C F H I

12 A C D F H I

3.1 Frequent 1-Docsets

• The first scan of the dataset will read the documents, keep a count of their frequencies and keep track of the

rows that the items are found in. Once the whole dataset has been scanned, any document with support count less than the minimum support count will be deleted from the arraylist. Then the above table results in table 2.

Table 2 : Frequent 1- Docsets

Document Support count List of DocIDs

C 10 1,2,3,4,5,6,7,10, 11,12

B 8 1,2,4,5,7,8,10,11

D 6 1,2.4.6.7.12

F 6 1.3..5.6.11.12

H 6 1.3.5.7.11.12

A 5 1.5.9.11.12

I 5 1.5.6.11.12

E 4 1.2.3.4

(4)

Now, we generate frequent 2-document sets using Table 2.

• Now the pointer is at last document i.e at M

• We check for items after M i.e (M + 1), and there are none, so, • The pointer now goes to the M – 1 location i.e E

• Now check for documents (E+1), then we find M, hence check for documents in which E and M are

together, then we find document 4. But it is less than min support, hence deleted.

• Now the pointer goes to M-2 location i.e at I. Now check for I and E together, no documents. Next I

and M, no documents.

• Now the pointer goes to M-3 location i.e at A, check for documents A and I together, we find 1, 5, 11,

12. Hence its support count = 4, since it is satisfying min support, it can be considered as frequent 2- Docset.

• Next check for A and E together, it is only 1, it is less than min support, hence deleted. Next check for

A and M, there are no documents.

Like this the process is continued for all locations. The frequent 2-Docset with Cosine is shown in table 3. All are positively correlated because its value is less than one. The documents whose cosine value is >1, must be deleted.

Table 3: Frequent 2-Docsets with cosine

Document Support count List of DocIDs Cosine

A,I 4 1, 5, 11,12 0.8

H,A 4 1, 5, 11,12 0.73

H,I 4 1, 5, 11,12 0.73

F,H 5 1, 3, 5, 11,12 0.83

F,A 4 1, 5, 11,12 0.73

F,I 5 1, 5, 6, 11,12 0.912

D,F 3 1,6,12 0.5

D,H 3 1,7,12 0.5

D,I 3 1,6,12 0.547

D,E 3 1,2,4 0.612

B,D 4 1,2,4,7 0.577

B,F 3 1,5,11 0.433

B,H 4 1,5,7,11 0.577

B,A 3 1,5,11 0.474

B,I 3 1,5,11 0.474

B,E 3 1,2,4 0.53

B,M 3 4,8,10 0.53

C,B 7 1,2,4,5,7,10,11 0.782

C,D 6 1,2,4,6,7,12 0.774

C,F 6 1,3,5,6,11,12 0.774

C,H 6 1,3,5,7,11,12 0.774

C,A 4 1,5,11,12 0.565

C,I 5 1,5,6,11,12 0.707

C,E 4 1,2,3,4 0.632

Now we will generate frequent 3-Docsets using again table 2

• Now the pointer is again at M, check for M+1, M+2, there are so such fields.

• Now move the pointer to M-1 location, i.e at E, check for E+1 and E+2, then there is only

E+1, no E+2

• Now move the pointer to M-2 location, i.e at I, check for I+1, I+2, it is, E and M. now I, E, M

(5)

• Now the pointer is at A, all the A documents are also not frequent, like, A,I,E and A,I,M etc. • Now the pointer is at H, check H+1 and H+2, i.e, H,A,I=4, remaining are not frequent. Like

this the process was repeated for all documents. The result is in table 4 Table 4: Frequent 3-Docsets with cosine

Document Support count List of DocIDs Cosine

H,A,I 4 1,5,11,12 0.326

F,H,A 4 1,5,11,12 0.298

F,H,I 4 1,5,11,12 0.298

F,A,I 4 1,5,11,12 0.326

D,F,I 3 1,6,12 0.223

B,D,E 3 1,2,4 0.216

B,F,H 3 1,5,11 0.176

B,F,A 3 1,5,11 0.193

B,F,I 3 1,5,11 0.193

B,H,A 3 1,5,11 0.193

B,H,I 3 1,5,11 0.193

B,A,I 3 1,5,11 0.212

C,B,D 4 1,2,4,7 0.182

C,B,F 3 1,5,11 0.137

C,B,H 4 1,5,7,11 0.182

C,B,A 3 1,5,11 0.15

C,B,I 3 1,5,11 0.15

C,B,E 3 1,2,4 0.167

C,D,F 3 1,6,12 0.158

C,D,H 3 1,7,12 0.158

C,D,I 3 1,6,12 0.173

C,D,E 3 1,2,4 0.193

C,F,H 5 1,3,5,11,12 0.263

C,F,A 4 1,5,11,12 0.23

C,F,I 4 1,5,11,12 0.23

C,H,A 4 1,5,11,12 0.23

C,H,I 4 1,5,11,12 0.23

C,A,I 4 1,5,11,12 0.252

Table 5 : Frequent 4-Docsets with cosine Document Support count List of DocIDs Cosine

F,H,A,I 4 1,5,11,12 0.133

C,B,D,E 3 1,2,4 0.0684

C,B,F,H 3 1,5,11 0.0559

C,B,F,A 3 1,5,11 0.0612

C,B,F,I 3 1,5,11 0.0612

C,B,H,A 3 1,5,11 0.0612

C,B,H ,I 3 1,5,11 0.0612

(6)

Table 6: frequent 5-Docsets

Document Support count List of DocIDs Cosine

C,B,F,H,A 3 1,5,11 0.025

C,B,H,A,I 3 1,5,11 0.027

4. Results and Discussions

This algorithm was implemented using JAVA by considering no. of datasets. On working with no. of datasets it is observed that, RRDs are generated when cosine measure was used in conjunction with support and confidence rather than using simply support and confidence. In fig 1, the rules at min. sup of 20% are represented. For 1200 rules in DDB, the RRDs generated are 1100.

Fig 1: Rules generated at min.Sup=20%

If we increase the min sup, rules generated will be less as shown in fig 2 below. Here, for 1200 rules in DDB, the RRDs generated are 920, which is less than the rules generated at min. sup of 20%. Fixing of min support is done by DBA based on requirement.

(7)

5. Conclusion and Future work

Using this cosine measure, number of rules generated are less and relevant when compared to rules with support and confidence and the number of database scan is only once. This method helps in finding that how one document related to another document and even applied in helping researchers to know which are the documents that are frequently retrieved, which gives an idea of latest trend. This method can also be applied using precision and recall measures. This method may also be implemented using WEKA tool if available.

Acknowledgements

We would like to thank our colleagues in the Department of Computer Science Engineering for their advice and helpful discussions. We are grateful to Dr. C. Nageswara Raju, Associate Professor, SV Group of Institutions, kadapa, for his support in all aspects. We thank Dean of Academics Dr. P. Kannaiah and Dr.M.Murali Krishna, Principal, SSITS for his moral support and providing Laboratory. We are grateful to the British Library, JNTUniversity Library for their services.

References

[1] Agrawal, R. and Srikant, R. (1994) ‘Fast algorithms for mining association rules in large databases’, Proceedings of the 20th VLDB Conference, pp.487–499.

[2] Agrawal, R. and Srikant, R. (1995) ‘Mining sequential patterns’, Proceedings of the International Conference on Data Engineering (ICDE ‘95), Taipei, Taiwan, pp.3–14.

[3] Agrawal, R., Imielinski, T. and Swami, A. (1993) ‘Mining association rules between sets of items in large databases’, ACM SIGMOD Conference, ACM Press, pp.207–216.

[4] Brin, S., Motwani, R., Ullman, J.D. and Tsur, S. (1997) ‘Dynamic itemset counting and implication rules for market basket data’, Proceedings of the 1997 ACM SIGMOD Conference, ACM Press, pp.225–264.

[5] Dong, G. and Li, J. (1999) ‘Efficient mining of emerging patterns: discovering trends and differences’, Proceedings of International Conference of Knowledge Discovery and Data Mining (KDD ‘99), San Diego, CA, pp.43–52.

[6] Gouda, K. and Zaki, M. (2005) ‘GenMax: an efficient algorithm for mining maximal frequent itemsets’, Data Mining and Knowledge Discovery, Vol. 11, No. 3, pp.223–242.

[7] Grahne, G. and Zhu, J. (20050 ‘Fast algorithms for frequent itemset mining using FP-trees, IEEE Transactions on Knowledge and Data Engineering, Vol. 17, No. 10, pp.1347–1362.

[8] Han, J. and Kamber, M. (2006) Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, USA. [9] Han, J., Pei, J. and Yin, Y. (2000) ‘Mining frequent patterns without candidate generation’, SIGMOD 2000, pp.1–12.

[10] Han, J., Pei, J., Yin, Y. and Mao, R. (2004) ‘Mining frequent patterns without candidate generation: a frequent-pattern tree approach’, Data Mining and Knowledge Discovery, Vol. 8, No. 1, pp.53–87.

[11] ko, P. and Kitsuregawa, M. (2003) ‘Parallel FP-growth on PC cluster’, Proceedings of Seventh Pacific-Asia Conference of Knowledge Discovery and Data Mining.

[12] [12] Lent, B., Swami, A. and Widom, J. (1997) ‘Clustering association rules’, Proceedings of the International Conference on Data

[13] Engineering (ICDE ‘97), Birmingham, England, pp.220–231

Ms.K. Reddy Madhavi, obtained Bachelor of Technology in Mechanical Engineering from SVU,Tirupati and Master of Technology in Computer Science Engineering from JNTU. Working as Asst.Professor in CSE and member of R&D in an organization.

Dr.A.Vinay Babu obtained his Bachelors degree in Electronics & Communication Engineering from Osmania University. Has duel Masters Degree, one in Computer Science & Engg and the other one is in ECE from JNTU. He obtained his PhD from JNTU, Hyderabad. His research area is Data Mining and Image processing.