An Unsupervised Approach for Mining Multiple Web Databases

(1)

International Journal of Electronics and Computer Science Engineering 2148

Available Online at www.ijecse.org ISSN- 2277-1956

ISSN 2277-1956/V1N4-2148-2151

An Unsupervised Approach for Mining Multiple Web

Databases

ANITHA RANI INTURI 1, JUJJURI RAMADEVI 2

#1 Student, P.V.P.Siddhardha Institute of Technology, Kanuru, Vijayawada, Krishna (Dt) #2 Asst.Professor, P.V.P.Siddhardha Institute of Technology, Kanuru, Vijayawada, Krishna (Dt)

#1 srianitharanich@gmail.com, #2 k.ramakarthik@gmail.com

Abstract- Well Trained Record matching methods such as SVM, OSVM, PEBL, and Christen offers better performances when mining and filtering duplicate query results from multiple web databases. They require huge training data sets for prelearning. Unsupervised Duplicate Detection (UDD) a query-dependent record matching method that requires no pre training was developed earlier. Non duplicate records from the same source can be used as training examples so for a given query UDD uses two cooperating classifiers, a weighted component similarity summing classifier and an SVM classifier that iteratively identifies duplicates in the query results from multiple Web databases. To optimize its performance in String Similarity calculations during Information Retrievals (IR), we propose the generalized Levenshtein distance algorithms. Experimental results show that our approach is comparable to previous works that requires huge training examples.

Keyword : UDD, SVM, OSVM, PEBL.

I INTRODUCTION

The Web Databases dynamically generate Web pages in response to user queries. Most Web databases are only accessible via a query interface through which users can submit queries. The Web Databases dynamically generate Web pages in response to user queries. Most Web databases are only accessible via a query interface through which users can submit queries from several data sets is often required as information from multiple sources needs to be integrated, combined or linked in order to allow more detailed data analysis or mining. It takes advantage of the dissimilarity among records from the same Web database for record matching.

In the Web database scenario, the records to match are highly query-dependent, since they can only be obtained through online queries. Moreover, these duplicates are only a partial and biased portion of all the data in the source Web databases. To overcome such problems, propose a new method of identifying duplicates among records in query results from multiple web databases. The proposed approach Unsupervised Duplicate Detection for the specific problem employs two classifiers that collaborate in an iterative manner.

Record matching can be done by supervised learning where training dataset is required beforehand. In the web databases, the result records are obtained through online queries. They are query dependant and thus, supervised learning is inappropriate. The representative training set in supervised learning cannot be applicable for the web results that are generated on-the-fly. For each new query, depending on the results returned, the field weights should probably change too, which makes supervised-learning based methods even less applicable.

Hence, we use a unsupervised technique named Unsupervised Duplicate detection (UDD) which uses two classifiers for record matching and duplicate detection. This eliminates the user preference problem in supervised learning. In this paper, by employing two classifiers that collaborate in an iterative manner, UDD identifies duplicates based on the dissimilarity among these records, field‘s weight is set and record matching is done by the first classifier. These results i.e., the matched records form the duplicate or positive set. The second classifier uses both duplicate and the non duplicate sets to identify the duplicate record pairs.

(2)

2149

An Unsupervised Approach for Mining Multiple Web Databases

ISSN 2277-1956/V1N4-2148-2151 classifier and an SVM classifier that iteratively identifies duplicates in the query results from multiple Web databases.

II RELATED WORK

The records consist of multiple fields, making the duplicate detection problem much more complicated. Approaches that rely on training data to learn how to match the records includes probabilistic approaches and supervised machine learning techniques. Approaches that rely on domain knowledge are generic distance metrics to match records.

The Bigram Indexing (BI) method as implemented in the Febrl record linkage system allows for fuzzy blocking. The basic idea is that the blocking key values are converted into a list of bigrams (sub-strings containing two characters) and sub-lists of all possible permutations will be built using a threshold (between 0.0 and 1.0). The resulting bigram lists are sorted and inserted into an inverted index, which will be used to retrieve the corresponding record numbers in a block. The number of sub-lists created for a blocking key value both depends on the length of the value and the threshold. The lower the threshold the shorter the sub-lists, but also the more sub-lists there will be per blocking key value, resulting in more (smaller blocks) in the inverted index. In the information retrieval field, bigram indexing has been found to be robust to small typographical errors in documents.

The supervised learning systems rely on the existence of training data in the form of record pairs, prelabeled as matching or not. One set of supervised learning techniques treat each record pair (a,b) independently, similar to the probabilistic techniques. A well-known CART algorithm, which generates classification and regression trees. A linear discriminant algorithm, which generates a linear combination of the parameters for separating the data according to their classes, and a “vector quantization” approach which is a generalization of the nearest neighbor algorithms. The transitivity assumption can sometimes result in inconsistent decisions.

One of the problems with the supervised learning techniques is the requirement for a large number of training examples. While it is easy to create a large number of training pairs that are either clearly nonduplicates or clearly duplicates, it is very difficult to generate ambiguous cases that would help create a highly accurate classifier. Based on this observation, some duplicate detection systems used active learning methods to automatically locate such ambiguous pairs. Their method suggested that, by creating multiple classifiers, trained using slightly different data or parameters, it is possible to detect ambiguous cases and then ask the user for feedback. The key innovation in this work is the creation of several redundant functions and the concurrent exploitation of their conflicting actions in order to discover new kinds of inconsistencies among duplicates in the data set.

Chaudhuri et al. proposed a new framework for distance-based duplicate detection, observing that the distance thresholds for detecting real duplicate entries are different from each database tuple. To detect the appropriate threshold, Chaudhuri et al. observed that entries that correspond to the same real-world object but have different representation in the database tend 1) to have small distances from each other (compact set property), to have only a small number of other neighbors within a small distance.

. The idea of unsupervised learning for duplicate detection has its roots in the probabilistic model proposed by Fellegi and Sunter. When there is no training data to compute the probability estimates, it is possible to use variations of the Expectation Maximization algorithm to identify appropriate clusters in the data. The duplicate record detection task is highly data-dependent and it is unclear if we will ever see a technique dominating all others across all data sets. The problem of choosing the best method for duplicate data detection is very similar to the problem of model selection and performance prediction for data mining.

III PROPOSED SYSTEM

(3)

IJECSE,Volume1,Number 4

Anitha Rani Inturi and Jujjuri Ramadevi

2150

ISSN 2277-1956/V1N4-2148-2151 duplicate vector set PDV, Non-duplicate vector set NDV, Adjustable parameter A and output is Duplicate vector set DV. For implementing the algorithm three classifiers are used. They are: Weighted Component Similarity Summing (WCSS) classifier, Support Vector Machine (SVM) classifier.

Major steps performed by UDD are:

• Similarity Calculation is performed.

• Obtain Potential Duplicate Vector (PDV) and Non-duplicate vector (NDV). • WCSS classifier is used to identify some duplicate vectors from PDV and NDV. • Train SVM.

• Classify the potential duplicate vector to obtain actual duplicates.

• Iteratively perform both the algorithms until Non-duplicate vector do not contain any further duplicates.

A) Similarity Calculation

In this paper, we propose to use generalized Levenshtein distance algorithms for calculating similarity between two records. The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character. In generalized Levenshtein distance algorithms, Compute the approximate string distance between character vectors. The distance is a generalized Levenshtein (edit) distance, giving the minimal possibly weighted number of insertions, deletions and substitutions needed to transform one string into another. The generalized Levenshtein distance can also be used for approximate (fuzzy) string matching, in which case one finds the substring of t with minimal distance to the pattern s then this distance is computed for partial = TRUE and corresponds to the distance used by agrep. Partial means a logical indicating whether the transformed x elements must exactly match the complete y elements, or only substrings of these. A similarity distance threshold value is set to identify initially a set potential duplicate vector and nonduplicate vector.

B) Generating Vectors

Based on the similarity calculation, Potential duplicate vector (PDV) and Non-duplicate vector (NDV) are generated. Initial vector of weights are allocated for the fields for similarity calculation.

C) Identifying duplicates using WCSS classifier

WCSS classifier is used to identify some duplicate vectors when there are no positive examples available. An intuitive method to identify duplicate vectors is to assume that two records are same if most of the fields under consideration are similar. It considers both positive Duplicate Vector PDV and Non-duplicate vector NDV. By using WCSS, get set of duplicate vector pairs from PDV and NDV. Delete or remove the duplicate vector pairs from PDV and NDV. Store those duplicate pairs in duplicate vector DV.

D) Train SVM

Support Vector Machine classifier is a useful technique for data classification. The goal of SVM is to produce a model (based on the training data) which predicts the target values of the test data given only the test data attributes. The second classifier ( SVM) used in UDE should be insensitive to the relative size of the positive and negative examples because the size of the negative examples is usually much bigger than the size of the positive examples. SVM is provided with the non duplicate vectors in NDV as examples to find vectors in PDV, that are similar to the provided examples.

E) Identify Actual Duplicates

Deduct the actual duplicates found from PDV and from NDV and update the new PDV and NDV. Iteratively perform WCSS classifier and SVM classifier until there are no duplicates in the Non- duplicates by adjusting the parameters.

IV PERFORMANCE

(4)

2151

An Unsupervised Approach for Mining Multiple Web Databases

ISSN 2277-1956/V1N4-2148-2151 classifier to do classification. However, UDD has the advantage that it does not require any prelabeled training examples, which relieves the burden on users having to provide such examples and makes UDD applicable for online record matching in the Web database scenario.

UDD outperforms both PEBL and Christens method. While both PEBL and UDD employ two classifiers, the two classifiers in UDD alternately cooperate inside the iterations while the weak classifier in PEBL only works before the iterations. Thus, UDD outperforms PEBL since in UDD either classifier can identify instances that cannot be identified by the other classifier, which reduces the possibility of being biased by the false positive/negative examples. In both PEBL and Christens method, there is only one classifier in the iterations. When there is only one classifier in the iterations, the identified results in a previous iteration are used by the same classifier as the new training examples for the next iteration, which makes the classifier more vulnerable to false examples.

V CONCLUSION

Duplicate detection is an important step in data integration and most state-of-the-art methods are based on offline learning techniques, which require training data. In this paper, to address the problems of record matching in the Web database scenario, we present an unsupervised online record matching method called UDD(unsupervised duplicate detection), which, for a given query, can effectively identify duplicates from the query result records of multiple Web databases. Unsupervised Duplicate Detection (UDD) a query-dependent record matching method that requires no pre training was developed earlier. Non duplicate records from the same source can be used as training examples so for a given query UDD uses two cooperating classifiers, a weighted component similarity summing classifier and an SVM classifier that iteratively identifies duplicates in the query results from multiple Web databases. To optimize its performance in String Similarity calculations during Information Retrievals (IR), we propose to replace the existing one with generalized Levenshtein distance algorithms.

VI REFERENCES

[1] Weifeng Su, Jiying Wang, and Frederick H. Lochovsky, ―Record Matching over Query Results from Multiple Web Databases,ǁ IEEE Transaction Knhe opowledge and Data Engineering, April 2010 (vol. 22 no. 4) pp. 578-589. [2] A.K. Elmagarmid, P.G. Ipeirotis, and V. S. Verykios. ―Duplicate Record Detection: A Surveyǁ, IEEE TKDE, 19(1):1-16, 2007.

[3] M. Bilenko and R.J. Mooney, “Adaptive Duplicate Detection Using Learnable String Similarity Measures,” Proc. ACM SIGKDD, pp. 39-48, 2003.

[6] V. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8):707-710, 1966.

[7] P. Christen, T. Churches, and M. Hegland, ―Febrl—A Parallel Open Source Data Linkage System, Advances in Knowledge Discovery and Data Mining, pp. 638-647, Springer, 2004.

[8] H. Yu, J. Han, and C.C. Chang, ―PEBL: Web Page Classification without Negative Examples, IEEE Trans. Knowledge and Data Eng., vol. 16, no. 1, pp. 70-81, Jan. 2004.

[9] R. Ananthakrishna, S. Chaudhuri, and V. Ganti, ―Eliminating Fuzzy Duplicates in Data Warehouses, Proc. 28th International Confernce Very Large Data Bases, pp. 586-597, 2002