Web Data Identification and Extraction

(1)

Web Data Identification and Extraction

G.V.Rajya Lakshmi 1_{, Mr.B.Narasimha Swamy}2

#1 Student, P.V.P.Siddhardha Institute of Technology, Kanuru, Vijayawada, Krishna (Dt) #2 Sr.Asst.Professor, P.V.P.Siddhardha Institute of Technology, Kanuru, Vijayawada, Krishna (Dt)

#1 gvrlaxmi@gmail.com, #2 swamy_bn@yahoo.com

Abstract: Nowadays, with the rapid growth of the web, a large volume of data and information are published in numerous web pages. As web sites are getting more complicated, the construction of web information extraction systems becomes more difficult and time-consuming. In this paper proposes a new method to perform the task automatically which is more effective than machine learning and semi automated system. The proposed method consists of two steps, (1) identifying individual data records in a page, and (2) aligning and extracting data items from the identified data records. For step 1, we propose a method based on visual information to segment data records, which is more accurate than existing methods. For step 2, we propose a novel partial alignment technique based on tree matching. Partial alignment means that we align only those data fields in a pair of data records that can be aligned (or matched) with certainty, and make no commitment on the rest of the data fields.

Keywords-—Web mining, Web data extraction, alignment, data records. 1-Introduction DATA MINING

Data mining is emerging as one of the key features of many homeland security initiatives. Often used as a means for detecting fraud, assessing risk, and product retailing, data mining [1] involves the use of data analysis tools to discover previously unknown, valid patterns and relationships in large data sets. In the context of homeland security, data mining is often viewed as a potential means to identify terrorist activities, such as money transfers and communications, and to identify and track individual terrorists themselves, such as through travel and immigration records.

WEB MINING

Web mining - is the application of data mining techniques to discover patterns from the Web. According to analysis targets, web mining can be divided into three different types, which are Web usage mining, Web content mining and Web structure mining.

WEB MINING USAGE

Web usage mining is the process of extracting useful information from server logs i.e. users history. Web usage mining is the process [2] of finding out what users are looking for on the Internet. Some users might be looking at only textual data, whereas some others might be interested in multimedia data.

STRUCTURE OF WEB MINING

Web structure mining is the process of using graph theory to analyze the node and connection structure of a web site. According to the type of web structural data, web structure mining can be divided into two kinds:

1. Extracting patterns from hyperlinks in the web: a hyperlink is a structural component that connects the web page to a different location.

2. Mining the document structure: analysis of the tree-like structure of page structures to describe HTML or XML tag usage.

WEB DATA IDENTIFICATION AND EXTRACTION

While data mining products can be very powerful tools, they are not self sufficient applications. To be successful, data mining requires skilled technical and analytical specialists who [3] can structure the analysis and interpret the output that is created. Consequently, the limitations of data mining are primarily data or personnel related, rather than technology-related.

(2)

data record without extracting its data items and method also uses visual cues to find data records. Visual information helps the system in two ways:

(i) It enables the system to identify gaps that separate data records, which helps to segment data records correctly because the gap within a data record (if any) is typically smaller than that in between data records.

(ii) The proposed system identifies data records by analyzing HTML tag trees or DOM trees. A straightforward way to build a tag tree is to follow the nested tag structure in the HTML code.

A novel partial tree alignment method is proposed to align and to extract corresponding data items from the discovered data records and put the data items in a database table. Using tree alignment is natural because of the nested (or tree structured) organization of HTML code. Specifically, after all data records have been identified, the sub-trees of each data record are re-arranged into a single tree as each data record may be contained in more than one subtree in the original tag tree of the page, and each data record may not be contiguous. The tag trees of all the data records are then aligned using our partial alignment method. The resulting alignment enables us to extract data items from all data records in the page. It can also serve as an extraction pattern to be used to extract data items from other pages with data records generated using the same template.

ADVANTAGES OF DATA MINING Marketing / Retail

Data mining helps marketing companies to build models based on historical data to predict who will respond to new marketing campaign such as direct mail, online marketing campaign and etc. Through this prediction, marketers can have appropriate approach to sell profitable products to targeted customers with high satisfaction.

Finance / Banking

Data mining gives financial institutions information about loan information and credit reporting. By building a model from previous customer’s data with common characteristics, the bank and financial can estimate what are the god and/or bad loans and its risk level. In addition, data mining can help banks to detect fraudulent credit card transaction to help credit card’s owner prevent their losses.

DISADVANTAGES Privacy Issues

The concerns about the personal privacy have been increasing enormously recently especially when internet is booming with social networks, e-commerce, forums, blogs…. Because of privacy issues, people are afraid of their personal information is collected and used in unethical way that potentially causing them a lot of trouble. Businesses collect information about their customers in many ways for understanding their purchasing behaviors trends. However businesses don’t last forever, some days they may be acquired by other or gone. At this time the personal information they own probably is sold to other or leak.

Misuse of information/inaccurate information

Information collected through data mining intended for marketing or ethical purposes can be misused. This information is exploited by unethical people or business to take benefit of vulnerable people or discriminate against a group of people.

DISADVANTAGES OF EXISTING SYSTEM

- When multiple pages are given, the extraction target aims at page-wide information.

- When single pages are given, the extraction target is usually constrained to record wide information - Page-level extraction tasks are much more complicated than record-level extraction tasks since more data

are concerned.

- It is time-consuming and it exploits only structural information to measure the similarity; visual information is recommended and important for such similarities.

(3)

ADVANTAGES OF PROPOSED SYSTEM

- Proposed system enables the system to identify gaps that separate data records, which helps to segment data records correctly because the gap within a data record (if any) is typically smaller than that in between data records.

- Identifies data records by analyzing HTML tag trees. A straightforward way to build a tag tree is to follow the nested tag structure in the HTML code.

- Tree alignment is natural because of the nested organization of HTML code. - Discovered data records and put the data items in a database table.

- Proposed method leads to more robust tree construction due to the high error tolerance of the rendering engines of Web browsers.

II RELATED WORK

Initial approach is wrapper induction, which uses supervised learning to learn data extraction rules from a set of manually labeled positive and negative examples. Manual labeling of data is, however, labor intensive and time consuming. Additionally, for different sites or even pages in the same site, the manual labeling process needs to be repeated because they follow different templates/patterns. Example wrapper induction systems include WIEN, Softmealy, Stalker, WL2 etc.

Record Level Extraction [Fig:1.1]

Our technique requires no human labeling. It mines data records in a page and extracts data from the records automatically. The second approach is automatic extraction. previously a study is made to automatically identify data record boundaries. The method is based on a set of heuristic rules, e.g., highest-count tags, repeating-tags and

(4)

However, those methods produce poor results. In addition, these methods do not extract data from data records.

Page Level Extraction [Fig:1.2]

Recently, two more techniques are proposed. However, they need to use multiple pages (which are assumed to be given) that contain similar data records from the same site to find patterns or grammars from the pages to extract data records. Assuming the availability of multiple pages containing similar data records is a serious limitation. Our method works on each single page.

III PROPOSED SYSTEM

We focus on page-level extraction tasks and propose a new approach called Tree based web data extraction.

The proposed technique presents a new structure, called fixed/variant pattern tree, a tree that carries all of the required information needed to identify the template and detect the data schema.

We combine several techniques like Alignment, pattern mining and Tree templates to solve much difficult problem of page-level template construction.

STEPS FOR RECORD DATA EXTRACTION

Based on two observations about data records in a Web page and an edit distance string matching algorithm [4] to find data records. The two observations are: (1) A group of data records that contains descriptions of a set of similar objects are typically presented in a contiguous region of a page and are formatted using similar HTML tags. Such a region is called a data record region (or data region in short). (2) The nested structure of HTML tags in a Web page naturally forms a tag tree.

Given a Web page, the algorithm works in three steps:

(5)

Step 2: Mining data regions in the page using the tag tree. A data region is an area in the page that contains a list of similar data records. Instead of mining data records directly, which is hard, proposed algorithm mines data regions first and then finds data records within them.

Step 3: Identifying data records from each data region.

DATA EXTRACTION

Fig : Proposed data extraction system model (i) Building a HTML Tag Tree

In a Web browser, each HTML element (consisting of a start tag, optional attributes, optional embedded HTML content, and an end tag that may be omitted) is rendered as a rectangle. A tag tree can be constructed based on the nested rectangles (resulted from nested tags). The details are as follows:

1. Find the 4 boundaries of the rectangle of each HTML element by calling the embedded parsing and rendering engine of a browser, e.g., Internet explorer.

2. Detect the containment relationship among the rectangles, i.e., whether one rectangle is contained inside another rectangle. A tree can be built based on the containment check.

(ii) Mining Data Regions

This step mines every data region in a page that contains similar data records. Instead of mining data records directly, which is hard, we first mine data regions. By comparing tag strings of individual nodes (including their

Identifying all data regions

Mine data regions in the page using tag tree Bulid a html tag tree of the

page

Identify data records from each region

Align all data records using partial tree alignment

INTERNET

(6)

descendents) and combination of multiple adjacent nodes, we can find each data region. In our new system, gaps between data records are used to eliminate false node combinations. We utilize the following visual observation about data records: The gap between two data records in a data region should be no smaller than any gap within a data record.

(iii) Identifying Data Records

After all data regions are identified, we identify data records from generalized nodes. We note that each generalized node (a single or a combination of tag nodes in the tag tree) may not represent a single data record. The situations can be quite complex. For this kind of situation, the corresponding children nodes of every tag node in a generalized node form a non-contiguous data record.

EDIT DISTANCE STRING ALGORITHM

Edit distance string algorithm is used for the identification of similarity[9] between records. string-edit distances is used as a general-purpose record matching scheme. This algorithm is flexible, fast and easy to implement.

String edit distance (also known as Lvenshtein distance) is perhaps the most widely used string matching/comparison technique. The edit distance of two strings, S1 and S2, is defined as the minimum number of point mutations[15] required to change S1 into S2, where a point mutation is one of (1) change a character, (2) insert a character, and (3) delete a character, Assume we are given two strings S1 and S2. Here, String means data items.

Let consider a empty string ‘E’ and string is ‘s’ then the distance between E and s is: d (E,E)=0

d(s,E)=d(E,s)=|s|

The following recurrence relations define the edit distance, d (S1 , S2), of two strings S1 and S2: d(S1+C1,S2+C2)= min [d(S1,S2) + P(C1,C2), d(S1+C1,S2)+1, d(S1,S2+C2)+1]

Where C1 and C2 are the last characters of S1 and S2 and P(C1,C2) = 0 if C1=C2; P(C1,C2) = 1, otherwise

The first two rules are obvious. Let us examine the last one. Since neither string is empty, each has a last character, C1 and C2 respectively. C1 and C2 have to be explained in an edit of S1+C1 into S2+C2. If C1=C2,they match with no penalty, I,e P(C1,C2)=0,and the overall edit distance is d(S1,S2) If C1=C2 then C1 could be changed into C2,giving P(C1,c2)=1 and an overall cost d(S1,S2)+1. The last possibility is to delete C1 and edit S1 into S2 +C2 giving d(S1,S2+C2)+1. Thereare no other alternatives.

From the relations, we can see that d(S1,S2) depends only on d(S1,S2) where S1 is a shorter string than S1 or S2 is a shorter string than S2 or both. Thus the dynamic programming technique can be applied to compute the edit distance of two strings.

By using Edit String distance algorithm, linkages between the data records are identified. By implementing this algorithm, similar and redundant records are identified and extracted. Matching corresponding data items for extracting data records is performed by Edit string distance algorithm. After matching the data items, data records are extracted by partial tree alignment technique.

IV DATA EXTRACTION

We now present the partial tree alignment technique for data extraction [5]. The key task is how to match corresponding data items or fields from all data records. There are two sub-steps:

1. Produce one rooted tag tree for each data record: After all data records are identified, the sub-trees of each data record are rearranged into a single tree. As shown above, each data record may be contained in more than one sub-tree of the original tag sub-tree of the page, and each data record may not be contiguous. Thus, this sub-step is needed to compose a single tree for each data record (an artificial root node may also need to be added).

2. Partial tree alignment: The tag trees of all data records in each data region are aligned using our partial alignment method which is based on tree matching. It should be noted that in the matching process, we only use tags. No data item is involved.

(i) Partial Tree Alignment

(7)

is similar to the center tree but without the O(k2) pair-wise tree matching to choose it. The reason for choosing this seed tree is clear as it is more likely for this tree to have a good alignment with data fields in other data records. Then for each Ti (i ≠s), the algorithm tries to find [10] for each node in Ti a matching node in Ts. When a match is found for node ni, a link is created from ni to ns to indicate its match in the seed tree. If no match can be found for node ni, then the algorithm attempts to expand the seed tree by inserting ni into Ts. The expanded seed tree Ts is then used in subsequent matching.

As indicated above, after Ts and Ti are matched, some nodes in Ti can be aligned with their corresponding [6] nodes of Ts because they match one another. For those nodes in Ti that are not matched, we want to insert them into Ts as they may contain optional data items. There are two possible situations when inserting a new node ni from Ti into the seed tree Ts, depending on whether a location in Ts can be uniquely determined [11] to insert ni. In fact, instead of considering a single node ni, we can consider each set of unmatched consecutive sibling nodes nj…nm from Ti together. Without loss of generality, we assume that the parent node of nj…nm has a match in Ts and we want to insert nj…nm into Ts under the same parent [8] node. We only insert nj…nm into Ts if a position for inserting nj…nm can be uniquely determined in Ts. Otherwise, they will not be inserted into Ts and left unaligned. The alignment is thus partial.

V PERFORMANCE

From all the approaches that the extract SRRs and web template data from Web page, the main problem is to detect record boundary and data boundary. Some of the problems[13][14] we are addressing [9] the align data inside these data records as well as the extracted web data. In this paper we are proposing a new approach shows the results that our approach is better than the previous ones.

COMPARISIONS SRR’S-Extraction

Actual SRR’S:419

Depta Fivatech Our

Approach

#Extracted 248 409 415

#Correct 226 401 410

Recall 53.9% 95.7% 97.9%

Precision 91.1% 98.0% 99.0%

GRAPH:

0 200 400 600 800 1000 1200

Our approach

Fivatech

Depta

VI CONCLUSION

(8)

identifying data records without extracting each data field in the data records, and (2) aligning corresponding data fields from multiple data records to extract data from them to put in a database table. We proposed an enhanced method based on visual information for step (1), which significantly improves the accuracy of our previous algorithm. For step 2, we proposed a novel partial tree alignment technique to align corresponding data fields of multiple data records.

VII REFERENCES [1] Data Mining News, Volume 1, No. 18, May 11, 1998.

[2]. Arasu, A. and Garcia-Molina, H. Extracting Structured Data from Web Pages. SIGMOD-03, 2003. [3]. Baeza-Yates, R. Algorithms for string matching: A survey. ACM SIGIR Forum, 23(3-4):34-58, 1989.

[4]. Barton, G., Sternberg, M. A strategy for the rapid multiple alignment of protein sequences: confidence levels from tertiary structure comparisons. J. Mol. Biol. 1987, 327-337.

[5]. Bar-Yossef, Z. and Rajagopalan, S. Template Detection via Data Mining and its Applications, WWW 2002, 2002. [6].Wang, J.-Y., and Lochovsky, F. Data extraction and label

assignment for Web databases. WWW-03, 2003.

[7]. Buttler, D., Liu, L., Pu, C. A fully automated extraction system for the World Wide Web. IEEE ICDCS-21, 2001. [8].Liu, B., Grossman, R. and Zhai, Y. “Mining data records

from Web pages.” KDD-03, 2003.

[9].Valiente, G. Tree edit distance and common subtrees. Research Report LSI-02-20-R, Universitat Politecnica de Catalunya, Barcelona, Spain, 2002.

[10]. Chang, C. and Lui, S-L. IEPAD: Information extraction based on pattern discovery. WWW-10, 2001

[11]. Carrillo, H., Lipman, D. The multiple sequence alignment problem in biology. SIAM J. Applied Math., 1988;48(5). [12]. Chakrabarti, S. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers, 2002. [13]. Chang, C. and Lui, S-L. IEPAD: Information extraction based on pattern discovery. WWW-10, 2001.

[14]. Chen, H.-H., Tsai, S.-C., and Tsai, J.-H. Mining tables from large scale html texts. COLING-00, 2000. [15]. Chen, W. New algorithm for ordered tree-to-tree correction problem. Journal of Algorithms, 40:135–158, 2001.

B.N.Swamy received his B. Tech from Pondichery University and completed post graduation from SRM

University, Chennai. He is currently working as Senior Assistant Professor in PVP Siddhartha Institute of Technology, in the Department of Computer Science and Engineering, Vijayawada, Andhra Pradesh. His research interests include Data Mining and Data Warehousing, Computer Networks and Network security. He has more than four years of experience in teaching. He is the member of ACM, Indian Society of Technical Education (ISTE) and also member of Computer Society of India.

G.V.RajyaLakshmi received B.E from Institution of Engineers (INDIA), Kolkata, and elected as Associated Member in Institution of Engineers. She is currently pursuing Post Graduation (M.Tech) in PVP Siddhartha Institute of Technology, in the Department of Computer Science and Engineering, Vijayawada, Andhra Pradesh. Her research interests include Data Mining and Data Warehousing, Data Structures & Algorithms. She has five