Key Concepts - Aristotle University of Thessaloniki Faculty of Sciences

4.2 Method

4.2.1 Key Concepts

We also present WebGraph-It, a system which implements our methods and is available at http://webgraph-it.com. Web crawling engineers could use WebGraph-It topreprocess websites prior to web crawling to get speciﬁc lists of URLs to avoid as duplicates or near- duplicates, get URLs to avoid as web spider traps, and, generate webgraphs of the target website.

Finally, we conduct an experiment with a non-trivial dataset of websites to evaluate the proposed methods. Our contributions can be summarized as follows:

• we propose a set of methods to detect duplicate and near-duplicate webpages in real time during web crawling.

• we propose a set of methods to detect web spider traps using webgraphs in real time during web crawling.

• we introduce WebGraph-It, a web platform which implements the proposed methods.

The remainder of this chapter is organised as follows: Section 4.2 presents the main concepts of our methods and introduces new web crawling methods that use them to detect duplicate and near-duplicate content, as well as detect web spider traps. Section 4.3 presents the system architecture of WebGraph-It. Section 4.4 presents our experiments and detailed results.

Finally, Section 4.5 discusses results and presents future work.

is no new web content available. This issue is also known asweb spider trapsand is already detailed in Section 4.4.5.

To detect more optimal web crawling methods, we exploit three concepts which, to the best of our knowledge, have not been fully exploited until this point:

• Unique webpage identiﬁer selection: Which webpage attribute is considered as its unique identiﬁer?

• Unique webpage identiﬁers similarity: Which webpage unique identiﬁers should be considered similar?

• Webpage content similarity: Which webpage content should be considered similar?

Using these concepts, we identify duplicate or near-duplicate web content which highlights webgraph nodes which contain little or no new information and, thus, can be removed. These ﬁndings result in webgraph edge contractions and restructuring. In addition, this process enableswebgraph cycle detection. The result is the reduction of webgraph complexity, im- proving the eﬃciency of the web crawling process and the quality of its results.

We must note that each of the presented concepts can be used not only independently but also in conjunction with others. In the sequel, we analyse each concept in detail.

Unique webpage identiﬁer selection

The Uniform Resource Identiﬁers (URIs) are the de facto standards for unique web resource identiﬁcation of the WWW. The architecture of the WWW is based on the Uniform Resource Locator (URL), which is a subset of the URI that, in addition to identifying a resource, provide a means of locating the resource by describing its primary access mechanism (e.g., its network “location”) [19]. Many web related technologies such as the Semantic Web and Linked Open Data use URLs [21].

We suggest that we should rethink the use of URLs for unique webpage identiﬁcation in web crawling applications. There are many special cases where there are issues with this concept:

• URLs with excessive parameters are usually pointing at the same webpage. Web applications ignore arbitrary HTTP GET parameters. For instance, the following three URLs are pointing at the same webpage:

http://edition.cnn.com/videos

http://edition.cnn.com/videos?somevar=1

http://edition.cnn.com/videos?somevar=1&other=1

There is no restriction in using either of these URLs. If a web content editor or user mentions one of these for any reason, they would be accepted as valid as they are pointing at a valid webpage. The web server responds with an HTTP 200 status response and a correct web document. The problem is that web crawlers would capture three copies of the same webpage.

• Two or more totally diﬀerent URLs could point at the same webpage. For instance, the following two AUTH university webpages are duplicates:

http://www.auth.gr/invalid-page-1 http://www.auth.gr/invalid-page-2

They are both pointing at the same “Not found” webpage. If URLs such as these are mentioned in any webpage visited by a web crawler, this would result in multiple copies of the same webpage.

• Problematic DNS conﬁgurations could lead to multiple duplicate web documents. For example, in many cases the handling of the ’www.’ Preﬁx in websites is not consistent.

For instance, the following two URLs are exactly similar:

http://www.example.com/

http://example.com/

The correct DNS conﬁguration would make the system respond with an HTTP redirect from the one to the other according to the owner’s preference. Currently, web crawlers would consider them as two diﬀerent websites.

We suggest that URLs need to be preprocessed and normalised before used as unique webpage identiﬁers. An appropriate solution for this problem would be the use ofSort-friendly URI Reordering Transform (SURT)to encode URLs. In sort, SURT converts URLs from their original format:

scheme://user@domain.tld:port/path?query#fragment into the following:

scheme://(tld,domain,:port@user)/path?query#fragment An example conversion is presented below.

URL:http://edition.cnn.com/tech SURT:com,cnn,edition)/tech

The ‘(’ and ‘)’ characters serve as an unambiguous notice that the so-called ’authority’ portion of the URI ([userinfo@] host[:port] in http URIs) has been transformed; the commas prevent confusion with regular hostnames. This remedies the ‘problem’ with standard URIs that the host portion of a regular URI, with its dotted-domains, is actually in reverse order from the natural hierarchy that’s usually helpful for grouping and sorting. The value of re- specting URI case variance is considered negligible: it is vanishingly rare for case-variance to be meaningful, while URI case- variance often arises from people’s confusion or sloppi- ness, and they only correct it insofar as necessary to avoid blatant problems. Thus the usual SURT form is considered to be flattened to all lowercase, and not completely reversible⁴. Web archiving systems use SURT internally. For instance, Murray et al. use SURT and certain limits to conduct link analysis in captured web Content [104]. Alsum et al. use SURT to create a unique ID for each URI to achieve incremental and distributed processing for the same URI on different web crawling cycles or different machines [5].

4http://crawler.archive.org/articles/user_manual/glossary.html, accessed: August 1, 2015

Unique webpage identiﬁers similarity

One of the basic assumptions of the web is that URLs are unique [19]. When a web crawler visits a webpage and encounters duplicate URLs, it does not visit the same URL twice. We suggest that in some cases, near-duplicate URLs could also lead to the same webpage and should be avoided. For instance, the following two URLs lead to the same webpage:

http://vbanos.gr/

http://vbanos.gr

A slight diﬀerence in HTTP GET URL parameters could also trick web crawlers into processing duplicate webpage such as lowercase or uppercase characters, unescaped parameters, or any other web application speciﬁc variable could lead to the same results. For example, the following could be near-duplicate webpages:

http://vbanos.gr/show?show-greater=10 http://vbanos.gr/page?show-greater=11

Parameter ordering may also trick web crawlers. Example:

http://example.com?a=1&b=2 http://example.com?b=2&a=1

Thus, we propose to detect near-duplicate URLs using standard string similarity methods and consider webpages with near-duplicate URLs as potential duplicates. The content of these webpages is also evaluated to clarify if they are indeed duplicate.

We use the Sorensen-Dice coeﬃcient similarity because it is a string similarity algorithm with the following characteristics: (i) low sensitivity to word ordering, (ii) low sensitivity to length variations, (iii) runs in linear time [16, 45].

For the sake of experimentation, we consider the 95% similarity threshold as appropriate to deﬁne near-duplicate URLs. Finally, we must highlight that the proposed method can be used with both URL and SURT as unique identiﬁers.

Webpage content similarity

Webpage content similarity can be also used to detect duplicate or near-duplicate webpages.

The problem can be deﬁned as:

• Detect duplicate webpages: Two webpages which contain exactly the same content.

• Detect near-duplicate webpages: Webpages with content that is very similar but not exactly the same. This is a common pattern on the web, webpages may even be slightly diﬀerent if the same user makes two subsequent visits because some dynamic parts of the webpage are updated. E.g. some counter or some other widget.

Digital file checksum algorithms create a unique signature for each file based on their contents. They can be used to identify duplicate webpages but not near-duplicates. We need a very efficient and high performance algorithm. The simhash algorithm by Charikar can

be used to calculate hashes from documents to be able to perform fast comparisons [35].

It has has been already used very eﬀectively to detect near-duplicates in web search engine applications [95]. This work demonstrates that simhash is appropriate and practical for near- duplicate detection in webpages.

To use simhash, we need to calculate the simhash signature of every webpage after it is captured, and save it in a dictionary with its URL. Then, when capturing any new page, we would compare its simhash signature with existing ones in the dictionary to ﬁnd duplicates or near-duplicates. The similarity threshold would be an option according to user needs. For the sake of simplicity and experimentation, we only consider similarity evaluation between two webpage to identifyexact similarityor at least 95% similarity.

The potential problem of this approach is that in case a website contains a large number of webpages, then it would not be eﬃcient to calculate every new webpage’s similarity with all existing webpages, even though we may use simhash, which is very eﬃcient compared with a bag of words or any other attribute section method [35].

Webgraph cycle detection

During web crawling, a webgraph is generated in real time using the newly captured webpages as nodes and their links as edges. New branches are created and expanded as the web crawler captures new webpages. The ﬁnal outcome is a directed acyclic graph [28]. Our method can be summarised as follows: Every time a new node is added to the webgraph, we evaluate if it is duplicate or near-duplicate to nearby nodes. If this is true, then the two nodes are merged, their edges are contracted and we detect potential cycles in the modiﬁed graph starting from the new node up to a distance of𝑁 nodes. If a cycle is detected, we do not proceed with crawling links from this node, else we continue. A generic version of the web crawling with cycle detection algorithm is presented in Listing 4.2.

In more detail, to implement our method we need to have a shared global webgraph object in memory. which can be accessed by all web crawler bots. Each webgraph node has the structure of Listing 4.1.

1 struct webgraph -node {

2 string webpage -url

3 string webpage -surt

4 bitstream webpage -content - simhash

5 list[string] internal -links

6 }

Listing 4.1: Webgraph node structure

Webpage-url keeps the original webpage url without any modiﬁcation. Webpage-surt is generated by the webpage-url and webpage-content-simhash is generated by the webpage html markup. Only internal links are used due to the scope of our algorithms to detect duplicate webpages of the same website.

1 global var webgraph

3 method crawl(URL):

4 Fetch webpage from URL

5 new -node = create -webgraph -node(URL)

6 webgraph ->add -node(new -node)

7 for limit = (1,...,N):

8 near - nodes = webgraph ->get -nodes(new -node, limit)

9 for node in near -nodes:

10 if is - similar (node, new -node):

11 webgraph ->merge(node, new -node)

12 for limit = (1,...,N):

13 has -cycle = dfs -check -cycle( webgraph, new -node, limit )

14 if has -cycle is True:

15 return

16 parse webpage and extract all URLs

17 save webpage

18 for all URLs not seen before:

19 crawl (URL)

Listing 4.2: Generic web crawling with cycle detection algorithm

The algorithm can have multiple variations regarding: a) node similarity method and b) maximum node distance evaluation. Webgraph nodes which would be otherwise considered as unique considering only unique URL can now be identiﬁed as duplicate or near-duplicate using the methods presented in the previous sections (4.2.1-4.2.1). The potential similarity metrics are presented in Table 4.1.

Table 4.1: Potential webgraph node similarity metrics Id Identiﬁer Identiﬁer Similarity Content Similarity

𝑆₁ URL No No

𝑆₂ SURT No No

𝑆₃ URL Yes No

𝑆₄ SURT Yes No

𝑆₅ URL Yes Yes

𝑆₆ SURT Yes Yes

To search for cycles we use Depth-First Search (DFS) [136] because it is ideal to perform limited searches in potentially inﬁnite graphs. We limit the search distance to 3 nodes because our experiments indicate that it is not relevant to perform deeper searches to detect cycles. We present such an experiment in Section 4.4.4.

We must note that our method is very eﬃcient because we do not need to save the contents of every webpage but only some speciﬁc webpage attributes as presented in Listing 4.1. Also,

our method uses one shared webgraph model in memory regardless of the number of web crawler processes using locking mechanisms when adding or removing nodes. Due to the fact that web crawling is I/O bound, this architecture does not incur performance penalties.

No documento Aristotle University of Thessaloniki Faculty of Sciences (páginas 85-91)