• Nenhum resultado encontrado

4.2 Method

4.2.1 Key Concepts

We also present WebGraph-It, a system which implements our methods and is available at http://webgraph-it.com. Web crawling engineers could use WebGraph-It topreprocess websites prior to web crawling to get specific lists of URLs to avoid as duplicates or near- duplicates, get URLs to avoid as web spider traps, and, generate webgraphs of the target website.

Finally, we conduct an experiment with a non-trivial dataset of websites to evaluate the proposed methods. Our contributions can be summarized as follows:

• we propose a set of methods to detect duplicate and near-duplicate webpages in real time during web crawling.

• we propose a set of methods to detect web spider traps using webgraphs in real time during web crawling.

• we introduce WebGraph-It, a web platform which implements the proposed methods.

The remainder of this chapter is organised as follows: Section 4.2 presents the main concepts of our methods and introduces new web crawling methods that use them to detect duplicate and near-duplicate content, as well as detect web spider traps. Section 4.3 presents the sys- tem architecture of WebGraph-It. Section 4.4 presents our experiments and detailed results.

Finally, Section 4.5 discusses results and presents future work.

is no new web content available. This issue is also known asweb spider trapsand is already detailed in Section 4.4.5.

To detect more optimal web crawling methods, we exploit three concepts which, to the best of our knowledge, have not been fully exploited until this point:

Unique webpage identifier selection: Which webpage attribute is considered as its unique identifier?

Unique webpage identifiers similarity: Which webpage unique identifiers should be considered similar?

Webpage content similarity: Which webpage content should be considered similar?

Using these concepts, we identify duplicate or near-duplicate web content which highlights webgraph nodes which contain little or no new information and, thus, can be removed. These findings result in webgraph edge contractions and restructuring. In addition, this process enableswebgraph cycle detection. The result is the reduction of webgraph complexity, im- proving the efficiency of the web crawling process and the quality of its results.

We must note that each of the presented concepts can be used not only independently but also in conjunction with others. In the sequel, we analyse each concept in detail.

Unique webpage identifier selection

The Uniform Resource Identifiers (URIs) are the de facto standards for unique web resource identification of the WWW. The architecture of the WWW is based on the Uniform Resource Locator (URL), which is a subset of the URI that, in addition to identifying a resource, provide a means of locating the resource by describing its primary access mechanism (e.g., its network “location”) [19]. Many web related technologies such as the Semantic Web and Linked Open Data use URLs [21].

We suggest that we should rethink the use of URLs for unique webpage identification in web crawling applications. There are many special cases where there are issues with this concept:

• URLs with excessive parameters are usually pointing at the same webpage. Web ap- plications ignore arbitrary HTTP GET parameters. For instance, the following three URLs are pointing at the same webpage:

http://edition.cnn.com/videos

http://edition.cnn.com/videos?somevar=1

http://edition.cnn.com/videos?somevar=1&other=1

There is no restriction in using either of these URLs. If a web content editor or user mentions one of these for any reason, they would be accepted as valid as they are point- ing at a valid webpage. The web server responds with an HTTP 200 status response and a correct web document. The problem is that web crawlers would capture three copies of the same webpage.

• Two or more totally different URLs could point at the same webpage. For instance, the following two AUTH university webpages are duplicates:

http://www.auth.gr/invalid-page-1 http://www.auth.gr/invalid-page-2

They are both pointing at the same “Not found” webpage. If URLs such as these are mentioned in any webpage visited by a web crawler, this would result in multiple copies of the same webpage.

• Problematic DNS configurations could lead to multiple duplicate web documents. For example, in many cases the handling of the ’www.’ Prefix in websites is not consistent.

For instance, the following two URLs are exactly similar:

http://www.example.com/

http://example.com/

The correct DNS configuration would make the system respond with an HTTP redirect from the one to the other according to the owner’s preference. Currently, web crawlers would consider them as two different websites.

We suggest that URLs need to be preprocessed and normalised before used as unique web- page identifiers. An appropriate solution for this problem would be the use ofSort-friendly URI Reordering Transform (SURT)to encode URLs. In sort, SURT converts URLs from their original format:

scheme://user@domain.tld:port/path?query#fragment into the following:

scheme://(tld,domain,:port@user)/path?query#fragment An example conversion is presented below.

URL:http://edition.cnn.com/tech SURT:com,cnn,edition)/tech

The ‘(’ and ‘)’ characters serve as an unambiguous notice that the so-called ’authority’ por- tion of the URI ([userinfo@] host[:port] in http URIs) has been transformed; the commas prevent confusion with regular hostnames. This remedies the ‘problem’ with standard URIs that the host portion of a regular URI, with its dotted-domains, is actually in reverse order from the natural hierarchy that’s usually helpful for grouping and sorting. The value of re- specting URI case variance is considered negligible: it is vanishingly rare for case-variance to be meaningful, while URI case- variance often arises from people’s confusion or sloppi- ness, and they only correct it insofar as necessary to avoid blatant problems. Thus the usual SURT form is considered to be flattened to all lowercase, and not completely reversible4. Web archiving systems use SURT internally. For instance, Murray et al. use SURT and certain limits to conduct link analysis in captured web Content [104]. Alsum et al. use SURT to create a unique ID for each URI to achieve incremental and distributed processing for the same URI on different web crawling cycles or different machines [5].

4http://crawler.archive.org/articles/user_manual/glossary.html, accessed: August 1, 2015

Unique webpage identifiers similarity

One of the basic assumptions of the web is that URLs are unique [19]. When a web crawler visits a webpage and encounters duplicate URLs, it does not visit the same URL twice. We suggest that in some cases, near-duplicate URLs could also lead to the same webpage and should be avoided. For instance, the following two URLs lead to the same webpage:

http://vbanos.gr/

http://vbanos.gr

A slight difference in HTTP GET URL parameters could also trick web crawlers into pro- cessing duplicate webpage such as lowercase or uppercase characters, unescaped parameters, or any other web application specific variable could lead to the same results. For example, the following could be near-duplicate webpages:

http://vbanos.gr/show?show-greater=10 http://vbanos.gr/page?show-greater=11

Parameter ordering may also trick web crawlers. Example:

http://example.com?a=1&b=2 http://example.com?b=2&a=1

Thus, we propose to detect near-duplicate URLs using standard string similarity methods and consider webpages with near-duplicate URLs as potential duplicates. The content of these webpages is also evaluated to clarify if they are indeed duplicate.

We use the Sorensen-Dice coefficient similarity because it is a string similarity algorithm with the following characteristics: (i) low sensitivity to word ordering, (ii) low sensitivity to length variations, (iii) runs in linear time [16, 45].

For the sake of experimentation, we consider the 95% similarity threshold as appropriate to define near-duplicate URLs. Finally, we must highlight that the proposed method can be used with both URL and SURT as unique identifiers.

Webpage content similarity

Webpage content similarity can be also used to detect duplicate or near-duplicate webpages.

The problem can be defined as:

Detect duplicate webpages: Two webpages which contain exactly the same content.

Detect near-duplicate webpages: Webpages with content that is very similar but not exactly the same. This is a common pattern on the web, webpages may even be slightly different if the same user makes two subsequent visits because some dynamic parts of the webpage are updated. E.g. some counter or some other widget.

Digital file checksum algorithms create a unique signature for each file based on their con- tents. They can be used to identify duplicate webpages but not near-duplicates. We need a very efficient and high performance algorithm. The simhash algorithm by Charikar can

be used to calculate hashes from documents to be able to perform fast comparisons [35].

It has has been already used very effectively to detect near-duplicates in web search engine applications [95]. This work demonstrates that simhash is appropriate and practical for near- duplicate detection in webpages.

To use simhash, we need to calculate the simhash signature of every webpage after it is captured, and save it in a dictionary with its URL. Then, when capturing any new page, we would compare its simhash signature with existing ones in the dictionary to find duplicates or near-duplicates. The similarity threshold would be an option according to user needs. For the sake of simplicity and experimentation, we only consider similarity evaluation between two webpage to identifyexact similarityor at least 95% similarity.

The potential problem of this approach is that in case a website contains a large number of webpages, then it would not be efficient to calculate every new webpage’s similarity with all existing webpages, even though we may use simhash, which is very efficient compared with a bag of words or any other attribute section method [35].

Webgraph cycle detection

During web crawling, a webgraph is generated in real time using the newly captured web- pages as nodes and their links as edges. New branches are created and expanded as the web crawler captures new webpages. The final outcome is a directed acyclic graph [28]. Our method can be summarised as follows: Every time a new node is added to the webgraph, we evaluate if it is duplicate or near-duplicate to nearby nodes. If this is true, then the two nodes are merged, their edges are contracted and we detect potential cycles in the modified graph starting from the new node up to a distance of𝑁 nodes. If a cycle is detected, we do not proceed with crawling links from this node, else we continue. A generic version of the web crawling with cycle detection algorithm is presented in Listing 4.2.

In more detail, to implement our method we need to have a shared global webgraph object in memory. which can be accessed by all web crawler bots. Each webgraph node has the structure of Listing 4.1.

1 struct webgraph -node {

2 string webpage -url

3 string webpage -surt

4 bitstream webpage -content - simhash

5 list[string] internal -links

6 }

Listing 4.1: Webgraph node structure

Webpage-url keeps the original webpage url without any modification. Webpage-surt is generated by the webpage-url and webpage-content-simhash is generated by the webpage html markup. Only internal links are used due to the scope of our algorithms to detect duplicate webpages of the same website.

1 global var webgraph

2

3 method crawl(URL):

4 Fetch webpage from URL

5 new -node = create -webgraph -node(URL)

6 webgraph ->add -node(new -node)

7 for limit = (1,...,N):

8 near - nodes = webgraph ->get -nodes(new -node, limit)

9 for node in near -nodes:

10 if is - similar (node, new -node):

11 webgraph ->merge(node, new -node)

12 for limit = (1,...,N):

13 has -cycle = dfs -check -cycle( webgraph, new -node, limit )

14 if has -cycle is True:

15 return

16 parse webpage and extract all URLs

17 save webpage

18 for all URLs not seen before:

19 crawl (URL)

Listing 4.2: Generic web crawling with cycle detection algorithm

The algorithm can have multiple variations regarding: a) node similarity method and b) maximum node distance evaluation. Webgraph nodes which would be otherwise considered as unique considering only unique URL can now be identified as duplicate or near-duplicate using the methods presented in the previous sections (4.2.1-4.2.1). The potential similarity metrics are presented in Table 4.1.

Table 4.1: Potential webgraph node similarity metrics Id Identifier Identifier Similarity Content Similarity

𝑆1 URL No No

𝑆2 SURT No No

𝑆3 URL Yes No

𝑆4 SURT Yes No

𝑆5 URL Yes Yes

𝑆6 SURT Yes Yes

To search for cycles we use Depth-First Search (DFS) [136] because it is ideal to perform limited searches in potentially infinite graphs. We limit the search distance to 3 nodes be- cause our experiments indicate that it is not relevant to perform deeper searches to detect cycles. We present such an experiment in Section 4.4.4.

We must note that our method is very efficient because we do not need to save the contents of every webpage but only some specific webpage attributes as presented in Listing 4.1. Also,

our method uses one shared webgraph model in memory regardless of the number of web crawler processes using locking mechanisms when adding or removing nodes. Due to the fact that web crawling is I/O bound, this architecture does not incur performance penalties.