• Nenhum resultado encontrado

We presented our extended work towards the foundation of a quantitative method to evaluate WA. The Credible Live Evaluation of Archive Readiness Plus (CLEAR+) method to evaluate Website Archivability has been elaborated in great detail, the key Facets of WA have been defined and the method of their calculating has been explained in theory and practice.

In addition, we presented the ArchiveReady system, which is the reference implementation of CLEAR+. We overviewed all aspects of the system, including design decisions, technolo- gies, workflows and interoperability APIs. We believe that it is quite important to explain how the reference implementation of CLEAR+ works because transparency raises the con- fidence for the method.

A critical part of this work is also the experimental evaluation. First, we performed ex- perimental WA evaluations of assorted datasets and observed the behaviour of our metrics.

Then, we conducted a manual characterisation of websites to create a reference standard and we identified correlations with WA. Both evaluations provided very positive results, which support that the CLEAR+ can be used to identify whether a website has the potential to be archived with correctness and accuracy. We also experimentally proved that that CLEAR+

method needs only to evaluate a single webpage to calculate the WA of a website, based on the assumption that webpages from the same website share the same components, standards and technologies.

Finally, we evaluated the WA of the most prevalent WCMS, one of the common technical denominators of current websites. We investigated the extent to which each WCMS meets the conditions for a safe transfer of their content to a web archive for preservation purposes, and thus identified their strengths and weaknesses. More importantly, we deduced specific recommendations to improve the WA of each WCMS, aiming to advance the general practice of web data extraction and archiving.

Introducing a new metric to quantify the previously unquantifiable notion of WA is not an easy task. We believe that we have captured the core aspects of a website crucial in diag- nosing whether it has the potential to be archived with correctness and accuracy with the CLEAR+ method and the WA metric.

Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling

The performance and efficiency of web crawling is important for many Applications, such as for search engines, web archives and online news. We propose methods to optimise web crawling by duplicate and near-duplicate webpage detection. Using webgraphs to model web crawling, we perform webgraph edge contractions and detect web spider traps, improving the performance and efficiency of web crawling as well as the quality of its results. We introduce http://webgraph-it.com (WebGraph-It), a web platform which implements the presented methods, and conduct extensive experiments using real-world web data to evaluate the strengths and weaknesses of our methods1.

4.1 Introduction

Websites have become large and complex systems, which require strong software systems to be managed effectively [22]. Web content extraction, or web crawling, is becoming increas- ingly important. It is crucial to have web crawlers capable of efficiently traversing websites to harvest their content. The sheer size of the web combined with an unpredictable publish- ing rate of new information call for a highly scalable system, while the lack of programmatic access to the complete web content makes the use of automatic extraction techniques neces- sary.

1This chapter is based on the following publication:

• Banos V., Manolopoulos Y.: “Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling�,ACM Transactions on the Web Journal, submitted, 2015.

Special software systems have been created, the web crawlers, also known as “spiders” or

“bots”, to conduct web crawling with efficiency and performance on large scale. They are self-acting agents that navigate around-the-clock through the hyperlinks of the web, harvest- ing topical resources without human supervision [112]. Essentially, a web crawler starts from a seed webpage and then uses the hyperlinks within it to visit other webpages. This process repeats with every new webpage until some conditions are met (e.g. a maximum number of webpages is visited or no new hyperlinks are detected). Despite the simplicity of the basic algorithm, web crawling has many challenges [12]. In this work, we focus on addressing two key issues:

• There are a lot of duplicate or near-duplicate data captured during web crawling. Such data are considered superfluous and, thus, great effort is necessary to detect and remove themafter crawling[95]. To the best of our knowledge, there is no method to perform this task during web crawling.

• Web spider traps are sets of webpages that cause web crawlers to make an infinite number of requests. They result in software crashes, web crawling disruption and excessive waste of computing resources [110]. There is no automated way to detect and avoid web spider traps; web crawling engineers use various heuristics with limited success.

These issues impact greatly web crawling systems’ performance and users’ experience. We explore some fundamental web crawling concepts and present various methods to improve baseline web crawling to address them:

Unique webpage identifier selection: URI is the de facto standard for unique webpage identification but web archiving systems also use the Sort-friendly URI Reordering Transform (SURT)2, a transformation applied to URIs which makes their left-to-right representation better match the natural hierarchy of domain names3. We suggest using SURT as an alternative unique webpage identifier for web crawling applications.

Unique webpage identifier similarity: Unique URI is the defacto standard but we also look into near-duplicates as well. It is possible that two near-duplicate URIs or SURTs belong to the same webpage.

Webpage content similarity: Duplicate and near-duplicate webpage content detec- tion can be used in conjunction with unique webpage identifier similarity.

Webgraph edge contraction: Modeling websites as webgraphs[28] during crawling, we can apply node merging using the previous three concepts as similarity criteria and achieve webgraph edge contraction and cycle detection.

Using these concepts, we establish a theoretical framework as well as novel web crawling methods, which provide us with the following information for any target website: (i) unique and valid webpages, (ii) hyperlinks between them, (iii) duplicate and near-duplicate web- pages, (iv) web spider trap locations, and, (v) a webgraph model of the website.

2http://crawler.archive.org/apidocs/org/archive/util/SURT.html, accessed August 1, 2015

3http://crawler.archive.org/articles/user_manual/glossary.html, accessed August 1, 2015

We also present WebGraph-It, a system which implements our methods and is available at http://webgraph-it.com. Web crawling engineers could use WebGraph-It topreprocess websites prior to web crawling to get specific lists of URLs to avoid as duplicates or near- duplicates, get URLs to avoid as web spider traps, and, generate webgraphs of the target website.

Finally, we conduct an experiment with a non-trivial dataset of websites to evaluate the proposed methods. Our contributions can be summarized as follows:

• we propose a set of methods to detect duplicate and near-duplicate webpages in real time during web crawling.

• we propose a set of methods to detect web spider traps using webgraphs in real time during web crawling.

• we introduce WebGraph-It, a web platform which implements the proposed methods.

The remainder of this chapter is organised as follows: Section 4.2 presents the main concepts of our methods and introduces new web crawling methods that use them to detect duplicate and near-duplicate content, as well as detect web spider traps. Section 4.3 presents the sys- tem architecture of WebGraph-It. Section 4.4 presents our experiments and detailed results.

Finally, Section 4.5 discusses results and presents future work.