The WebGraph-it System Architecture - Aristotle University of Thessaloniki Faculty of Sciences

Table 4.2.2 is a summary of all presented web crawling algorithms. We notice that we are not exhausting all potential method combinations but we are focusing on a substantial set, which is suﬃcient to explore the value of our methods. In the sequel, we present the WebGraph-It platform, which implements the presented algorithms.

Table 4.2: Web crawling algorithms summary

Id Identiﬁer Selection Identiﬁer Similarity Content Similarity Cycle Detection

𝐶₁ URL No No No

𝐶₂ SURT No No No

𝐶₃ URL Yes No No

𝐶₄ SURT Yes No No

𝐶₅ URL Yes Yes No

𝐶₆ SURT Yes Yes No

𝐶₇ URL No Yes Yes

𝐶₈ SURT No Yes Yes

data, h) PhantomJS, a headless WebKit scriptable with a JavaScript API. It has fast and na- tive support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG, i) JavaScript and CSS libraries such as Bootstrap for UI. An overview of the system architecture is presented in Figure 4.1.

Figure 4.1: WebGraph-It system architecture

We explain some of our system architecture decisions as they are important for the implementation of our methods. First, we choose the micro-services architecture because we need to separate the web crawling logic, the data storage and the user interface. This way, we can upgrade each system without aﬀecting the other ones. For instance, we could create a new more user-friendly web interface or a public REST API for WebGraph-It without modifying the web crawling logic.

We use asynchronous job queues in the back-end to define and conduct the web crawling process because it is a flexible system; we can define arbitrary numbers of worker processes in one or more servers, and, thus, the system is resilient to faults due to unexpected condi- tions. A single process can crash without affecting the other ones. Also, the results are kept in the job queues and can be evaluated later.

Figure 4.2: Viewing a webgraph in thehttp://webgraph-it.comweb application We use Python to implement the front-end and the back-end subsystems because it has many features such as a robust MVC framework and networking libraries, such as python-requests⁵. Python also has a large set of libraries, which implement algorithms such as simhash⁶ and Sorensen-Dice, as well as graph analysis (NetworkX⁷) and numeric calculations (numpy⁸).

We use PhantomJS to improve the ability of our web crawler to process webpages which use Javascipt, AJAX and other web technologies, which are diﬃcult to handle with HTML pro- cessing. Using PhantomJS, we render JavaScript in webpages and extract dynamic content.

In the past, this method has been tested successfully in web crawling work [16].

We use Redis⁹to store temporary data in memory because of its extremely high performance, its ability to support many data structures and multiple clients in parallel. Web crawling is

5http://python-requests.org, accessed: August 1, 2015

6https://github.com/sangelone/python-hashes, accessed: August 1, 2015

7https://networkx.github.io/, accessed: August 1, 2015

8http://www.numpy.org/, accessed: August 1, 2015

9http://redis.io, accessed: August 1, 2015

performed by multiple software agents / web crawling processes, which can be distributed in one or more servers. The WebGraph-It architecture uses Redis as a common temporary data storage to maintain asynchronous job queues, webgraph structures, visited URL lists, SURT lists, webpages’ simhash values and other vital web crawl information.

We use MariaDB to store permanent data for users, webcrawls and captured webpages information such as hyperlinks. All data are stored in a relational model to be able to query them and generate views, reports and statistics.

Thehttp://webgraph-it.comfront-end enables users to register and conduct web crawls with various options. The users can see completed web crawls and retrieve the results. An indicative screenshot of the front-end is presented in Figure 4.3.1. Users are able to create new web crawling tasks or view the results of existing tasks via an intuitive interface. Users are also able to export webgraph data in a variety of formats such as Graph Markup Language (GraphML) [26], Geography Markup Language (GML) [30], Graphviz DOT Language [49], sitemap.xml [127] and CSV. Our aim is to enable the use of the generated webgraphs in a large variety of 3rd party applications and contexts.

4.3.2 Web Crawling Framework

The development of multiple alternative web crawling methods requires the appropriate code base. We implement a special framework for WebGraph-it which simpliﬁes the web crawler creation process; we use it for the implementation of the alternative web crawling algorithms presented in Section 4.2.2. The basic functionality of user input/output, storage and logging is common for all web crawlers. The developer needs only to create a Python module with three methods:

• check_url: Check if we should continue to follow a URL.

• process: Analyse captured webpage and extract information.

• capture: Download webpage from URL

We present the Python implementation of the basic web crawling algorithm𝐶₁ from Sec- tion 4.3) in Listing 4.11.

1 from lib. crawl_list_class import CrawlList

2 from lib. crawling_utils import enqueue_capture

3 from app. models .base import db

4 from app. models .crawl import Crawl

5 from app. models .page import Page

7 def permit_url ( target ):

8 crawl_list = CrawlList ( target . crawl_id )

9 return not crawl_list . is_visited ( target .url)

11 def capture ( target ):

12 current_crawl = Crawl .query. filter (Crawl.id == target . crawl_id ).first ()

13 crawl_list = CrawlList ( target . crawl_id )

14 if permit_url ( target ):

15 if not current_crawl . has_available_pages ():

16 return

17 if not target . get_html ():

18 return

19 new_page = Page(

20 crawl_id = target . crawl_id,

21 url= unicode ( target .url),

22 )

23 db. session .add( new_page )

24 db. session . commit ()

25 if target . unique_key and new_page .id:

26 crawl_list . add_visited ( target . unique_key, new_page .id)

27 target . page_id = new_page .id

28 if target .links:

29 target . save_links ()

30 enqueue_capture (" standard ", capture, target, target . links)

32 def process ( target ):

33 crawl_list = CrawlList ( target . crawl_id )

34 crawl_list .clear ()

35 enqueue_capture (" standard ", capture, target, ( target .url,)) Listing 4.11: 𝐶₁: Basic web crawling algorithm python implementation

No documento Aristotle University of Thessaloniki Faculty of Sciences (páginas 95-99)