Aristotle University of Thessaloniki Faculty of Sciences

In this thesis, we focus on improving the data collection aspect of the web archiving process. We introduce weblog archiving as a special type of web archiving and present our findings and developments in this field: a technical survey of the blogosphere, a scalable approach to harvesting modern weblogs, and an integrated approach to preserving weblogs using a digital storage system .

Key Deﬁnitions and Problem Description

We focus on the Blogosphere, the collective outcome of all weblogs, their content, interconnections and influences that constitute an active part of the social media, an established channel of online communication with great significance [78]. In the next section, we present our contributions and the overall structure of the thesis.

Contributions and Document Organisation

We focus on the state of the art in web content deduplication and spider catcher detection, as well as website evaluation and quality assurance (QA) methods for web archiving. We evaluate the website archivalability (WA) of the most common WCMS and provide specific recommendations for their developers.

Publications

Web Archiving Quality Assurance

Focusing on quality review, when a harvest is completed, the harvest result is stored in the .. digital asset store and the target instance is stored in the harvested state2. QA was listed as one of the most important tasks to be assigned to users: "The process of examining the characteristics of the websites captured by web crawling software, which is largely manual in practice, before making a decision on , whether a website has been successfully captured to become a valid archive copy”4.

Web Content Deduplication

The nature of the Internet encourages the creation of duplicate content, whether intentionally or unintentionally. One of the most important principles of the Internet is that every content has its own URI.

Web Crawler Automation

Search Engine Optimization (SEO) guidelines also advise website administrators to have one URI for each web resource. In Section 5.4, we present our design which is similar to the distributed active object pattern presented in [85].

Blog Archiving

Blog Archiving Projects

We are addressing two of the main challenges related to the technical aspects of web archiving: (a) the acquisition of web content and (b) the Quality Assurance (QA) assessment performed before it is entered into a web archive. Make observations about the archiveability of the 12 most prominent Web Content Management Systems and suggest improvements.

Credible Live Evaluation method for Archive Readiness Plus (CLEAR+) . 17

Website Archivability Facets
Attributes
Evaluations
Example
The Evolution from CLEAR to CLEAR+

Sitemap.xml files are intended to include references to all web pages of the website. Source code and related web resources: Website source code (HTML, JavaScript, CSS).

ArchiveReady: A Website Archivability Evaluation Tool

System Architecture
Scalability
Workﬂow
Interoperability and APIs

For example, checking that archived versions of the target website are present in the Internet Archive or not should be part of the assessment. When a user or third-party application initiates a new evaluation task, the web application server maps the task into multiple individual atomic subtasks, which are inserted into the system's asynchronous task queue, which is stored in a Redis list.

Evaluation

Methodology and Limits
Experimentation with Assorted Datasets
Evaluation by Experts
WA Variance in the Same Website

We calculate and store the WA of the home page for each website (𝑊 𝐴ℎ𝑜𝑚𝑒𝑝𝑎𝑔𝑒) as an extra variable. The WA standard deviation for web pages from the same website does not depend on the average WA of the website.

Web Content Management Systems Archivability

Website Corpus Evaluation Method

A valid explanation for this phenomenon is that website owners spend more resources on the homepage than any other page because it is the most visited part of the website. Overall, we can confirm that it is justified to evaluate WA solely using the website's website.

Evaluation Results and Observations

The results of the 𝐴3 evaluation (Table 3.15) indicate that most WCMS lack proper support for this feature. We evaluate the performance of two 𝐹𝐶 related evaluations. 𝐶1: The percentage of local versus remote images is shown in Table 3.16).

Discussion

Additionally, Blogger scores very low on many metrics such as the number of inline scripts for example (Table 3.14) and HTML errors for example (Table 3.18). One aspect they should consider is HTML errors for example (Table 3.18), where it has the second worst result.

Conclusions

To the best of our knowledge, there is no method to perform this task during web crawling. There is no automatic way to detect and avoid web spiders; Web crawlers use various heuristics with limited success.

Method

Key Concepts

If URLs like these are mentioned on a web page visited by a web crawler, this will result in multiple copies of the same web page. Only internal links are used due to the scope of our algorithms to detect duplicate web pages on the same site.

Algorithms

In this algorithm, we propose to use near-similarity to decide whether a web page is visited or not. We propose that in addition to the similarity of the unique identifier, we should also consider the similarity of the content of web pages.

The WebGraph-it System Architecture

System

For example, we can create a new, more user-friendly web interface or a public REST API for WebGraph-It without modifying the web crawling logic. Users are able to create new web crawling tasks or view the results of existing tasks through an intuitive interface.

Web Crawling Framework

All data is stored in a relational model to be able to query it and generate views, reports and statistics. Users can also export web graph data in a variety of formats, such as Graph Markup Language (GraphML) [26], Geography Markup Language (GML) [30], Graphviz DOT Language [49], sitemap.xml [127] and CSV .

Evaluation

Methodology
Example
Results
Optimal DFS Limit for Cycle Detection
Web Spider Trap Experiment

All web crawling tasks are executed in parallel using 4 processes (python workers) as presented in Section 4.3. We perform 8 consecutive web crawls for each website with the WebGraph-it system using the 8 different web crawling algorithms presented in Section 4.2.2 (𝐶1 − 𝐶8).

Conclusions and Future Work

Using the framework we developed in the context of WebGraph-It to enable easy web crawling algorithm implementation (Section 4.3.2), we aim to evolve existing web crawling algorithms and create new ones. The uses of web crawl optimization, web graph generation, and web spider trap detection are numerous.

Blogosphere Technical Survey

Survey Implementation

The inclusion of additional blogs shared by survey participants expands the automatically generated list of blogs with a set of selectively contributed ones. Inclusion of top blogs from Technorati and Blogpulse enables a comparative analysis between the more general Weblogs.com cohort and the list of highly ranked blogs.

Results

Metadata • Dublin Core • Friend of a Friend (FOAF) • Open Graph Protocol (OG) • Semantically-Interlinked Online Communities (SIOC). The use of MS Word and PDF documents is between 4.9-6.4% of all sources studied.

Comparison Between Blogosphere and the Generic Web

It is also clear that the adoption of PNG images is higher in the blogosphere (25%) compared to the general web (20%). On the other hand, there are a large number of established and widespread technologies and standards that are used consistently throughout the blogosphere.

User Requirements

Preservation Requirements

Our evaluation shows that there are approximately 470 platforms in addition to the dominant WordPress and Blogger. The wide variety of content types in addition to the 61% text/html published across a wide range of coding standards demonstrates this fact.

Interoperability Requirements

Therefore, the spider and respectively the archive should ideally be able to capture blogs from every platform or at least from the most common platforms. Uncover parts of the archive via Open Archives Initiative - Protocol for Metadata Harvesting (OAI-PMH)10 based on specified criteria.

Performance Requirements

The BlogForever platform must allow different parts of the archive to be exposed to different clients via OAI-PMH according to certain parameters and policy (eg client identity, admin settings). The archive should allow pingback/trackback to facilitate connections between archived content and external resources.

System Architecture

The BlogForever Software Platform

Because of the overhead incurred by the various network connections, and the delay incurred by the tools that process the information, it is accurate to say that archiving is as close to real-time as these factors allow. In addition, an extensive range of personal services, social and community tools are made available to the end user.

Blog Spider Component

At the core of the spider, the Source Database processes all blog URLs, as well as information about the structure of the blogs and the filters used to parse their content. At the core of the Spider, the high load on the source database can be distributed across a number of dedicated database servers.

Digital Repository Component

The BlogForever METS profile should be considered a work in progress to be adjusted as necessary as the repository implementation continues. When the repository receives new content in the form of individual packages (sets of resources), these are treated as SIPs according to the Open Archive Information System (OAIS) specification.

Implementation

Examine the detailed user requirements and capabilities of the existing software to understand how they relate to each other. At the end of the implementation process for both the spider and the repository, each feature is tested and documented.

Evaluation

Method

Detailed information about internal testing is part of BlogForever Deliverable 5.2, Case Study Implementation [6]. Detailed information about the user questionnaires is part of BlogForever Deliverable 5.3 User Questionnaires [14].

Results

RQ4: Does the use of the BlogForever repository lead to successful outcomes for the various users. In all other cases, the differences are much smaller, which strengthens the results of the evaluation.

Evaluation Outcomes

If we look at the general information about the case studies as presented in the syslogs, we see that the number of visitors and page views is significant. From these results, we can conclude that the majority of users are satisfied with the functioning of the BlogForever platform.

Discussion and Conclusions

We present the open source BlogForever Crawler, a key component of the BlogForever platform [80] responsible for crawling blogs, extracting their content, and monitoring their updates. A map of blog content extracted to Archival Information Packages (AIP) using the METS and MARCXML standards for interoperability purposes.

Algorithms

In the remaining parts of this section, we expand our work on each specific part of the crawler system. Fourth, we can extend the goals of the BlogForever platform to include microblogs.