• Nenhum resultado encontrado

• Looking into the general information of the case studies as presented through the sys- tem logs, we see that the number of visitors and page views is substantial. These statistics demonstrate the rigour of our testing process: multiple tests were made by many users. It also shows the platform is capable of handling a large number of users and requests.

• In addition, evaluating the system logs metrics and especially Metric 5: Python error codes and Metric 6: HTTP status distributions, we see that very few system errors have occurred considering the testing process.

• Theme 8: Functionality and Theme 2: System integrity scores are above the average in internal testing results and near the average in external testing results.

RQ2: Are complex BlogForever platform search strategies working efficiently when high levels of content are available within the BlogForever platform?

The fifth case study is characterised by the large volume and complexity of the blogs in scope without any problem. Therefore, we consider that the BlogForever platform is working efficiently.

RQ3: How useful is the BlogForever platform as a whole?

RQ3 is aligned with the following Themes: T2: System integrity, T6: Data integrity, T7:

Preservation and T8: Functionality. Their internal and external testing scores are presented in Table 5.7. Also, the System Logs Metrics relevant to T2: System integrity are M5: Num- ber of python errors, M6: HTTP status distribution and M7: Page loading time distribution.

As we see in the results, all their scores are quite high.

RQ4: Does the use of the BlogForever repository lead to successful results for the different users?

RQ4 is aligned with T4: Searching and T5: Access. Their internal and external testing scores ar quite good as we see in Table 5.7.

RQ5: How user friendly are the BlogForever platform functions for the different designated blog communities?

RQ5 is aligned with the following Themes: T1: Using blog records, T3: Sharing and in- teraction, T9: System navigation and T10: System terminology. Their internal and external testing scores are also quite good.

To conclude, we consider the rating of all Themes to average out between 3 (“Most areas worked as expected”) and 4 (“All work as expected”). Any deviations between the different themes, evaluation methods and case studies are not significant. From these scores, we can conclude that the majority of users are satisfied with the performance of the BlogForever platform.

be preserved. As we described in Section 5.4.1, the BlogForever spider is provided with a predefined list of blogs to analyse and harvest. The problem lies in the fact that the list of target blogs has to be predefined explicitly. The administrators need to already have the list of blogs, which may not be always the case. When BlogForever is deployed to preserve the blogs of a specific organization, the list of target blogs is predefined but when BlogForever is deployed in a different context, e.g. to create a repository of major Mathematics blogs, then the definition of target blogs is a big issue. The solution would be have a mechanism able to generate and curate such a list in a semi-automatic or fully automatic way, based on config- urable topic hierarchies. This way, the administrator would define the topics of interest and let the platform handle the specifics of blog collection management.

The Blog Spider has also limitations regarding new blog content detection and processing.

First of all, it uses RSS and ping servers to receive notifications for blog content updates.

Nevertheless, these methods do not notify about layout changes. Moreover, RSS is used in different ways. Some blogs provide separate RSS feeds for posts and comments while others provide RSS feeds just for posts. Thus, the detection of new comments in real-time is problematic in such cases. We also face issues during the processing of unknown blog platforms or ‘exotic implementations’ because the identification/analysis process for the en- tities of blogs is a knowledge intensive process and has to be adapted to new developments in blog platforms. Therefore, the amount of necessary adaptations is dependent on the actual domain respectively the actual blog platforms that should be archived. In addition, while the identification of structural blog entities like posts, comments, etc. is achieved, the validation of whether the author uses a real name or an alias cannot be applied automatically. To sum up, there are some difficulties to evaluate the validity of certain elements.

Another issue is relevant to the scalability of the BlogForever platform. Scalability is the ability of a system, network, or process to handle a growing amount of work in a capable manner or its ability to be enlarged to accommodate that growth [25]. As we described in 5.4.3, the BlogForever repository component is based on the Invenio software suite which has been in production since 2002 and many popular repositories hosting millions of records such as CERN Document Server40 and INSPIRE41 use it with great success. Nevertheless, due to its reliance on MySQL RDBMS, the BlogForever repository architecture is inherently not scalable to web scale datasets. New database technologies such as NoSQL databases are better fit for this purpose [41]. Thus, it would not be possible to deploy BlogForever in a large cluster of servers in order to create internet scale blog archives.

In summary, we identified several limitations that should be addressed in further research and development. However, they do not refute the claim that the BlogForever system solves the identified problems of current web archiving.

40http://cds.cern.ch/

41http://inspirehep.net/

A Scalable Approach to Harvest Modern Weblogs

We present methods to automatically extract content such as articles, authors, dates and com- ments from blog posts. To achieve this goal, we introduce a simple yet robust and scalable algorithm to generate extraction rules based on string matching using the blog’s web feed in conjunction with blog hypertext. Furthermore, we present a system architecture which is characterised by efficiency, modularity, scalability and interoperability with third-party systems. Finally, we conduct thorough evaluations of the performance and accuracy of our system1.

6.1 Introduction

One of the key challenges in developing blog archiving systems is the design of a web crawler capable of efficiently traversing blogs to harvest their content. The sheer size of the blogo- sphere combined with an unpredictable publishing rate of new information call for a highly scalable system, while the lack of programmatic access to the complete blog content makes the use of automatic extraction techniques necessary. The variety of available blog publish- ing platforms offers a limited common set of properties that a crawler can exploit, further narrowed by the ever-changing structure of blog contents. Finally, an increasing number of blogs heavily rely on dynamically created content to present information, using the latest

1This chapter is based on the following publications:

• Banos V., Blanvillain O., Kasioumis N., Manolopoulos Y.: “A Scalable Approach to Harvest Modern Weblogs”,International Journal of AI Tools, Vol.24, N.2, 2015.

• Blanvillain O., Banos V., Kasioumis N.: BlogForever Crawler: “Techniques and Algorithms to Harvest Modern Weblogs”,Proceedings 4th International Conference on Web Intelligence, Mining & Semantics (WIMS), ACM Press, Thessaloniki, Greece, 2014.

web technologies, hence invalidating traditional web crawling techniques.

A key characteristic of blogs which differentiates them from regular websites is their associa- tion with web feeds [89]. Their primary use is to provide a uniform subscription mechanism, thereby allowing users to keep track of the latest updates without the need to actually visit blogs. Concretely, a web feed is an XML file containing links to the latest blog posts along with their articles (abstract or full text) and associated metadata [122]. While web feeds essentially solve the question of update monitoring, their limited size makes it necessary to download blog pages to harvest previous content.

We present the open-source BlogForever Crawler, a key component of the BlogForever plat- form [80] responsible for traversing blogs, extracting their content and monitoring their up- dates. Our main objectives in this work are to introduce a new approach to blog data ex- traction and to present the architecture and implementation of a blog crawler capable of extracting articles, authors, publication dates, comments and potentially any other element which appear in weblog web feeds. Our contributions can be summarized as follows:

• A new algorithm to build extraction rules from web feeds and an optimised reformula- tion based on a particular string similarity algorithm featuring linear time complexity.

• A methodology to use the algorithm for blog article extraction and how it can be aug- mented to be used with other blog elements such as authors, publication dates and comments.

• The overall BlogForever crawler architecture and implementation with a focus on de- sign decisions, modularity, scalability and interoperability.

• An approach to use a complete web browser to render JavaScript powered webpages before processing them. This step allows our crawler to effectively harvest blogs built with modern technologies, such as the increasingly popular third-party commenting systems.

• A mapping of the extracted blog content to Archival Information Packages (AIPs) using METS and MARCXML standards for interoperability purposes.

• An evaluation of the content extraction and execution time of our algorithm against three state-of-the-art web article extraction algorithms.

The concepts emerging from our research are viewed in the context of the BlogForever plat- form but the presented algorithms, techniques and system architectures can be used in other applications related to Wrapper Generation and Web Data Extraction.

The rest of this chapter is structured as follows: Section 6.2 introduces the new algorithms to extract data from blogs. Section 6.3 presents the blog crawler system architecture and implementation. Section 6.4 presents the evaluation and results. Finally, our conclusions and some discussion on our work are presented in Section 6.5.