• Nenhum resultado encontrado

Schema evolution and change coupling in MediaWiki

N/A
N/A
Protected

Academic year: 2021

Share "Schema evolution and change coupling in MediaWiki"

Copied!
65
0
0

Texto

(1)

Universidade de Aveiro Departamento de Electr´e Inform´atica onica, Telecomunica¸c˜oes, 2019

Ricardo Jorge

Ferreira Da Silva

Evolu¸

ao do esquema e acoplamento de altera¸

oes

no MediaWiki

Schema Evolution and Change Coupling in

MediaWiki

(2)
(3)

Universidade de Aveiro Departamento de Electr´e Inform´atica onica, Telecomunica¸c˜oes, 2019

Ricardo Jorge

Ferreira Da Silva

Evolu¸

ao do esquema e acoplamento de altera¸

oes

no MediaWiki

Schema Evolution and Change Coupling in

MediaWiki

Dissertation presented to the University of Aveiro to fulfill the necessary requirements to obtain the degree of Master in Computer and Telematics Engineering conducted under the scientific guidance of Doctor Jos´e Moreira, assistant Professor at the Department of Electronics, Telecommunications and Informatics of the University of Aveiro, and Stefanie Scherzinger, Pro-fessor at the Faculty of Computer Science and Mathematics of the Ostbay-erische Technische Hochschule Regensburg (OTH Regensburg), Germany.

(4)
(5)

o j´uri / the jury

presidente / president Professor Doutor Lu´ıs Filipe de Seabra Lopes Professor Associado da Universidade de Aveiro

vogais / examiners committee Professor Doutor Joel Perdiz Arrais

Professor Auxiliar, Departamento de Engenharia Inform´atica da Fac. de Ciˆencias e Tecnologia da Universidade de Coimbra

Professor Doutor Jos´e Manuel Matos Moreira Professor Auxiliar da Universidade de Aveiro (orientador)

(6)
(7)

agradecimentos / acknowledgements

Agrade¸co ao Prof. Dr. Jos´e Moreira e Prof. Dra. Stefanie Scherzinger pelas orienta¸c˜oes e conselhos que me foram fornecendo ao longo do de-senvolvimento deste trabalho. Agrade¸co imenso aos meus pais pelo apoio e motiva¸c˜ao que me forneceram, sem eles chegar a este ponto n˜ao seria poss´ıvel. E agrade¸co tamb´em ´a minha restante familia e amigos por me acompanharem ao longo deste ´arduo caminho.

I would like to thank Prof. Dr. Jos´e Moreira and Prof. Dr. Stefanie Scherzinger for the guidance and advices you have given me throughout the course of this work. I am very grateful to my parents for the support and motivation they have provided me, without them reaching this point would not be possible. And I would also want to thank my remaining family and friends for accompanying me along this arduous path.

(8)
(9)

Palavras Chave Evolu¸c˜ao de Esquemas, Bases de Dados, Extra¸c˜ao de Dados, Padr˜oes Fre-quentes, Regras de Associa¸c˜ao

Resumo Os desenvolvedores de software enfrentam diversas adversidades durante o desenvolvimento dos seus projetos, sendo uma das mais importantes a evolu¸c˜ao dos esquemas das bases de dados. Este ´e um procedimento in-evit´avel que quando n˜ao se lhe ´e fornecida a devida aten¸c˜ao, pode inutilizar uma aplica¸c˜ao inteira.

A melhor op¸c˜ao para estudar o impacto da evolu¸c˜ao dos esquemas no de-senvolvimento de software foi a extra¸c˜ao de dados de uma aplica¸c˜ao com uma grande base de dados. O caso de estudo escolhido foi a MediaWiki. Usando ferramentas existentes para extrair dados do projeto MediaWiki, o objectivo passou pela descoberta de ficheiros que causavam a evolu¸c˜ao de esquemas da base de dados e ficheiros que eram afetados por essa evolu¸c˜ao, e assim, extrair de padr˜oes frequentes. Com base em trabalhos existentes, foi poss´ıvel ter acesso a dados relacionados com as modifica¸c˜oes que afetam os esquemas. A partir destes dados foi poss´ıvel extrair mais de uma centena de regras de associa¸c˜ao diferentes.

Estas regras permitiram o desenvolvimento de um sistema de re-comenda¸c˜ao. Este sistema permite ao utilizador fornecer dois ficheiros que operam sobre um esquema de uma base de dados e receber as poss´ıveis modifica¸c˜oes a fazer no esquema mais recente. Permite-se assim que os de-senvolvedores de software mantenham os seus esquemas de bases de dados coerentes e concisos.

(10)
(11)

Keywords Schema Evolution, Databases, Data Mining, Frequent Patterns, Association Rules

Abstract Software developers face many adversities while working on projects, one of the most important ones being schema evolution. This is an inevitable procedure that, when not given the proper attention, can render an whole application unusable.

The best option to study the impact of schema evolution on software devel-opment was the mining of data in a large database application. The chosen case study was MediaWiki. Using existing tools to mine the files of the MediaWiki project, the aim was to find which files caused and which were affected by schema evolution, and to extract frequent patterns. Building on existing work on mining data schema updates in each revision of MediaWiki, it was possible to extract over a hundred different association rules.

These rules allowed the development of a recommendation system. This system allows to provide two files working on a database schema and receive as an output possible missing updates that should be applied to the newer schema. This will help software developers to keep their database schemas coherent and concise.

(12)
(13)

Contents

Contents i

List of Figures iii

List of Tables v

1 Introduction 1

1.1 Overview . . . 1

1.2 Motivation and Objectives . . . 2

1.3 Contributions . . . 2

1.4 Outline . . . 3

2 State of the Art 5 2.1 Schema Evolution . . . 5

2.2 Patterns of Schema Evolution . . . 6

2.3 Schema and Code Co-Evolution in Database Applications . . . 6

2.3.1 An empirical analysis . . . 7

2.3.2 Study Results . . . 9

2.4 Revision History Patterns . . . 11

2.5 MediaWiki . . . 11

3 Case Study 13 3.1 MediaWiki . . . 13

3.2 Schema Evolution in Wikipedia . . . 14

3.3 Objective . . . 17

4 File Coupling in MediaWiki 21 4.1 MediaWiki Evolution Analysis . . . 21

4.1.1 Codemaat . . . 21

4.1.2 Statistical Temporal Summary . . . 21

4.1.3 File Coupling Analysis . . . 22

4.2 Results and Discussion . . . 22

4.3 Summary . . . 24

5 Mining Association Rules 25 5.1 Overview . . . 25

(14)

5.2.1 Algorithms . . . 26

5.2.2 Metrics . . . 26

5.3 Finding Association Rules in MediaWiki . . . 28

5.3.1 Data collection . . . 28

5.3.2 Rules extraction . . . 29

5.4 Results and Discussion . . . 29

5.4.1 MediaWiki Rules . . . 29

5.4.2 Multiple Projects Rules . . . 30

5.4.3 Rules violations analysis . . . 31

5.4.4 Patterns within rules . . . 33

5.5 Summary . . . 34

6 Recommendation System 37 6.1 Overview . . . 37

6.2 Tools and Methods . . . 37

6.3 Using Association Rules For a Recommendation System . . . 39

6.4 Recommendation System Application . . . 40

7 Conclusion and Future Work 43 7.1 Conclusion . . . 43

(15)

List of Figures

2.1 The evolution trend of tables/columns in the studied projects [32] . . . 9

3.1 MediaWiki Number of Tables [25] . . . 15

3.2 MediaWiki Number of Columns [25] . . . 15

3.3 Proposed Solution Application Architecture . . . 19

6.1 Recommendation System Application . . . 41

6.2 Recommendations for table ’Cur’ . . . 42

(16)
(17)

List of Tables

2.1 Results of DB revision and schema change extraction. [32] . . . 8

3.1 Macro-Classification of Schema Changes . . . 16

3.2 Micro-Classification of Schema Changes Using SMOs . . . 17

4.1 Statistical summary from 2003 to 2018. . . 22

4.2 File Coupling in MediaWiki . . . 23

4.3 File Coupling in MediaWiki in 2018. . . 23

5.1 Atomic changes made in all valid database revisions. . . 29

5.2 Association Rules in MediaWiki . . . 30

5.3 Association Rules Metrics in MediaWiki . . . 31

5.4 Association Rules in 10 different projects. . . 31

(18)
(19)

Chapter 1

Introduction

This chapter presents an overview of this work. It starts with a description of what schema evolution is, what are the main issues and their impact on the development of In-formation Systems, followed by a brief presentation of the objectives of this dissertation, the contributions and an outline of the thesis structure.

1.1

Overview

Evolution refers to change through time caused from our need to adapt to new contexts or situations. From Charles Darwin’s theory of evolution to the always evolving world of technology, we have the necessity to keep changing, improving and getting the best out of everything.

In every Information System this necessity to keep changing is well felt, as there is always room for improvement and there is the need to stay up-to-date regarding multiple factors such as, changing requirements, obeying to new regulations, implementation of new functionality, integration with other systems, technological updates, or to deal with security and privacy issues.

Information Systems, as critical components of modern organizations, are exposed to evolution, and the data management system is one of the core elements in this context. They need to change, to adapt whenever there is the need modify the logical or the physical organization of the data. That is where schema evolution happens, because database schemas must be updated without any loss of data and many changes occur when the systems are already in production. This kind of evolution represents the changes undergone by a database since it’s creation.

To further understand the complexity of schema evolution, let’s have a look at one case of a constant evolving of a Information System, in particularly, MediaWiki [4]. MediaWiki is a data-intensive, open-source, collaborative, web-portal software, originally developed to run Wikipedia. Wikipedia is a multilingual, web-based, free-content encyclopedia, being used for academic, professional or just for curiosity purposes, and it is one of the most well-known websites in the world. Nowadays, Wikipedia contains more than 300 million articles, with a growth of about 2.7 million articles per month [21].

In previous studies conducted in 2007 [25], when MediaWiki had merely 4 years of life, there were already 171 different Database schema versions released to the public by means

(20)

of a Subversion versioning system. By 2013 it had 377 versions [32]. Every change in the database schema impacts not only the queries but also the client applications managing or requesting data from MediaWiki. Its estimated that only a small percentage of queries, 22% to be more precise, originally written to run on old schema versions are valid throughout the schema evolution [25].

1.2

Motivation and Objectives

Wikipedia is one of the main sources where a person can obtain information related to any subject, it has over 752.2B page views since its creation. To put things into perspective about the size of the information contained in Wikipedia, if a person were to print it all into books, it would take roughly 7,473 volumes of 700 pages each [20].

The everlasting growth of the quantity of data present in MediaWiki, the importance and the impact that Wikipedia has in our lives nowadays, makes the job of keeping the integrity of all the data and information, one of extreme importance. This growth is represented not only in data but also in the applications code, because new features are always being added, software keeps evolving and inevitably the schema will change with it.

Schema evolution is and will always be present in the development of MediaWiki, because it is necessary and unavoidable. The changes will continue to happen and, as it was men-tioned before, they will deeply affect the queries and the code. The goal of this project is to develop a recommendation system that allows someone to make changes in the database schema and instantly get a recommendation about what changes to do next in the code, to avoid further changes and other problems in the future. For instance, when a developer uses the recommendation system and adds a new column and removes one key from a table in a database, he will get a recommendation to add a new key for the new column. This recommendation will have a confidence equal to, for instance, 100%, meaning that this three changes always happen together, and a precision equal to, for instance, 60%s, meaning that the changes follow a specific pattern in 60% of the cases, in this case it’s the relation between the added column and the key to be added.

1.3

Contributions

The main contributions of this dissertation are:

• Research work and deep analysis of file coupling changes concerning the data definition language operations performed throughout the entire history of MediaWiki .

• Analysis and retrieval of nearly 80 association rules, with a confidence above 60%, associated with Database atomic changes of MediaWiki and 9 other popular database systems.

• A recommendation system to assist with schema evolution. These recommendations are based on the association rules mentioned before, and also take into account their confi-dence, precision and other metrics to improve the feedback to the users (programmers).

(21)

1.4

Outline

This dissertation starts with a brief introduction about schema evolution and presents the Information System under study, followed by the motivation and the main goal of this work. The remaining chapters are organized as follows:

Chapter 2 presents the related work in which this dissertation is based on. Chapter 3 presents the database project under study and outlines the solution proposed in this work. Chapter 4 presents a statistical data and an analysis of file coupling in MediaWiki. Chapter 5, presents an introduction to Association Rules and the mining of these rules in MediaWiki. Chapter 6, presents the recommendation system proposed in this work, how to use it and the benefits to the users. Chapter 7, presents the conclusions and guidelines for future work, concluding this dissertation.

(22)
(23)

Chapter 2

State of the Art

Schema evolution has been a headache for developers. It is unpredictable, as this is a matter that tends to be overlooked until it causes problems. Data management and schema evolution are often forgotten and tend to be costlier when they are finally given some attention. This problem affects not only the schema itself but also all the code that depends on it, turning it useless and causing unexpected errors that can make an application impossible to use for the customers.

This chapter presents the related work in which this dissertation is based on. It presents the impact of schema evolution on information systems development and maintenance, what is schema and code co-evolution in database applications, and the results of an empirical study in ten popular open-source projects.

2.1

Schema Evolution

At the beginning of every Information System project the schema of its database is defined. The schema is the backbone of every database, defining its structure and logical configura-tion. Schemas may go unaltered for some time, but an Information System is an always evolving one and its code keeps changing, because the software must adapt to new trends, new functionalities are implemented and new approaches are taken. These variables tend to keep the software changing and the database has to evolve with it, to keep supporting all the new demanded features.

The data kept in a database and the related code, depend on the schema. When the schema changes, the data and the code must also change. This means that it is important to manage schema changes, as well as related code and data, all together. Schema evolution is the ability of the database to respond to changes to its schema without loss of existing data and without affecting day-to-day operations of the database.

Schema evolution influences all the applications surrounding a database. Even small schema changes can deeply impact the application queries and the code requesting them, causing the latter to be syntactically or semantically invalid, as well as missing or incorrect data and runtime crashes. According to [25], only 22% of the queries in a database application are valid throughout the entire lifecycle of an application.

(24)

2.2

Patterns of Schema Evolution

In [33], the authors studied the schema evolution of eight databases of open source projects. They focused on which tables evolved and how they evolved, having special attention on the table schema size (the number of attributes a table has), duration (elapsed time between creating and dropping a table) and number of updates. From this data they concluded that there are at least four different patterns of schema evolution: the T Pattern, the comet pattern, the inverse T pattern and the empty triangle pattern.

T Pattern

This pattern states that tables with small schema sizes can have arbitrary duration. On the other hand, tables with larger schema sizes tend to last longer. When a table has 10 or more attributes it has a high probability of surviving. Wide tables are frequently created early on and are not deleted afterwards.

Comet Pattern

In this pattern there is a majority of tables with small schema size and small amount of change, as well as medium schema size tables with many changes and wide tables with large schema sizes with medium of changes.

Inverse T Pattern

This pattern states that tables with small duration undergo small changes, tables with medium duration undergo small or medium changes and long-lived tables demonstrate all kinds of change behavior.

Empty Triangle

The empty triangle pattern states that few deleted tables demonstrate either late birth, or long duration, or high average transitional update, or large number of updates resulting in a practically hollow triangle in the birth × duration space. There is often a large concentration of deleted tables in a cluster that are newly born, quickly removed, with few or no updates, whereas older tables of long duration are scarcely removed.

2.3

Schema and Code Co-Evolution in Database Applications

A database application is a software system that manages, retrieves and collects data. The data are usually stored in a database managed by a database management system and organized in a database schema. The applications accessing the data need to obey to the structure of the data defined by the database schema. This means that the schema acts like a mediator that manages the interactions between the application code and the data. Consequently, the evolution of a database application is more complex than in other types of applications. For instance, consider a system that uses a table (USER) to store the user authentication data and other personal data. If the requirements change and the system

(25)

needs to store user authentication and personal data separately, the table USER must be split into two new tables, USER LOGIN and USER DETAILS. The data and the application code must be synchronized to make them compliant with these changes. The original data must be migrated into the new tables and the original application code must be modified to access the newly organized data.

There are two types of co-evolution in database applications: data co-evolves with the schema and code co-evolves with schema. The first type consists in three main tasks: predict-ing and estimatpredict-ing the effects before the proposed schema changes are performed, rewritpredict-ing the existing DBMS-level queries to work on the new schema and migrating data to the new schema. The second type involves two main tasks: evaluating the cost of reconciling the ex-isting code regarding the new schema before any schema changes, and locating and modifying all impacted code after applying the schema changes.

Schema changes, query updates and data migration are straightforward, as they are done using the same languages, typically the Data Definition Language and the Data Manipulation Language. However, database schema changes and application code changes are at different levels. This is a much more challenging problem because schema changes impacting the application code are not as direct.

2.3.1 An empirical analysis

There are several studies regarding the first type of evolution mentioned above, data co-evolving with schema, but there are no studies dealing with the second type, code co-co-evolving with schema.

Analyzed Projects

The authors of [32] have analysed 10 different projects (database applications) to collect information about schema and data co-evolution. The projets were:

• Coppermine, Web gallery;

• Dotproject, Project management; • e107, Enterprise CMS;

• Joomla!, CMS; • MediaWiki, Wiki,

• PrestaShope, online store; • RoundCube, Webmail; • Tikiwiki, Wiki/CMS; • TYPO3, Enterprise CMS; • webERP, Business management;

(26)

Analysis Process

The steps used to extract information about schema evolution and co-change analysis from those 10 projects were the following:

• Locate Schema Files, locate and extract the files with database schema definitions and updates.

• Extract Database Revisions: identify database revisions, these being revisions or commits changing a database schema.

• Extract valid Database Revisions: filter database revisions to discard those that are not related with schema changes. The revisions of interest are those regarding the following type of schema changes: syntax changes, comment change, format change, data-sensitive change, DBMS-sensitive change, system-related change and rollback. • Extract atomic changes: extract all schema changes. These schema changes are

categorized into six high-level categories [30]: transformation, structure, refactoring, referential integrity refactoring, architecture refactoring, data quality refactoring and method refactoring.

• Co-change analysis: analyze the real impact caused by these atomic schema changes by mining a project’s version control history.

The atomic changes that fit into the six different categories referred above concern the following atomic changes: add table, add column, add view, drop table, rename table, drop column, rename column, change column data type, drop view, add key, drop key, add foreign key, drop foreign key, add trigger, drop trigger, add index, drop index, add column default value, drop column default value, make column not null, drop column not null, add stored procedure and drop stored procedure.

Project # Total DB Rev. # Valid DB Rev. % Valid /Total # Atomic Changes #Atomic /Valid Coppermine 116 69 59.5% 118 1.7 Dotproject 163 88 54.0% 279 3.2 e107 76 63 82.3% 114 1.8 Joomla! 532 133 25.0% 888 6.7 MediaWiki 377 221 58.6% 892 4.0 PrestaShop 221 203 91.9% 928 4.6 RoundCube 56 45 80.4% 101 2.3 Tikiwiki 941 493 52.4% 2208 4.5 TYPO3 73 58 79.5% 249 4.3 webERP 189 91 48.1% 640 7.0 Total 2744 1464 53.4% 6417 4.4

Table 2.1: Results of DB revision and schema change extraction. [32]

Table 2.1 presents the results of the first two steps: the second column lists the number of database revisions for each project, the third column lists the number of valid database

(27)

revisions, the fourth shows the ratio of valid over total revisions, the fifth lists the number of atomic changes and the last column lists the average number of atomic changes per valid revision.

2.3.2 Study Results

The previous information about the 10 projects was used to answer three fundamental questions, to understand how database schemas and programs’ code co-evolve in database applications.

How Frequently and Extensively do Schemas Evolve?

This question focuses on how often and how much the schemas change across different versions, to understand how they evolve during an application’s development and maintenance process.

First, it was measured how frequently schemas evolve by examining the occurrences of schema changes during each project’s life cycle. It was calculated the average number of valid database revisions (atomic schema changes) by release (year). It was found that there are around 5 to 25 valid database revisions and 15 to 180 atomic schema changes by release.

Second, it was measured how extensively schemas change, by examining the trend on schema size changes. To do so, it was collected the number of tables/columns in each valid revision, as it is seen in Figure 2.1.

Figure 2.1: The evolution trend of tables/columns in the studied projects [32] To achieve a precise evaluation to answer the proposed question, two metrics were used, Growth Rate(GR) and Change Rate(RT):

GR = # Added Elements - # Deleted Elements

# Initial Elements (2.1)

CR = # Added Elements + # Deleted Elements

(28)

The results show that there are on average 65 atomic schema changes per release and 90 atomic schema changes per release, across the ten projects. The size of the schemas in most projects grew significantly: the growth rate of the number of tables in 60% of the projects exceeded 100%. The schema sizes in Seven projects reached 60% of schema changes during only about 20% of the projects’ life cycle. These results show that database schemas evolve significantly during the development of database applications.

How do Database Schemas Evolve?

This question helps to analyze all possible schema changes in database applications to understand what schema change types usually occur in practice. To answer it, it was necessary to analyze which schema change category each atomic change belongs to.

The results were that that three high-level schema change categories covered most schema changes, these categories being, transformation, structure refactoring and data quality refac-toring. At the low-level, the most frequent atomic changes were, add table, add column and change column data type. Referential integrity constraints, such as, foreign key, trigger, and procedures, such as, stored procedures are rarely used in practice.

How much Application Code has Co-Changed with a Schema Change?

This question helps to quantify the real impact of schema changes on application code. This was performed Using change history from the projects’ repositories and associating changes in source code with schema changes.The validity of this approach was proven by an-swering two questions: How many valid database revisions contain the co-change information of schema and code? And How many code-level changes are truly caused by schema changes? The first question is to understand whether and how often schema and code changes are committed together and whether co-change history information is useful. The second ques-tion is to further explore the accuracy of co-change informaques-tion as a means for estimating the code-level impact of a schema change. To answer these two questions, 10% of the valid database revisions (146 revisions) were used.

There are four possible co-change situations:

1. There is no code change but there is evolved code caused by schema changes, that were committed in different revisions.

2. The schema changes don’t impact the code.

3. The schema changes can be responsible for all or some code change. 4. All of the code changed wasn’t affected by the schema changes.

Regarding the first question, situation 2 and situation 3 provide effective co-change infor-mation. 72% of all valid database revisions provided useful co-change inforinfor-mation. 22% of valid database revisions did not need any code changes as they are inserted into the second situation. To answer the second question it was used a metric, precision:

Precision(revision) = Code Change Caused By Schema Changes ∩ Code Change

(29)

This metric allows to estimate how much of the co-change history contains useful infor-mation. Any revision that belongs to situation 1 and situation 4 will have a precision of 0%, 27% of valid database revisions do not provide useful co-change information. Over 80% of valid database revisions belonging to situation 2 and situation 3 have precisions over 60%, and over 70%, 56% of the revisions have 100% precision.

Schema changes impact the code greatly. For an atomic schema change, developers will need to change about 10 to 100 lines of code on average. For a valid database revision, developers will need to change about 100 to 1,000 lines of code.

2.4

Revision History Patterns

To overcome the previous referred problems, an idea emerged, the development of a rec-ommendation system. At first, there was the need to find a way of how to develop such an application. In order to give recommendations, we need to understand what this recommen-dation might be. A way to achieve this is to first find patterns in the data and then process these data to obtain the desired recommendations. The authors of [27], did the mining of patterns in order to achieve a similar solution but in a different context.

The maintenance of correct and consistent links in Wikipedia pages is challenging due to two reasons: the pages are often edited by different people, not all updates performed in one page are properly propagated to other related pages and the consistency constraints on Wikipedia are often soft and do not need to be applied at all times. These problems gave the authors the objective of developing a tool, WiClean, that signals incomplete and inconsistent updates and suggests corrections. This system is based in update patterns mined from update revision logs from Wikipedia and it identifies potential errors by signalling updates that deviate from the mined patterns.

2.5

MediaWiki

The focus of this dissertation is on the MediaWiki’s project. This project is the backbone of one of the most famous and most used websites in the world, Wikipedia, a free online encyclopedia, created and edited by volunteers around the world.

MediaWiki is a free server-based software that handles, processes and displays data from Wikipedia’s database. This database is in a constant state of change because thousands or millions of users keep changing and inserting new data into it, making it an interesting case study about schema evolution.

(30)
(31)

Chapter 3

Case Study

This chapter presents the Information System under study, MediaWiki. A brief introduc-tion to MediaWiki is presented, as well as a study on schema evoluintroduc-tion in the repository. It also presents the main objective of this project and the steps taken to achieve it.

3.1

MediaWiki

When Wikipedia [19] was launched, back in January 2001, it was designed to boost the pro-duction of content for Nupedia, a free-content and peer-reviewed encyclopedia. Wikipedia [19] used an engine written in Perl, UseModWiki, that stored all pages in individual text files but did not had records of historical data. In this way it was impossible to know what changes were made. Wikipedia [19] was hosted in a single server, and so, it was unknown to most search engines, and had performance issues because all it’s content was stored in a flat file database due to UseModWiki. All this problems motivated one of its participants to take action and develop a new engine dedicated specifically to Wikipedia. This resulted in an engine called ”PHP script” or ”PHP wiki”, a script written in PHP and MySQL-backed with the objective of improving Wikipedia’s performance.

The increasing traffic, expensive features and limited hardware caused performance is-sues despite having improvements due to the PHP script. These isis-sues motivated Lee Daniel Crocker to rewrite the code again, and he reorganized the existing architecture for better performance and hacked all the code, adding new features in the process. Even though this restructuring, solved some problems in the short term, performance problems soon reap-peared. The developers would have to choose if they wanted to re-engineer and re-architect all the software from scratch or continue to improve and tweak the existing code. The base code made most of the developers happy so they went with the second solution and in 2003 it was added a second server. This solution made possible load-balancing between the two existing servers, improving the performance and enabling new functionalities.

In the same year, 2003, the Wikimedia Foundation was created, a nonprofit foundation to support and manage its infrastructures and operations. With this creation, using wordplay, the ”PHP script” was officially named MediaWiki.

In short, MediaWiki is a open-source server-based, collaborative, data intensive, scalable software, feature-rich wiki implementation that uses PHP to process and display data stored in a database.

(32)

tables, it stayed that way for the first versions of the project. The necessity to evolve, to have new functionalities established and correct previous mistakes made the schema grow to 34 tables by 2007, a 100% increase [25]. By 2013 it had grown to 52 tables, the same tables we find in the present day. In the current version the 48 tables can be grouped into 15 different groups [5]:

• Users: user, user properties, user newtalk, actor, bot passwords;

• Permissions: user groups, user former groups, ipblocks, ipblocks restrictions, page restrictions, protected titles;

• Logging: logging, log search; • Tags: change tag def, change tag;

• Recent Changes: recentchanges, watchlist;

• Pages: archive, page, page props, category, redirect;

• Revisions: revision, slots, slot roles, content, content models, text;

• Link Tables: pagelinks, imagelinks, iwlinks, templatelinks, categorylinks, externallinks, langlinks;

• Multimedia: image, filearchive, uploadstash, oldimage; • Interwiki: sites, site identifiers, interwiki;

• Caching tables: objectcache, querycache, querycache info, l10n cache, querycachetwo; • Maintenance: updatelog, job;

• Statistics: site stats; • Search: searchindex;

• Resource Loader: module deps;

These tables contain all the data of a website, in the case of Wikipedia [19], it holds over 200TB of data [20].

3.2

Schema Evolution in Wikipedia

Based on the 171 schema versions that were committed to the SVN repository of Medi-aWiki, between April 2003 and November 2007, the schema evolution was analyzed.

(33)

Statistics

As seen in Figure 2.1 and Figure 2.2, the size of MediaWiki database shows an evident trend of growth. The number of tables had a 100% increase, from 17 to 34, and a 42% increase in terms of the number of columns, from 100 to 142. From time to time there are schema versions that could not be properly installed, this is due, for example, to schema versions with syntax errors, being represented as the sudden drops displayed in the graphics.

The growth of the schema is due to three main aspects: 1. Performance improvement.

2. Addition of new features.

3. Preservation of the database content history.

Figure 3.1: MediaWiki Number of Tables [25]

Figure 3.2: MediaWiki Number of Columns [25]

As of tables lifetime, the lifetimes range from very long ones, tables that were alive throughout the entire history of the project, to short ones, for example, tables that only lasted for two revisions. The average lifetime for each table is of about 103.3 versions, 60% of the total database history.

In the case of columns lifetime, they lasted for 97.17 versions on average, 56.8% of the total database history. Like in the lifetimes of tables, the columns lifetimes also range from very long ones to short ones, this is due to the fact that the growing of the schema is a constant procedure so a significant portion of tables and columns have been introduced shortly before

(34)

the study, therefore the short-living ones, and the core columns and tables tend to be stable throughout the entire history, explaining the long-living ones.

Macro-Classification of changes

As it is present in Table 2.2, there are 170 evolution steps, divided by the type of evolution. The ones categorized as actual schema changes have an impact on the queries, they modify the schema layout. The evolution of the DBMS engine, indexes and data types, does not require any query correction, due to the fact of the physical data-independence provided by the DBMS.

Type of Change # of Evolution Steps % of Evolution Steps

Actual Schema 94 54.9% Index/Key 69 40.3% Data Type 22 12.8% Syntax Fix 20 11.7% Rollback 15 8.8% Doc Only 13 7.6% Engine 6 3.5%

Table 3.1: Macro-Classification of Schema Changes

Approximately 55% of the evolution steps involve actual schema changes, 40% involve in-dex or keys adjustments, 8.8% rollbacks to previous schema versions and 7.6% documentation changes.

Micro-Classification of changes

To better understand the schema evolution in relational databases, schema modification operators (SMOs) were used to classify actual changes to the schemas. This operators express changes to the schema not modeled by previous attempts and also capture the essence of the existing work.

These Schema Modification Operators have a similar syntax to the one of SQL DDL and describes typical modifications to the schema of a database as well as the corresponding data migration. The operators take a schema as an input and produce a new version of this schema. Table 3.2 presents the different types of SMOs used. The first six operate on tables and the last five operate on columns of a given database schema. These operators can be combined to produce a sequence that describes a complex structural change to a schema. Table 3.2 also shows the distribution of the SMOs and how many times they have been used throughout the entire history of MediaWiki schema evolution.

The Impact on the Applications

To better understand how schema evolution affects the front end application of MediaWiki, there was the need to resort to legacy lab generated templates, queries generated and extracted from an old version of MediaWiki. In this set of queries, 77% of them fail when executed in

(35)

SMO type # of Usages % of Usages % per Revision CREATE TABLE 24 8.9% 14% DROP TABLE 9 3.3% 5.2% RENAME TABLE 3 1.1% 1.75% DISTRIBUTE TABLE 0 0.0% 0% MERGE TABLE 4 1.5% 2.33% COPY TABLE 6 2.2% 3.5% ADD COLUMN 104 38.7% 60.4% DROP COLUMN 71 26.4% 41.5% RENAME COLUMN 43 16.0% 25.1% MOVE COLUMN 1 0.4% 0.58% COPY COLUMN 4 1.5% 2.33%

Table 3.2: Micro-Classification of Schema Changes Using SMOs

the newer versions of the project due to schema evolution. In revisions where there is a major evolution of the schema, over 70% of the queries are invalidated.

This represents that the front end application is highly influenced by the evolution of the schema and claims for a better support to overcome this problem due to its strong impact.

3.3

Objective

As it was referred in Chapter 1, Information Systems have a critical component that is subject to evolution, its data management core, most of the times implemented using relational database technology. MediaWiki [6] is a notable example, because of increasing data stored in the database and the necessity to adapt to new functionalities and features. Main Objective

The main goal of this work is tostudy and evaluate a solution to give software developers with some kind of prediction of the changes that must be performed when evolving the schemas. The development of a recommendation system that assists with schema evolution, providing the user with recommendations and predictions of what changes can help to prevent inconsistencies in the schemas. In order to do so, it was essential to do research work on how to develop such application. What files do schema evolution affects? How to generate predictions? These are some of the questions that needed an answer to properly develop a recommendation system.

First Step - File Coupling

MediaWiki [6] is a very large project, very active and in constant development. There was the need to gather information about this project, more precisely, about its files. To handle evolution it is required to know what modules tend to change together, so a logical coupling analysis must be done.

Some modules, or files, that are logically coupled have implicit dependencies [3] between them and a change to one of them can lead to a predictable change in the coupled module.

(36)

This analysis is important to know what files cause the evolution of the schemas and what files it affects with them.

Second Step - Frequent Patterns

Once discovered the coupled modules that affect the schema structure, it is needed a way to generate recommendations to complete the main objective. The prediction of what change will have to be done next is what is expected to be obtained. The mining of frequent patterns in data is a good way to achieve this objective, as data mining has been very useful in predictive analytics throughout the years.

The frequent patterns were discovered by mining of association rules [10] of the valid database revisions of the MediaWiki [6] project. Having data from 221 valid revisions [32] is a good sample to mine association rules and from these obtain predictions to use in the recommendation system.

Third Step - Tool

The final goal of this project is a tool capable of providing the user with useful recom-mendations regarding schema evolution. Such tool must be capable of telling the user what changes he can do the next after the ones he did.

Resorting to the mined association rules, the recommendations will be filtered by the different metrics of the rules to provide the user with information about the efficiency and accuracy of such prediction.

The proposed solution for this tool is a recommendation system, developed as a client Java application.

As functional requirements the solution would have to:

• Receive as input different files with operations on database schema; • Obtain predictions from association rules;

• Provide the user with weighted predictions;

These functional requirements are all fulfilled by the final version of the tool. These functionalities will be demonstrated farther ahead in the last chapter.

Use Case example A software developer, currently working on a project that is using MediaWiki wants to make changes to the database schema. The user does his alterations to the schema and provides to the application the schema file of the previous version before he modified the schema and the modified file. The application will then use the mined frequent patterns from the MediaWiki project and with these patterns it will present the user a recommendation or prediction of what atomic change should be added. This recommendation is evaluated through a group of metrics that allow the user to have a certain degree of trust in it.

The following figure, Figure 3.1, presents the proposed architecture of the recommenda-tion system.

(37)

Figure 3.3: Proposed Solution Application Architecture

Components

The proposed architecture is divided into four main components:

1. Schema Differences Extractor: This component represents a schema difference ex-tractor, and its main function is to detect and return the differences between two differ-ent schema files. The result generated in this compondiffer-ent will then proceed to the next one.

2. Association Rules: It contains the association rules mined from the MediaWiki [6] project, these rules are composed by two different sets of items, one representing an antecedent and another a consequent. If an antecedent is found, it implies that the consequent may also be found. This component will check the result from the schema differences extractor and compare it with the antecedent of every mined rule.

3. Patterns: Some rules have a more specific pattern within, if a rule detected on the previous component contains one of these patterns, it will be generated a more specific recommendation on the next component.

4. Recommendations: In this component, the recommendations provided by the system will be generated. These recommendations will be based on the consequent of each rule with a positive match in the association rules component. These recommendations will be general predictions, letting the user know what atomic change can be done afterwards. If there are patterns found within the rules, the predictions provided to the user will be more specific, providing information not only about the atomic change to do but also about where to apply it, for an example, providing the name of the key that might be added and in which table.

(38)
(39)

Chapter 4

File Coupling in MediaWiki

This chapter focuses on a temporal analysis of MediaWikit to collect data about the behavior of some variables of the project over time. The objective is to help us understand how the project evolves. The tool used for this analysis, Codemaat, is introduced followed by the statistical temporal summary of MediaWiki obtained using that tool. Lastly, a file coupling analysis of the MediaWiki project is presented.

4.1

MediaWiki Evolution Analysis

Since it’s creation, in 2003, MediaWiki [6] is in a constant state of development. This woke a sense of curiosity towards what is happening within the project, lifting several questions: How is it evolving, how many developers are working on it, how many files exist and how many commits are made? To answer these questions, a tool was used, Codemaat [2].

4.1.1 Codemaat

Codemaat [2] a is tool used to mine and analyze data from version-control systems, giving us insight about the development of a project. It analyses the log file of a repository, going through all its history and returns historical statistics about it, providing information about the number of commits, the number of entities or files, the number of entities changed and the number of authors. This is enabled by using the statistical analysis summary option. There are other multiple options such as the coupling file analysis that allows us to see which files are coupled the most and the degree of the coupling.

4.1.2 Statistical Temporal Summary

To understand the evolution of MediaWiki [6], its log file was analyzed every three years, using Codemaat [3] statistical analysis option. Data was gathered about the number of authors, number of entities, number of entities changed and number of commits.

Starting with the number of authors, as we can see in the second column of Table 4.1, there is an increase in the number of different developers that work and contribute to this project. In every three-year gap the number of authors increases exponentially.

The increase in the number of authors may have an impact in the increase of commits, as it is seen in the third column of Table 4.1, the number of commits increases as the number of authors grows. This may not be a direct relation but as the development teams get larger,

(40)

it is expected that the development of the project accelerates, causing more commits over time.

Year Authors Commits Entities Entities Changed

2003 16 391 295 990 2006 56 10473 2042 28491 2009 138 26795 4248 112124 2012 275 44347 6954 195183 2015 604 56975 14554 275946 2018 813 68260 24656 359858

Table 4.1: Statistical summary from 2003 to 2018.

The evolution of the project means that more functionalities will be added throughout time and this implies a significant increase in the number of files present in the project, as it can be seen in the fourth column of Table 4.1.

The development of these new functionalities and the upgrading of existing ones is trans-mitted through the number of files that are changed. This number keeps increasing meaning that the code keeps evolving and as time goes by it just keeps increasing because in every three year gap the number of files changed is just so much higher than in the previous time interval. The biggest increase can be seen in the first six years of the project. From 2006 to 2009, the number of files changed increases by nearly 400%, as seen in the last column of Table 4.1.

4.1.3 File Coupling Analysis

The growing of the MediaWiki project is undeniable, the increasing number of entities and changes to them makes it hard to know which files are related and what a change to them affects. To find the files that cause others to change with them, Codemaat [3] was once again used, this time with the coupling option. This option allows the user to find coupling changes between files, giving us information about the entities that are related, how often they change together and the degree to which the two entities are coupled. The higher the degree the more likely a change of an entity will affect the related entity in a coupled file.

4.2

Results and Discussion

Regarding the file coupling, the following results, Table 4.2, were obtained after running Codemaat [3] on the log file of the MediaWiki [6] GIT repository.

The first column represents the entity under analysis, and the second represents the file which the analyzed entity is coupled with, the third shows the degree of coupling, representing the degree of interdependence of two modules as a percentage, and the fourth column gives the number of average revisions of the entity.

The previous results seem too vague and do not enlighten about what files cause schema evolution . Table 4.2 presents 9 examples of file coupling in MediaWiki [6]. There are a total of 1972 coupled modules with a coupling degree above or equal to 30%.

To find the coupled modules that affect the schemas, another approach was taken. In this approach the first step was to find the files that have SQL Data Definition Language [17] and

(41)

Entity Coupled Degree Average Revisions

oojs-ui.js oojs-ui.svg.css 100 25

oojs-ui-widgets.js oojs-ui-windows.js 100 35

MessagesKk cn.php MessagesKk kz.php 91 67

Preprocessor DOM.php Preprocessor Hash.php 76 101

MysqlUpdater.php SqliteUpdater.php 69 172

en.json qqq.json 69 204

TitleArray.php UserArray.php 64 13

MessagesDsb.php MessagesHsb.php 60 248

SearchOracle.php SearchPostgres.php 47 34

Table 4.2: File Coupling in MediaWiki

save them to a comma-separated values file. The second step was to compare the previous file with the one that contains the information about the file coupling of MediaWiki, resorting to a script, written in Python, creating a new file containing only the coupling of files that contain SQL Data Definition Language [17] in them.

The result of this approach is presented in Table 4.3, this table contains the coupled modules that use SQL Data Definition Language [17].

The files presented in Table 4.3, clarify the disruption caused by evolving the schema of databases. Going by order, in the first row, these files are the files that handle the updates to the SQLite [18] and MySQL [11] databases. In the second row, these files contain the classes used to setup MediaWiki [6] with the Oracle [13] and SQLite [18] databases. Followed by two PHP [15] files that contain some Query Language related to Microsoft SQL Server [7] and SQLite [18]. In the fourth row, these two files handle automatic connections to the databases, followed by two benchmark files. On the following rows, the files on the first column are the files that handle the updates to their respective databases, Microsoft SQL Server [7], Oracle [13] and PostGres [16] and on the second column are the files that contain the schemas structure of these different databases.

Entity Coupled Degree Average Revisions

MysqlUpdater.php SqliteUpdater.php 69 172

OracleInstaller.php SqliteInstaller.php 40 50

DatabaseMssql.php DatabaseSqlite.php 39 31

DBConnRef.php IDatabase.php 52 57

bench del truncate.php bench if switch.php 52 10

MssqlUpdater.php mssql/tables.sql 67 55

OracleUpdater.php oracle/tables.sql 55 77

PostgresUpdater.php postgres/tables.sql 34 172

tables.sql updaters.inc 32 402

Table 4.3: File Coupling in MediaWiki in 2018.

The files that are coupled are mostly files that handle updates to different databases and the changes that are made to these will affect the respective database schema files.

(42)

4.3

Summary

MediaWiki [6] is a complex project, being in constant development for over 16 years, new functionalities are added, new authors keep coming and this is represented in the increase of the number of commits and entities. The file coupling analysis gives us a deeper understanding of what is happening inside the project, providing us with insight about the hotspots [3], the files that change the most, allowing us to keep track of which files will be affected by changes to these entities. The coupling analysis provides the pairs of files that are coupled, the degree of this coupling and the number of revisions that affected these files.

To be more precise with data gathered, schema change evolution is what we are looking into. The results of the file coupling analysis were filtered to see which coupled pairs contain commands that will affect the database schema. So in order to do so, the focus passed to the files that contain SQL Data Definition Language [17], for example the command ’CREATE TABLE’.

After a careful investigation, most of the files that contain SQL DDL [17] have coupled files to them, these are files that contain the database schemas for different types of databases. These files contain the schema for the PostgreSQL [16], Oracle [13] and MySQL [11] databases. The rest of the files are PHP [15] scripts that allow the authors of the project to update the different databases when they need.

The search for patterns that allows us to better understand and predict schema evolution will have its focus on the files that were brought to the spotlight by this analysis. In the next chapter these files will be used to mine frequent patterns.

(43)

Chapter 5

Mining Association Rules

For the development of a recommendation system, there was the need to discover patterns in order to prevent ruptures caused by schema evolution. The mining of association rules is one of many different methods to mine frequent patterns. Finding these patterns allows us to predict the occurrence of a specific item based on the occurrence of other items in a transaction. In order to find the kind of patterns that we needed, association rules were mined from data gathered from the files presented in the results of the previous chapter.

5.1

Overview

To make it easier to understand what an association rule is, it is better resort to the story of one of the first use case [1]. The association rules were first applied by a sales person who wanted to boost the sales of his store. What he did was simply to bundle products together and apply discounts to these bundles. He began to make obvious bundles, like bread and jam, increasing is revenue and making the customers life easier. He then decided to analyze all the sales records to find new and less obvious bundles and he discovered that most of the clients that bought diapers also bought beer. Diapers and beer, one strange combination but very effective when thinking in bundle revenue. This is one perfect example of association rules in data mining.

{Diapers} =⇒ {Beer} (5.1)

An association rule consists of an antecedent and a consequent. They are basically a simple if/then statement that allow the discovery of patterns in datasets. These patterns are a relationship between the antecedent (if) and the consequent (then). A consequent is an item found in combination with the antecedent. They are mined resorting to machine learning models that analyze data in search of frequent patterns, i.e., frequent antecedent (if) and their respective consequent (then) are identified. The most important relationships between an antecedent and consequent are identified by using different metrics to evaluate the rules.

(44)

5.2

Methods for Mining Association Rules

5.2.1 Algorithms

In order to retrieve the relations between different item sets, there are some popular algorithms that use association rules. The three main ones are AIS [31], STEM [28] and Apriori [29], being Apriori the one used in this dissertation.

Apriori

Apriori is one of the most used algorithms for association rule mining [29]. It uses a level-wise search to explore itemsets, with the objective to mine frequent itemsets from trans-actional databases to boolean association rules. The first process is known as candidate generation. The frequent sub-itemsets are extended one by one, then this groups of candi-dates are tested against the data, using a breath-first search and a hash tree structure. The algorithm identifies frequent items in the database and extends them to larger itemsets. The requirement for these items is that they appear sufficiently often in the database. The final goal of Apriori is to determine and gather the most frequent item sets in a database. Then these item sets can be used to obtain association rules in order to have a preview of the trends in the database.

5.2.2 Metrics

The strength of an association rule is understood using the following metrics.

Support

This metric is defined for itemsets, not association rules. The support metric helps us understand about the frequency of a certain itemset in all transactions. When mining associ-ation rules, there will be three different support metrics for an associassoci-ation rule, the antecedent support, the consequent support and the support of the rule. In an association rule, the an-tecedent support calculates the proportion of transactions that contain the anan-tecedent, the consequent support calculates the proportion for the consequent and the rule support calcu-lates the combined support of the antecedent and consequent [22].

In other words, the support gives us the abundance or significance of an itemset in a database. The higher the support or the closer it is to one, the more significant it is, meaning it happens more frequently than other rules.

Support(X) = T ransactions Containing X

T otal T ransactions (5.2)

Support(X → Y ) = Support X ∪ Support Y

(45)

Confidence

The confidence metric is the conditional probability of occurrence of the consequent given the antecedent. When there is an already defined antecedent in a transaction the confidence will give us the likeliness of occurrence of the consequent. It is the ratio of the number of transactions that include the consequent and antecedent to the number of transactions that include the antecedent [22].

Confidence(X → Y ) = Support (X → Y )

Support (X) (5.4)

This metric can be misleading sometimes because the confidence of an association rule will be high if we have a very frequent consequent. In these cases it does not matter what antecedent we have, if the consequent is a very frequent itemset, the confidence will always be high, causing an association that at first seems to be solid, to be a weak one, misleading us with it’s confidence value.

Lift

This metric was introduced to overcome the misleading confidence measure. The lift value of an association rule measures the relation between the probability of a given rule to occur and its expected probability if the antecedent and consequent were independent. It is the ratio of the confidence and the expected confidence of the rule. The expected confidence is given by the product of the support of the consequent and support of the antecedent divided by the support of the antecedent. It is often used to measure how much more often the antecedent and consequent occur together than we would expect if they were independent [23].

Lift(X → Y ) = Confidence (X → Y )

Support (Y ) (5.5)

The value of the lift measure is between zero and infinity. If the value is greater than one, it indicates that the antecedent and consequent appear more often together than expected, meaning that the occurrence of the antecedent has a positive effect on the occurrence of the consequent. If it is smaller than one, it indicates the opposite, the antecedent and consequent appear less often together than expected, meaning that the occurrence of the antecedent has a negative effect on the occurrence of the consequent. If the value equals one, the antecedent and consequent appear as often together as expected meaning that the antecedent has no effect on the occurrence of the consequent, in this case the antecedent and consequent are statistically independent.

Leverage

The leverage measure is similar to the lift measure. Both measure the relation between the probability of a rule to occur and its expected probability if the antecedent and consequent were independent of each other. The difference between the lift and leverage measures, is that the lift computes the ratio between the confidence and the support, and the leverage computes the difference between the confidence and the expected confidence. The leverage

(46)

tends to find very strong associations, prioritizing antecedents and consequent with higher support in the dataset, unlike the lift, that tends to find those strong associations prioritizing less frequent antecedents and consequents [26].

Leverage(X → Y ) = Support(X → Y ) − Support(X) ∗ Support(Y ) (5.6) The values of this measure will be between minus one and one. If value of the measure is equal to zero, it indicates that the antecedent and consequent are statistically independent. If the value is negative, it implicates that there is a negative relation between the antecedent and the consequent, on the other hand if the value is positive it indicates a positive relation. Conviction

The conviction metric is commonly used to compute the expected frequency that the antecedent occurs without the consequent, giving us the frequency that the rule makes an incorrect prediction [24].

Conviction(X → Y ) = 1 − Support(Y )

1 − Conf idence(X → Y ) (5.7)

The values have a range from zero to infinity. A high conviction value means that the consequent depends highly on the antecedent. If the antecedent and consequent are statisti-cally independent, the value will be equal to one. When the confidence equals one (perfect confidence score), the conviction value will be defined as ’inf’.

5.3

Finding Association Rules in MediaWiki

In order to find the kind of patterns needed to build a recommendation system, we resorted to a frequent pattern mining method, a good and reliable method, association rules.

Schema evolution happens when there are changes that affect its structure and these changes can be categorized in different types of change [32]: transformation, structure refac-toring, referential integrity refacrefac-toring, architecture refacrefac-toring, data quality refactoring and method refactoring. Each of these categories includes different types of atomic changes of the schema, which represent a certain DDL command in MySQL. Since these are the changes that cause schema evolution, finding a pattern in them seemed like the path to take.

Two things were needed, data and a way of mining association rules.

5.3.1 Data collection

To obtain frequent patterns related with schema evolution, data was gathered resorting to a schema difference extraction tool [32]. This tool collected all the different atomic changes made throughout all valid database revisions, 221 to be exact in the case of MediaWiki. These revisions represent the evolution made to the schema through its entire history, until 2013, a 9-year period. In this period, 892 schema changes were made. This is the data used to mine frequent patterns. The same data was gathered for 9 other projects, coppermine, dotproject, e107, joomla, prestashop, roundcube, tikiwiki, typo3 and weberp. The data of all these projects was combined into one table, in order to find more global patterns. This

(47)

table consists in information regarding 1474 different valid database revisions, containing 6416 different atomic changes.

ID Revision New Column Add Key ... New Index Drop Index

1 51 3 5 4 2 102 1 3 112 2 ... 219 104687 1 220 107886 1 1 221 107887 1 1

Table 5.1: Atomic changes made in all valid database revisions.

5.3.2 Rules extraction

In order to find frequent patterns in the data presented in Table 5.1, it was developed a Python script. This script starts with the reading of the file that contains the data, proceeds to the normalization of the values, converting all values above or equal to one into ones and filling the empty cells with zero. The values of the columns representing atomic changes above one are turned into one because we don’t care about the quantity of a transaction but we care only about the fact that the transaction happened. In some cases, we find association rules with weights. These weights make some items more valuable than others. In this case our items, atomic changes, are all worth the same.

Using Mlxtend [8], a Python library of machine learning tools, and its module for finding frequent patterns, the method apriori [9] was used to run the apriori algorithm on the item set to identify frequent items in the data, with a minimum support of 1% because patterns with lower support values are patterns that have a very reduced presence.

After the apriori, the method for mining association rules was used to retrieve the re-spective rules and metrics [10]. To better filter the obtained rules, they were gathered using a minimum confidence of 40%. This resulted in a table containing all the association rules, with a confidence above or equal to 40%, present in 221 valid database revisions.

5.4

Results and Discussion

The script was applied to the table containing data only from MediaWiki and to the table that contains data from all the 10 different projects mentioned before.

113 rules were mined for the MediaWiki data and 97 for all the projects.

5.4.1 MediaWiki Rules

In Table 5.2, 10 examples of different association rules mined exclusively in the Medi-aWiki project [6] are presented, as well as, their respective metrics in Table 5.3.

The first four are examples of trustworthy association rules. Their metrics are what is expected to find in rules that make correct predictions, maximum confidence value, reasonably high lift values, which indicate that the antecedent and consequent of each rule are occurring together very often.

(48)

Rule 16 and rule 74 are a very curious example of two rules that have the same antecedent but they differ when it takes to the consequent, both of the consequent include the addition of a new column but in rules 74 consequent there is one more element, an addition of a key. They have the same confidence value but in this case it can be observed how the confidence metric is misleading. Rule 16 has a 2.93 lift value, which is kind of low, and rule 74 has a very high one, 20.27. This means that the consequent of rule 74 occurs much more often with the shared antecedent than the consequent of rule 16, this makes rule 74 more reliable than rule 16.

ID Antecedent Consequent

24 Drop Key, New Column Add Key

70 Drop Key, New Column, Drop Column Add Key

73 Add Key, New Column, Drop Column Drop Key

78 Drop Key, New Column, Change Data Type Add Key

16 Drop Key, Drop Column New Column

74 Drop Key, Drop Column New Column, Add Key

32 Drop Index, New Column New Index

14 Drop Column, Add Key New Column

102 Drop Column, Drop Index New Index, Rename Column

6 Add Key Drop Key

Table 5.2: Association Rules in MediaWiki

Rules 32 and 14 fail to have a 100% confidence value and don’t have very high lift values as well, but still their antecedents are highly dependent on the respective consequent, due to their conviction metrics, representing that the rules frequently make correct predictions.

Rule 102 is an example of how a rule can have a low confidence value but the antecedent is very often found together with the respective consequent. The high lift value represents that the consequent doesn’t have a high support value meaning that its not a very frequent consequent and when it occurs it is often with the antecedent of rule 102.

These last three examples lift a sense of curiosity towards the instances where this rules are violated, since their respective antecedents and consequent are very dependent on each other.

In rule 6 we find a very simple rule that has a low confidence value and despite its rather high lift value, its confidence value represents that it is a rule that doesn’t make correct predictions frequently.

5.4.2 Multiple Projects Rules

These rules are useful when making predictions within the MediaWiki project but they are mined from a rather small dataset, as they were extracted from 221 valid database revisions. In order to have more precise predictions from rules, there was the need to mine from a bigger dataset. 97 rules were mined from data from 10 different projects and 1474 valid database revisions. In Table 5.4, it can be seen 8 examples of rules mined from these datasets and their respective metrics in Table 5.5.

One of the differences that can be found in the association rules mined from MediaWiki only and from 10 different projects is that the number of rules with 100% confidence reduced

(49)

ID Support Confidence Lift Leverage Conviction 24 0.022 1 8.58 0.02 inf 70 0.018 1 8.58 0.02 inf 73 0.018 1 13.12 0.02 inf 78 0.013 1 8.58 0.01 inf 16 0.017 1 2.93 0.01 inf 74 0.017 1 20.27 0.02 inf 32 0.031 0.875 3.25 0.02 5.84 14 0.018 0.8 2.34 0.01 3.29 102 0.013 0.6 44.6 0.01 2.47 6 0.05 0.5 6.55 0.05 1.84

Table 5.3: Association Rules Metrics in MediaWiki

drastically. In the MediaWiki rules there are 24 rules with 100% confidence while in the data extracted from the 10 different projects there are only four rules with this value.

ID Antecedent Consequent

64 Change Data Type, New Column, Rename Column Add Key

70 New Column, Drop Key, Rename Column Add Key

82 Add Key, Drop Column, Rename Column Drop Key

7 Drop Key Add Key

12 Drop Column, Rename Column New Column

32 New Column, Change Default Value Change Data Type

87 Drop Column, Rename Column Drop Key, Add Key

56 Add Key, New Column, Drop Key Drop Column

Table 5.4: Association Rules in 10 different projects.

In the presented examples, all the rules have high lift values, specially rule 12. A rule capable of frequently giving a correct prediction, rule 7, and one that not so much, rule 56.

5.4.3 Rules violations analysis

Some rules have a very high confidence value but there are some violations to that rule that makes it’s confidence be bellow 100%. These violations reduce the value of the confidence and make a rule which is apparently very trustworthy, one that you trust less. A violation is when the antecedent and consequent of a certain rule aren’t found together in a valid database revision. In this section those violations are presented, in order to try to understand why some antecedents and consequent don’t appear together in some revisions. To find these violations, every valid database revision where the antecedent of the rule is present will be checked to discover the violations to the frequent patterns.

For example, if a rule has a 90% confidence value and the its antecedents are present in 10 valid database revisions, it means that there is one revision that violates the rule and its this revision that we are trying to explain the violation.

Referências

Documentos relacionados

Ao Dr Oliver Duenisch pelos contatos feitos e orientação de língua estrangeira Ao Dr Agenor Maccari pela ajuda na viabilização da área do experimento de campo Ao Dr Rudi Arno

Uma das explicações para a não utilização dos recursos do Fundo foi devido ao processo de reconstrução dos países europeus, e devido ao grande fluxo de capitais no

Neste trabalho o objetivo central foi a ampliação e adequação do procedimento e programa computacional baseado no programa comercial MSC.PATRAN, para a geração automática de modelos

Ousasse apontar algumas hipóteses para a solução desse problema público a partir do exposto dos autores usados como base para fundamentação teórica, da análise dos dados

Em sua pesquisa sobre a história da imprensa social no Brasil, por exemplo, apesar de deixar claro que “sua investigação está distante de ser um trabalho completo”, ele

Peça de mão de alta rotação pneumática com sistema Push Button (botão para remoção de broca), podendo apresentar passagem dupla de ar e acoplamento para engate rápido

Desde que o PG e sua cadeia de GAG transmite forças entre as fibrilas de colágeno e é mediador de força (Scott, 1992; Redaelli et al., 2003; Vesentini et al., 2005; Liao e

As negociaçõs demoraram sete anos e o Perdão geral de 1605 custou aos cristãos-novos 1700000 cruzados, para o que certamente muito contribuíram os portuenses,