Improving bug localization by mining crash reports: an empirical study

(1)

Department of Informatics and Applied Mathematics Graduate Program in Systems and Computing Academic Master’s Degree in Systems and Computing

Improving Bug Localization by Mining Crash

Reports: An Empirical Study

Marcos Alexandre de Melo Medeiros

Natal, Brazil February, 2020

(2)

Improving Bug Localization by Mining Crash Reports:

An Empirical Study

A dissertation submitted to the Computer Science Graduation Program of the Center of Exact and Earth Sciences in conformity with the requirements for the Degree of Master in Systems and Computing.

Research line:

Software Engineering

Advisor

Uirá Kulesza

PPgSC – Graduate Program in Systems and Computing DIMAp – Departmentof Informatics and Applied Mathematics

CCET – Center of Exact and Earth Sciences UFRN – Federal University of Rio Grande do Norte

Natal, Brazil February, 2020

(3)

Medeiros, Marcos Alexandre de Melo.

Improving bug localization by mining crash reports: an empirical study / Marcos Alexandre de Melo Medeiros. - 2020. 85f.: il.

Dissertação (Mestrado) - Universidade Federal do Rio Grande do Norte, Centro de Ciências Exatas e da Terra, Programa de Pós-Graduação em Sistemas e Computação. Natal, 2020.

Orientador: Uirá Kulesza.

1. Engenharia de software - Dissertação. 2. Falha de software - Dissertação. 3. Correlação entre falhas - Dissertação. 4. Localização de falha Dissertação. 5. Relatório de falha -Dissertação. 6. Pilha de execução - -Dissertação. I. Kulesza, Uirá. II. Título.

RN/UF/CCET CDU 004.41

Catalogação de Publicação na Fonte. UFRN - Biblioteca Setorial Prof. Ronaldo Xavier de Arruda - CCET

(4)

(5)

(6)

Thanks, family! I will be forever grateful for all the love, support and comprehension. Sorry for the moments of absence.

Thanks to all my professors for the knowledge disseminated. Thanks to my advisor for all the help and patience.

Thanks to the IT Superintendency of the Federal University of Rio Grande do Norte for their support and trust.

Thanks to my friends for encouragement.

(7)

(8)

An Empirical Study

Author: Marcos Alexandre de Melo Medeiros Advisor: Uirá Kulesza

Abstract

The information available in crash reports has been used to understand the root cause of bugs and improve the overall quality of systems. Nonetheless, crash reports often lead to a huge amount of information, being necessary to apply techniques that aim to consolidate the crash report data into groups, according to a set of well-defined criteria. In this dissertation, we contribute with a customization of rules that automatically find and group correlated crash reports (according to their stack traces) in the context of large scale web-based systems. We select and adapt some existing approaches described in the literature about crash report grouping and suspicious file ranking of crashing the system. Next, we design and implement a software tool to identify and rank buggy files using stack traces from crash reports. We use our tool and approach to identify and rank buggy files — that is, files that are most likely to contribute to a crash and thus need a fix. We evaluate our approach comparing two sets of classes and methods: the classes (methods) that developers changed to fix a bug and the suspected buggy classes (methods) that are present in the stack traces of the correlated crash reports. Our study provides new pieces of evidence of the potential use of crash report groups to correctly indicate buggy classes and methods present in stack traces. For instance, we successfully identify a buggy class with recall varying from 61.4% to 77.3% and precision ranging from 41.4% to 55.5%, considering the top 1, top 3, top 5, and top 10 suspicious buggy files identified and ranked by our approach. The main implication of our approach is that developers can locate and fix the root cause of a crash report considering a few classes or methods, instead of having to review thousands of assets.

(9)

relatórios de falhas: Um Estudo Empírico

Autor: Marcos Alexandre de Melo Medeiros Orientador: Uirá Kulesza

Resumo

As informações disponíveis nos relatórios de falhas estão sendo usadas para entender a causa raiz dos erros e melhorar a qualidade geral dos sistemas. No entanto, esses relatórios geralmente levam a uma enorme quantidade de informações, sendo necessário aplicar téc-nicas que visam consolidar os dados em grupos de acordo com um conjunto de critérios bem definidos. Nesta dissertação, contribuímos com uma personalização de regras que au-tomaticamente localizam e agrupam relatórios de falhas correlacionados (de acordo com seus stack traces) no contexto de sistemas Web de larga escala. Para isso, selecionamos e adaptamos algumas abordagens descritas na literatura sobre o agrupamento de relatórios de falhas e a classificação de arquivos suspeitos por travar o sistema. Em seguida, proje-tamos e implemenproje-tamos uma ferramenta de software para identificar e classificar arquivos com erro usando stack traces presentes nos relatórios de falhas. Usamos nossa ferramenta e nossa abordagem para identificar e classificar arquivos com erros, ou seja, arquivos com maior probabilidade de causarem uma falha e que, portanto, necessitam de uma correção. Avaliamos nossa abordagem comparando dois conjuntos de classes e métodos: as classes (métodos) que os desenvolvedores alteraram para corrigir um bug e as classes (métodos) suspeitas de conterem um bugs dentre as que estão presentes nos stack traces dos relatórios de falhas correlacionados. Nosso estudo fornece novas evidências acerca do potencial do uso de grupos de relatórios de falhas para indicar corretamente classes e métodos com erro, dentre as que estão presentes nos stack traces. Por exemplo, identificamos com êxito uma classe com erro, com recall variando de 61,4% a 77,3% e precisão variando de 41,4% a 55,5%, considerando o 1, 3, 5 e 10 arquivos suspeitos identificados e classificados por nossa abordagem. A principal implicação de nossa abordagem é que os desenvolvedores podem localizar e corrigir a causa raiz de um relatório de falha considerando algumas classes ou métodos, em vez de revisar milhares de ativos.

(10)

(11)

1 Stack trace example . . . p. 24 2 Software idea . . . p. 35 3 Urupema Architecture . . . p. 36 4 Urupema Feature Model . . . p. 37 5 Urupema Class Diagram . . . p. 38 6 Urupema product generation . . . p. 39 7 Coverage variant idea . . . p. 41 8 Study detail screen . . . p. 43 9 Issue tracker connection detail screen . . . p. 44 10 Issue detail screen . . . p. 45 11 Log group detail screen . . . p. 46 12 Suspicious Files . . . p. 47 13 Calc Top N form screen . . . p. 48 14 Top N result - start . . . p. 48 15 Top N result - end . . . p. 48 16 Top N result - called methods . . . p. 49 17 Study overview . . . p. 56 18 Crash report grouping levels by stack traces . . . p. 58 19 Linking issue with crash report group: equivalent signature without

mes-sages . . . p. 60 20 Mean of Recall and MAP . . . p. 66

(12)

three modified files . . . p. 68 22 Frequency of changed methods in crash report groups . . . p. 70 23 Presence of changed file methods in crash report groups . . . p. 71

(13)

1 Target Systems Characterization . . . p. 55 2 Issues . . . p. 61 3 Crash reports . . . p. 62 4 Pairs formed with Issues and Crash Report Groups . . . p. 62 5 Distinct Issues Found . . . p. 66

(14)

WER – Windows Error Reporting URI – Uniform Resource Identifier

SINFO – Superintendência de Informática

SIGAA – Sistema Integrado de Gestão de Atividades Acadêmicas SIPAC – Sistema Integrado de Patrimônio, Administração e Contratos SIGRH – Sistema Integrado de Recursos Humanos

(15)

(16)

1 Introduction p. 18 1.1 Problem statement . . . p. 19 1.2 Current research limitations . . . p. 20 1.3 Dissertation proposal . . . p. 20 1.4 General and Specific Objectives . . . p. 21 1.5 Dissertation organization . . . p. 22

2 Background p. 23

2.1 Crash Reports and Crash Types . . . p. 23 2.2 Stack Traces . . . p. 24 2.3 Crash Report Clustering . . . p. 27 2.3.1 Clustering rules by Wang et al.(WANG; KHOMH; ZOU, 2016) . . . p. 27 2.4 Crash Fault Localization . . . p. 31 2.4.1 Discriminative factors by Wu et al. (WU et al., 2014) . . . p. 31

3 Urupema p. 34

3.1 Architecture . . . p. 35 3.2 Possible usage scenarios . . . p. 41 3.2.1 Bug prioritization . . . p. 41 3.2.2 Software testing prioritization and management . . . p. 42 3.2.3 Bug Code Identification . . . p. 42 3.3 Features Overview . . . p. 42

(17)

3.3.2 Issue tracker connection detail . . . p. 43 3.3.3 Issue detail . . . p. 45 3.3.4 Log group detail . . . p. 45 3.3.5 Suspicious files . . . p. 46 3.3.6 Calculate Top N . . . p. 47 3.4 Limitations . . . p. 49 3.5 Related Tools . . . p. 50 3.6 Summary . . . p. 51 4 Empirical Study p. 52 4.1 Introduction . . . p. 52 4.2 Study Settings . . . p. 54 4.2.1 Target Systems . . . p. 55 4.2.2 Study Procedures . . . p. 55 4.2.3 Data Collection . . . p. 57 4.3 Results and Discussion . . . p. 60 4.3.1 First Assessment: Coarse-grained level . . . p. 60 4.3.2 Second Assessment: Fine-grained level . . . p. 69 4.4 Discarding Suspicious Files and Stack Traces . . . p. 71 4.5 Recommendations for Approach Improvement . . . p. 72 4.6 Threats to validity . . . p. 74 4.7 Related Work . . . p. 74 4.8 Summary . . . p. 76

5 Conclusions p. 77

(18)

5.3 Future work . . . p. 78

References p. 79

(19)

1 Introduction

Every day increases the demand for software that continually evolves their functional-ities, with frequent updates and without adding new defects (BOSCH, 2012). Additionally,

users require quicker fixes for reported faults. Software testing helps maintain code quality and identify errors faster and more efficiently, mainly when the build process includes au-tomated test execution (FOWLER, 2006). However, it is difficult to achieve 100% coverage of software code, due to the high volume of work required and the prohibitive costs to reproduce a test environment perfectly identical to the production (CUKIER, 2013). Even when there is software testing covering a large part of the system, errors still occur in the production environment. This failure occurs for a variety of reasons, such as hard-ware failures, communication, load higher than supposed, input data different from those expected and tested.

The use of information collected in production can fill gaps that software testing does not address such as the need to reproduce the bug by providing information about the application context at the crash time (user, input data, called methods) or enabling to notice any anomaly in the critical system operations (user logins, order submissions, user sign-up) in real-time (CUKIER, 2013). Organizations typically monitor their production environments and record software failures in log files for later analysis. Some use software (e.g., Bugzilla1_{) to store and track issues manually reported by users or automatically by}

the system (e.g., Firefox2 _{has built-in Mozilla Crash Reporter}3_{). Typically, these reports}

contain the stack of methods called by the thread that failed, an identifier, the timestamp of the crash, information about the execution environment (e.g., operating system, ver-sion, user), and comments made by users, when manually reported (WANG; KHOMH; ZOU, 2016)(AN; KHOMH, 2015).

Developers commonly look for the method where the crash occurred in the stack trace to investigate the cause of the problem in the source code of the system. They typically

1_{https://www.bugzilla.org/} 2_{https://www.mozilla.org/firefox}

(20)

analyze variable assignments, parameters, called methods, and more. When they cannot identify the problem in the investigated method, they analyze the immediately preceding method in the call stack. They continue this process until they find the cause of the problem or check all methods present on the stack trace. This is often a tedious and time-consuming task.

The difficulty level increases when the number of crash reports is high since it is necessary to choose what should be investigated first. If there are no selection criteria, a single-time failure may be analyzed before another that is affecting multiple parts of the system and hundreds of users. Thus, crash reports must be analyzed and grouped by bugs to facilitate prioritization, error identification, and correction. However, manually analyzing a large volume of crash reports is impractical.

1.1 Problem statement

Software crashes annoy users and can cause disasters in safe-critical systems, because of that, they are considered as severe bugs. Therefore, tools are developed and embedded in the software systems to automatically collect and send information about systems crashes, facilitating their identification and correction.

The problem is that these tools often collect a large number of crash reports. For example, Mozilla Firefox receives millions of crash reports every day (WANG; KHOMH; ZOU, 2016), and Windows Error Reporting (WER) collected billions of crash reports during ten years of operation (WU et al., 2018). It may be costly for the development team to analyze a large number of reported errors, and it may take a long time to identify and correct the causes of the problem.

In this context, it is challenging to facilitate the task of analyzing crash reports by providing more relevant, accurate, and faster information to the development team. Auto-matically grouping crash reports related to the same bug is one way to help developers, for example, in the prioritization task. Another strategy to reduce the effort to identify and solve the issue is to suggest source code snippets that have a great potential to contain the bug that generated the crash report.

This dissertation has the goal of reducing the effort to aggregate correlated crash reports based on the similarity of the stack traces and locate buggy files analyzing a large number of crash reports.

(21)

1.2 Current research limitations

Recent research works have developed studies to find and correct failures based on stack traces. Dhaliwal et al. (DHALIWAL; KHOMH; ZOU, 2011) proposed a new way of grouping crash reports automatically reported by the Firefox browser based on the sim-ilarity of the stack traces, using Levenshtein’s distance4_{, reducing by 5% the time spent}

to correct the faults.

Wang et al. (WANG; KHOMH; ZOU, 2013)(WANG; KHOMH; ZOU, 2016), motivated by the fact that a failure has the potential to affect several parts of the application, proposed five rules to automatically identify correlated crash types. They conducted an empirical study with data from Firefox and Eclipse where they identify correlated crash types with a precision of 91% and 76%, respectively. In addition to grouping correlated crash reports, they developed a methodology to identify buggy files, based on the dimensions collected for each file present in the stack traces. When it looked at the first three files, the recall was 62%, and the precision was 42% for Firefox. For the Eclipse project, they found 52% of precision and 50% of recall. Examining the first ten files suggested the recall increased to 92% in the case of Firefox and 90% to Eclipse.

Wu et al. (WU et al., 2014) also studied the identification of faulty code. They pro-posed a method called CrashLocator to locate defective functions related to crash report groupings using crash stack information. They have developed approaches to expand the crash stack using static analysis, and mathematical equations to rank suspicious func-tions. They located 50.6%, 63.7%, and 67.5% of failures by examining the top 1, 5, and 10 functions recommended by CrashLocator, respectively.

All these previous research work has applied their proposed approaches exclusively to open source systems. There is no prior work that investigated the impact of stack traces grouping in industrial software systems.

1.3 Dissertation proposal

This work also retrieves information present in stack traces to suggest a list of po-tentially buggy files to help development teams to correct software bugs. We use a set of crash reports as input and produce crash report groups and a rank of suspicious files for each of the generated groups as outputs. In this way, the development team can find

(22)

the root cause of the issue by analyzing a few files (out of all system assets) and crash reports. We group the crash reports based on the first three rules suggested by Wang et al. (WANG; KHOMH; ZOU, 2016). The rank of suspicious files uses the ideas proposed by Wu (WU et al., 2014). The main difference for the previous research work is that we apply our approach to industrial large-scale web information systems. Other differences are because we have adapted some proposed approaches to make the study feasible.

We analyzed tasks registered at the issue tracker - a Redmine5customization - for three web information systems from the Federal University of Rio Grande do Norte (UFRN) to identify corelated stack traces and changed files in bug fixes. We aggregate crash reports registered during the same period using stack traces. They were automatically collected in the production environment and stored in ElasticSearch6 along with other information related to the crash (date, user, system, and others). We linked bug fix tasks to crash report groups to measure the precision and recall of the suspicious file listing that we generate for each group.

1.4 General and Specific Objectives

Our objective is to verify if the results for industrial web information systems are similar to those reported in previous work. We also checked if the changed methods to correct the bugs appear in the stack traces of the crash report groups. To enable the study, we developed a tool called Urupema to speed up the analysis of the desired amount of logs and minimize human error.

In this context, the specific objectives of this work are:

• Investigating the main approaches based on stack trace to crash report grouping and ranking of suspicious files of malfunctioning, to know which techniques are currently used and to select the most appropriate to the study in question;

• Designing and Implementing the Urupema tool to assist the study execution;

• Applying the proposed approach and tool in real context to compare the results obtained with previous studies, as well as the benefits and limitations found.

5_{https://www.redmine.org/}

(23)

1.5 Dissertation organization

The remainder of this dissertation is organized as follows: Chapter 2 gives an overview of crash report, crash type, and stack trace. Chapter 3 presents an overview of the Uru-pema tool developed to perform analysis using log information. Chapter 4 describes the empirical study performed on three systems of the Federal University of Rio Grande do Norte. Finally, Chapter 5 presents the main contributions, limitations, and future work.

(24)

2 Background

The runtime information (operating systems, user, error messages, and others) of the software systems at the time of failure is important because it can help developers to identify and solve software bugs (AN; KHOMH, 2015). Issue tracking systems (e.g., Bugzilla, Redmine1_{, Gitlab}2_{) are used by existing companies to store and track bugs}

(WANG; KHOMH; ZOU, 2016), allowing users and developers to submit reports for perceived failures and include information related to them, such as stack traces, system versions and failure time. However, not all users report problems encountered, and they do not always submit relevant information to help to find out.

2.1 Crash Reports and Crash Types

Software crash is one of the most severe manifestations of software failure and there-fore has a high correction priority (WU et al., 2014). To minimize the lack of information, many existing companies automatically collect and send data from the runtime environ-ment, including the stack trace, at the time of the software breakdown (e.g., Mozilla Crash Reporter, Windows Error Reporting, Apple Crash Reporter). Automatically submitted crash reports are stored in repositories (e.g., Mozilla Crash Report Server, Netbeans re-porter exception, Elasticsearch) maintained by software organizations. These data can then be processed and grouped into crash types by similarity. Ideally, each group should correspond to a unique failure. Depending on the software used to track bugs and group crash reports, you can make linkages between bugs and crash types so that you can navi-gate from one to another. Many bugs can be linked to a unique crash type, and multiple crash types can be connected to a single bug (AN; KHOMH, 2015). Usually, a crash report contains a signature, a stack trace, timestamp, user environment variables, operating sys-tem, version, software name, and installation date (WU et al., 2014)(WANG; KHOMH; ZOU,

1_{http://www.redmine.org/} 2_{https://gitlab.com/}

(25)

2016).

2.2 Stack Traces

A stack trace is an ordered set of frames < F1, F2, ...Fn >. Each frame Fi consists of

a qualified name of a method (formed by package, class and method name), a filename (fileName), and a line number (line). Fi = qM ethodN ame|(f ileN ame| : |line), where

i ∈ 1..n is the frame position Fi in the stack trace. The last executed frame (the most

recent) is at the top, and the first (oldest) is at the bottom. The frame at the top of the stack is called of crash point, and the name of the file on that same line is called Top Frame File. Fig. 1 illustrates an example of a stack trace adapted from the Java Language documentation (method Throwable.printStackTrace()).

Figure 1: Stack trace example

It is worth noting that the same call sequence of methods can come from two distinct source codes that can generate similar stack traces. Listings 2.1 and 2.2 execute the same sequence of methods of the Fig. 1, but generate different stack traces respectively shown in Listings 2.3 and 2.4. At all times that we parse the stack trace, we put the innermost method at the top, even when they were similar to the example in Listing 2.4. In this sense, there may be differences from our work to the others. Analyzing Fig. 4 of Wang et al.(WANG; KHOMH; ZOU, 2016), for example, it is possible that they considered the lines "HighLevelException: MidLevelException: LowLevelException at Junk.a (Junk.java:13)" as the top of the stack for the listing 2.4 instead of the lines "Caused by: LowLevelException at Junk.e (Junk.java:30)".

Listing 2.1: Java code example A

1 public c l a s s Junk {

2 public s t a t i c void main ( S t r i n g a r g s [ ] ) { 3 a ( ) ;

(26)

5 s t a t i c void a ( ) throws L o w L e v e l E x c e p t i o n { 6 b ( ) ; 7 } 8 s t a t i c void b ( ) throws L o w L e v e l E x c e p t i o n { 9 c ( ) ; 10 } 11 s t a t i c void c ( ) throws L o w L e v e l E x c e p t i o n { 12 d ( ) ; 13 } 14 s t a t i c void d ( ) throws L o w L e v e l E x c e p t i o n { 15 e ( ) ; 16 } 17 s t a t i c void e ( ) throws L o w L e v e l E x c e p t i o n { 18 throw new L o w L e v e l E x c e p t i o n ( ) ; 19 } 20 }

Listing 2.2: Java code example B

1 public c l a s s Junk {

2 public s t a t i c void main ( S t r i n g a r g s [ ] ) { 3 try { 4 a ( ) ; 5 } catch ( H i g h L e v e l E x c e p t i o n e ) { 6 e . p r i n t S t a c k T r a c e ( ) ; 7 } 8 } 9 s t a t i c void a ( ) throws H i g h L e v e l E x c e p t i o n { 10 try { 11 b ( ) ; 12 } catch ( M i d L e v e l E x c e p t i o n e ) { 13 throw new H i g h L e v e l E x c e p t i o n ( e ) ; 14 } 15 } 16 s t a t i c void b ( ) throws M i d L e v e l E x c e p t i o n { 17 c ( ) ; 18 } 19 s t a t i c void c ( ) throws M i d L e v e l E x c e p t i o n { 20 try { 21 d ( ) ; 22 } catch ( L o w L e v e l E x c e p t i o n e ) { 23 throw new M i d L e v e l E x c e p t i o n ( e ) ; 24 } 25 }

(27)

26 s t a t i c void d ( ) throws L o w L e v e l E x c e p t i o n { 27 e ( ) ; 28 } 29 s t a t i c void e ( ) throws L o w L e v e l E x c e p t i o n { 30 throw new L o w L e v e l E x c e p t i o n ( ) ; 31 } 32 }

Listing 2.3: Stack trace generated by Listing 2.1

HighLevelException: MidLevelException: LowLevelException at Junk.a(Junk.java:6)

at Junk.main(Junk.java:3)

Caused by: MidLevelException: LowLevelException at Junk.c(Junk.java:12)

at Junk.b(Junk.java:9) at Junk.a(Junk.java:6) ... 1 more

Caused by: LowLevelException at Junk.e(Junk.java:18) at Junk.d(Junk.java:15) at Junk.c(Junk.java:12) ... 3 more

Listing 2.4: stack trace generated by Listing 2.2

HighLevelException: MidLevelException: LowLevelException at Junk.a(Junk.java:13)

at Junk.main(Junk.java:4)

Caused by: MidLevelException: LowLevelException at Junk.c(Junk.java:23)

at Junk.b(Junk.java:17) at Junk.a(Junk.java:11) ... 1 more

Caused by: LowLevelException at Junk.e(Junk.java:30) at Junk.d(Junk.java:27) at Junk.c(Junk.java:21) ... 3 more

(28)

2.3 Crash Report Clustering

Several researchers have conducted studies to identify automatically, and aggregate duplicate or correlated crash reports to reduce development team efforts. One of the primary purposes of crash reports grouping is to facilitate the prioritization of crash types during screening, indicating which crashes should be fixed first.

Podgurski et al. (PODGURSKI et al., 2003) proposed a reported software failure clas-sification approach to facilitate prioritization and diagnosis of their causes. Khomh et al. (KHOMH et al., 2011) proposed an entropy-based approach to identify crash types from crash frequencies and distribution of the occurrences among the users of a system.

Kim et al. (KIM et al., 2011) investigated the crash reports databases of Firefox and Thunderbird and found that only 10 to 20 crashes account for the vast majority of crash reports. They developed a machine-learning methodology to predict whether a crash will be a top crash the first few times it is reported, allowing allows for quick resolution of the most important crashes before a new release of a software system.

Dang et al. (DANG et al., 2012) developed the ReBucket method, which measures the

similarities of call stacks in crash reports and then assigns the reports to the appropriate bucket based on the similarity values. They achieved better performance results than existing methods.

Wang et al. (WANG; KHOMH; ZOU, 2013)(WANG; KHOMH; ZOU, 2016) proposed five rules for automatically identifying correlated crash types and duplicate or related bug reports: Crash Type Signature Comparison (i.e., Rule 1), Top Frame Comparison (i.e., Rule 2), Frequent Closed Ordered Sub-Set Comparison (i.e., Rule 3), Time-based Co-occurrence of Crash Types Comparison (i.e., Rule 4) and Textual Similarity of Crash Types Comparison (i.e., Rule 5). The first three rules use the information present in the reported stack traces, while the other two use crash time and textual similarity between user comments. They obtained good precision measures by analyzing data from Firefox and Eclipse.

2.3.1 Clustering rules by Wang et al.(

WANG; KHOMH; ZOU

, 2016)

This section describes in detail the five rules for identification of crash correlation groups proposed by Wang et al., after manual analysis 40 of Firefox crash types selected randomly.

(29)

Some definitions are essential for understanding the rules. Crash type signature is de-fined as S = P1|P2|...|Pn, where each element Pi = hf ileiihopiihmethiihparamiihmemlocii.

Pj contains Pi if (f ilei = f ilej) ∧ {opi, methi, parami} ⊆ {opj, methj, paramj}.

Rule 1 (Crash type signature comparison) identifies similarities between the signatures of correlated crash types. More specifically, it compares the strings of the signatures of two crash types and uses the contains relation to decide if they are correlated. Given two crash types CTA and CTB with signatures SA and SB respectively, CTA

and CTB are correlates if SA⊂ SB or SB ⊂ SA. They define a contains relation between

crash type signature elements as follows.

Let SA and SB be two crash type signatures where, SA = P1A|P2A|...|PnA and SB =

PB

1 |P2B|...|PmB, with PiA = hf ileAi ihopAi ihmethAi ihparamAi ihmemlocAi i, PjB = hf ileBj ihopBj i

hmethB

j ihparamBj ihmemlocBj i, i ∈ {1...n}, j ∈ {1...m}, and m ≥ n.

SA⊂ SB if ∀PiA, i ∈ {1...n}, ∃j ∈ {1...m}|PjBcontainsPiA

Rule 2 (Top frame comparison) compares the fully qualified file names of the top frames of two crash types to verify if the crash types are correlated. When two crash types have the same fully qualified file name in their top frame, the two crash types are correlated.

Given two crash types CTAand CTBwhith top frames F1A= methSignA1|qf ileN ameA1

and FB

1 = methSignB1|qf ileN ameB1, respectively. Then, CTA and CTB are correlated if

qf ileN ameA

1 = qf ileN ameB1, ignoring file extensions.

Rule 3 (Frequent closed ordered Set comparison) analyze the other sub-sequent frames in the stack traces of a crash type to further improve the identification of crash types correlations. For this, they introduce the concept of closed ordered sub-sets of frames for crash types.

Lets ST be a set of stack traces {T1, T2, ..., Tp}, where p is the number of stack traces

in the set, Ti = hF1i, F2i, ..., Fniii, F

i

j = methSignij|qf ileN ameij, j ∈ {1, ..., ni}, ni is the

number of frames in Ti, and i ∈ {1, ..., p}.

Given an ordered set of frames SubF = hG1, ..., Gmi, for each Ti, i ∈ {1, ..., p}, if ∃k, l,

with 1 ≤ k ≤ l ≤ ni|(G1 = qf ileN emeik) ∧ ... ∧ (Gm = qf ileN ameil), then SubF is an

ordered sub-set of frames of Ti. The value of each frame in SubF is a Fully Qualified File

Name.

(30)

as an ordered sub-set of frames of ST . SubF is a closed ordered sub-set of frames of ST if there is no other ordered sub-set of frames of ST containing SubF .

The absolute support of SubF is the number of i ∈ {1, ..., p}|SubF is an ordered sub-set of frames of Ti. The relative support of SubF is the absolute support/p. This relative

support is the frequency of SubF in ST . They considerer an ordered sub-set of frames as frequent if its relative support > 0.5.

Rule 3 examines the frequent closed ordered sub-set (FCSF) of two crash types. If two crash types have a common FCSF, they are correlated. Her definition is as follows.

Given two crash types CTA and CTB whith stack traces STA = {T1A, T2A, ..., TpA}

and STB = {T1B, T2B, ..., TpB}, respectively. If S (

ASub) (respectively S (

BSub)) is the set of

frequent closed ordered sub-sets of frames STA (respectively STB), S (

ASub) ∩ S (

BSub) 6=

∅ =⇒ CTA and CTB are correlated.

Rule 4 (Time-based co-occurrence of crash types comparison) identify crash correlation groups using frequent patterns of co-occurrences of crash types on users’ ma-chines.

Let U be a set of users {U1, U2, ..., Un}, where n is the number of users who reported a

crash. For each user Ui, the group of crash types reported by Ui is hC1i, C2i, ..., Cniii, where

ni is the number of crash types reported by the user Ui, Cji is the jth crash type reported

by the user Ui, j ∈ {1, ..., ni}, i ∈ {1, ..., n}.

Given a set of crash types SubC = hC1, ..., Cmi, where m ≥ 2, for each user Ui,

i ∈ {1, ..., n}, if ∃k, l, with 1 ≤ k ≤ l ≤ ni|(C1 = Cki) ∧ ... ∧ (Cm = Cli), then SubC is a

sub-set of crash types of Ui.

The absolute support of SubC is the number of i ∈ {1, ..., p}|SubC is a sub-set of crash types of Ui. They mine all the groups of crash types of users and extract frequent sub-sets

of crash types, using AprioriTID, an algorithm for discovering frequent item-sets (groups of crash types appearing frequently) among users. To be able to capture more sub-sets of crash types, they set the absolute support threshold value of the algorithm to 2, i.e., as long as a sub-set appears twice among users and it contains at least two crash types, they consider it as frequent.

Once the sets of frequent sub-sets of crash types are identified, they use the crash times of crash types to validate these frequent sub-sets. Given a time window (e.g., one day or one week), if all crash types of a sub-set occur within the time window, they keep this sub-set as valid.

(31)

Given two crash types CTA and CTB, if they are in a frequent sub-set of crash types

SubC and co-occurred within a given time window, they are correlated.

Rule 5 (Textual similarity of crash types comparison) identify crash correlation groups using the textual similarity between comments provided by users about the crash types.

Each crash type has its textual description which is a set of user comments. They merge the set of user comments into a single document. Each document has a set of terms {T Mi

1, T M2i, ..., T Mmi }; m is the total number of terms in the textual description. They

have a mapping between a crash type and a set of terms.

They use vector space model, a widely used technique in traditional information re-trieval, to calculate the textual similarity between crash types. In the vector space model, each document (i.e., a crash type in our case) is represented as an N -dimensional vec-tor, where N is the number of unique terms appearing in all the documents and Wi,

where 1 ≤ i ≤ N , is the weight of the ith term in the vector hW1, ..., WNi and defined by

Wi = T Fi × IDFi. The Term Frequency (TF) is the frequency of a term appearing in a

document. The Inverse Document Frequency (IDF) diminishes the weight of terms that occur very frequently in the whole corpus and increases the weight of terms that occur rarely. They calculate the T Fi and the IDFi for each term.

T Fi =

|{occurrences of ith term in the document}|

|{total terms in the document}| (2.1)

IDFi = log(

|{total documents in the corpus}| |{documents having the ith term}|

) (2.2)

After the vectors are created for each document (i.e., a crash type in our case), we can calculate the similarity of a pair of documents through a formula defining the sim-ilarity of two vectors. Typically, for two vectors V1 = hW11, W12, ..., W1N i and V2 =

hW21, W22, ..., W2N i, the similarity of V1 and V2 equals the value of the Cosine similarity

of V1 and V2: T Fi = Pn i=1W1i × W2i pPn i=1(W1i)2× pPn i=1(W2i)2 (2.3)

Given two crash types CTAand CTB, if the similarity value of their textual description

(32)

2.4 Crash Fault Localization

Several crash fault location techniques have been proposed over time. Some of them identify potentially faulty code regions by comparing failing execution traces with suc-cessful execution traces (BALL; NAIK; RAJAMANI, 2003; JONES; HARROLD; STASKO, 2002;

JONES; HARROLD, 2005). Typically these techniques use code instrumentation and soft-ware testing to capture execution traces. Nessa et al. (NESSA et al., 2008) applied data mining techniques to information collected in the testing phase, narrowing down the pos-sible location of the fault. They developed a fault localization algorithm to rank the executable statements of software by the level of suspicion based on N-gram analysis.

Other works use stack traces to find and fix bugs. This makes sense since bugs that have at least one stack trace in their crash reports are fixed faster (SCHROTER et al., 2010). Wang et al. (WANG; KHOMH; ZOU, 2013, 2016) have created an algorithm (BFFinder) for bug localization based on stack traces of correlated crash types. They trained a Bayesian Belief Network (BBF) with the dimensions collected for each file present in the stack traces and ranked the files by the probability of a file being buggy. Wu et al. (WU et al., 2014) proposed a method for expanding the crash stack with functions in the static

call graph and ranking by the suspiciousness of each function in the approximate crash traces based in four discriminative factors. They expanded the crash stack because not all failures reside in stack traces. For the same reason, Gu et al. (GU et al., 2019) proposed an automatic approach that predicts whether a crashing code resides in the lines of the stack traces or not. Wu et al. (WU et al., 2018) have developed a method (ChangeLocator) to automatically locate crash-inducing changes for a given bucket of crash reports using SZZ algorithm.

2.4.1 Discriminative factors by Wu et al. (

WU et al.

, 2014)

This section shows the discriminative factors to identify the most suspicious functions that could cause crashes, proposed by Wu et al., based on the observations made in their empiracal study. We preserve the original text written by the authors and, therefore, we warn that the term crash stack used by them is equivalent to the term stack trace, used in our work. Likewise, what he calls crash stack buckets we call crash report groups.

Function Frequency (FF): If a function appears frequently in crash traces caused by a certain fault, it is likely to be the cause of this fault. The FF factor measures the frequency of a function appearing in crash traces of a specific bucket B:

(33)

F F (f, B) = Nf,B NB

(2.4)

where Nf,B is the number of crash traces in bucket B that the function f appears. NB

is the total number of crash traces in bucket B.

Inverse Bucket Frequency (IBF): If a function appears in crash traces caused by many different faults, it is less likely to be the cause of a specific fault. The IBF factor measures the discriminative power of a function with respect to all buckets:

IBF (f ) = log(#B #Bf

+ 1) (2.5)

where #B is the total number of buckets, and #Bf is the number of buckets whose

crash traces contain the function f .

Inverse Average Distance to Crash Point (IAD): If a function appears closer to the crash point, it is more likely to cause the crash. The IAD factor measures how close a function is to the crash point:

IAD(f, B) = Nf,b 1 +Pn

j=1disj(f )

(2.6)

where n is the number of crash traces int the bucket B that include the function f. disj(f ) represents the distances between the crash point and f in the jth crash traces,

which is defined as follows:

disj(f ) = posj(f ) + CallDepthj(f ) (2.7)

where posj(f ) is the position offset between the crash point and the stack frame from

which f is expanded, in the jth _{crash trace. CallDepth}

j(f ) is the call depth of f in jth

crash trace.

Combining the above four factors, they calculate the suspiciousness score of a function f with respect to a bucket B as follows:

Function’s Lines of Code (FLOC): Their prior research on software defect pre-diction shows that larger modules are more likely to be defect-prone. Therefore, they use the function’s size as a discriminative factor. They measure function in terms of lines of code and identify a factor FLOC as follows:

(34)

F LOC(f ) = log(LOC(f ) + 1) (2.8)

where LOC(f ) is the number of lines of code of function f .

Score(f ) = F F (f, B) ∗ IBF (f ) ∗ IAD(f, B) ∗ F LOC(f ) (2.9)

Their method assigns higher scores to the functions that appear more frequently in crash traces in a bucket, less frequently in crash traces in other buckets, closer to the crash point, and larger in LOC. For each crash bucket, CrashLocator calculates the suspicious score of each function, ranks all the functions by the scores in descending order, and recommends the ranked list to developers. The developers can then examine the top N functions in the list to locate crashing faults.

(35)

3 Urupema

Manually analyzing log information is not a quick and fun activity. As the volume of logs increases, it becomes slower, exhausting, and prone to failures because repeated, similar, and related data may exist (GU et al., 2019). Typically, strategies for removing duplicate information and grouping related logs should be employed to reduce effort and time. The activity is even more complicated when it is necessary to compare information obtained from different sources and in different formats.

There is already software available that assists the task of grouping specific log types, such as Mozilla Soccorro1, which is a set of components for collecting, processing, and ana-lyzing crash data (FOUNDATION, ). However, it is common for companies not to use them and record their logs in their formats and different databases. Information Technology Superintendency (SINFO), where we conducted our study, is an example of a company that uses ElasticSearch to store crash reports in its own format. To conduct studies at these companies, we first need to change their organizational cultures to make use of other technological solutions or develop/adapt tools for collecting information without interfering in their culture.

In this context, we needed to develop software that allows conducting studies in a noninvasive mode. It should also use existing enterprise logs and compare logs of distinct sources and formats. It should be flexible and easy to extend, to be applied in diverse usage scenarios, and with different data processing strategies. Figure 2 shows the idea of the application, which is to access logs from different sources and convert it into a generic model that allows operations and extracting useful information.

We design and implement Urupema tool using software product line concepts and variation points for sources of logs and issues, as well as log aggregation and correlation algorithms.

(36)

Figure 2: Software idea

3.1 Architecture

We implemented Urupema as a web application to facilitate adoption. Thus developers and architects can use it without installing any modules on their workstations. We use the Angular framework to develop the web application frontend. It is responsible for interacting with the user, receiving their requests, accessing the services available in the backend, and displaying the results. Angular2 _{is an open-source platform for building}

mobile and desktop web applications based on Microsoft’s TypeScript3 language, which is a typed superset of JavaScript4 that compiles to plain JavaScript.

Backend is a layered java application developed using the Spring5 framework, specifi-cally Spring Boot and Spring Data JPA. Spring is an open-source framework for the Java platform, which includes several modules that provide a range of services (Aspect-oriented programming, Authentication/authorization, Convention over configuration, Data access, Inversion of control, Model-view-controller, Transaction management) (JOHNSON et al., 2019). Spring Boot is the convention-over-configuration solution, designed to get you up and running as quickly as possible, with a minimal upfront configuration of Spring (WEBB et al., 2019). Spring Data JPA provides repository support for the Java Persistence API (JPA), facilitating the development of applications that need access to JPA data sources (GIERKE et al., 2019).

The Service layer is the implementation of a REST Web Service, a software

architec-2_{https://angular.io/}

3_{https://www.typescriptlang.org/}

4_{https://www.ecma-international.org/publications/standards/Ecma-262.htm} 5_{https://spring.io}

(37)

tural style that defines a set of constraints to be used for creating Web services (BOOTH et al., 2004). The Process layer provides transactional control and orchestration of the func-tionality provided by the Core layer. In the Core layer are the model, the repositories, and the application variation points. It is responsible for the persistence of Urupema domain entities and interfacing with extended features plugged into variation points. The figure 3 shows Urupema architecture.

Figure 3: Urupema Architecture

The purpose of the Log Source interface is to make it possible to extract access, execu-tion, or error logs stored in different data sources, such as ElasticSearch and PerfMiner, by merely implementing the extraction logic for each of them. The idea of the Issue Tracker interface is the same but applied to software issues.

It is common for companies to use their formats to store information according to their needs while using the same tools to record data. For example, one company using Redmine may use plugins available for Git integration while another uses a webhook and logs the committed files in the issue notes. Both use the same issue tracker tool but track the changed files differently in each issue. The same thing happens with the crash reports. Thus, it would be necessary to create a version of Urupema for each company studied, since probably the solution for extracting information from one could not be used in the other. On the other hand, we could reuse most of the applications. Given this, we decided to develop Urupema as a software product line, which promotes the reuse of the same architecture and of the set of common and variable functionalities, with the possibility of

(38)

generating products with specific characteristics.

Figure 4 shows the simplified Urupema feature model, containing the variation points and some possible variants. LogSource is an optional feature that allows you to include im-plementations to extract information from access logs, crash reports, runtime executions, among others, from different sources (ElasticSearch, PerfMiner, Pinpoint). IssueTracker-Source enables data extraction (title, description, date and time, stack trace, changed files) from tracking tools (Redmine, GAS, GitLab). LogGroupAggregator feature allows the implementation of log grouping strategies (by stack trace, crash type signature, top frame file, URL) from the same data source. LogGroupCorrelator allows correlating groups (by Equivalence, coverage) from different log sources. LogConnection, IssueTrackerCon-nection and LogCorrelator are features that make it possible to choose the data source and other connection settings (host, user, password).

Figure 4: Urupema Feature Model

Figure 5 shows a class diagram detailing the implemented solution for adding Issue-TrackerSource and LogSource variants. The others variants use the same idea. For each of the points of variability, there is a related interface (shown in blue) that needs to be implemented, whose methods allow the identification of the variant, usage restrictions, and setting configuration parameters that must be supplied to use it. For issue and logs, you also need to develop concrete classes that implement one or more repository interfaces (highlighted in yellow) for retrieving information from external data sources.

We use Java Reflections API6 to make it easy for users to register, discover, and make variants available. We did this through the Plugin interfaces, which are unified access points for variant configuration information and its respective repositories. To attach a variant, you need to add the concrete implementation of one of the interfaces highlighted

(39)

(40)

in green in the application classpath. They are discovered and configured at Urupema load time. For the features shown in the figure 5, the IssueTrackerTypeRepository and LogSourceTypeRepository classes are responsible for finding the added variants. Listing 3.1 shows the source code of class IssueTrackerTypeRepository.

We use Gradle7 _{for dependency management, build automation, and generation of}

dif-ferent product types. Gradle is an open-source build automation tool focused on flexibility and performance (INC., 2019). Gradle build scripts are written using a Groovy8 or Kotlin9 DSL. We take advantage of Gradle’s conditional dependency management to generate different products, including the desired optional features as command line parameters. Thus, the Java libraries of the selected variants are embedded in the application’s .jar file and are recognized through Java reflections at the time of application execution and made available for use. Figure 6a shows Urupema modules. Figure 6b presents the build.gradle script with the conditional compilation. Figure 6c shows examples of product generation with only es-sinfo, perfminer, and both.

(a) (b)

(c)

Figure 6: Urupema product generation

We currently have implemented variants of the LogSource feature to extract scenarios stored by PerfMiner as well as error and access logs stored in ElasticSearch in the formats used by SINFO. We also built the IssueTrackerSource feature variant to get informa-tion from issues recorded in GAS. For the LogGroupAggregator feature, we implemented

7_{https://gradle.org/} 8_{http://groovy-lang.org/} 9_{https://kotlinlang.org/}

(41)

Listing 3.1: IssueTrackerTypeRepository.java package br . u f r n . a s e . urupema . e x t e n s i b l e . i s s u e t r a c k e r t y p e . r e p o s i t o r y ; // o m i t t e d i m p o r t s @ R e p o s i t o r y public c l a s s I s s u e T r a c k e r T y p e R e p o s i t o r y { Map<S t r i n g , I s s u e T r a c k e r T y p e > m a p I s s u e T r a c k e r T y p e s ; public I s s u e T r a c k e r T y p e R e p o s i t o r y ( ) throws I n s t a n t i a t i o n E x c e p t i o n , I l l e g a l A c c e s s E x c e p t i o n , I l l e g a l A r g u m e n t E x c e p t i o n , I n v o c a t i o n T a r g e t E x c e p t i o n , NoSuchMethodException , S e c u r i t y E x c e p t i o n { m a p I s s u e T r a c k e r T y p e s = new TreeMap < >(); R e f l e c t i o n s r e f l e c t i o n s = new R e f l e c t i o n s ( " br . u f r n . a s e . urupema " ) ; Set<C l a s s <? extends I s s u e T r a c k e r T y p e P l u g i n >> c l a s s e s = r e f l e c t i o n s . getSubTypesOf ( I s s u e T r a c k e r T y p e P l u g i n . c l a s s ) ; f o r ( C l a s s <? extends I s s u e T r a c k e r T y p e P l u g i n > c l a s s 1 : c l a s s e s ) { i f ( ! M o d i f i e r . i s A b s t r a c t ( c l a s s 1 . g e t M o d i f i e r s ( ) ) && ! c l a s s 1 . i s I n t e r f a c e ( ) ) { I s s u e T r a c k e r T y p e P l u g i n i n s t = c l a s s 1 . g e t C o n s t r u c t o r ( ) . n e w I n s t a n c e ( ) ; m a p I s s u e T r a c k e r T y p e s . put ( i n s t . g e t I d ( ) , i n s t ) ; } } } public I s s u e T r a c k e r T y p e P l u g i n f i n d B y I d ( S t r i n g i d ) throws I n s t a n t i a t i o n E x c e p t i o n , I l l e g a l A c c e s s E x c e p t i o n { return ( I s s u e T r a c k e r T y p e P l u g i n ) m a p I s s u e T r a c k e r T y p e s . g e t ( i d ) ; } public C o l l e c t i o n <I s s u e T r a c k e r T y p e > f i n d A l l ( ) throws I n s t a n t i a t i o n E x c e p t i o n , I l l e g a l A c c e s s E x c e p t i o n { return m a p I s s u e T r a c k e r T y p e s . v a l u e s ( ) ; } }

(42)

variants to aggregate logs by stack trace, equivalent signature, crash type signature, top frame file, and others. The LogGroupCorrelator feature has only one implemented vari-ant, which checks whether a given method call graph (e.g., error log) is contained in another. The figure 7 illustrates some examples. It is possible to see the stacks trace of logs A and C are contained in the trace execution of scenario S, while the logs B (due to ClassH.method15 method) and D (due to the inversion of methods ClassG.method14 and ClassG.method13).

Figure 7: Coverage variant idea

3.2 Possible usage scenarios

We can apply Urupema in different contexts and for different purposes due to its flexibility and the various combinations allowed. We have enumerated some of the forms of use we have imagined so far.

3.2.1 Bug prioritization

The grouping of access and error logs can allow the creation of metrics to indicate which crash types should be analyzed first by the development team. A simple example is to aggregate error logs by URI and first investigate the ones with the most errors. However, most erroneous URI may not be the most accessed. In this case, we could also group the access logs by URI and prioritize the most accessed that have the most substantial amount of error.

A more complex example is aggregating error logs by the similarity of stack traces and attacking those that have been occurring the longest, those that have happened after the new release date, which affects more URI or impacts more number of users.

(43)

3.2.2 Software testing prioritization and management

It is possible to use Urupema to group identical manual test scenarios, that is, test executions with the same method call graph. This information can be used to indicate the tests that should be automated first. Repetitive execution of the same manual test may suggest that it is a strong candidate for automation, freeing testers to perform other tasks. If we do the same kind of grouping with automated test scenarios, we can correlate manual scenarios with automated and identify manual tests that are covered by automated tests, avoiding rework, and relieving the test team.

Another investigation that can be done using Urupema is to check whether the se-quence of called methods that appear in error log stack traces collected in the production environment is also present in manual or automated test scenarios. If this is happening, it is essential to verify if the tests manifest the same errors during execution. If they do not produce the same errors, the input data used in testing may not be revealing the errors that occur in the production environment. This information may indicate whether the test team needs to improve input data and test cases.

3.2.3 Bug Code Identification

Error logs can also be used to identify buggy code. Some studies (WANG; KHOMH; ZOU, 2013) (WANG; KHOMH; ZOU, 2016) (WU et al., 2014) (WU et al., 2018) have found that grouping error logs by stack trace similarity along with extracting information about files and methods present in group stack traces can help identify faulty files and methods. Urupema allows the creation and use of various forms of log grouping. It also has functionality for ranking files suspected of being buggy and displaying their most common methods in stack traces, helping the development team find and fix the bug.

3.3 Features Overview

In this section, we show some of the features implemented, displaying screenshots, and commenting on them. We do not show simple functionalities such as the creation of a new study and sources of log or issues.

(44)

3.3.1 Study detail

The study entity is used to group connections and correlators, making it easier for users to organize. You can create more than one connection with issue tracker tools or log sources (error, access, or execution), as well as multiple log group correlators in each study.

The study detail screen displays information about connections and correlators, such as name, source, date, start and end timestamp to filter desired logs or issues, aggregator type, and action buttons (edit, detail, remove). Figure 8 shows the detail of the study performed in SIGAA, where it is possible to notice the existence of a connection with an issue tracker (GAS - SINFO) and three connections with the same log source (SINFO ElasticSearch), however using aggregators of different types.

Figure 8: Study detail screen

3.3.2 Issue tracker connection detail

The details screen of issue tracker connection allows you to create issues manually, import issues, and perform other functions that require prior configuration of error log connections in the same study:

• Import Affected URIs : queries the selected error log source for error reports that contain any of the stack traces present in the imported issues. We extract the URIs from the error reports found and link them to the corresponding issues. An

(45)

affected URI is considered enabled if there is at least one stack trace and one enabled changed file for a particular issue.

• Aggroup Logs: queries the selected error log source for error reports whose URLs contain the enabled imported URIs, and groups the logs found according to the aggregator type configured in the log source connection.

• Match Issues with Log Groups: Matches imported issues with log groups for the selected connection, based on stack traces.

The issue import process extracts primary data of issues and information about changed files to resolve the issues, marking each file as disabled or enabled if the ex-tension is equal to one of the informed ones. Also, it extracts and formats stack traces, considering as disabled those with less than five classes. All imported information can be edited by the user to allow the correction of any import errors.

You can also view the imported issues, with their identifiers, titles, opening and closing timestamp, type, and status (enabled or disable). Also, it shows the total and disabled quantities for changed files, stack traces, and affected URIs. Figure 9 shows a part of the issues imported from the SIPAC system, where you can see that the only stack trace extracted for the issue with id 1539 is disabled. Therefore all 21 affected URIs are also disabled.

(46)

3.3.3 Issue detail

The issue detail screen shows imported stack traces and allows editing, deleting, and manually inserting stack traces. Similarly, the screen displays and allows perform oper-ations on revisions and changed files to solve the problem. The possibility of manually including information allows conducting studies in companies that do not yet use issue tracking tools.

Figure 10 shows the details of a SIPAC system issue, where one stack trace was imported, along with two changed files in the same revision. You can see that the issue is related to three distinct connection log groups.

Figure 10: Issue detail screen

3.3.4 Log group detail

The log group detail screen shows information about the type of aggregator used, the timestamp range to search logs, the log occurrence period (first and last), the number of logs grouped, supergroup, and the label, which is the group signature. It also shows subgroups listing with the quantity of grouped elements, number of subgroups, dates of the first and last log records. Features accessible from this screen include viewing affected URIs and users, logs distribution over time, subgroup detail, and files suspected of crashing for each subgroup.

Figure 11 shows a group of error logs formed by applying Frequent Closed Ordered Sub-Comparison (Rule 3), consisting of 146 elements divided into 3 subgroups. The first

(47)

log occurred on 04/17/2018 and the last on 12/13/2018.

Figure 11: Log group detail screen

3.3.5 Suspicious files

The listing of suspicious files in an error log group shows the score and all methods called for each of the files, including the line number and the number of times it appeared in the group stack traces. Descending order is the default way of displaying both lists, but this is easily modifiable from the table headers. The system processes and ranks suspicious files on the first request to the listing, so it may be slow to display the files during the first access.

Figure 12 shows part of the suggested files as suspicious for a SIPAC system error log group. By default, no filter is applied to the filename, although the most common use is to filter by packets of the application source code (e.g., br/ufrn/sipac/). We can also see the list of methods of the ProcessorReqMaterial.java file present in the group’s stack traces, along with the lines where they stopped executing. Note that the difference in method stopping line may occur due to the execution of a new version in the production environment. For example, reformatting the ProcessorReqMaterial file broke one of the lines into four, making the error that occurred at line 231 to appear at line 235 of the error logs after deploying the new system version.

(48)

Figure 12: Suspicious Files

3.3.6 Calculate Top N

It provides information about the performance of error log grouping and suspicious file ranking algorithms, and the visualization of files suggested as suspicious and those that have been changed to resolve the issue.

Figure 13 shows the form to calculate the performances, where we can choose the issue tracker connection, the error log connection, the file extensions (comma separated), the package (considers only files belonging to the package or sub-packages), the maximum number of enabled files to consider (ignores issues with more enabled files than specified). We can also choose whether to consider only stack traces that contain at least one class from the given source code package.

Figure 14 shows part of the results obtained for the data entered in the form in Figure 13. The listing shows the issue identifier, the number of files changed to resolve the issue, the log group identifier, and the number of error logs belonging to the group. The Top 1, 3, 5, and 10 columns show the number of changed files present in the list of N suspicious files. The Top S 1, 3, 5, and 10 columns show the same quantities, but only consider files contained in the selected source code package. We can expand each table row and view lists of changed files and the first 10 suspicious files considering all packages and the selected package.

Figure 15 shows the final part of the listing, which shows for each of the Top N and Top S N columns some performance-related metrics of the algorithms used.

(49)

Figure 13: Calc Top N form screen

Figure 14: Top N result - start

(50)

The list of methods present in the stack traces of each of the changed files found in the suspicious file list is other relevant information shown in the results. We present the data as a table, where each row shows the issue and log group identifier, the file name, the number of altered methods found in suspicious files listing, and the total number of file methods present in the error log stack trace. We also display all called methods, highlighting in green those that have been modified in the changed files to resolve the issue. For this to be possible, registration of the modified method names in each changed file is required. Figure 16 illustrates part of the results obtained for the parameters reported in figure 13.

Figure 16: Top N result - called methods

3.4 Limitations

We designed Urupema to analyze data from web information systems, where the URL is very relevant information. Several Urupema features depend on the URL, such as filtering logs to be grouped, importing, and listing affected URIs. Given this, it can only be applied to web systems currently.

Only the parser for java stack traces is implemented and available at the core of the application. However, it is possible to develop parsers for other stack trace formats and use them in Issue Tracker, Log Group Aggregator, and Log Group Correlator variants.

We did not test Urupema with big data, although we evaluated it with a substantial volume of crash reports (137,856 records).

(51)

Cannot change suspicious file ranking algorithm. Currently, we can only use an algo-rithm implemented based on the ideas proposed by Wu (WU et al., 2014).

3.5 Related Tools

Buggy Files Finder (BFFinder) (WANG; KHOMH; ZOU, 2016) is a bug localization method to locate and rank buggy files from the stack traces in crash types. In future work should be implemented as a tool to assist developers teams. It uses log grouping rules to identify related crash types and a Bayesian Belief Network (BBF) to computes and ranks files from stack traces based on their probability of being faulty. The authors train the BBF with vectors of four dimensions of characteristics for files appearing in a failing stack trace. Urupema lets you use implementations based on rules proposed by Wang and colleagues, as well as other forms of log grouping. It also ranks suspicious files but does not use BBF, and the ranking algorithm is based on functions proposed by Wu (WU et al., 2014).

CrashLocator (WU et al., 2014) is a method to locate faulty functions using the crash stack information in crash reports. The authors will also try to integrate CrashLocator into a crash reporting system and evaluate its effectiveness in practice. It ranks suspicious functions based on stack traces of crash types, considering four discriminative factors to identify the most suspicious functions that could cause crashes. It generates approximate crash traces by stack expansions, computes the suspiciousness scores of all functions in the approximate crash traces, and returns a ranked list of suspicious functions. Urupema’s suspicious file ranking algorithm is firmly based on ideas proposed by Wu et al. to rank suspicious functions. However, they indicate functions, and we indicate files. Another difference is that Urupema performs log grouping, while CrashLocator uses buckets of crash reports produced by Mozilla Crash Reporter.

ChangeLocator is an automatic tool for locating crash-inducing changes. It (i) collects crash reports and their bucket information as well linked and fixed bugs; (ii) mine the changelogs in the code repository and identify the crash-fixing changes for the linked bug; (iii) use the improved SZZ algorithm to identify bug inducing changes; and (iv) obtain the crash-inducing changes for each crash bucket to provide a suggested list of changes for a specific crash bucket. Thus, it can also suggest buggy functions whose code has been changed inducing the bug, but it needs access to the source code repository. Urupema uses the only information present in stack traces to suggest files. Also, Urupema groups

(52)

error logs, while ChangeLocator needs to collect data from buckets of another tool, which in this case was from NetBeans Exception Reports.

3.6 Summary

In this chapter, we present the architecture, features, and possible scenarios for using Urupema software, which is designed to compare web system data from multiple sources, allowing log grouping and correlation between them. In future works, we intend to allow the development of variants and the choice of algorithm. We also intend to implement a log aggregator pipeline, facilitating the chaining of existing aggregators, increasing the range of possible studies.

(53)

4 Empirical Study

We conducted a case study at the Federal University of Rio Grande do Norte (UFRN1₎

in the Information Technology Superintendency (SINFO2) . We investigated issues that reported system crashes and crash reports related to three large scale web information systems. SINFO manages your projects in a customized version of Redmine3_{. This tool}

is used to prioritize and assign tasks to correct or evolve systems. The company also continuously monitors the use of its systems. When a crash occurs in the production servers, error-related information is automatically collected and sent to Elasticsearch4. The crash report contains URI that presented error, the server that was responding to the request, the module that manifested error, and timestamp of the failure, in addition to the stack trace and error message.

4.1 Introduction

Software developers can leverage crash reports as a mechanism to build a general comprehension about possible root causes of bugs and then to improve software quality by fixing them (AN; KHOMH, 2015; KOVASH, 2010;THOMSON, 2012). Developers can collect such an information using built-in automatic crash reporting tools or log-based tools that monitor the execution of software systems. Each crash report usually maintains a set of runtime information, such as the user requested functionality, execution date/time, and an associated stack trace. A stack trace is an ordered set of frames where each frame refers to a method signature. Developers often use the information available in crash reports to identify and correct existing bugs.

Despite the benefits of using crash reports to automatically identify and correct bugs, crash reporting and logging tools usually need to deal with a large volume of daily crash

1_{Universidade Federal do Rio Grande do Norte} 2_{Superintendência de Informática}

3_{http://www.redmine.org/} 4_{https://www.elastic.co/}

(54)

reports (AN; KHOMH, 2015; KINSHUMANN et al., 2011). To mitigate this problem, crash reports can be grouped based on the similarity of their associated stack traces. The stack traces of grouped crash reports can then be used by developers to facilitate bug identi-fication and fixing. Accordingly, over the last years many research works have explored the use of stack traces to aggregate crash reports (PODGURSKI et al., 2003;KHOMH et al., 2011; KIM et al., 2011; DANG et al., 2012; WANG; KHOMH; ZOU, 2013, 2016); as well as to locate and correct bugs (BALL; NAIK; RAJAMANI, 2003; JONES; HARROLD; STASKO, 2002;

JONES; HARROLD, 2005; NESSA et al., 2008; SCHROTER et al., 2010; GU et al., 2019; WU et al., 2014, 2018). On the other hand, only a few research works have explored the develop-ment of automated tools to group correlated crash reports through their respective stack traces.

For instance, Dhaliwal et al. (DHALIWAL; KHOMH; ZOU, 2011) proposed a new way of automatically grouping crash reports of the Firefox browser—based on the Levenshtein’s distance similarity algorithm of the stack traces. Their approach leads to a reduction of 5% in the time that is necessary to correct faults. Wang et al. (WANG; KHOMH; ZOU, 2013, 2016) proposed five rules to automatically group correlated crash reports. They con-ducted an empirical study with data from Firefox and Eclipse. Their approach identified correlated crash reports with a precision of 91% (Firefox) and 76% (Eclipse). The authors of the mentioned work also developed a methodology to identify buggy files, based on the dimensions collected for each file present in the stack traces. Similarly, Wu et al. (WU et al., 2014) also studied the identification of faulty code using groups of crash reports. They proposed a method called CrashLocator to locate defective functions related to groups of crash reports—previously merged using the stack trace information. They have devel-oped approaches to expand the crash stack trace using static analysis, and discriminative factors to rank suspicious functions. They located 50.6%, 63.7%, and 67.5% of failures by examining the top 1, top 5, and top 10 functions recommended by CrashLocator, respectively.

In this work, we also investigate the effects of rules to group crash reports on the performance of fault localization capabilities, though using a different context: large scale proprietary web-based systems. To this end, we first tailored, improved and implemented existing approaches to locate and rank buggy files using groups of crash reports (WU et al.,

2014; WANG; KHOMH; ZOU, 2016). We then conducted an empirical study to answer two research questions: (RQ1) What is the performance of our stack-trace approach to identify buggy code files in Java web-based systems? and (RQ2) To what extent do methods from the closed bug fix issues appear in the stack traces of the associated crash report group?