Evaluation of GUI testing techniques for system crashing: from real to model-based controlled experiments

Texto

(1)Federal University of Pernambuco Informatics Center. Computer Science Graduate. Evaluation of GUI Testing Techniques for System Crashing: From Real to Model-based Controlled Experiments Cristiano Bertolini PhD Thesis. Recife - PE June 2010.

(2) !.

(3) Federal University of Pernambuco Informatics Center. Cristiano Bertolini. Evaluation of GUI Testing Techniques for System Crashing: From Real to Model-based Controlled Experiments. Trabalho apresentado ao Programa de Pós-Graduação do Centro de informática da Universidade Federal de Pernambuco como requisito parcial para obtenção do grau de Doutor em Ciência da Computação.. A Thesis presented to the Federal University of Pernambuco in partial fulfillment of the requirements for the degree of Doctor in Computer Science.. Advisor: Dr. Alexandre Mota Co-Advisor: Dr. Eduardo Aranha. Recife - PE June 2010.

(4) Catalogação na fonte Bibliotecária Jane Souto Maior, CRB4-571 Bertolini, Cristiano. Evaluation of GUI testing techniques for system crashing: from real to model-based controlled experiments / Cristiano Bertolini - Recife: O Autor, 2010. xvi, 131 p. : il., fig., tab. Orientador: Alexandre Cabral Mota. Tese (doutorado) - Universidade Federal de Pernambuco. CIn, Ciência da Computação, 2010. Inclui bibliografia. 1. Ciência da Computação. 2. Engenharia de software. 3. Verificação e validação. I. Mota, Alexandre Cabral (orientador). II. Título. 004. CDD (22. ed.). MEI2011 – 169.

(5)

(6) To all my family and friends..

(7) Acknowledgements First of all, I would like to thank Prof. Alexandre Mota for being a great advisor, for his guidance, great support, ideas and inspiration. Without his supervision, this work would not have been a reality. I also would like to thank my co-advisor, Prof. Eduardo Aranha, for his constant support, availability and constructive suggestions, which were also determinant to the accomplishment of this thesis. A special thanks goes to Prof. Marcelo d‘Amorim that helped me in the first steps of this thesis, mainly in the work described in Chapter 2. His extensive discussions around my work have been very helpful for this thesis. Concerning Statistics, I am in debt to Prof. Cristiano Ferraz. He gave me several ideas about this work as well as he was always ready to help me. I am particularly grateful for his cooperation reported in Chapters 3 and 4. I am also in big debt to Prof. Paulo Borba and Prof. Augusto Sampaio for their valuable suggestions and support during my PhD program and in particular to the opportunity of being part of the CIn-BTC research project. Furthermore, my gratitude goes to a number of people at the CIn-BTC project and FORMULA group. All of them contributed to various aspects of this thesis. I thank them for their friendship and for providing a pleasing and friendly atmosphere. To my closer friends, in special André and Silvana. I would like to express my gratitude for their unconditional friendship, support and patient during all these years. Last but not least, I would like to thank my family, my parents José and Lizete, and my brother Rafael, for their unconditional support, encouragement and love, and without which I would not have finished this thesis. A great thanks to all of you.. iv.

(8) A theory is something nobody believes, except the person who made it. An experiment is something everybody believes, except the person who made it. —ALBERT EINSTEIN.

(9) Abstract. Cellular phone applications are becoming more complex as well as their testing. Graphical User Interface (GUI) testing is a current trend to test such applications by simulating user interaction. Several techniques are proposed and their efficiency (execution cost) and effectiveness (chance of finding bugs) are the most crucial aspects desired by industry. However, more systematic evaluations are required to identify which technique maximizes such aspects. This thesis presents an experimental assessment of two GUI testing techniques, DH and BxT, which are used to test cellular phone applications with a history of real errors. These techniques run for a long period of time (timeout of 40h, for instance) trying to identify critical issues that drive the system to an unexpected situation where the system cannot continue its normal execution. We define this situation as a crash state. The DH technique already existed and it is used by the software industry. The BxT technique is our proposal. In a preliminary evaluation, we compare the effectiveness and efficiency between DH and BxT through a descriptive analysis. We show that a systematic exploration, which is performed by BxT, seems a more interesting approach to detect crashes in cellular phone applications. Based on the preliminary results, we plan and perform a controlled experiment to obtain statistical evidence about their efficiency and efficacy. As both techniques are constrained to a timeout of 40h, the controlled experiment gives partial results and thus we perform a deeper investigation using survival analysis. Such an analysis allows us to find the probability of crashing an application using both DH and BxT. As real experiments are expensive, we propose a strategy based on computer experiments tailored to the PRISM notation and tool to be able to compare GUI testing techniques in general, and DH and BxT in particular. But our computer experiment results about DH and BxT have a weakness: the model accuracy is not statistically evidenced. Thus we use previous results of survival analysis to be able to calibrate our models, proposing a new calibration strategy for that purpose. Finally, we reuse our computer experiments framework to assess a new proposed GUI testing technique, called Hybrid BxT (or simply H-BxT), which is a combination of DH and BxT. Keywords: GUI Testing, Experimental Software Engineering, Model Checking, Computer Experiments, Probabilistic Models. vi.

(10) Resumo. Aplicações para celular estão se tornando cada vez mais complexas, bem como testá-las. Teste de interfaces gráficas (GUI) é uma tendência atual e se faz, em geral, através da simulação de interações do usuário. Várias técnicas são propostas, no qual, eficiência (custo de execução) e eficácia (possibilidade de encontrar bugs) são os aspectos mais cruciais desejados pela industria. No entanto, avaliações mais sistemáticas são necessárias para identificar quais técnicas melhoram a eficiência e eficácia de tais aplicações. Esta tese apresenta uma avaliação experimental de duas técnicas de testes de GUI, denominadas de DH e BxT, que são usadas para testar aplicações de celulares com um histórico de erros reais. Estas técnicas são executadas por um longo período de tempo (timeout de 40h, por exemplo) tentando identificar as situações críticas que levam o sistema a uma situação inesperada, onde o sistema pode não continuar sua execução normal. Essa situação é chamada de estado de crash. A técnica DH já existia e é utilizada pela industria de software, propomos outra chamada de BxT. Em uma avaliação preliminar, comparamos eficácia e eficiência entre DH e BxT através de uma análise descritiva. Demonstramos que uma exploração sistemática, realizada pela BxT, é uma abordagem mais interessante para detectar falhas em aplicativos de celulares. Com base nos resultados preliminares, planejamos e executamos um experimento controlado para obter evidência estatística sobre sua eficiência e eficácia. Como ambas as técnicas são limitadas por um timeout de 40h, o experimento controlado apresenta resultados parciais e, portanto, realizamos uma investigação mais aprofundada através da análise de sobrevivência. Tal análise permite encontrar a probabilidade de crash de uma aplicação usando tanto DH quanto BxT. Como experimentos controlados são onerosos, propomos uma estratégia baseada em experimentos computacionais utilizando a linguagem PRISM e seu verificador de modelos para poder comparar técnicas de teste de GUI, em geral, e DH e BxT em particular. No entanto, os resultados para DH e BxT tem uma limitação: a precisão do modelo não é estatisticamente comprovada. Assim, propomos uma estratégia que consiste em utilizar os resultados anteriores da análise de sobrevivência para calibrar nossos modelos. Finalmente, utilizamos esta estratégia, já com os modelos calibrados, para avaliar uma nova técnica de teste de GUI chamada Hybrid-BxT (ou simplesmente H-BxT), que é uma combinação de DH e BxT. Palavras-chave: Teste de GUI, Engenharia de Software Experimental, Model Checking, Computer Experiments, Modelos Probabilísticos. vii.

(11) Contents. List of Figures. xiii. List of Tables. xiv. List of Acronyms. xv. 1 Introduction 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 1.3. 1 1. Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Related Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Software Testing . . . . . . . . . . . . . . . . . . . . . . . . .. 2 2 3. 1.3.2 1.3.3. GUI Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . Statistical Analysis . . . . . . . . . . . . . . . . . . . . . . . .. 4 4. 1.3.4 1.3.5 1.3.6. Experimental Software Engineering . . . . . . . . . . . . . . . Probabilistic Models . . . . . . . . . . . . . . . . . . . . . . . Computer Experiments . . . . . . . . . . . . . . . . . . . . . .. 5 6 7. 1.4 1.5. Research Problem and Hypotheses . . . . . . . . . . . . . . . . . . . . Justification for the Research and Goals . . . . . . . . . . . . . . . . .. 8 8. 1.6 1.7 1.8. Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary of the Contributions . . . . . . . . . . . . . . . . . . . . . . Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 9 10 11. 2 Exploratory Experimentation 2.1. 2.2 2.3. 12. Software Testing Techniques . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . .. 12 13. 2.1.2 2.1.3 2.1.4. Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . DH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . BxT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 13 14 15. 2.1.5 Exploration Parameters . . . . . . . . . . . . . . . . . . . . . . Characterization of Experimental Material . . . . . . . . . . . . . . . .. 17 17. 2.2.1 Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Motivation and Hypothesis . . . . . . . . . . . . . . . . . . . .. 18 20 20. 2.3.2. 20. Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . .. viii.

(12) 2.3.3. 2.4. Random Data and Sequence in BxT . . . . . . . . . . . . . . .. 22. 2.3.4 Exploration Improvements . . . . . . . . . . . . . . . . . . . . 2.3.5 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 23 26 27. 3 Controlled Experimentation 3.1. 30. Experiment Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Experimental Material . . . . . . . . . . . . . . . . . . . . . .. 30 31. 3.1.2 Treatment and Experiment Design . . . . . . . . . . . . . . . . 3.1.3 Experiment Plan and Model . . . . . . . . . . . . . . . . . . . Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 32 33 33. 3.2.1 3.2.2. Comparison of the Techniques . . . . . . . . . . . . . . . . . . Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . .. 33 37. 3.3. 3.2.3 Factor Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 37 42 42. 3.4. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 43. 4 Experimental Data Evaluation by Survival Analysis 4.1 Survival Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 45 45. 3.2. 4.2 4.3 4.4. Survival Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 48 49 49. 4.4.1 4.4.2. Survival Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . Survival Models . . . . . . . . . . . . . . . . . . . . . . . . .. 52 54. 4.4.3. Exponential Model . . . . . . . . . . . . . . . . . . . . . . . . Weibull Model . . . . . . . . . . . . . . . . . . . . . . . . . . Failure Time Significance . . . . . . . . . . . . . . . . . . . .. 54 56 58. 4.4.4 4.4.5. Estimates of Survival . . . . . . . . . . . . . . . . . . . . . . . Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . .. 61 62. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 62 62. 5 Computer Experiments Based on Formal Modeling 5.1 Proposed Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 65 66. 4.5 4.6. 5.2. PRISM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 67. ix.

(13) 5.2.1. The PRISM Language . . . . . . . . . . . . . . . . . . . . . .. 67. 5.3. 5.2.2 The PRISM Property Languages . . . . . . . . . . . . . . . . . 5.2.3 The PRISM Model Checker . . . . . . . . . . . . . . . . . . . The GUI Framework . . . . . . . . . . . . . . . . . . . . . . . . . . .. 68 69 70. 5.4. 5.3.1 GUI-based System in PRISM . . . . . . . . . . . . . . . . . . Case Study: Cellular Phone Applications . . . . . . . . . . . . . . . . .. 72 74. 5.4.1 5.4.2 5.4.3. GUI Testing Techniques in PRISM . . . . . . . . . . . . . . . Model Instantiation . . . . . . . . . . . . . . . . . . . . . . . . Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . .. 75 76 79. 5.4.4 5.4.5. Discussion and Limitations . . . . . . . . . . . . . . . . . . . . Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . .. 82 83. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 83. 6 Process Model Calibration Based on Experiments and Survival Analysis 6.1 Model Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Formal Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 85 85 86. 5.5. 6.3 6.4. Model Calibration Strategy . . . . . . . . . . . . . . . . . . . . . . . . Applying the Proposed Strategy . . . . . . . . . . . . . . . . . . . . .. 89 92. 6.4.1 6.4.2 6.4.3. Step 1: Collect Data . . . . . . . . . . . . . . . . . . . . . . . Step 2: Create a Markov-based Model . . . . . . . . . . . . . . Step 3: Perform a Survival Analysis . . . . . . . . . . . . . . .. 92 94 94. 6.5. 6.4.4 Step 4: Calibration . . . . . . . . . . . . . . . . . . . . . . . . Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 95 96. 6.6. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 97. 7 Computer Experiments: A Case Study 100 7.1 Hybrid GUI Testing Techniques . . . . . . . . . . . . . . . . . . . . . 100 7.1.1 Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.1.2 7.1.3 7.2 7.3. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . 106. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109. 8 Conclusions 111 8.1 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 8.2. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113. x.

(14) 8.2.1. Improve the State-Space Exploration . . . . . . . . . . . . . . 113. 8.2.2 8.2.3 8.2.4. Testing Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 114 Metrics on Probabilistic Models . . . . . . . . . . . . . . . . . 114 Extraction of Probabilistic Models . . . . . . . . . . . . . . . . 115. 8.2.5 8.2.6. GUI Testing Benchmark . . . . . . . . . . . . . . . . . . . . . 115 Runtime Verification . . . . . . . . . . . . . . . . . . . . . . . 117. 8.2.7 8.2.8. Statistical Model Checking . . . . . . . . . . . . . . . . . . . . 117 Comparison of Statistical Designs in Software Engineering . . . 117. Bibliography. 117. Index. 130. xi.

(15) List of Figures. 2.1. Mobile device environment. . . . . . . . . . . . . . . . . . . . . . . .. 13. 2.2 2.3 2.4. Sequence of screens. . . . . . . . . . . . . . . . . . . . . . . . . . . . BXT time distributions for random data and sequence. . . . . . . . . . Correlation between dispersion of screen and capability to find errors. .. 14 23 24. 2.5. Residuals values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 26. 3.1. Effectiveness of techniques. . . . . . . . . . . . . . . . . . . . . . . . .. 34. 3.2 3.3. Entire model leverage plot. . . . . . . . . . . . . . . . . . . . . . . . . Interaction effects of configurations. . . . . . . . . . . . . . . . . . . .. 38 39. 4.1. Types of censoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 47. 4.2 4.3. Survival plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Failure plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 51 51. 4.4 4.5 4.6. Exponential plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Weibull plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Linear regression for DH and BxT. . . . . . . . . . . . . . . . . . . . .. 55 57 60. 5.1 5.2. Strategy overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Experiment generation and graph plotting in PRISM. . . . . . . . . . .. 67 70. 5.3 5.4 5.5. Compositional framework. . . . . . . . . . . . . . . . . . . . . . . . . Modules with synchronizing events. . . . . . . . . . . . . . . . . . . . GUI module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 71 72 73. 5.6 5.7. SYSTEM module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ORACLE to check the state changing frequency. . . . . . . . . . . . .. 74 74. 5.8 HANDLER module and its oracle. . . . . . . . . . . . . . . . . . . . . 5.9 DH technique module. . . . . . . . . . . . . . . . . . . . . . . . . . . 5.10 BXT technique module. . . . . . . . . . . . . . . . . . . . . . . . . . .. 75 77 77. 5.11 Probability to find a crash. . . . . . . . . . . . . . . . . . . . . . . . . 5.12 Sensitive analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 79 80. 5.13 Techniques running with unstable regions. . . . . . . . . . . . . . . . .. 81. 6.1 6.2. GUI module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . HANDLER module. . . . . . . . . . . . . . . . . . . . . . . . . . . .. 87 88. 6.3 6.4. ORACLE module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . DH technique module. . . . . . . . . . . . . . . . . . . . . . . . . . .. 88 89. xii.

(16) 6.5. BxT technique module. . . . . . . . . . . . . . . . . . . . . . . . . . .. 89. 6.6 6.7 6.8. Calibration process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Survival plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Survival analysis versus PRISM for DH and BxT. . . . . . . . . . . . .. 92 95 99. 7.1 7.2. A screen example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Hybrid-BxT module in PRISM. . . . . . . . . . . . . . . . . . . . . . 103. 7.3 7.4 7.5. GUI techniques with handler reliability equals to 98%. . . . . . . . . . 104 GUI techniques with handler reliability equals to 50%. . . . . . . . . . 105 GUI techniques with handler reliability equals to 80%. . . . . . . . . . 107. 8.1. GUI framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116. xiii.

(17) List of Tables. 2.1. Characterization of experimental material. . . . . . . . . . . . . . . . .. 18. 2.2 2.3 2.4. DH versus BxT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . BxT distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ANOVA table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 21 22 25. 3.1 3.2. Characterization of experimental material. . . . . . . . . . . . . . . . . Characterization of the factors. . . . . . . . . . . . . . . . . . . . . . .. 32 32. 3.3 3.4 3.5. Full factorial design and results. . . . . . . . . . . . . . . . . . . . . . Effectiveness data of techniques. . . . . . . . . . . . . . . . . . . . . . Mean of runs with fixed parameters. . . . . . . . . . . . . . . . . . . .. 34 35 36. 3.6 3.7. ANOVA table of the experiments. . . . . . . . . . . . . . . . . . . . . ANOVA table for all effects. . . . . . . . . . . . . . . . . . . . . . . .. 37 38. 3.8 3.9. Least square mean of main effects. . . . . . . . . . . . . . . . . . . . . Least square mean of main interaction effects. . . . . . . . . . . . . . .. 40 41. 4.1. Linear regression versus survival analysis . . . . . . . . . . . . . . . .. 47. 4.2 4.3. Full factorial design and results with censored data. . . . . . . . . . . . Survival summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 49 52. 4.4 4.5 4.6. Survival quantiles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tests between groups. . . . . . . . . . . . . . . . . . . . . . . . . . . . Exponential parameter estimates. . . . . . . . . . . . . . . . . . . . . .. 52 54 56. 4.7 4.8. Weibull parameter estimate . . . . . . . . . . . . . . . . . . . . . . . . Survival results for DH . . . . . . . . . . . . . . . . . . . . . . . . . .. 57 58. 4.9 Survival results for BxT . . . . . . . . . . . . . . . . . . . . . . . . . . 4.10 ANOVA table of DH and BxT regression models. . . . . . . . . . . . . 4.11 Estimates of survival analysis. . . . . . . . . . . . . . . . . . . . . . .. 59 61 61. 5.1. Variable instantiations. . . . . . . . . . . . . . . . . . . . . . . . . . .. 78. 6.1. ANOVA table for DH and BxT . . . . . . . . . . . . . . . . . . . . . .. 97. 7.1 7.2 7.3. Handler’s reliability equals to 98% . . . . . . . . . . . . . . . . . . . . 105 Handler‘s reliability equals to 50% . . . . . . . . . . . . . . . . . . . . 106 Handler‘s reliability equals to 80% . . . . . . . . . . . . . . . . . . . . 107. xiv.

(18) List of Acronyms. ANOVA Analysis of Variance API Application Programming Interface BTC Brazil Test Center BxT Behavior Explorer Tool CID Crash Identifier CNL Controlled Natural Language CSL Continuous Stochastic Logic CSP Communicating Sequential Process CTMC Continuous Time Markov Chain DH Driven Hoper DOE Design of Experiments DTMC Discrete Time Markov Chain FB Flex Bit GQM Goal Question Metric GUI Graphical User interface KM Kaplan-Meier LTL Linear Temporal Logic MBT Model Based Testing MDP Markov Decision Process PTF Phone Test Framework PCTL Probabilistic Computational Tree Logic RQ Research Question. xv.

(19) RCBD Randomized Complete Blocked Design SD Standard Deviation SMC Statistical Model Checking SMS Short Message Service SQL Structured Query Language SUT System Under Test TC Test Case UML Unified Modeling Language USB Universal Serial Bus. xvi.

(20) 1 Introduction. In this chapter we present the main motivation to develop this work, in which context it was developed, what is the main problem, what is the related background material, an overview of the proposed solution, and the main contributions.. 1.1 Motivation When a software system is developed two factors are very important: (i) the system conformance with respect to its specification (the system only performs what is stated in its requirements); (ii) the system reliability (the system ability to perform and maintain its usual functions during a certain period of time even under hostile or unexpected circumstances). Due to the complexity of the current systems neither (i) nor (ii) can be guaranteed completely. Therefore, it is common practice in industry to check them up to a certain acceptable degree. Several techniques have been proposed to this end. Testing is one of them. In fact, software testing is the dominant approach in industry to assure software quality. Testing is not cheap though. Santhanam and Hailpern [107] and the National Institute of Standards and Technology (NIST) [118] reported that from 50% to 80% of the total cost of a project involves testing and debugging. Automation of software testing then becomes very important as a mean to reduce costs. However, software industry has difficulty to identify which testing technique is more efficient and effective in a given context. These techniques in general are chosen without statistical evidence. Instead, other subjective factors are used: expected software quality, test team experience, availability of tools and manuals, and so on. For graphical user interface (GUI) testing, for instance, the situation can be even more complex. The testing techniques are in general state exploration methods, which, without guidance, the exploration can become intractable as well as very ineffective. Another issue is related to. 1.

(21) 1.2. CONTEXT. the cost associated to obtain statistical evidence. To be more precise, experiments must consider real systems and this is very difficulty due to resources availability, time to market, complexity and so on. But experiments can also be performed in an abstract world (model-based or computer experiments) where certain real difficulties can be overcome. Although computer experiments are not the best choice with respect to the accuracy of the experiment results, if they are properly calibrated the industry can benefit from an experimental framework where several different (some indeed impossible to realize with real experiments) contexts can be analysed in a lightweight manner.. 1.2 Context This work is the result of a research partnership between Motorola Inc. and the Informatics Center of Federal University of Pernambuco (called CIn-BTC). CIn-BTC has a team that performs stress testing (automated testing aiming at breaking the system under test—similar to defect testing) in real cellular phone applications. This team developed a testing technique named DH (Driven Hopper). The goal of such a technique is to stress the phone within a certain amount of time (timeout), trying to put the phone into a crash state (in which the phone does not answer any user action and needs to be restarted). Usually, each run of DH takes a long time (around 40 hours). The technique ends only if a crash is found or the timeout is reached. Although it can take weeks to test just one phone build, this very expensive testing campaign adds quality to the final product and thus it is necessary. One important fact we have observed is that this kind of technique is applied only in a final stage of the software development. In this way, other testing techniques like unit, functional, integration, etc. were already performed in the phone applications. Thus, we can assume that the applications are reasonably stable (free of a lot of errors) when DH is executed. Moreover, one important concept in testing is the oracle, which decides if a given test case fails or not. In the context of this work, a Test Oracle is a general crash oracle that monitors the phone memory for bad states.. 1.3 Related Background In this section we summarize the main background used in this thesis. More details about the background are presented in later chapters.. 2.

(22) 1.3. RELATED BACKGROUND. 1.3.1 Software Testing Software testing is a growing interest area for both academia and industry, with lots of applications in computer science. Following Dijkstra [29], testing can only show the presence of bugs in a software not their absence. Therefore, this is the main focus of any testing technique, which may use the most convenient testing strategy to improve as much as possible its efficiency and effectiveness. Following Meyers [92], the two main strategies for software testing are black-box and white-box testing. Black-box testing [92] aims at testing without the knowledge of the program internal (source code) organization [15]. It consists on exercising the interface of a component (typically the entire system) to find errors. Any part of the system providing a public interface is amenable to black-box testing. Black-box testing is particularly important for organizations that outsource testing activities. Often these organizations need to make the source code they develop private, say for confidentiality reasons. In those situations, black-box testing becomes the primary choice for independent assurance of system’s quality. This is the context in which this thesis is inserted. White-box testing, in contrast, requires knowledge of the program structure. For example, with white-box testing an engineer may construct a test with the goal of exercise a specific program path. Black-box and white-box testing are recognized as complementary techniques [92], in the sense that they can find different kinds of bugs and be used in different stages of a software development process. White-box has, of course, some advantages when compared to black-box testing and vice-versa. With the aid of white-box testing, one can detect and remove certain lines of code, which might introduce potential defects. Only with white-box testing we can be sure of covering all possible paths of a given code, at least once. Advantages of black-box testing compared to white-box testing include independence between the programmer and the tester – which may reduce bias – and no reliance on source code – which may help with work division. Also, two complementary trends on software testing are explored in the literature: conformance and defect testing. Conformance testing aims at checking whether the system implementation does exactly what its requirements state; and defect testing aims to identify unexpected situations in the system implementation which are not necessarily associated to its requirements, called crashes. For example, manually-written system tests (derived from requirements) may succeed in covering common user interactions but fail to cover corner case scenarios that can lead to a crash [81, 9]. These kinds of testing are widely used in industry and could be applied either in a black-box or white-. 3.

(23) 1.3. RELATED BACKGROUND. box approach. Finally, one important effort is on the automation of testing activities. Test automation is essential for any organization that works with software testing and aims at becoming more efficient [93].. 1.3.2 GUI Testing One important interface of an interactive system is its graphical user interface (GUI). Testing a system via its GUI is a kind of black-box testing. A GUI test consists of (i) a sequence of GUI commands and (ii) a test oracle to check whether the execution of this sequence produces the expected result [90]. Nowadays, GUI testing is getting special attention from the academy [132, 122, 136, 101]. As a consequence, this is also benefiting the industry by providing more disciplined testing techniques. It is worth noting that the automation of black-box testing can be challenging: unguided search may be ineffective for too large state spaces [53]. For GUI testing, in particular, the size of the state space reachable (from the GUI) is typically intractable [28, 136]. To alleviate the state space explosion problem many software model-checking techniques, which are based on testing, need to access the state to perform space reductions [57, 50, 123]. Unfortunately, such optimizations are not possible for black-box testing in general. Another challenge to address is to find faults that the GUI itself creates. Pichler and Ramler [101] report that up to 50% of the code written in a GUI application relates to the GUI part of the application itself.. 1.3.3 Statistical Analysis One of the main goals to apply statistics is to investigate the causality of event occurrences. That is, to observe the effect of an independent variable (or variables) on the behavior of the dependent (or outcome) variable. To conduct a statistical study it is important to consider the statistical method that can be used. For instance, to perform analysis of variance (ANOVA) of two groups. Basically, a statistical method is used to answer research questions about one subject and it comprehends, in general, a statistical analysis. The main goals of the statistical analysis over a data set are to make correct decisions about one phenomenon, to solve problems or to produce new knowledge. But it also usually generates new problems and in this way we have an iterative process. Statistical analysis can be used for two main purposes: to describe a collection of. 4.

(24) 1.3. RELATED BACKGROUND. data, that is, a descriptive analysis and to draw inferences about the process or population being studied, that is, inference statistics [46]. A descriptive analysis is used to analyse, in a quantitative way, the main characteristics of collected data (it presents the facts). It uses graphs (like box plots, scatter plot, etc.), tables and measurements to discover certain characteristics (facts) about the studied phenomenon, which are not easily observable looking only at the data. The data can be qualitative or quantitative information of a variable. A qualitative variable describes the results as attributes or considering the quality. A quantitative variable describes the results as numbers in a specific scale. Inference statistics is a more detailed analysis. To obtain conclusions, the modeling decisions made by an analyst must have a minimal influence. For that reason, an approach widely used is to perform a controlled experiment. Thus, inference statistics results consider a confidence level usually of 95%. However, the confidence level is not constant and can change according with the research area (depends on the subject, for example, in the medical area this interval is 99%) and the goal of the study. Design of experiments (DOE) is a method of inference statistics used to plan and perform controlled experiments. That is, it is used to define the data set, its size, and under what conditions the data must be collected in an experiment. When DOE is performed, a common technique used to analyse the data is Regression Analysis. It aids to understand how the values of the dependent variable change when any one of the independent variables varies and the others are fixed. Another useful method to be used when we have time-dependent problems is survival analysis, which involves the modeling of time to event data. When we study such phenomenons it is impossible to observe them for an infinite time or to guarantee that outliers do not affect our study. In these cases, survival analysis is a useful method to give us more accurate inference statistics. Thus, the results are given by a probability distribution, which is an equation that links each outcome of a statistical experiment with its occurrence probability. Moreover, a good statistical analysis in general includes both descriptive and inference statistics. In the context of this thesis, we explore a complete statistical analysis including DOE and survival analysis.. 1.3.4 Experimental Software Engineering Software development comprehends a lot of ideas, techniques and methods. Juristo and Moreno et al. [65] present some general and even philosophic questions that are not easy. 5.

(25) 1.3. RELATED BACKGROUND. to answer in the software engineering area: Are we sure about the validity of our beliefs? Which of the claims made by the software development community are valid? Under what circumstances are they valid? Experimentation in software engineering refers to finding facts with statistical evidence about the suppositions, assumptions, speculations and beliefs that are intrinsic to software development. Unfortunately, experimentation based works are relatively scarce in the software engineering literature. An important area inside software engineering, where we can find most of the experimentation’s related works, is testing. As new different systems are developed constantly, experiment evaluation is frequently needed. However, the experiments in this field have some flaws, and, consequently, the empirical knowledge used on testing techniques is far from solid [66]. Several software testing techniques are used in real systems. They are chosen according to different factors: usage related costs (installation, maintenance, etc.), test engineers skills, the level of automation, technical aspects (Web, operating system, etc.), market, and so on. But in general, the intuition and personal experiences are used to select them because industry rarely uses resources to determine precisely which technique is really better (for instance, with respect to finding bugs more quickly). The evaluation of GUI testing techniques is made by experimental analysis [23, 28, 136] in general. Most experimental evaluations provide information about user usage, defects found, efficiency and efficacy of the techniques, that is, provide a descriptive analysis. Experimental evaluation [67, 49, 138] is one of the best ways to demonstrate which technique is better or why practitioners should adopt a new technique introduced in a research [126]. In this way, evaluation through experiments is important to obtain statistically significant evidence [137, 67]. Although there is a lack of knowledge in industrial practice in general to perform controlled experiments in software engineering, the distance between the academia and industry has a tendency to decrease with the use of experimental studies of higher quality and relevance [113].. 1.3.5 Probabilistic Models Computer systems are often unpredictable or non-deterministic and may have several possible paths (that is, different behaviors). Some of these paths may be more likely than others, but no one is certain. In this way, the probability theory was developed to deal with the randomness nature which emerges from the real world. Models can be constructed to describe systems with random characteristics.. 6.

(26) 1.3. RELATED BACKGROUND. A probabilistic model [91], also called statistical model in statistics, is a set of mathematical equations that describe the behavior of an object of study in terms of random variables and their associated probability distributions. A common model used in computer science is a Markov Chain [117] that is used to model different characteristics of systems. Variations of this model include Petri Nets [100], stochastic process agebras [5, 56] like PRISM and so on. Basically the models describe a set of events that are associated with some outcome variable and a probability associated with the occurrence of that event. By convention, models are normalized in such a way that certain events receive a probability of 1 and non-occurring events a probability of 0. A probabilistic model is very useful to describe situations in which uncertainty is involved. In computer science we have an extensive number of different objectives and applications where these kinds of models are useful. Probabilistic methods are used to describe the nature mathematically as well as to describe real-world applications. We can easily find several examples of probabilistic models in real world problems: network traffic modeling, probabilistic analysis of algorithms and graphs, reliability modeling, simulation algorithms, data mining, speech recognition and so on.. 1.3.6 Computer Experiments Performing experiments on concrete systems is the best choice to obtain high accurate results [126]. However, experiments on real systems can sometimes take a (almost) prohibitive time to be accomplished completely. Moreover, to extend the techniques and ideas for other contexts it is necessary time to study the context, implement the techniques and perform new experimentation. Following Briand and Labiche [26], humans can be involved in the exploratory testing activities or to compare testing techniques, working with models instead of real systems is desirable [26]. The major drawback of working with models instead of real systems is related to how good are the results provided by the model with respect to the real system. That is, how accurate is the model. An additional difficulty occurs when the model needs certain knowledge about the real system. In this way, humans provide information based on (limited) observable (visible or external) information about the system behavior [26]. Weyuker [127] reports a practical problem related to using models to compare testing techniques: the practitioner has difficulty to build and maintain such models. Thus, we need a way to follow [26] using models and at a lower cost. The process of adjusting a model to obtain as close results as possible with respect to. 7.

(27) 1.4. RESEARCH PROBLEM AND HYPOTHESES. a real system is known as calibration [108]. The purpose of calibration is to determine the values of model parameters (configuration) according to observations from experiments performed with the real system. We have exact responses of the real experiment and probabilistic predictions from the model. Thus, we use survival analysis to transform the exact responses (from the real experiment) into probabilistic predictions to be able to calibrate the model.. 1.4 Research Problem and Hypotheses The main problem investigated in this thesis is how to improve state exploration in GUI testing considering a black box perspective? Moreover, we investigate techniques whose focus is crashing the system with the automated generation and execution of GUI tests. The main research hypotheses investigated in this work, stated in general terms, are: • H1 : to find more bugs a GUI testing technique must improve the state space exploration; • H2 : survival analysis can be used as a way to calibrate models; • H3 : using models is an efficient way to compare testing techniques.. 1.5 Justification for the Research and Goals As systems are becoming more complex and the expected time to market is decreasing, a precise evaluation of testing techniques is necessarily important. Moreover, applying non-appropriate testing techniques can increase significantly the development costs as well as reduce the capability to detect bugs. In what follows, we summarize our specific goals: • Study and propose GUI testing techniques used in an industrial environment; • Evaluate the efficiency and effectiveness of GUI testing techniques: use statistical methods to analyse different GUI testing techniques and decide which one is the best for a specific context; • Propose and show the usage of a strategy to compare testing techniques using probabilistic model checking to overcome the usual lack of resources (phones, PCs, humans, etc) and perform complex analysis. For example, to study the variation of different parameters in the model;. 8.

(28) 1.6. PROPOSED SOLUTION. • To calibrate the proposed models and perform new simulations.. 1.6 Proposed Solution We argue that changing the system state more frequently increases the chance to crash an application and a more systematic GUI testing technique can detect more crashes and in less time. As a research effort, we propose a new technique, named BxT (Behavior eXplorer Tool). Our goal is to determine if it is possible to replace DH with BxT (if we can show that BxT is more efficient and effective than DH) and when (if BxT may still be applied at the last development stages like DH or in another stage). We perform experimental analysis over two GUI testing techniques using real applications. These experiments were done based on a descriptive analysis only. Moreover, we lack internal knowledge about the application (we do not have access to its sourcecode). We are unable to monitor specific aspects of the application (we cannot consider a bug taxonomy [15]), and we have a high cost to perform experiments (each experiment needs at most 40 hours of execution time). Thus, we cannot precisely identify which GUI testing technique is better in that context (the domain and system characteristics) [23]. However, we used controlled experiments in order to evaluate the parameters used to guide each technique as well as to show the results with a statistical confidence. Moreover, we perform a survival analysis to deal with the time-dependent characteristics of our data set. We propose a strategy that consists in modeling the GUI testing techniques as well as the GUI-based system using a Markov based language (PRISM) [5, 75]. The GUIbased system is modeled following a proposed framework which is compositional and parametrized. The compositional aspect is employed to ease adapting the framework to different contexts as well as to capture different oracles and bugs they can detect [15, 27]. And the parameters are used to instantiate the model to specific characteristics based on test hypothesis obtained from the application domain. Moreover, we propose a new technique called Hybrid-BxT that is based on the previous techniques, BxT and DH. We highlight some conclusions about this new technique using our probabilistic models. In our solution we address the problem by performing concrete experiments and also computer experiments. We show that we can combine both in order to improve the analysis. Both approaches have advantages and disadvantages. We argue that concrete experiments are more precise but they are expensive (time, effort and resources) and computer experiments are less accurate but they are more flexible and take less time to. 9.

(29) 1.7. SUMMARY OF THE CONTRIBUTIONS. be performed. The design and implementation of GUI testing techniques is expensive and experiments take long time to be completed. However, the evidences found with such real implementation and controlled experiments are important to delivery the research results to the industry. Moreover, models are also important and can bring faster results than controlled experiments. However, models present a less precise way than controlled experiments. The use of both approaches can be a good solution, but it has a high cost associated.. 1.7 Summary of the Contributions In what follows we list our main contributions: 1. The proposal and implementation of GUI testing techniques, BxT and HybridBxT, to analyse cellular phone applications; • A preliminary experimental evaluation of the techniques using Motorola cellular phones to evaluate some GUI techniques (DH and BxT) considering two metrics: the time and the capability to find crashes [23]; • We designed a controlled experiment to investigate which technique (DH and BxT) is more efficient: in the end, BxT is more efficient than DH and the values of some parameters of the techniques and their combination impact the results; 2. The use of survival analysis to evaluate the results and determine the probability of failure for each technique; • Based on survival analysis, we show that BxT has a higher probability (20% more after 40h of execution) to find crashes over time; • We show that the current timeout for DH is not enough to find crashes. 3. We propose a strategy to evaluate GUI-based testing techniques using probabilistic model checking [18]; • We developed a compositional and parameterized framework to model GUIbased systems. • We ilustrate the use of the PRISM model checker to assess testing techniques;. 10.

(30) 1.8. THESIS OUTLINE. • We show how to combine experimental software engineering and formal probabilistic models; • We performed an evaluation of some GUI testing techniques to increase the confidence of our strategy.. 1.8 Thesis Outline The thesis is organized as follows: • Chapter 2 presents a preliminary experimentation of GUI techniques, where no experimental planning is provided. Its main focus is to check the feasibility of undertaking a more detailed study as well as present preliminary results about the efficiency and applicability of the GUI testing techniques in a cellular phone application context; • Chapter 2 presents a descriptive analysis in an ad hoc fashion and without enough control of our collected data. Is this sense, Chapter 3 presents a planning, execution and analysis of controlled experiments, where we collect a new data set that describes the effect of several factors: technique parameters and the techniques themselves; • Chapter 4 presents an analysis considering the time related nature of the data of the previous experimentation. In this way, we show a more precise statistical method considering censored data; • Chapter 5 presents a general framework to perform experimentation based on probabilistic models; • Chapter 6 presents an instantiation of the proposed probabilistic framework and an algorithm to calibrate the models; • Chapter 7 presents a new testing technique and a simulated experiment that shows the relationship between this new techniques and the original ones; • Chapter 8 presents the final conclusions of this work and some topics to be investigated as future work.. 11.

(31) 2 Exploratory Experimentation. This chapter presents the results of an empirical study of GUI testing techniques. These experiments were realized in a period where we did not have enough knowledge about the techniques to perform a controlled experiment. Thus, we conducted these experiments mainly to check the feasibility of such a research direction as well as to gather more information about the environment where this thesis was developed. These results can also be found in [23].. 2.1 Software Testing Techniques Regarding the introduction, we have two GUI testing techniques to analyse: DH and BxT, whose main difference between them is that BxT changes the state more frequently. Basically, DH drives the application to a particular screen and keeps pressing random keys (some of which can change the current screen) for some (configured) time until it finds crashes in an application or a timeout occurs. To drive the application to a particular screen, DH use the instruction goto(). This instruction indeed implies in the execution of a small test case that drives the phone to a specific screen. Although BxT is similar to DH in the sense of visiting certain screens and pressing keys on them, it attempts to make a more systematic selection of inputs than DH: it recognizes which controls are available at a screen, selects inputs (when necessary) according to these controls and press random keys within the available controls. For instance, BxT can send scroll down and scroll up events when it recognizes a scroll bar control in the current screen. Another example would be a screen that contains only two buttons, say “OK” and “Cancel”. While DH may press several keys before it hits the ones for “OK” and “Cancel”, BxT makes a random selection between one of these two options. Also, BxT can recognize when an event requires a data entry. For example, in a SMS mes-. 12.

(32) 2.1. SOFTWARE TESTING TECHNIQUES. sage textbox BxT can identify that it is a data entry and generates a random data for this textbox. Note, however, that BxT has more stringent observability and controllability requirements [92, 15] than DH: it requires a library (called PTF [42]) providing support for recognizing screen components and sending specific events to them. The algorithms of our GUI testing techniques can be generalized to other kinds of GUI applications (like Desktop, Web, etc.). In our context, we verify crashes in cellular phones. In what follows we present these two techniques in more detail and also report an empirical study about them.. 2.1.1 Infrastructure Figure 2.1 shows the main environment where these GUI techniques are applied. Basically, we use a framework called PTF [42]. PTF API is already inside any Mobile Device and which allows one to send and to receive commands to the phone using a USB connection. The Communication Controller is responsible for detecting crashes using an oracle monitor (it recognizes if the phone is in a crash state or not), controlling the response time (sometimes the phone cannot respond) and the USB connection.. Figure 2.1 Mobile device environment.. An alternative way to perform tests uses an emulator [1]. But as several crashes are related to physical phone restrictions (memory, process allocation, and so on), the use of the environment depicted in Figure 2.1 represents a more precise testing. As our goal is to detect crashes, we argue that these kinds of crashes cannot be detected just using an emulator.. 2.1.2 Example Here we demonstrate a simple example of how each technique explores phone applications. As a simple application can have thousands of different sequences of pressing. 13.

(33) 2.1. SOFTWARE TESTING TECHNIQUES. keys, our goal here is just to visualize the exploration mechanism used by these GUI testing techniques. Thus, we will show only one simple sequence.. Figure 2.2 Sequence of screens.. Figure 2.2 shows a possible sequence of 5 screens in the Message application. These screens show the following observable behavior: from the main screen (Screen 1), which has 1 message in the Message Inbox folder, 13 messages in the Outbox folder and the options button in the Message Center application, a new message is created (screen 2). This action creates an empty textbox screen (screen 3). When the text Testing is entered in the textbox (screen 4) and the option button is selected, a Message Abort Options screen is displayed (screen 5). DH explores the application considering all possible inputs. It considers all keyboard inputs. In the first screen, it can choose an enabled input (options, back, center key, up and down) as well as press other keys (for example, numbers available in the phone keyboard), even if they are not valid in a given context (screen). That is, in principle such keys should be simply discarded by the phone application. On the other hand, BxT explores the application more systematically than DH. For example, in the first screen it will select only inputs that are enabled (options, back, center key, up and down). In order to know which option is available it executes a special event that obtains all enabled inputs. Therefore, for each visitec screen BxT sends an event just to know its content. Also, BxT will always enter with data when data components are enabled. In this way, we populate the memory more frequently. As it will be presented later, by accelerating memory change we increase the potential to crash an application more quickly. Next we detail the algorithms of these two techniques.. 2.1.3 DH Algorithm 1 shows the pseudo code for DH. The algorithm repeats (loop at line 1) the following sequence of steps until it either reaches the timeout or finds a crash: (i) selects. 14.

(34) 2.1. SOFTWARE TESTING TECHNIQUES. one screen, (ii) drives the application to that screen, (iii) sends random events (presses keys) to the GUI for a while (timeout1 at line 6), and (iv) checks for a crash. Algorithm 1: DH pseudo code 1 main(List!Test" screens, int seed, int timeout1, int timeout2): bool; 2 Beginn main 3 while !isTimeout(timeout2) do 4 Test screen = listOfScreens.pickOne(seed); 5 screen.run(); /*goto random screen*/ 6 pressRandomKeys(seed, timeout1); 7 if isCrash() then return true; 8 endw 9 Ende The inputs to DH are a sequence of manually written tests – screens, a random seed that the event generator uses – seed, a bound on execution time for sending random events (the amount of keys that can be pressed) in one iteration – timeout1, and a bound on the total execution time – timeout2. Note that DH does not generate inputs (that is, GUI events) according to the active components in the current screen. It simply generates control events randomly within one important region of the application. Also it is important to note that the algorithm uses a library to send events to the GUI. For DH it sends general events, those corresponding to physical keyboard keys. Line 6 highlights the use of a library function for sending key pressed events to the application. One needs to provide such a function to allow DH to be used in a specific context. In our context (Motorola), it is already available. Note on the length of the sequence of keys that can be pressed. In an execution using the PTF library, 50 keys can be pressed in about 30 seconds. Thus, we use this information to set the timeout of our experiments.. 2.1.4 BxT Algorithm 2 shows the pseudo code for BxT. The main difference from DH is the way it generates events. Conceptually, the algorithm contains two main parts. The first part decides which screen to focus. The second part stresses the application from the selected screen. This part performs the following steps for a fixed number of times or until it crashes the application: (i) identifies enabled events in the current screen, (ii) selects one of such events randomly, (iii) generates data corresponding to the selected control. 15.

(35) 2.1. SOFTWARE TESTING TECHNIQUES. and sends the event to the GUI, and (iv) checks for a crash. The code fragment in the line range 12-13 corresponds to this sequence. Algorithm 2: BxT pseudo code 1 main(int seed, int numRept, int timeout): bool; 2 Beginn main 3 Set!Test" screenSet = 0; / 4 if driven() then /*random set of goto-screen tests*/ 5 6 7 8 9 10 11 12. screenSet = {tc1 ,tc2, ...,tcn}; else screenSet = {initialScreen()}; endif while !isTimeout(timeout) do screenSet.pickOne(seed).run(); for i=1 to numRept do Event ev = enabledEvents().pickOne(seed); /*sends message to the GUI*/. 13 14 15 16 17 18. ev.genInputs(seed).run(); if isCrash() then return true; endfor endw return false; Ende The inputs to BxT are a seed used for generating sequences of pressed keys (that is,. choosing the event) and data (that is, generating input to the event) – seed; an integer denoting the number of iterations on the second part of the algorithm – numRept, and a bound on the total time for testing – timeout. Line 4 makes a random choice of which screen in the set screenSet the execution should stress. Note that the code fragment in the line range 4-8 initializes this variable. The external Boolean function driven() indicates whether or not execution should perform jumps across different screens (in a similar fashion to DH). The BxT can initializes the variable screenSet with a fixed set of screens. Lines 12 and 13 highlight the uses of a library to identify which events are enabled and to send the event to the GUI, respectively. In the following section we explain the main elements we used to perform our preliminary experiments.. 16.

(36) 2.2. CHARACTERIZATION OF EXPERIMENTAL MATERIAL. 2.1.5 Exploration Parameters We configure both techniques with some parameters to determine how the exploration of the systems may be done. In what follows, we list the parameters we will evaluate: • Re-initialization (Driven): the exploration usually starts from an initial state. From this initial state the testing techniques explore a set of screens for a specific time (number of pressed keys or timeout). The parameter Driven is used to characterize if the choice of an initial state will be fixed (always the same initial state) or random (any state from a set of available ones). • Probabilities (KeyProb): each pressed key can have a different probability (frequency of occurrence). We assume that some sets of pressed keys are more suitable to occur than others. For example, keys like “Cancel” have a smaller probability to change the system state than an “OK” key, and then we assume that it has a greater probability to occur. • Test Case Size (SizeTC): a test case is a sequence of keys. As we know that there is an approximate correspondence between pressed keys and the time to perform them, we use the parameter SizeTC to control the amount of time each test case will need to be performed. Depending on how these parameter are configured the generated test sequence can be significantly different. For example, if we can consider the Driven as one single screen or a set of completely different screens.. 2.2 Characterization of Experimental Material In what follows we define the material that we use to evaluate our GUI testing techniques, not only the software that should be tested, but the embedded system as well. We characterize each experimental material with: • The phone model: is a list of external and internal phone features to identify a set of similar phone functions; • The hardware version: is a new hardware that represents a little change of a previous version. For example, it contains an improved design, or bug fix in some physical component;. 17.

(37) 2.2. CHARACTERIZATION OF EXPERIMENTAL MATERIAL. • The software version: is the build for the operating system and its applications. Table 2.1 shows the experimental materials we used in our experiments. Column “Config.” introduces a unique identifier to distinguish each combination of model, hardware, software and flex bit. The other columns show each of these attributes. The identifiers we use in this table correspond to real identifiers, but they are masked for confidentiality reasons. Note that some configurations share the same model or hardware, but the software and flex bit vary. The selection of these configurations was driven by the availability of equipment where past errors have been detected. Config. A B C D E F G H. Model M1 M1 M1 M2 M2 M3 M3 M2. Hard. H3 H4 H4 H2 H1 H5 H6 H2. Soft. S1 S2 S3 S4 S5 S6 S7 S8. Table 2.1 Characterization of experimental material.. The experimental materials have direct influence on the number of crashes detected. For example, a new model or a new set of applications may represent a more unstable cellular phone and the GUI testing techniques can find more crashes than using other more stable cellular phones (for example, old models).. 2.2.1 Failures The oracle does not operate in the GUI. It is a general Motorola proprietary program that monitors the phone memory for bad states. In our work we do not investigate the oracle problem [90], but we use this proprietary oracle to obtain the results of our tests. The oracle detected 12 distinct kinds of crashes across all experiments. In the following, we distinguish them using crash identifiers (CIDs) from 1 to 12. Each identifier denotes a different undesirable scenario of the application that the oracle is able to capture. For example, CID=1 denotes that the system makes no progress but the oracle is unable to ascertain the reason, CID=2 means that an issue with the hardware interface (for example, it is not possible to allocate memory) prevents the application from making progress, CID=6 denotes a programming error like divide by zero, etc.. 18.

(38) 2.2. CHARACTERIZATION OF EXPERIMENTAL MATERIAL. It is worth noting that the oracle reports only the crash event; it does not inform the reason for the crash (as debuggers do). As a consequence, it is possible that distinct techniques report different manifestations (CIDs) of the same defect. There is not a correct way to characterize bugs, but Beizer [15] presents a bug taxonomy that is used in several researches. However, here we present a bug taxonomy used in our context. We will not explore the correlation of these taxonomies because it is not the focus of this thesis. The following list associates one failure identifier to a description of the corresponding failure. These CIDs will be used later in this section. • CID=1: The crash monitor identifies that the system makes no progress and is unable to assert the reason. This kind of crash is also called freeze. • CID=2: The crash monitor detects an issue with the program hardware interface. For example, it is unable to allocate memory, re-initialize status variables, switch off phone connections (for example, Bluetooth, USB, etc.). • CID=3: The monitor detects that the application cannot consume multimedia file, for example, it cannot read data for a picture that has been just taken. • CID=4: If a memory region is invoked by any request, as the same time as a request for allocation of bytes is made by a task from a partition that has been exhausted. • CID=5: When calls of multimedia functions (for example taking a picture, open an image, etc.) are made and canceled (by pressing “Cancel” key) very quickly in succession the system lost the multimedia references. • CID=6: The crash monitor observes an exception from the application. For example, memory access error, divide by zero, illegal instruction, etc. • CID=7: When the system is creating a timer (for example, timer to display a screen saver) and the maximum limit is reached but the system clock is still creating the timer then the counter range is out of bound. • CID=8: If a memory region is invoked by any request, as the same time as an address that was already freed was being freed again.. 19.

(39) 2.3. EMPIRICAL EVALUATION. • CID=9: If a memory region is invoked by any request, at the same time as a request for allocation of bytes is made by a task from a partition which has been corrupted. • CID=10: If the specified timer (by an internal function) is not scheduled to expire it results in a livelock situation. • CID=11: Sometimes internal timeout (related to the hardware interface) is not reset and then it can cause a crash (for example, interruptions). • CID=12: Error related to the SIM card (Subscriber Identity Module). For any reason the phone cannot read, write or access the SIM card. For example, to store a new contact in the SIM card, or make a call.. 2.3 Empirical Evaluation We present in this section the results obtained by our preliminary empirical experimentation. These results demonstrate how the techniques work in an industrial context.. 2.3.1 Motivation and Hypothesis Our exploratory experimentation intends to validate the GUI techniques. We are interested in confirming the following two main hypotheses: • H1 : a more systematic GUI testing technique can detect more crashes and in less time. • H2 : there is a correlation between the time to find a crash and the dispersion of screens. Based on these hypotheses we report the main results in the following section.. 2.3.2 Comparison This section compares BxT and DH. The goal is to determine empirically which technique is better, in the sense of the chance to find a crash (efficacy) and how fast it can find (efficiency). Setup. This experiment configures BxT with a timeout of 40h. To do so, the parameter numRept is set to 50. BxT runs 50 events in each iteration, taking approximately 30. 20.

(40) 2.3. EMPIRICAL EVALUATION. seconds. BxT explores one screen for some time and jumps to another screen. It repeats these tasks until it reaches the 40h timeout. It is important to note that the numRept is related to the amount of time according to previous equivalence between pressed keys and the time spent. It is important to observe that the setup can have effect in the time to find a crash. In this experiment we define the setup according to the practices used to test cellular phones in our context. In a more controlled experiment (Chapter 3) we show the effect of these parameters. Table 2.2 summarizes the comparison. Column “Config.” shows the identifier for one configuration (experimental material), column “CID” shows the identifier of the crash, and column “time” shows the execution time for each experiment. Column “diff1” denotes the execution time difference between running DH to BxT considering all configuration runs, and “diff2” the time difference considering only those runs that both DH and BxT crash the application. Line “avg.” reports the averages of each column. For column CID, it shows the fraction of experiments that revealed a crash. For column time, it shows the arithmetic mean of the elapsed time.. A B C D E F G H Avg.. DH CID Time 1 21.2 40.0 40.0 8 3.3 3 7.0 6 5.7 6 33.7 5 21.7 75% 20.2. BxT CID Time 6 33.8 6 6.2 12 14.8 8 4.4 5 2.7 11 4.0 1 1.7 10 7.9 100% 9.44. Diff diff1 diff2 +12.6 +12.6 -33.8 -25.2 +1.1 +1.1 -4.3 -4.3 -1.7 -1.7 -21 -21 -13.8 -13.8 -86 -27. Table 2.2 DH versus BxT.. Results. We observe the execution time difference tween running DH to BxT considering all configuration runs, and the time difference considering only those runs that both DH and BxT crash the application. Our preliminary results show that the GUI techniques are effective and efficient, where BxT shows a better tendency. We list below our key observations: • Time. On average, DH is slower than BxT when they both find a crash. DH ran. 21.

(41) 2.3. EMPIRICAL EVALUATION. 27 hours more than BxT when both find a crash and 86 hours in all configurations. Considering all executions BxT ran 43% less than DH. • Effectiveness. In contrast to BxT, DH could not crash the application in 2 out of 8 configurations. It seems that the screen jumps and change state more frequently was effective to improve the exploration. Conceptually, the jumps correspond to start the technique in a specific state that can be an initial or intermediate state (considering a graph). Also, we observe that all errors are different, but we cannot affirm whether the root cause of these errors is different too.. 2.3.3 Random Data and Sequence in BxT In this section we evaluate the effect of randomization on BxT. It means that we observe both data and sequence generation in BxT. We evaluate the effect of using different random seeds as input to BxT with and without a Driven parameter. We show below the impact that the use of different random seeds has in the effectiveness of BxT. Setup. We ran BxT with and without a Driven parameter 10 times for each configuration with different random seeds. The use of different seeds impacts the generation of different sequences of events and data. With this experiment we want to observe the variance of the technique for distinct seed selections. Figure 2.3 shows the distribution time (in hours) and Table 2.3 shows the detailed data for all configurations. As we can observe a small variation of BxT seems to have a higher impact in time and also the capability to find bugs.. BxT without Driven parameter. 1 2 3 4 5 avg.. Config A CID Time 6 14.6 1 1.4 40.0 11 12.9 6 7.8 80% 15.3. Config B CID Time 40.0 40.0 4 39.9 40.0 40.0 20% 39.9. Config C CID Time 40.0 40.0 1 7.1 12 32.4 1 35.2 60% 30.9. # 1 2 3 4 5 avg.. Config A CID Time 1 21.5 5 9.1 6 22.3 1 18.4 40.0 80% 22.3. Config B CID Time 4 10.4 6 8.1 1 7.3 1 13.0 12 9.3 100% 9.6. Config C CID Time 1 24.1 40.0 6 8.9 1 15.1 40.0 60% 25.6. #. Config D CID Time 7 1.2 7 1.4 7 1.3 7 1.3 7 1.3 100% 1.3. Config E CID Time 5 0.5 5 0.5 5 0.5 5 0.4 5 0.5 100% 0.5. Config F CID Time 40.0 40.0 40.0 40.0 40.0 0% 40,0. Config G CID Time 6 8.0 6 10.0 6 10.6 6 10.5 6 10.7 100% 9.9. Config H CID Time 4 12.0 4 15.4 5 1.2 5 4.5 10 0.8 100% 6.8. Config F CID Time 1 4.0 40.0 4 30.1 1 14.2 40.0 60% 25.7. Config G CID Time 6 1.7 6 1.0 6 0.9 6 0.3 6 0.3 100% 0.8. Config H CID Time 5 24.4 1 3.4 4 11.3 1 1.1 4 1.9 100% 8.4. BxT with Driven parameter Config D CID Time 7 7.5 7 7.6 7 10.4 7 1.1 7 1.1 100% 5.5. Config E CID Time 5 2.7 5 1.8 5 0.3 5 0.3 5 1.8 100% 1.4. Table 2.3 BxT distribution.. Results. We list next key observations:. 22.

(42) 2.3. EMPIRICAL EVALUATION. 30 20. Time (Hours). 0. 10. 20 0. 10. Time (Hours). 30. 40. Driven BxT distribution. 40. BxT distribution. A. B. C. D. E. F. G. H. A. B. C. D. E. F. G. H. Figure 2.3 BXT time distributions for random data and sequence.. • Effectiveness. Although BxT without Driven misses crashes in some executions, it can find crashes consistently. The mean of the effectiveness was high (69%). • Variance. The standard deviation in BxT with and without Driven is high for configurations A, C and H but low for the other configurations. It is likely that, for those cases, the fault density is low relative to other configurations and the selection plays an important role. In this case, the different kinds of failures varies more than configurations with less dispersion. • For configurations D, E and G the random seed has smaller impact to find a crash. • In Config. F no crash was detected in all 10 executions, which indicates that the crash may be in a specific application of the phone and no sequence and/or data was generated to reach the crash during the timeout established. • Observing Config. B, which finds crashes only in experiments, we believe that the timeout is not enough and the crash is not easily reached by the BxT exploration method.. 2.3.4 Exploration Improvements We conducted another experiment to check whether BxT, using a more uniform exploration, can find crashes even faster. The insight is that it is not interesting to explore regions without bugs for too long time. Our null and alternative hypotheses are:. 23.

(43) 2.3. EMPIRICAL EVALUATION. • H0 : there is no correlation between the time to find a crash and the dispersion of screens. • Ha : there is a correlation between the time to find a crash and the dispersion of screens. We run each of the 8 phone configurations for 5 times with different seeds and measure two variables of interest: (i) dispersion of screens, and (ii) time for a bug. For dispersion, we count how many times each screen is visited (in a single exploration) and calculate the standard deviation of these counters. Figure 2.4 shows the scatter plot with points relating these two variables of interest. The linear regression line captures the tendency (red line). And its mean is showed (green line). Linear regression attempts to model the relationship between the time to find a crash and the standard deviation of the screen dispersion. We consider the time to find a crash as the explanatory variable and the standard deviation is considered a dependent variable. Thus, we want to relate the time to find a crash with the dispersion of screens using a linear regression model. We observe that the regression line has a increasing trend. That is, there is an association between the proposed explanatory and dependent variables (that is, the time to find a crash and the standard deviation of the screen dispersion).. Figure 2.4 Correlation between dispersion of screen and capability to find errors.. 24.