Metodologia dinâmica para avaliação da efetividade de otimização e exploração de localidade de valor.

Texto

(1)CARLOS HENRIQUE ANDRADE COSTA. DYNAMIC METHODOLOGY FOR OPTIMIZATION EFFECTIVENESS EVALUATION AND VALUE LOCALITY EXPLOITATION. Tese apresentada à Escola Politécnica da Universidade de São Paulo para obtenção do Título de Doutor em Engenharia.. São Paulo 2012.

(2) CARLOS HENRIQUE ANDRADE COSTA. DYNAMIC METHODOLOGY FOR OPTIMIZATION EFFECTIVENESS EVALUATION AND VALUE LOCALITY EXPLOITATION. Tese apresentada à Escola Politécnica da Universidade de São Paulo para obtenção do Título de Doutor em Engenharia.. Área de Concentração: Sistemas Digitais. Orientador: Prof. Dr. Paulo S. L. M. Barreto. São Paulo 2012.

(3) To my parents, who taught me. To my wife, who supported me. To my friends, who belived in me..

(4) Acknowledgements I would like first to thank my advisor Prof. Dr. Paulo S. L. M. Barreto, who gave me the opportunity and support to pursue my research interests with independence, my collaborators, and informal advisors in a sense, Dr. José E. Moreira (IBM) and Prof. Dr. David Padua (University of Illinois at Urbana-Champaign), with whom I have worked with during the part of this work conducted at IBM T. J. Watson Research Center, whose suggestions motivated me to pursue the core ideas in this thesis and whose guidance, support, and encouragement have been invaluable. Without them all this thesis would not have been possible. I am grateful to various people at the IBM T. J. Watson Research Center and University of São Paulo, places where I have the opportunity and the environment to conduct research and to collaborate with amazingly talented people. At IBM, I would like to thank Dr. Robert Wisnieski for being such a good manager during my time with the Exascale System Software Group and for making this joint research possible. I thank José Brunheroto, with whom I have worked developing the full-system simulator (Mambo) for BlueGene/Q, for sharing his vast knowledge on the simulator’s intrinsics and for helping me to understand how this tool could be used in the development of the present work, and for the long and always interesting hours of chatting that always shed new lights to a wide spectrum of subjects. I also thank the people that I have worked with directly or indirectly and that helped to make my period at T. J. Watson truly enjoyable: Roberto Gioiosa (now with Pacific Northwest National Laboratory), Alejandro Rico-Carro (now with Barcelona Supercomputing Center, Spain), Jamin Naghmouchi (now with T.U. Braunschweig, Germany), Daniele P. Scarpazza (now with D. E. Shaw Research), Chen-Yong Cher, George Almasi, and many others. At University of Sao Paulo, my deepest gratitude to Prof. Tereza C. M. B. Carvalho who believed and gave me a life changing opportunity. Her constant willingness to help and guide helped to be a better researcher and person. I also thank all my fellow talented graduate students and researchers, Charles Miers (now with Universidade do Estado de Santa Catarina), Rony R. Sakuragui (now with Scopus Tecnologia), Fernando Redigolo, Marcelo C. Amaral,.

(5) Diego S. Gallo, who I have worked or collaborated with in multiple research projects and many others that even indirectly inspired me. I thank Guilherme C. Januário in particular, multi-talented graduate student, that make valuable suggestion for optimization of the code written in this work. Finally, I’m grateful to my parents that provided me the amazing environment to grow and pursue my true interests and that taught me everything I know that really matters in life. Most of all I would like to thank my wife, Emanuele P. N. Costa, whose unending patience, love, and support through all the long working hours required for this work have made this possible. Thanks for making life worth living!.

(6) Resumo O desempenho de um software depende das múltiplas otimizações no código realizadas por compiladores modernos para a remoção de computação redundante. A identificação de computação redundante é, em geral, indecidível em tempo de compilação, e impede a obtenção de um caso ideal de referência para a medição do potencial inexplorado de remoção de redundâncias remanescentes e para a avaliação da eficácia de otimização do código. Este trabalho apresenta um conjunto de métodos para a análise da efetividade de otimização de código através da observação do conjunto completo de instruções dinamicamente executadas e referências à memória na execução completa de um programa. Isso é feito por meio do desenvolvimento de um algoritmo de value numbering dinâmico e sua aplicação conforme as instruções vão sendo executadas. Este método reduz a análise interprocedural à análise de um grande bloco básico e detecta operações redundantes de memória e operações escalares que são visíveis apenas em tempo de execução. Desta forma, o trabalho estende a análise de reuso de instruções e oferece tanto uma aproximação mais exata do limite superior de otimização explorável dentro de um programa, quanto um ponto de referência para avaliar a eficácia de uma otimização. O método também provê uma visão clara de hotspots de redundância não explorados e uma medida de localidade de valor dentro da execução completa de um programa. Um modelo que implementa o método e integra-o a um simulador completo de sistema baseado em Power ISA 64-bits (versão 2.06) é desenvolvido. Um estudo de caso apresenta os resultados da aplicação deste método em relação a executáveis de um benchmark representativo (SPECInt2006) criados para cada nível de otimização do compilador GNU C/ C++. A análise proposta produz uma avaliação prática de eficácia da otimização de código que revela uma quantidade significativa de redundâncias remanescentes inexploradas, mesmo quando o maior nível de otimização disponível é usado. Fontes de ineficiência são identificadas através da avaliação de hotspots e de localidade de valor. Estas informações revelam-se úteis para o ajuste do compilador e da aplicação. O trabalho ainda apresenta um mecanismo eficiente para explorar o suporte de hardware na eliminação de redundâncias. Palavras-chave: otimização de código, análise dinâmica, value numbering, efetividade de otimização.

(7) Abstract Software performance relies on multiple optimization techniques applied by modern compilers to remove redundant computation. The identification of redundant computation is in general undecidable at compile-time and prevents one from obtaining an ideal reference for the measurement of the remaining unexploited potential of redundancy removal and for the evaluation of code optimization effectiveness. This work presents a methodology for optimization effectiveness analysis by observing the complete dynamic stream of executed instructions and memory references in the whole program execution, and by developing and applying a dynamic value numbering algorithm as instructions are executed. This method reduces the interprocedural analysis to the analysis of a large basic block and detects redundant memory and scalar operations that are visible only at run-time. This way, the work extends the instruction-reuse analysis and provides both a more accurate approximation of the upper bound of exploitable optimization in the program and a reference point to evaluate optimization effectiveness. The method also generates a clear picture of unexploited redundancy hotspots and a measure of value locality in the whole application execution. A framework that implements the method and integrates it with a full-system simulator based on Power ISA 64-bit (version 2.06) is developed. A case study presents the results of applying this method to representative benchmark (SPECInt 2006) executables generated by various compiler optimization levels of GNU C/C++ Compiler. The proposed analysis yields a practical analysis that reveals a significant amount of remaining unexploited redundancies present even when using the highest optimization level available. Sources of inefficiency are identified with an evaluation of hotspot and value locality, an information that is useful for compilers and application-tuning softwares. The thesis also shows an efficient mechanism to explore hardwaresupport for redundancy elimination. Keywords: code optimization, dynamic analysis, value numbering, optimization effectiveness.

(8) List of Figures 1. Example of unexploited optimization . . . . . . . . . . . . . . . . . . . . .. p. 22. 2. Optimizing compiler structure . . . . . . . . . . . . . . . . . . . . . . . .. p. 28. 3. Source code and multiple levels of IR . . . . . . . . . . . . . . . . . . . .. p. 29. 4. Intermediate code representation with a DAG . . . . . . . . . . . . . . . .. p. 30. 5. Simple C code translated into SSA form . . . . . . . . . . . . . . . . . . .. p. 33. 6. Local Value Numbering (LVN) operation . . . . . . . . . . . . . . . . . .. p. 36. 7. Steps to build a DAG applying local value numbering . . . . . . . . . . . .. p. 39. 8. Minimal SSA form from an intermediate code . . . . . . . . . . . . . . . .. p. 41. 9. Value Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. p. 42. 10. Steps to perform register allocation with K-coloring . . . . . . . . . . . . .. p. 45. 11. Example of interference graph . . . . . . . . . . . . . . . . . . . . . . . .. p. 45. 12. Graph coloring heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . .. p. 47. 13. Relationship between redundancy categories . . . . . . . . . . . . . . . . .. p. 50. 14. Example of profile-guided transformation . . . . . . . . . . . . . . . . . .. p. 51. 15. An example of control flow profiles . . . . . . . . . . . . . . . . . . . . .. p. 53. 16. An example of value profiles . . . . . . . . . . . . . . . . . . . . . . . . .. p. 53. 17. An example of address profiles . . . . . . . . . . . . . . . . . . . . . . . .. p. 54. 18. Profile-guided optimizing compiler . . . . . . . . . . . . . . . . . . . . . .. p. 55. 19. Integrating a Reuse Buffer (RB) with the pipeline . . . . . . . . . . . . . .. p. 57. 20. Instruction Reuse (IR): scheme S v . . . . . . . . . . . . . . . . . . . . . .. p. 58.

(9) 21. Instruction Reuse (IR): scheme S n . . . . . . . . . . . . . . . . . . . . . .. p. 59. 22. Instruction Reuse (IR): scheme S n+v . . . . . . . . . . . . . . . . . . . . .. p. 60. 23. Value Prediction Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. p. 62. 24. Classification of redundancy elimination techniques . . . . . . . . . . . . .. p. 63. 25. Registers and value numbers mapping . . . . . . . . . . . . . . . . . . . .. p. 69. 26. Value numbers and accesses tables . . . . . . . . . . . . . . . . . . . . . .. p. 76. 27. Steps building a DAG with dynamic value numbering algorithm . . . . . .. p. 77. 28. Operation of the algorithm to identify hotspot . . . . . . . . . . . . . . . .. p. 80. 29. Representation of the algorithm to detect unnecessary spill . . . . . . . . .. p. 82. 30. IBM Full-system simulator architecture . . . . . . . . . . . . . . . . . . .. p. 87. 31. Statistics data collection with IBM Full-system simulator . . . . . . . . . .. p. 87. 32. Redundancy evaluation scheme . . . . . . . . . . . . . . . . . . . . . . . .. p. 89. 33. Approach for implementation of memory accesses with different sizes. . . .. p. 90. 34. Approach for creating an entry in the value number hash table. . . . . . . .. p. 91. 35. Redundancy evaluation scheme with validation support . . . . . . . . . . .. p. 93. 36. Redundancy evaluation scheme with validation and reuse support . . . . . .. p. 94. 37. Dendrogram showing similarity between CINT2006 Programs. . . . . . . . p. 100. 38. Instruction distribution per category set . . . . . . . . . . . . . . . . . . . p. 102. 39. Redundant load instructions normalized . . . . . . . . . . . . . . . . . . . p. 103. 40. Redundant store instructions normalized . . . . . . . . . . . . . . . . . . . p. 105. 41. Redundant arithmetic instructions normalized . . . . . . . . . . . . . . . . p. 106. 42. Redundant arithmetic instructions per type normalized . . . . . . . . . . . p. 107. 43. Complete redundancy detection statistics normalized . . . . . . . . . . . . p. 108. 44. Value locality identification: 400.perlbench . . . . . . . . . . . . . . . . . p. 111.

(10) 45. Value locality identification: 401.bzip2 . . . . . . . . . . . . . . . . . . . . p. 111. 46. Value locality identification: 429.mcf . . . . . . . . . . . . . . . . . . . . . p. 112. 47. Value locality identification: 445.gobmk . . . . . . . . . . . . . . . . . . . p. 112. 48. Value locality identification: 458.sjeng . . . . . . . . . . . . . . . . . . . . p. 113. 49. Value locality identification: 462.libquantum . . . . . . . . . . . . . . . . . p. 113. 50. Value locality identification: 471.omnet . . . . . . . . . . . . . . . . . . . p. 114. 51. Redundancy detection with constrained dynamic value numbering . . . . . p. 118. 52. Redundancy detection for load instructions with constrained dynamic value numbering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 118. 53. Redundancy detection for arithmetic instructions with constrained DVN . . p. 119. 54. Compiler audition based on dynamic value numbering: 400.perlbench . . . p. 124. 55. Compiler audition: 401, 429 and 445 . . . . . . . . . . . . . . . . . . . . . p. 125. 56. Compiler audition: 458, 462 and 475 . . . . . . . . . . . . . . . . . . . . . p. 127. 57. Prediction accuracy for top-128 most redundant instructions . . . . . . . . p. 129. 58. Integer memory access instructions . . . . . . . . . . . . . . . . . . . . . . p. 156. 59. Integer arithmetic instructions . . . . . . . . . . . . . . . . . . . . . . . . p. 157. 60. Integer logical instructions . . . . . . . . . . . . . . . . . . . . . . . . . . p. 157. 61. Integer rotate instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 157. 62. Integer compare instructions . . . . . . . . . . . . . . . . . . . . . . . . . p. 158. 63. Integer load and reserve store conditional instructions . . . . . . . . . . . . p. 158.

(11) List of Tables 1. GCC optimization levels . . . . . . . . . . . . . . . . . . . . . . . . . . .. p. 97. 2. SEPCInt 2006 application set . . . . . . . . . . . . . . . . . . . . . . . . .. p. 98. 3. Representative subset for SPECInt 2006 . . . . . . . . . . . . . . . . . . . p. 100. 4. SPECInt 2006 benchmarks analyzed . . . . . . . . . . . . . . . . . . . . . p. 101. 5. Redundant load instruction . . . . . . . . . . . . . . . . . . . . . . . . . . p. 103. 6. Redundant store instruction . . . . . . . . . . . . . . . . . . . . . . . . . . p. 104. 7. Total arithmetic redundant instructions . . . . . . . . . . . . . . . . . . . . p. 106. 8. Detailed redundant arithmetic instruction . . . . . . . . . . . . . . . . . . p. 107. 9. Complete statistics of redundancy detection . . . . . . . . . . . . . . . . . p. 108. 10. Ratio between most redundant instruction and average . . . . . . . . . . . p. 115. 11. Total redundancy detection with dynamic value numbering . . . . . . . . . p. 117. 12. Load redundancy detection with dynamic value numbering . . . . . . . . . p. 119. 13. Arithmetic redundancy detection with dynamic value numbering . . . . . . p. 120. 14. Constrained DVN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 121. 15. Prediction accuracy for top-128 most redundant instruction. . . . . . . . . . p. 130.

(12) List of Acronyms AST. Abstract Syntax Tree. CFG. Control-Flow Graph. CFP. Control-Flow Profiles. CP. Copy Propagation. CSE. Common Subexpression Elimination. CT. Classification Table. DAG. Directed Acyclic Graph. DCE. Dead-Code Elimination. DFCM. Differential Finite Context Method. DIR. Dynamic Instruction Reuse. DSE. Dead Store Elimination. DVN. Dynamic Value Numbering. DVP. Data Value Predition. FCM. Finite Context Method. GCC. GNU Compiler Collection. GSA. Gated Single Assignment. GVN. Global Value Numbering. IG. Interference Graph. IPC. Instruction per Cycle. IR. Instruction Reuse. LFU. Least Frequently Used. LV. Last Value. LVN. Local Value Numbering. PC. Program Counter. PDG. Program Dependence Graph. PGO. Profile Guided Optimization.

(13) PRE. Partial Redundancy Elimination. RB. Reuse Buffer. RST. Register Source Table. SPEC. Standard Performance Evaluation Corporation. SSA. Single Static Assignment. STR. Stride Value. TNV. Top-n-values Table. VG. Value Graph. VN. Value Numbering. VPT. Value Prediction Table. WPP. Whole Program Paths.

(14) Contents. List of Figures List of Tables List of Acronyms 1. 2. Introduction. p. 18. 1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. p. 18. 1.2. Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. p. 19. 1.3. Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. p. 23. 1.3.1. Contributions summary . . . . . . . . . . . . . . . . . . . . . . . .. p. 24. 1.3.2. Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. p. 25. Compile-time Optimization. p. 27. 2.1. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. p. 27. 2.1.1. Intermediate Representation (IR) . . . . . . . . . . . . . . . . . . .. p. 28. Code Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. p. 31. 2.2.1. Flow-based Analysis . . . . . . . . . . . . . . . . . . . . . . . . .. p. 32. 2.2.2. Static Single Assignment (SSA) . . . . . . . . . . . . . . . . . . .. p. 33. Value Numbering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. p. 36. 2.3.1. Local Value Numbering (LVN) . . . . . . . . . . . . . . . . . . . .. p. 36. 2.3.2. Global Value Numbering (GVN) . . . . . . . . . . . . . . . . . . .. p. 40. 2.2. 2.3.

(15) 2.3.3 2.4. 3. 4. 5. Partial Redundancy Elimination (PRE) . . . . . . . . . . . . . . .. p. 43. Register Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. p. 43. 2.4.1. Variable Liveness Analysis . . . . . . . . . . . . . . . . . . . . . .. p. 44. 2.4.2. Register Allocation by Graph-Coloring . . . . . . . . . . . . . . .. p. 46. Dynamic Identification of Redundant Computation. p. 49. 3.1. Limitations of Static Analysis . . . . . . . . . . . . . . . . . . . . . . . .. p. 50. 3.2. Profile Guided Optimization (PGO) . . . . . . . . . . . . . . . . . . . . .. p. 51. 3.3. Dynamic Limit Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . .. p. 55. 3.4. Hardware support for Redundancy Elimination . . . . . . . . . . . . . . .. p. 56. 3.4.1. Dynamic Instruction Reuse . . . . . . . . . . . . . . . . . . . . . .. p. 56. 3.4.2. Data Value Prediction (DVP) . . . . . . . . . . . . . . . . . . . . .. p. 61. Run-time Optimization Effectiveness Analysis. p. 65. 4.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. p. 65. 4.2. Dynamic Value Numbering Algorithm (DVN) . . . . . . . . . . . . . . . .. p. 67. 4.2.1. Redundant Memory Operation . . . . . . . . . . . . . . . . . . . .. p. 69. 4.2.1.1. Redundant Load . . . . . . . . . . . . . . . . . . . . . .. p. 70. 4.2.1.2. Redundant Store . . . . . . . . . . . . . . . . . . . . . .. p. 72. 4.2.2. Redundant Common-subexpression . . . . . . . . . . . . . . . . .. p. 73. 4.2.3. Unsupported Instructions . . . . . . . . . . . . . . . . . . . . . . .. p. 75. 4.3. Value-based Hotspot and h-index . . . . . . . . . . . . . . . . . . . . . . .. p. 78. 4.4. Dynamic Value Numbering and Unnecessary Spill . . . . . . . . . . . . . .. p. 78. 4.5. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. p. 82. Experimental Evaluation and Optimization Framework. p. 84.

(16) 5.1. 6. IBM Full-system simulator (Mambo) . . . . . . . . . . . . . . . . . . . . .. p. 86. 5.1.1. Analysis Facilities . . . . . . . . . . . . . . . . . . . . . . . . . .. p. 86. 5.2. Instruction Set Architecture (ISA) . . . . . . . . . . . . . . . . . . . . . .. p. 88. 5.3. Dynamic Value Numbering Algorithm Implementation . . . . . . . . . . .. p. 88. 5.3.1. p. 92. Validation and Instruction Reuse Support . . . . . . . . . . . . . .. Case Study. p. 95. 6.1. Evaluation Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. p. 96. 6.1.1. GNU Compiler Collection (GCC) . . . . . . . . . . . . . . . . . .. p. 96. 6.1.2. SPEC CPU2006 Benchmark Suite . . . . . . . . . . . . . . . . . .. p. 97. 6.1.2.1. Subsetting SPEC . . . . . . . . . . . . . . . . . . . . . .. p. 99. 6.1.2.2. Reduced Inputset . . . . . . . . . . . . . . . . . . . . . p. 100. 6.2. Effectiveness Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . p. 101 6.2.1. Memory operations . . . . . . . . . . . . . . . . . . . . . . . . . . p. 102 6.2.1.1. 6.3. 6.2.2. Store instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 102. 6.2.3. Arithmetic instructions . . . . . . . . . . . . . . . . . . . . . . . . p. 105. 6.2.4. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 109. Value Locality Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 109 6.3.1. 6.4. Load instructions . . . . . . . . . . . . . . . . . . . . . p. 102. Value Locality Graphs . . . . . . . . . . . . . . . . . . . . . . . . p. 110 6.3.1.1. Cluster occurrence . . . . . . . . . . . . . . . . . . . . . p. 114. 6.3.1.2. Optimization Effect . . . . . . . . . . . . . . . . . . . . p. 114. 6.3.1.3. H-index . . . . . . . . . . . . . . . . . . . . . . . . . . p. 115. Instruction Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 116 6.4.1. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 117.

(17) 7. 6.5. Performance Considerations . . . . . . . . . . . . . . . . . . . . . . . . . p. 121. 6.6. Compiler Audition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 123. 6.7. Value Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 128. 6.8. Discussion on the Replicability of the Study . . . . . . . . . . . . . . . . . p. 130. Conclusions 7.1. p. 133. Summary of contributions and Future Work . . . . . . . . . . . . . . . . . p. 136. References. p. 138. Appendix A -- SPEC CPU2006 Benchmark Suite. p. 147. A.1 400.perlbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 147 A.2 401.bzip2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 149 A.3 429.mcf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 149 A.4 445.gobmk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 151 A.5 458.sjeng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 151 A.6 462.libquantum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 152 A.7 471.omnetpp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 153 Appendix B -- Power ISA decoding approach. p. 155. B.1 Fixed-point facility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 155. Index. B.1.1. Memory Operation . . . . . . . . . . . . . . . . . . . . . . . . . . p. 156. B.1.2. Arithmetic Instructions . . . . . . . . . . . . . . . . . . . . . . . . p. 157 p. 159.

(18) 18. 1. Introduction. 1.1. Motivation. Computer systems relying on software are in the core of modern society. Software-based tools have never been as ubiquitous and powerful as they are nowadays. They are with no exaggeration a crucial part of the infrastructure that undergirds the global economy. The ability to produce highly specialized software algorithms relies strongly in the abstraction level the modern high-level programming languages exhibit, allowing programmers to implement sophisticated algorithms with decreasingly effort and knowledge of the machine low-level implementation. Compiler technology, the central piece that is ultimately responsible for translating high-level language to machine-readable sequence of bits, along with advances in programming languages, are in the core of today’s complex software infrastructure. The high-performance computing level of two decades ago is now achieved by a handheld battery-powered device (MARKOFF, 2011) and it is safe to say that compilers and high-level languages have played a role in this improvement as significant as semiconductor technology has. It has been shown that improvements in software algorithm outpace the gains attributable to faster hardware. The time to complete some reference performance benchmarks improved by a factor of 43 million in 15 years, being a factor of 43,000 due to improvements in the efficiency of software algorithms (USA, 2010). Software performance has historically relied on code optimization performed by compilers. In order to increase program efficiency by restructuring code to simplify instruction sequences and take advantage of machine-specific features, modern optimizing compiler for.

(19) 1.2 Problem Statement. 19. general-purpose processors deploys several optimization techniques, such as Single Static Assignment (SSA)-based optimization, pointer analysis, profile-guided optimization, link-time cross-module optimization, automatic vectorization, and just-in-time compilation. New compilation challenges have been presented as programming languages and computer architecture evolve. As the hardware complexity increases, it has never been as important as it is now the development of smarter software able to take advantage of the improvements in hardware. Compiler technology is a key component for tackling new software-hardware interaction challenges. The current wide availability of multicore processors offers a variety of resources of which the compiler designer must take advantage, such as multicore programming models, optimization and code generation for explicitly parallel programs, naming a few. It is also expected that research in compilers will provide new and more sophisticated tools for optimization tuning exposing unexploited performance improvements with existing compiler technology (ADVE, 2009). In this respect, it is expected that compilers will guide the adoption of speculative optimizations in order to compensate for the constraints imposed by conservative static analysis and take advantage of recent architecture improvements that provide new ways to exploit hardware-support for more powerful and nontraditional optimization.. 1.2. Problem Statement. The main problem addressed in this work is the identification of suboptimal code produced by an optimizing compiler, measuring its effectiveness through the identification of missed opportunities in static analysis and code generation, and the identification of approaches to take advantage of unexploited opportunities through software-tuning and existing hardware-support for redundancy removal. A generated compiled code is said to be suboptimal if the same effect on the program’s state can be obtained by a shorter or faster sequence of instructions. The optimization effectiveness of a given program can be measured in terms of the amount of redundant computation executed, where an operation repeatedly performs the same computation because it sees the same operands. From a compiler-hardware interaction point of view, redundant computation has been related to the value locality concept (LEPAK;.

(20) 1.2 Problem Statement. 20. LIPASTI, 2000), as the likelihood of a previously seen value recurring repeatedly as the output. of the same instruction. The traditional approach to eliminate redundant computation is the compile-time optimizer based on static analysis. The effectiveness in the elimination of redundant computation translates into performance improvement and decrease in the power consumption required by the processor during the program’s execution. Common obstacles that limit the effectiveness of compile-time code optimization are:. 1. Compile-time limitations: limited information at the compile-time often prevents static analysis from distinguishing redundant dynamic instances of static instructions that are visible only at the run-time (COOPER; LU, 1997; COOPER; ECKHARDT; KENNEDY,. 2008);. 2. Analysis boundaries and limited scope: most deployed optimization techniques are limited to procedure boundaries, which inhibit their effectiveness and benefits to be gained from optimizing across procedure boundaries are unexploited. In addition, the use of shared libraries, dynamic class loading, and run-time binding complicates compiler’s ability to statically analyze and optimize programs (BRUENING; GARNETT; AMARASINGHE,. 2003);. 3. Conservative program analysis: for correctness, compilers do not perform some optimizations because a necessary requirement cannot be guaranteed at compile-time. In some cases, program analysis can be improved by identifying specific cases that can be precisely analyzed and optimized, but in general static analysis must conservatively speculate on the program’s dynamic behavior (COOPER; XU, 2003); 4. Limited heuristic algorithm: since identification of redundant computation is in general undecidable (RAMALINGAM, 1994), heuristic algorithms have to be applied. However, they do not always produce satisfactory results. Good example is graph-coloring register allocation, which addresses an NP-complete problem whose precise and efficient solution is not possible. Heuristic’s performance is typically evaluated through its.

(21) 1.2 Problem Statement. 21. behavior on real programs and workloads; 5. Optimization unpredictable interaction: multiple and concurrent optimization methods are heavily deployed in modern compilation. In some cases they can interact with one another and the system architecture in an unpredictable way, generating suboptimal code. Dynamic identification and exploitation of opportunities for redundancy elimination is the approach that attempts to overcome most of these limitations providing ways to optimize the application at the run-time based on profiling data (BRUENING; GARNETT; AMARASINGHE, 2003). Profile Guided Optimzation (PGO) is the approach that using run-time knowledge collected in a training execution attempts to guide optimization at the compile-time, such as prioritizing through code motion a frequently executed path. The performance gains depend on the ability to obtain accurate profiling data and the overhead generated by the optimizer to instrument and interact with the executing program. Hardware support has been also proposed as a way to detect and avoid redundant operations, such as instruction reuse buffers (SODANI; SOHI, 1997; SODANI; SOHI, 1998a; SODANI; SOHI, 1998b) and value prediction mechanisms (LIPASTI; WILKERSON; SHEN, 1996; GABBAY; MENDELSON, 1997; GHANDOUR; AKKARY; MASRI, 2012).. The improvements achievable through these methods are tightly bound to the mechanism used to populate the reuse buffer and prediction table. An understanding on the sources of redundant computation plays a crucial role on extracting the most out of these techniques. An example of redundant computation is show in Figure 1. It shows a C program and the object code generated for PowerISA instruction set by a recent version of GCC (GNU Compiler Collection) with the highest level of optimization enabled (-O3) (STALLMAN, 2009). The code could be highly optimized during compile-time, providing a simple example full of redundancies. The code calculates nineteen consecutive elements of a simple series two hundred times. It is truly optimization prone via static analysis, since the involved array is declared locally, for it is within the main function. In this case, no pointer aliasing should be expected by the compiler. Also, no external procedure is called, providing a very controlled environment to the compiler. Looking at the code, one can tell that each run of the outer loop winds up.

(22) 22. 1.2 Problem Statement. object dump. #define exte 200 #define inte 19 int main(){ int i,t,a[inte]; for(t=0,a[0]=4; t<exte;t++) for(i=1;i<inte;i++) a[i]= a[i-1] * 3; return a[3]*a[4]; }. GCC 4.3.2 -O3. PowerISA 2.06. 1000350 1000354 1000358 100035c 1000360 1000364 1000368 100036c 1000370 1000374 1000378 100037c 1000380 1000384 1000388 100038c 1000390 1000394 1000398 100039c 10003a0 10003a4 10003a8 10003ac 10003b0 10003b4 10003b8 10003bc. .... li li stw addi addi nop nop nop lwa mr nop nop rlwinm add extsw stw addi cmpd bne addi clrldi cmpwi bne lwz lwz mullw extsw blr. r0,4 r8,0 r0,-112(r1) r10,r1,-36 r7,r1,-108. r11,-112(r1) r9,r7. r0,r11,1,0,3 r0,r0,r11 r11,r0 r11,0(r9) r9,r9,4 cr7,r9,r10 cr7,1000380 r0,r8,1 r8,r0,32 cr7,r8,200 cr7,1000370 r0,-100(r1) r3,-96(r1) r3,r3,r0 r3,r3. .... Fig. 1: Example of unexploited optimization: redundant loop highlighted. with the same values being stored on the very same variables whereon they were previously stored. One would expect a highly optimized compilation when optimization is enabled. Such optimized executable should reduce the outer loop’s steps to only one, or, more drastically, to thoroughly remove the loop structures, replacing the returned multiplication by its foreseeable result. This is not what is seen for GCC (behavior observed in version 4.3 for PowerISA and x86_64). The outer loop still dwells on the object code, as shown in Figure 1 (instructions 100039c – 1000a8), and will be run redundantly as many times as the programmer unawares had prescribed. The compiler should be able to eliminate such instruction redundancies, but this simple scenario is sufficient to present a situation for which an unexpected interaction among optimizations is experienced. Even at its highest optimization level, GCC still emits a significant number of redundant instructions within a basic block boundaries, resulting in multiple redundant executions of the same set of instructions. If this is so for this scenario, situation could worsen for interprocedural environments. This example shows that even simple code compiled at high optimization level can exhibit.

(23) 23. 1.3 Thesis. significant optimization potential unexploited. The question is whether and how often such situation occurs in optimized code of a real world program. The identification of redundant computation analyzing executed instructions is a way to detect missed opportunities for optimization exploitation. The approach can be used as reference for measuring how effective optimization has been performed. It can be also used to indicate hotspots for compiler audition or tuning, and instructions and values to be targeted by hardware support for redundancy elimination.. 1.3. Thesis. This work is focused on the development of a methodology based on a dynamic limit study for identification and measurement of the total redundancies exhibited at run-time for the whole application execution. The approach yields a practical and arguably precise approximation to the upper bound of exploitable redundancies, and also a reference point for optimization effectiveness evaluation, based on the remaining unexploited redundancy potential. The methodology relies on the development and implementation of a dynamic approach through a framework to detect redundant memory operations and dynamic instances of arithmetic instructions that are certain to produce the same known result as in a previous execution. The information produced by this approach is used to provide an accurate picture of hotspots and value locality that can be used to correlate instruction redundancy to source-level constructs and to indicate the most profitable instructions and produced results to be targeted by hardware-based optimization. The methodology developed is based on the observation of the stream of memory references and executed instructions. The objective is to find an approximation to all instruction redundancies visible at run-time along with the identification of the exact redundant instances, using as input a real-time execution stream. In order to detect redundancies, a novel dynamic approach for the Local Value Numbering algorithm (COCKE, 1969; COOPER; ECKHARDT; KENNEDY,. 2008) is developed, which discovers redundancies and folds constants. The pro-. posed Dynamic Value Numbering (DVN) algorithm is designed to be applied over a stream of.

(24) 1.3 Thesis. 24. instruction execution and, for this reason, it is inherently interprocedural. The method reduces the difficulties encountered in value numbering of extended basic blocks (ALPERN; WEGMAN; ZADECK,. 1988a) to the analysis of one large basic block. The method in this way allows the. identification of suboptimal code related to dynamically linked libraries that would not be available at the compiled-time. The work presents a practical framework to validate the method. It relies on the IBM full-system simulator for an architecture implementing PowerISA (2.06) to execute optimized compiled code. The simulator is integrated with an implementation of the proposed method that operates observing the execution and possibly interfering in each instruction executed by the simulator. The implementation of the method relies on an approach that efficiently handles large tables required when applying a value numbering-like algorithm to an execution trace. The work also presents a case study of optimization effectiveness and value locality evaluation and exploitation for which a reference suite of applications was used. For each application, multiple executables were generated using different compiler optimization levels. The GNU C/C++ compiler, version 4.3.2, was chosen as optimizing compiler. The goal is to measure the occurrence frequency and identify redundant computations that can be detected using the proposed method, but that were not detected in a static analysis by GCC. The study consists in an upper limit analysis, and therefore also exposes instruction redundancy that cannot be necessarily exploited statically. The study shows how the hotspots identified can be used to understand the source of the inefficiencies in the optimization, and to provide an evaluation of value locality. It is discussed how this information can be used in hardware-based redundancy elimination and the practical gains achievable. The main purpose of this work is to provide a framework for optimization effectiveness analysis that indicates new ways to extend dynamic optimization to static languages, and how hardware and software-based mechanisms could benefit from the increasing support for optimization.. 1.3.1. Contributions summary. The main contributions of this work can be summarized as following:.

(25) 1.3 Thesis. 25. • Novel approach to construct and manipulate the representation of a program from runtime information for a whole application execution (Section 4.1); • Proposition of dynamic method to evaluate optimization effectiveness through the Dynamic Value Numbering (DVN) algorithm. The method identifies the amount of redundant instruction found in a whole application execution and the individual redundancy instances, including frequency of the redundantly produced result (Section 4.2); • Development and validation of experimental framework to evaluate optimization effectiveness based on an efficient implementation for dynamic value numbering. The framework integrates the method’s implementation to a full-system simulator, yielding practical analysis and validation of the approach through the avoidance of all redundant instructions identified (Chapter 5); • A case study exposing the unexploited opportunities in reference application (SPECInt 2006) optimized with GCC, evaluating objectively the effectiveness of optimizations applied (Section 6.2); • Identification of hotspots through a practical evaluation of value locality and the occurrence of redundancy clusters in reference applications (SPECInt 2006); • Demonstration of realistically achievable optimization gains through an instruction reuse scheme based on DVN (Section 6.3); • Demonstration of how value locality reports produced with DVN can be deployed as a method for compiler audition (Section 6.6) and value prediction (Section 6.7).. 1.3.2. Outline. The thesis is organized as following. In Chapter 2 is presented a background on compiletime optimization, with a review on compilation, code representation and optimization. The chapter also discusses the role of the intermediate representation, specially SSA form, and data flow analysis in two major static optimization approaches, Local/Global Value Numbering.

(26) 1.3 Thesis. 26. and Partial Redundancy Elimination techniques. Register allocation and liveness analysis are also discussed. Chapter 3 discusses the intrinsic limitation of static analysis and reviews the profile guided optimization and hardware-support for redundancy elimination approaches and how their effectiveness relies on the ability to collect accurate information on the occurrence of redundancy. Chapter 4 discusses limit studies based on run-time information as a way to measure optimization effectiveness analysis. The chapter details the development of the Dynamic Value Numbering algorithm. In Chapter 5 are details on the experimental evaluation framework developed. It is described the approach to execute the programs with a full-system simulator and integration with the DVN algorithm implementation. The chapter presents the details of the architecture used and integration method that makes the approach proposed possible. The chapter also discusses the framework validation and how the integration with a full-system simulator allows the evaluation of the method as a new mechanism to exploit hardware-based redundancy removal. Chapter 6 presents the application suite and compiler used as reference for a case study. The chapter presents the results obtained for effectiveness analysis of optimized executables generated by GCC with different optimization levels. It is also presented the results of applying the methodology to generate value locality graphs that identify redundancy hotspots. Three approaches to exploit this information and their results are presented: instruction reuse, compiler audition and value prediction. Discussion on the gains achievable are presented and compared to data available from related work. In Chapter 7 are the conclusions, a summary of results and related publications, and future envisioned work..

(27) 27. 2. Compile-time Optimization. A well-designed compiler has as ultimate goal the production of correct object code that runs as fast as possible, where fast means not only the time the program takes to produce the output but includes aspects as the data traffic traversing the memory hierarchy and other aspects related to an efficient use of the architecture resources. These goals in general translate into the aim to delete computations that have been previously performed, and when possible to move computations from their original positions to less frequently executed regions. A modern optimizing compiler relies strongly on static analysis to identify and remove redundant computation. This chapter introduces fundamentals of compilation and optimization at compile-time, reviewing classical methods for static analysis. Code optimization is a vast field, and it is not the intention of this work to review the existing techniques extensively. Rather, the aim is to provide the concepts of the data-flow analysis and a review of two prominent static optimization sets, Value Numbering (VN) and Partial Redundancy Elimination (PRE) from which the methodology to be proposed in this work is derived. The chapter also includes a discussion on the problem of register allocation and liveness analysis.. 2.1. Preliminaries. Compilation is the task of converting programs written in high-level (and desirably highly-abstracted) language into programs in machine language, the binary-encoded instructions that move data from and into memory, transfer data to devices and perform logical and arithmetic operations. It typically involves a lexical analysis, represented by the parsing that.

(28) 28. 2.1 Preliminaries Intermediate Representation Source Code. Scanner. Optimization n. String of tokens. Intermediate Representation. Intermediate Representation. Optimization 3. Parser. Register Allocation. Intermediate Representation. Parse tree Semantic Analyzer. Optimization 2. Optimization 1. Translator. Intermediate Representation. Intermediate Representation. Assembly. Intermediate Representation. Parse tree. Front-end. .... Instr. Selection. Intermediate Representation Object Code. Optimizer. Back-end. Fig. 2: Optimizing compiler structure. identifies tokens in the source code that are members of the high-level vocabulary; a syntactic analysis, the processing of the sequence of tokens producing an intermediate representation of the program; a semantic analysis, the process that determines that the program satisfies the semantic properties defined in the source language, and the code generation, represented by the transformation of an intermediate code into the specific machine code. A well-designed optimizing compiler is expected to produce executable code that is correct and efficient according to a performance metric, usually the time or memory required to complete the computational task. Optimization phases in the compilation process perform transformations in the program in a way to avoid unnecessary computations to be performed by the executable generated. Optimization can take place at any of the known compilation phases. Figure 2 shows the typical structure of an optimizing compiler.. 2.1.1. Intermediate Representation (IR). The optimization scope is typically dependent on the intermediate representation (IR) used by the compiler to manipulate the source code. Common design for the intermediate.

(29) 29. 2.1 Preliminaries. float a[10][20]; a[i][j+2];. (a) C Source Code. t1 t2 t3 t4 t5 t6 t7. = = = = = = =. j + 2 i * 20 t1 + t2 4 * t3 addr a t5 + t4 t7 * t6. t1 = a[i, j+2]. (b) High-level IR: low-level registers manipulation are not exposed to the compiler.. r1 r2 r3 r4 r5 r6 r7 f1. = = = = = = = =. [fp – 4] [r1 + 2] [fp – 8] r3 * 20 r4 + r2 4 * r5 fp - 216 [r7 + r6]. (c) Medium-level IR: general and infinite (d) Low-Level IR: limited and specialized regnumber of register available. isters.. Fig. 3: Source code and multiple levels of intermediate representation. code varies from low-level representations that include architecture dependent features to an architecture-independent representation. Optimizer operates over a low-level intermediate representation to take advantage of architecture-specific features such as special floating point units, hint bits for value speculation, etc, or over a high-level intermediate code to ensure a portable optimized executable. Mixed model implementing multiple passes over a low-level representation followed by a high-level representation is the approach frequently seen. Figure 3 shows possible intermediate representation for the code written in C. The translation of a statement from a high-level language into an intermediate representation involves the representation of the program’s simplest unit of control, basic blocks. They are the portion of the code that have one entry point, meaning that no instruction within their boundaries is the destination of a jump instruction, and only one exit point, meaning that only the last instruction causes the program to execute another basic block. In other words, a maximal length sequence of code without branch. An abstract syntax tree (AST) is typically built by the compiler to perform context-sensitive analysis and high-level optimizations. Three-address code in the form x ← y op z is generated during a walk of the AST. The.

(30) 30. 2.1 Preliminaries g h u z l. = = = = =. x u g k u. + + + +. l. y v h p z. +. z +. k. u +. k0. g. x0. p0. h y. +. x. p. y0. u. -. u0. v v0. Fig. 4: Intermediate code representation using a Directed Acyclic Graph (DAG). three-address code can be divided into basic blocks that have the property that if one instruction executes, they all execute. Basic blocks are organized into a control-flow graph (CFG), a directed graph, G = (N, E) for which each node n ∈ N corresponds to a basic block and each edge e = (ni , n j ) ∈ E corresponds to a possible transfer of control from block ni to block n j . CFGs are a graphical representation of the possible run-time paths of control-flow. The Directed Acyclic Graph (DAG) representation is a more compact representation than either trees or any linear notation as it reuses values. A DAG is the minimal sequence of trees that represents a basic block. Figure 4 shows a set of statements represented in a DAG. The nodes hold expressions or contents and are annotated with variable names (arrows). They are harder to construct and manipulate, but provide savings in space. They also highlight equivalent sections of code and for this reason are useful when studying redundancy elimination. In this case, the needed result is computed once and saved, rather than recomputed multiple times. Code representation has also to deal with the fact that programs are also divided into multiple procedures. Such fact has positive and negative impacts on generating code. Such division positively limits the amount of code to be considered by the compiler at any given time, helping to keep data structures small and to limit the problem size and cost of several.

(31) 2.2 Code Optimization. 31. compile-time algorithms. On the other hand, the division of the program into procedures limits the compiler’s view inside a call. Based on its scope, optimization is usually classified as either intraprocedural or interprocedural. Intraprocedural scope refers to local methods operating over basic blocks, regional methods operating over scopes larger than a single block but smaller than a full procedure, and global methods operating over whole procedures. Interprocedural scope refers to methods that operate over the whole program, i.e., larger than a single procedure (TORCZON; COOPER, 2011).. 2.2. Code Optimization. The optimization of a program involves the elimination and insertion of instructions in order to avoid extra computations. A redundant computation can be defined in two major ways. The first case is the case when two computations match exactly, including the variables names. This equivalence is known as lexical equivalence. A second case occurs when two computations are not lexically equivalent, i.e., represented by different variables, but are certain to compute the same value in any run of the program. This case is known as value equivalence. The analysis that takes into account value equivalences is stronger compared to the ones restricted to lexical equivalences. The optimization techniques can be divided into four main groups related to the level of intermediate code they operate over. In the first level are the optimizations applied to the source code or to a high-level description of the intermediate code that preserves loop structure and sequencing array access in essentially their original form. Second level optimizations are the ones that operate over a medium-level or low-level representation depending on whether a mixed model is being used. The optimizations in third level are architecture-dependent and operate over a low-level intermediate code. They can be related, for instance, to the addressing mode available on the target architecture. At last, optimizations in the fourth level are performed at the linking phase and operate on relocatable object code. The optimizations in the second level represent a large set of techniques that are used in.

(32) 2.2 Code Optimization. 32. compilers required to generate code for different architectures. They usually rely on the socalled flow-based analysis, such as the control-flow analysis and data-flow analysis. The flow analysis is the technique used to determine the multiple possible paths of a program execution and how data can be modified through these paths during the execution. The optimizations relying on flow-based analysis perform the task of moving computations to places where they are computed less frequently without changing the program’s semantic. Optimizations of this level are numerous. There are, however, two major techniques of notable importance. The Value Numbering (VN) analysis is the group of techniques that analyzes the program in order to identify computations that are known to have the same value in a static view of the program. The group includes both the Local Value Numbering (LVN) and Global Value Numbering (GVN). The Partial Redundancy Elimination (PRE) analysis is the group of techniques that hoist, i.e. anticipates, computations in order to make partially redundant computations fully redundant allowing their removal.. 2.2.1. Flow-based Analysis. Most existing compiler optimization algorithms need to perform control-flow analysis and build data-flow solvers. This section presents the basic concepts related to these approaches. The flow-based analysis is the set of techniques used to determine the multiple possible paths of a program execution and how data is modified as the program takes the possible paths. This approach is in the core of a large set of the most used optimization techniques. The flowbased analysis is essentially divided into the control-flow analysis and data-flow analysis. The control-flow analysis is the technique used to characterize how conditionals affect the flow of the program. The purpose is that any unused generality can be removed and operations can be replaced by faster ones. The purpose of the data-flow analysis is to provide global information on how the program manipulates data. The goal of the analysis is to seek and determine whether all the assignments to a particular variable that may provide the value of that variable at some particular point necessarily give it the same constant value. In this case, the variable at that point can be replaced by a constant. The flow-based analysis relies greatly.

(33) 33. 2.2 Code Optimization. on the Static Single Assignment (SSA) intermediate representation. Entry. w1 ← 0 z1 ← 1 p1 ← 2. !"#$!"#$%&'(&#$%&)(&#$%&*+, !"#$-&.&/0 !"#$1&.&20 !"#$3&.&40 !%$"*+&, -&.&'&5&)0 1&.&)0 3&.&1&5&'0 6 &'(&$, -&.&1&5&30 6 78%97$&1&5&3 6. w2 ← x1 + y1 z2 ← y1 p2 ← z2 + x1. (a) C code.. w3 ← z1 + p1. joint point w4 z3 p3 R. ← ← ← ←. φ(w2,w3) φ(z2,z1) φ(p2,p1) z3 + p3. Exit. (b) SSA form for a simple procedure.. Fig. 5: Simple C code translated into SSA form.. 2.2.2. Static Single Assignment (SSA). The Static Single Assignment (SSA) was first introduced by Alpern, Wegman, and Zadeck (ALPERN; WEGMAN; ZADECK, 1988b). The technique separates the values computed in a program from the locations where they are stored. SSA form makes possible a more effective implementation of several optimization techniques. In order to produce the static single assignment form, every value assigned to a variable, whether representing a sourcelevel variable chosen by the programmer or a temporary variable generated by the compiler, occurs as the target of only one assignment (MCKENZIE, 1999). The process of translating a code to the SSA form splits the multiple instances of distinct assignments to a variable.

(34) 2.2 Code Optimization. 34. when it occurs in the source program, being the variables divided into distinct versions. The SSA form makes more effective several kinds of optimization techniques such as value numbering, invariant code motion, constant propagation, and partial-redundancy elimination. The translation of an intermediate representation to the SSA form in order to optimize the program and its translation back into the original form is an approach typically used. Figure 5 shows an example of translation of a simple C code (5(a)) into its SSA equivalent form (5(b)). The SSA form has to deal with the multiple paths a basic block can take and the consequent merging. The merging of variables is represented in the SSA form by the φfunction at the joint points, i.e nodes in the CFG with merging paths (see Figure 5). Each version of the variable merging at a given point represents an argument to the φ-function in the SSA form. The control-flow defines the argument corresponding position. The procedure to create the SSA form starts identifying the joint points where to insert the φ-functions. Each function has the number of argument positions as the number of the control-flow preceding the point where some variable definition reaches. Since the φ-functions are conceptual tools and are not related to any computing operation, they have to be eliminated. The SSA form with a minimal number of φ-functions is known as minimal SSA form. The minimal SSA form can be obtained using the so-called dominance frontiers. In order to obtain the minimal SSA form, it is first generated the program’s Control Flow Graph (CFG), which is a graph with each node representing a basic block. The directed edges in the CFG are used to represent jumps in the control flow. For a CFG node x, the dominance frontier of DF(x), is the set of all nodes y in the graph such that x dominates the immediate predecessor of y, not strictly dominating y:. DF(x) = {y|(∃z ∈ Pred(y) : x dom z) and x !sdom y}. The computation of DF(x) for all x is quadratic in the number of nodes in the graph. The split of two intermediate components, DFlocal (x) and DFup (x, z) is performed in this way:.

(35) 35. 2.3 Value Numbering. DFlocal (x) = {y ∈ Succ (x)| idom (y) , x} DFup (x, z) = {y ∈ DF(z)| idom (z) = x & idom (y) , x}, where idom(x) is the set of nodes that x immediately dominates. In other words, idom(x) = y, only if x ∈ idom(y). DF(x) is then computed as:. [. DF(x) = DFlocal (x) ∪. DFup (x, z).. z∈N( idom (z)=x). In order to define a set of graph nodes S , the dominance frontier of S is defined as:. DF(S ) =. [. DF(x).. x∈S. The iterated dominance frontier DF + () can be defined then as:. DF + () = lim DF i (S ), i→∞. where DF 1 (S ) = DF(S ) and DF i+1 (S ) = DF(S ∪ DF i (S )). If S is defined as the set of nodes that assigns values to variable x, including the entry node, then DF + (S ) is exactly the set of nodes that need φ-functions for x. Many industrial compilers use the SSA form as an intermediate representation: GCC 4.0 (GOUGH, 2005), Sun’s HotSpot JVM (KOTZMANN et al., 2008), IBM’s Java Jikes RVM (ALPERN et al., 2005) and LLVM (LATTNER; ADVE, 2004)..

(36) 36. 2.3 Value Numbering. CSE ... c = d = e = f = d = .... ... c3 = d1 = e2 = f3 = f3 = d4 = .... a + b a b d + e x. 1. DCE ... c3 = d1 = e2 = f3 = d4 = .... 2. a + b a1 b2 d1 + e2 c3 x4. LVN. LVN. 1. ... c3 = e2 = f3 = d4 = .... 2. a + b a1 b2 c3 x4. a1 + b2 b2 c3 x4. LVN. Fig. 6: Local Value Numbering (LVN) operation: common subexpression elimination and dead-code elimination.. 2.3 2.3.1. Value Numbering Local Value Numbering (LVN). Local Value Numbering is the technique applied at the compile-time that recognizes redundancy in a basic block among expressions that are lexically different, but which are certain to compute the same value. Traditionally, it is achieved by assigning symbolic names (value numbers) to expressions. If the value numbers of the operands of two expressions and the operators applied by the expressions are identical, then the expressions receive the same value number and are certain to get same results. The analysis supports three different optimizations: common subexpression elimination (CSE), copy propagation (CP), and dead-code elimination (DCE) (CLICK, 1995; GULWANI; NECULA, 2007). The method works progressing through each statement in sequential order. For each new variable, a distinct value number is assigned. For any assignment expression, an existing value number on the right-hand side is assigned to the left-hand side (see Figure 6). A new value number is created and propagated to both sides when a new variable, i.e., one that has no existing value number, is found. This procedure implements copy propagation. Each unary or binary expression is analyzed, and when a match is recognized the algorithm searches a history table of value numbers and replaces the right-hand side with the variable associated with the matching value number. This.

(37) 2.3 Value Numbering. 37. is equivalent to constant subexpression elimination. If the algorithm identifies that two different value numbers are assigned to the same variable and that no intervening use of that variable exists, then the first assignment is marked as dead and removed from the instruction list. This implements dead-code elimination. The operation of the algorithm is illustrated by the steps shown in Figure 6. The algorithm starts with the original code in the first column, in the left. In the second column, all variables have been assigned value numbers. Common subexpression are identified and eliminated: binary expression c3 = a1 + b2 and binary expression f 3 = d1 + e2 have the same value number pattern (vn(3) = vn(1) + vn(2)), hence f 3 = d1 + e2 is replaced with f 3 = c3 . Dead code elimination is shown in the third column. In this case two assignment expressions, d1 = a1 and d4 = x4 , associate different value numbers to the same variable d, and there is no intervening use of d. First assignment expression d1 = a1 is marked dead and removed from the list of instructions. Hash-based value numbering is one LVN family of methods, which was originally fully described in (COCKE, 1969). This is the most used pessimistic algorithm operating over basic blocks and is a widely implemented and useful technique. The problem of determining whether two computations are equivalent is, in general, undecidable (RAMALINGAM, 1994). The approach taken in value numbering algorithm is that any two expressions identified as equivalent produce identical values on all possible executions. This commonly referred as a conservative solution. Algorithm 2.1 formalizes the steps for an implementation of LVN. The algorithm can be implemented with worst case complexity of O(N 2 ), but most LVN analyses complete in O(kN) time, with k << N (BRIGGS; COOPER; SIMPSON, 1997). The construction of the DAG representing the basic block is closely related to the operation of a value numbering algorithm, but in this case the reuse of nodes in the DAG is performed instead of the insertion of new nodes with the same value. This operation corresponds to deleting later computations of equivalent values and replacing them by previously computed ones. The presentation of the intermediate code in the DAG form helps to illustrate the operation of the technique. Figure 7 shows the application of the algorithm 2.1 to the.

(38) 2.3 Value Numbering. Algorithm 2.1 Local Value Numbering (LVN) algorithm over basic blocks 1: T ← empty 2: N ← 0 3: for all quadruple a ← b ⊕ c in the block do 4: if (b 7→ k) ∈ T for some K then 5: nb ← k 6: else 7: N ← N+1 8: nb ← N 9: put b 7→ nb into T 10: end if 11: if (c 7→ k) ∈ T for some K then 12: nc ← k 13: else 14: N ← N+1 15: nc ← N 16: put c 7→ nc into T 17: end if 18: if ((nb ⊕ nc ) 7→ m) ∈ T for some m then 19: put a 7→ m into T 20: mark this quadruple a ← b ⊕ c as a common subexpression 21: else 22: N ← N+1 23: put (nb ⊕ nc ) 7→ N into T 24: put a 7→ N into T 25: end if 26: end for. 38.

(39) 39. 2.3 Value Numbering g = x + y. g 3 +. x. y. 1. 2. x0. y0. (a) DAG construction: 1 expression g h i x. = = = =. x u x u. + + -. y v y v. g. x. i. 3. -. + 1. h. 6. y. x0. 2. 4. y0. u0. v. 5. u. v0. (b) DAG construction: 4 expressions g h i x u v w. = = = = = = =. x u x u g i u. + + + + +. y v y v h x v. w. 8 + u v. 7 +. g i. 1 x0. 3. 6. +. -. y. h. x. 2. 4. 5. y0. u0. v0. (c) DAG construction: 7 expressions. Fig. 7: Steps to build a DAG applying local value numbering..

(40) 40. 2.3 Value Numbering. code in the top left in a DAG form. The figure shows expressions scanned and the resulting graph in sequence. Each node holds a content or expression. The value number assigned is annotated in the top of the node. Variables are annotated with arrows pointing to content held as the expressions are scanned. For example, in the second block the reuse of x requires that the arrow pointing to node x0 to be removed and pointed to the node with the result of u-v. Similar reuses are seen in the third block.. 2.3.2. Global Value Numbering (GVN). The value numbering algorithm as proposed is only able to handle basic blocks. An extension that can be applied to extended basic blocks was proposed by Auslander and Hopkins and implemented in the IBM PL.8 compiler (AUSLANDER; HOPKINS, 2004). A general approach to operate over the program as a whole is known as Global Value Numbering (GVN). The GVN technique is the analysis used to remove redundant computations that compute the same static value in a program considering multiple procedures. The roots of GVN approach are in the work of Alpern, Rosen, Wegman, and Zadeck in (ALPERN; WEGMAN; ZADECK, 1988b). Their main contribution was the description of how expressions can be partitioned into congruences classes. The first contribution into the direction of a global value numbering targeting the whole program, including multiple procedures, was represented by the optimistic algorithm proposed by Reif and Lewis (REIF; LEWIS, 1986). The algorithm, in spite of having a good asymptotic complexity, is hard to implement. Alpern, Wegman and Zadeck (ALPERN; WEGMAN; ZADECK,. 1988b) suggested an algorithm that used the Hopcroft’s finite state minimiza-. tion to determine the congruences, including a subscripting scheme called φ -function that captures the semantics of conditionals and loops. The use of the φ -function in the description of a program led to the development of the SSA form. This approach was extended by Yang, Horowitz and Reps (YANG; HORWITZ; REPS, 1989). A combined SSA form with the Program Dependence Graph (PDG) was proposed by Ballance, Maccabe and Ottenstein (OTTENSTEIN; BALLANCE; MACCABE,. 1990). It is called Gated Single Assignment (GSA) and it is arguably.

(41) 41. 2.3 Value Numbering. a better subscripting scheme than the proposed by Alpern, Wegman and Zadeck. Entry. n ← val i1 ← 1 j1 ← 1. j3 ← φ2(i1,i2) j3 ← φ2(j1,j2) i3 mod 2 = 0. i4 ← i3 + 1 j4 ← j3 + 1. i ← 1 j ← 1 if i mod 2 = 0 i ← i + 1 j ← j + 1 else i ← i + 3 j ← j + 3 if j > n. i5 ← i3 + 3 j5 ← j3 + 3. i2 ← φ5(i4,i5) j2 ← φ5(j4,j5) j2 > n. Exit. (a) Intermediate code. (b) Corresponding minimal SSA form. Fig. 8: Minimal SSA form from an intermediate code. In the core of GVN technique is the discovering of the so-called variables congruence. The concept is defined as the case where the computation that defined the variables have identical operators and their corresponding operands are congruent. In order to apply the GVN technique, the procedure needs to be translated to the minimal SSA form. Such manipulation can be done using the dominance frontier. The SSA form can be used then to produce the socalled Value Graph (VG). The value graph is a labeled directed graph whose nodes are labeled with the operators, functions and constants. The value graph edges represent the assignment from an operator or function to its operands. The edges are labeled with natural numbers.

(42) 42. 2.3 Value Numbering. 1. 1 2 1 2. φ2. 1 2 1. 0. mod. +. 1. φ2. 2 1. 1. φ5. 3. 1. 2. 2. 2. 1. 2. 1. 3. 1. +. +. 2. =. 1. +. 2 1. φ5. 2. n 2. 1. >. Fig. 9: Value Graph. indicating the operand position with respect to a given operator. Figure 9 shows the value graph for the intermediate code in Figure 8(a) and its translation into the minimal SSA form of figure 8(b). From the value graph, a variable congruence is defined as the maximal relation on the graph such that two nodes are congruent if either they are the same node or have the same operators and theirs operands are congruent. The equivalence of two variables is then defined at a point P if they are congruent and their defining assignments dominate P (MUCHNICK, 1997). The Apern-Rosen-Wegman-Zadeck algorithm makes an optimistic assumption that a large set of expression classes is congruent and refines the grouping splitting the congruence classes at a fixed point. A hash table-based approach for GVN was introduced in (BRIGGS; COOPER; SIMPSON,. 1997). The role of the hash table is the association of an expression to a. value number. Taylor Simpson’s SCCVN algorithm for optimistic value numbering discovers value-based identities and is arguable the strongest global technique for redundant scalar values detection and has been implemented in a number of compiler (SIMPSON, 1996)..

(43) 2.4 Register Allocation. 2.3.3. 43. Partial Redundancy Elimination (PRE). Another large group of optimization techniques can be gathered under the Partial Redundancy Elimination (PRE) group. The technique was first proposed in (MOREL; RENVOISE, 1979). The basic idea is to find computations that are redundant on some, but not all paths (LO et al., 1998). The process can be viewed as follows: considering that a given expression e at some point p is redundant on some subset of the paths that reach p, the transformation inserts evaluations of e on paths where it had not been, making the evaluation at p redundant on all paths (XU, 2003). The conclusion frequently found in the literature comparing the GVN and PRE can be summarized as follows: in one hand PRE finds lexical congruences instead of value congruences, including only partial redundancies. On the other hand, GVN finds value congruences but can remove only full redundancies (VANDRUNEN, 2004). Attempts to integrate both approaches have been seen in the literature more recently. The mixed approaches proposed in (BODÍK; ANIK, 1998) and later in (VANDRUNEN, 2004) are significant examples. However, up to date such approaches have not been seen implemented in commercial compilers specially due to its complexity and memory demands.. 2.4. Register Allocation. Register allocation consists in an attempt to maximize execution speed of programs keeping as many variable as possible in registers instead of memory. In case there are more live variables than available registers, variables must be spilled from registers, i.e., written to the memory and reloaded at a later time when they are needed. An efficient allocation should avoid spilling of invariant values, since this spill is potentially redundant and can be folded by propagating the invariant value to the subsequent instructions that use it as a source operand. Register allocation is usually performed at the end of global optimization, when the final structure of the code is ready, and all registers to be used are known. The register allocation procedure attempts to map the registers in such a way that minimizes the number of memory references. Register allocation affects performance by lowering the instruction count and po-.

(44) 44. 2.4 Register Allocation. tentially reducing the execution time per instruction by changing memory operands to register operands. Code size is reduced by these improvements, leading to other secondary improvements. Register allocation depends on the liveness analysis in order to decide which variable to spill.. 2.4.1. Variable Liveness Analysis. The liveness analysis is one of the techniques of data flow analysis that is used to seek the variables that may be potentially read before their next write. A variable in this state is usually called as live. The liveness analysis is in the core of the register allocation task. The register allocation determines which of the values should be in the registers of the machine at each point of the execution. The register assignment is usually preceded by a liveness analysis. A liveness analysis of a basic block S can be expressed by the following equations:. Lin−state [s] = G[S ] ∪ (Lin−state − U[S ]) Lout−state [ f inal] = ∅ [ Lout−state [S ] = Lin−state [p] p∈succ[S ]. G[d : y ← f (x1 , · · · , xn )] = {x1 , · · · , xn } U[d : y ← f (x1 , · · · , xn )] = {y}, where G is the set of variables used before any assignment, U is the set of variables assigned a value in S before any use. The sets Lin−state and Lout−state are defined as respectively the set of variables that are live at the beginning of the block and the set of variables that are live at the end of the block. The process of making dead the written variables and making live the read variables is handled by the state function f (KHEDKER; SANYAL; KARKARE, 2009)..

(45) 45. 2.4 Register Allocation. x y w z x. ← ← ← ← ←. s1 s2 s3 s4 s5 s6. 2 4 x + y x + 1 z * 2. ← ← ← ← ← ←. 2 4 s1 s1 s1 s4. + + * *. (a). s2 1 s2 2. (b). r1 r2 r3 r3 r1 r2. ← ← ← ← ← ←. 2 4 r1 r1 r1 r3. + + * *. r2 1 r2 2. (c). Fig. 10: Steps to perform register allocation with K-coloring.. s1. s3. r2. s5 r1. s2. s4. r3 s6. Fig. 11: Example of interference graph..

(46) 2.4 Register Allocation. 2.4.2. 46. Register Allocation by Graph-Coloring. One of the most effective approaches largely used in register allocation is the technique known as register allocation by graph coloring. The application of the general graph-coloring problem (SAATY; KAINEN, 1977) to the register allocation problem was first proposed in (COCKE, 1969). The first implementation was obtained only ten years latter in (CHAITIN, 2004). The approach was first adapted for the PL.8 compiler for the IBM 801 RISC System. Most of the modern compilers implement this allocator or one of its derivations. The global register allocation by graph coloring assumes two major procedures, the construction of the so-called Interference Graph (IG) and the procedure of K−Coloring. The interference graph is created in a way that every vertex represents a unique variable in the program. The interference edges are the ones connecting the pairs of the vertices which are live at the same time. The pairs of vertices involved in move instructions are called preference edges. The register allocation can be done with the procedure of K-Coloring the interference graph. The K-Coloring problem applied to register allocation can be seen in a simplified way as the assignment of a necessarily different color to two vertices sharing an interference edge and the assignment of the same color to edges sharing the same preference edge when possible. Hardcoded register assignment is represented by the precoloring of some edges. The K-Coloring problem is a known NP-complete (BRIGGS; COOPER; TORCZON, 1994). For the past several years algorithms trading off quality and performance of the code produced have been proposed. Global register allocation by graph coloring can be summarized in the following steps (MUCHNICK, 1997): 1. In the phase that precedes the register allocation, allocate objects that can be assigned to registers r1 , r2 , · · · , rn to distinct symbolic registers, using as many registers as necessary to hold the objects; 2. Determine the object sets that should be candidate for allocation; 3. Generate the interference graph for which each node represents allocatable objects and the real registers of the target machine. The arcs should represent the interferences,.