Improving hardware/software transactional memory codesign : a phase-based and over-instrumentation elimination approach = Melhorando o codesign hardware/software em memória transacional: uma abordagem baseada em fases e eliminação de instrumentação em exc

(1)

João Paulo Labegalini de Carvalho

Improving Hardware/Software Transactional

Memory Codesign: A Phase-based and

Over-Instrumentation Elimination Approach

Melhorando o

Codesign Hardware/Software em

Memória Transacional: Uma Abordagem Baseada em

Fases e Eliminação de Instrumentação em Excesso

CAMPINAS

2020

(2)

Improving Hardware/Software Transactional

Memory Codesign: A Phase-based and Over-Instrumentation

Elimination Approach

Melhorando o Codesign Hardware/Software em

Memória Transacional: Uma Abordagem Baseada em Fases e

Eliminação de Instrumentação em Excesso

Tese apresentada ao Instituto de Computação da Universidade Estadual de Campinas como parte dos requisitos para a obtenção do título de Doutor em Ciência da Computação.

Thesis presented to the Institute of Computing of the University of Campinas in partial fulfillment of the requirements for the degree of Doctor in Computer Science.

Supervisor/Orientador: Prof. Dr. Guido Costa Souza de Araújo

Este exemplar corresponde à versão final da Tese defendida por João Paulo Labegalini de Carvalho e orientada pelo Prof. Dr. Guido Costa Souza de Araújo.

CAMPINAS

2020

(3)

Ana Regina Machado - CRB 8/5467

Carvalho, João Paulo Labegalini de,

C253i CarImproving hardware/software transactional memory codesign : a phase-based and over-instrumentation elimination approach / João Paulo Labegalini de Carvalho. – Campinas, SP : [s.n.], 2020.

CarOrientador: Guido Costa Souza de Araújo.

CarTese (doutorado) – Universidade Estadual de Campinas, Instituto de Computação.

Car1. Memória transacional. 2. Compiladores (Programas de computador). 3. Sistemas híbridos. 4. Linguagem de programação (Computadores). I. Araújo, Guido Costa Souza de, 1962-. II. Universidade Estadual de Campinas. Instituto de Computação. III. Título.

Informações para Biblioteca Digital

Título em outro idioma: Melhorando o codesign hardware/software em memória

transacional : uma abordagem baseada em fases e eliminação de instrumentação em excesso

Palavras-chave em inglês:

Transactional memory

Compilers (Computer programs) Hybrid systems

Programming languages (Electronic computers)

Área de concentração: Ciência da Computação Titulação: Doutor em Ciência da Computação Banca examinadora:

Guido Costa Souza de Araújo [Orientador] Marcio Bastos Castro

Alexandro José Baldassin Marcio Machado Pereira

João Pedro Faria Mendonça Barreto

Data de defesa: 23-07-2020

Programa de Pós-Graduação: Ciência da Computação

Identificação e informações acadêmicas do(a) aluno(a)

- ORCID do autor: https://orcid.org/0000-0002-3476-184X - Currículo Lattes do autor: http://lattes.cnpq.br/4569617146064566

(4)

João Paulo Labegalini de Carvalho

Improving Hardware/Software Transactional

Memory Codesign: A Phase-based and Over-Instrumentation

Elimination Approach

Melhorando o Codesign Hardware/Software em

Memória Transacional: Uma Abordagem Baseada em Fases e

Eliminação de Instrumentação em Excesso

Banca Examinadora:

• Prof. Dr. Marcio Bastos Castro INE-UFSC

• Prof. Dr. Alexandro José Baldassin DEMAC-UNESP

• Prof. Dr. Marcio Machado Pereira IC-UNICAMP

• Prof. Dr. João Pedro Faria Mendonça Barreto INESC-ID/ULisboa

• Prof. Dr. Guido Costa Souza de Araújo (Advisor) IC-UNICAMP

A ata da defesa, assinada pelos membros da Comissão Examinadora, consta no SIGA/Sistema de Fluxo de Dissertação/Tese e na Secretaria do Programa da Unidade.

(5)

The author would like to thank his advisor, Prof. Guido Araujo, for all the support and guidance throughout his Ph.D. studies. He would also like to thank all his friends from the Laboratório de Sistemas Computacionais (LSC) for walking alongside him even on hard times, for the shared laughter and, cups of coffee to help on long days of work. The author not only thanks but acknowledges the love he received from all his friends and family during his Ph.D. studies, such love enabled him to accomplish a life dream that, unfortunately, not all his family members were able to see fulfilled. The author would like to thank Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)1 _for

supporting this work.

1_{grant numbers 2013/08293-7, 2016/15337-9, and 2019/01110-0, São Paulo Research Foundation}

(6)

A época na qual computadores eram considerados artigos luxuosos demais para se ter em casa ou quando eram apenas grandes calculadoras já passou. Atualmente, ciência e enge-nharia da computação tem papel central nos avanços de diversas áreas de pesquisa. No entanto, atender a crescente demanda por desempenho ainda é um desafio para cientistas da computação, assim como para a indústria de computadores. Durante décadas a indús-tria de microprocessadores proveu contínuo aumento em desempenho a cada nova geração de processadores, seguindo a previsão de Moore baseada nas melhorias da tecnologia de semicondutores. Quando o limite na densidade de potência fornecida em semicondutores estava próximo de ser atingido, a mudança de máquinas mononucleo para multi-núcleo mostrou-se uma das melhorias mais significativas no desempenho de computadores moder-nos. Entretanto, mesmo com maior poder computacional, máquinas paralelas ainda são difíceis de serem adotadas devido aos desafios para desenvolver programas paralelos. A maioria dos modelos de programação paralelo requer programas que explicitamente usem os núcleos de processamento disponíveis. Todavia, para garantir resultados corretos, a necessária sincronização no acesso de recursos compartilhados mostrou-se ser um tarefa difícil e propensa a erros. Nessa direção, o modelo de programação denominado Memória Transacional (TM) propõe a simplificação na escrita de programas paralelos ao tornar a sincronização transparente para o programador. No modelo de TM o programador ape-nas deve se preocupar com o que deve ser sincronizado e não com a escrita de código de sincronização. Mesmo após um grande volume de resultados mostrando os benefícios e ganhos em desempenho em sistemas em software, hardware e híbridos, a adoção de TM ainda está restrita à aplicações de pesquisa. De fato, recentemente pesquisadores respeitados na área desistiram de apresentar a especificação técnica revisada ao comitê da linguagem C++ para incorporar TM como parte da linguagem. Após várias tentativas mal-sucedidas, os pesquisadores concluíram que não há dados suficientes para guiar as de-cisões de implementação baseadas na experiência de uso do modelo de TM. Somado a isso, sistema híbridos convencionais (HyTM) provaram-se limitados mesmo podendo executar transações em hardware e software simultaneamente. A classe de algoritmos baseados em fases (PhasedTM), antes dos resultados apresentados neste manuscrito, foram considera-dos alternativas inferiores aos sistemas HyTM. Nessa direção, esta tese apresenta duas contribuições na área de TM. Primeiramente, a tese constrói um caso sólido em favor de sistemas de TM baseados em fases. Segundamente, a tese apresenta um suporte esten-dido para TM no compilador Clang que permite a geração de código eficiente de forma automática. Mais especificamente, esta tese apresenta resultados contradizendo a suposta inferioridade de sistemas baseados em fases em contraste com sistemas híbridos convenci-onais de TM. A tese também apresenta um novo mecanismo de anotação (TMElide) para omissão seletiva de barreiras transacionais que normalmente seriam inseridas desnecessa-riamente pelo compilador. A tese apresenta a anotação TMElide que estende o sistema de tipos na linguagem C/C++ e a incorporação de TM no compilador Clang/LLVM.

(7)

The time when computers were considered just another luxury commodity to have at home or an expensive tool to crunch numbers are long gone. Nowadays, computer science and engineering are at the core of recent advances in many research fields. However, the ever-increasing demand for performance still imposes a challenge to both computer scientists and industry. For decades, the microprocessor industry successfully provided a steady performance increase with each new processor release, leveraged by Moore’s Law continuous improvement of semiconductor technology. As power density boundaries in semiconductor technology have been reached, the shift from single-core to multicore pro-cessing has become the most relevant performance improvement tool in modern computer design. Despite the great potential of multicore machines to meet today’s demands, their adoption is still modest mainly due to the challenging aspect of concurrent program-ming. Most concurrent programming models available require programmers to explicitly write their code in a way that all cores are used. Yet the task of synchronizing multi-ple concurrent accesses to shared resources has proven to be error-prone and far from trivial. In this direction, the Transactional Memory (TM) programming model provides a simple and transparent abstraction to express what needs to be synchronized, with-out requiring the programmer to write synchronization code. TM has the potential to greatly simplify the exploitation of the parallelism available on multicore architectures while simplifying the programmers’ task. Nonetheless, TM usage and adoption is mostly restricted to research applications, although a large body of research shows the benefits and great performance results that software, hardware and hybrid transactional systems enable. In fact, recently leading researchers in the area backed-off from presenting a re-vised technical specification for standardizing TM in C/C++, after many failed attempts, mainly due to the lack of usage experience. Moreover, although conventional hybrid TM systems allow different transactions to operate in hardware and software, simultaneously, such systems were proven to be inherently limited. Phase-based transactional systems (PhasedTM) are another class of hybrid TMs that execute transactions in non-overlaped phases but, prior to the work described in this manuscript, they were considered as an inferior hybrid variant. In this direction, this thesis makes two contributions to the field of TM. First, it makes a solid case in favor of phase-based transactional memory. And second, it shows how extended compiler support for TM can be used to automatically generate high-performance transactional code. More specifically, this work builds a case for phase-based transactional systems, which were, so far, regarded as inferior variations of conventional hybrid transactional systems. In addition, it proposes a novel annotation mechanism (TMElide) to selectively eliminate unnecessary transactional memory barri-ers from compiler generated code. This thesis presents the TMElide annotation which extends the C/C++ language type system and how trasactional memory support was incorporated into the Clang/LLVM compiler framework.

(8)

1 Introduction 11

2 Concurrent Programming 15

2.1 Intellectually Manageable Parallelism . . . 17

2.1.1 Quiescent Objects . . . 18

2.1.2 Sequential Objects . . . 19

2.1.3 Linearizability . . . 19

2.2 Synchronization (or agreeing on Time) . . . 20

2.2.1 Consensus . . . 21

2.2.2 Blocking Synchronization . . . 22

2.2.3 Non-Blocking Synchronization . . . 23

3 Transactional Memory 25 3.1 Building Blocks of TM Systems . . . 26

3.1.1 Concurrency Control . . . 26

3.1.2 Version Management . . . 27

3.1.3 Conflict Detection and Resolution . . . 27

3.2 Semantics for Transactions . . . 28

3.3 Brief History on Transactional Code Generation . . . 29

3.3.1 Transactional Memory Support on GCC . . . 32

4 Revisiting Phased Transactional Memory 35 4.1 Abstract . . . 35

4.2 Introduction . . . 36

4.2.1 Motivating example . . . 37

4.2.2 Contributions . . . 38

4.3 Background . . . 38

4.3.1 Hardware Transactional Support . . . 38

4.3.2 Hybrid Transactional Memory . . . 39

4.3.3 Phased Transactional Memory . . . 40

4.4 PhTM*– An Efficient Phased TM Algorithm . . . 42

4.5 Experimental Results . . . 44

4.5.1 Experimental Setup . . . 44

4.5.2 Results using TSX . . . 46

4.5.3 Results using Power8 . . . 48

4.5.4 Challenges and Limitations of Phased TM Systems . . . 49

(9)

5.2.1 Motivating example . . . 54

5.3 Background . . . 56

5.3.1 Hardware Transactional Support . . . 56

5.3.2 Hybrid Transactional Memory . . . 57

5.3.3 Phased Transactional Memory . . . 58

5.4 PhTM*– An Efficient Phased TM Algorithm . . . 59

5.5.2 Results using TSX . . . 65

5.5.3 Results using Power8 . . . 68

5.5.4 Behavior of STAMP Applications . . . 71

5.5.5 Performance in the Presence of Phases and Hybrid Behavior . . . . 72

5.5.6 Challenges and Limitations of Phased TM Systems . . . 75

5.6 Conclusion . . . 76

6 On the Efficiency of Transactional Code Generation: A GCC Case Study 77 6.1 Abstract . . . 77

6.3 Background and Related Work . . . 79

6.3.1 Compiler Instrumentation . . . 79

6.3.2 Related Work . . . 81

6.4 Language Support for Barrier Elision . . . 82

6.4.1 Elision pragma: syntax and usage . . . 82

6.4.2 Implementation . . . 83

6.5.2 Breakdown of STAMP Transactional Accesses . . . 85

6.5.3 Pragma Elidebar to The Rescue . . . 86

7 Improving Transactional Code Generation via Variable Annotation and Barrier Elision 89 7.1 Abstract . . . 89

7.2.1 Motivating Example . . . 91

7.3 Background and Related Work . . . 93

7.3.1 Transactional Memory Support in GCC . . . 93

7.3.1.1 Language Support . . . 94

7.3.1.2 Transactional Code Generation . . . 94

7.3.1.3 Transactional Runtime Library . . . 95

7.3.2 Related Work . . . 95

7.4 Transactional Memory Support in Clang/LLVM . . . 97

(10)

7.5.2 Performance Gains Through Barrier Elision . . . 100 7.5.3 Identifying Barrier Elision Opportunities . . . 102 7.6 Conclusion . . . 105

8 Final Discussion and Conclusion 106

(11)

Chapter 1 Introduction

Computer Science is among the youngest fields in science to be so widely adopted in order to improve and enable the progress of many research areas. The dawn of the 21st century has witnessed Computer Science becoming a major tool in science exploration[71]. In fact, computing machines have evolved from simple mechanical number-crunching ma-chines [76] to sophisticated devices that aid surgical procedures, thus enabling complex systems like those that simulate the sub-atomic particles of the universe or most recently those unleashed the power of self-learning machines [95]. In the previous decades, most of the progress achieved by the microprocessor industry was done by delivering increasingly faster machines in alignment to Moore’s Law [42]. For over 40 years, the software stack has achieved performance gains with every new microprocessor generation, without the need to alter its source code. Increase in clock speed, micro-architectural optimizations (e.g. Instruction Level Parallelism) and the exploitation of the memory hierarchy have been the main reasons the software stack kept improving its performance. Nevertheless, the physical limits in the power density of semiconductor technology [43], captured by the concept of Power Wall [69], slowed down the increase in operating frequency, as well as the transfer speed between processor and memory. In addition, as typical programs have a small number of independent instructions [107], the effectiveness of ILP and related optimizations are also limited. Moreover, due to the Memory Wall [111] the increase in the instruction scheduling window would not work as a long-term solution to improve performance. In essence, single core processors are no longer a feasible solution to faster systems and will not keep up with the ever-increasing demand for performance. As a re-sult, the most widespread alternative to achieve better program performance has become microprocessors with multiple or many processing units (multicores and many-cores).

The readily exploitation of multicore architectures has proven to be an enormous programming challenge. Despite its success with general purpose graphical processors (GP-GPUs) for data-parallel problems, existing programming models and tools are still too archaic for more demanding problems in concurrent programming [1]. When syn-chronization is required, it is the consensus that concurrent programming is far from trivial and, therefore new tools and strategies are needed [100]. A new concurrent pro-gramming model known as Transactional Memory (TM) [101], which abstracts shared memory accesses through transactions, similarly to database transactions, has shown to be more than just promising [3, 30]. TM is not a novel idea, its use in the context of

(12)

multicore programming was introduced back in 1976 by Eswaran et al. [56]. Transactions systems are not only of interest to the academia, but to the industry as well. Two major microprocessor vendors, IBM® _{in 2012 and Intel}® _{2013, released processors with}

Hard-ware Transactional Memory support (HTM). ARM recently announced extensions to its Instruction-Set Architecture for HTM [98]. Not long after that, transactional memory support appeared in production-ready compilers, such as Intel® _{C/C++ Compiler (ICC)}

and GNU’s Compiler Collection (GCC). Compiler support for TM focuses on software transactional memory (STM) in order to reduce overheads of naïve usage and take from the programmers the burdensome task of manually calling the TM runtime. Despite the language support existing for over 7 years, the performance of automatically generated transactional code is still a significant barrier to new TM users [80].

Software TM systems are inherently more flexible than hardware TM systems because multiple TM algorithms are implemented in different libraries [28, 3]. In addition, if such libraries follow the same Application Programming Interface (API), algorithms can be used interchangeably without re-compiling application’s code. Unfortunately, the poten-tial parallelism STMs are able to unveil usually hindered in short-running transactions due to transaction’s management overhead. Hardware TM systems excel in the execution of short transactions. However, HTM available on commodity processors only provide best-effort completion guarantees due to resource and implementation constraints. In order to take advantage of both software and hardware TM incarnations, the widely accepted solution in the literature is known as Hybrid TM (HyTM) [58, 29]. Conventional HyTM systems allow different transactions to operate in hardware and software, simultaneously. Despite this flexibility, conventional HyTMs introduce unavoidable overhead to guarantee correct and simultaneous execution in both software and hardware TM modes [6].

Phase-based transactional systems (PhasedTM) are another class of HyTMs that ex-ecute hardware and software transactions in non-overlaped phases [68]. However, before the work described in this manuscript, PhasedTM was considered an inferior variation of conventional HyTMs [27, 73]. This work claims that the potential performance of PhasedTM has been understimated, and to support such claim it makes two major con-tributions. First, it makes a solid case in favor of phase-based transactional memory. Secondly, it shows how extended compiler support for TM can be used to automatically generate high-performant transactional code. More specifically, this work’s contributions are the following:

• A thorough experimental evaluation that contrasts conventional HyTM and a novel phase-based TM implementation (PhTM*) that shows PhasedTM as the higher performance alternative (Chapter 4);

• A large experimental analysis showing that PhTM* also outperforms conventional HyTMs even for hybrid-behaved applications (Chapter 5).

• A study that reveals the deficiencies of existing compiler support for TM targeting the C/C++ languages, in particular, for the GNU C/C++ Compiler (Chapter 6). • An in-depth assessment of a novel transactional memory barrier elision mechanism

(13)

(TMElide) added to C/C++’s type system and incorporated into the Clang/LLVM compiler framework (Chapter 7).

The above contributions have been published in recognized conferences and journals of the area. All the publications directly related to this work’s research are listed below.

1. Chapter 4: Revisiting Phased Transactional Memory. João P. L. de Carvalho, Guido Araujo, and Alexandro Baldassin. Proceedings of the 31st Annual Interna-tional Conference on Supercomputing, 2017.

2. Chapter 5: The Case for Phase-Based Transactional Memory. João P. L. de Car-valho, Alexandro Baldassin, and Guido Araujo. 30th Volume of IEEE Transactions on Parallel and Distributed Systems, 2018.

3. Chapter 6: On the Efficiency of Transactional Code Generation: A GCC Case Study. Bruno C. Honorio, João P. L. de Carvalho, and Alexandro Baldassin. Simpósio de Sistemas Computacionais de Alto Desempenho (WSCAD) [Best-Paper Award], 2018.

4. Chapter 7: Improving Transactional Code Generation via Variable Annotation and Barrier Elision. João P. L. de Carvalho, Bruno C. Honorio, Alexandro Baldassin, and Guido Araujo. Proceedings of the 34th Annual IEEE International Parallel and Distributed Processing Symposium, 2020.

Besides the publications related to the this work, the following three articles have been published by the author during his PhD program.

1. An Efficient Parallel Implementation for Training Supervised Optimum-Path For-est Classifiers. Aldo Culquicondor, Alexandro Baldassin, Cesar Castelo-Fernandez, João P. L. de Carvalho, and João Paulo Papa. 393rd Volume of The Neurocom-puting Journal, 2018.

2. DOACROSS Parallelization Based on Component Annotation and Loop-carried Prob-ability. Luis Mattos, Divino Cesar, Juan Salamanca, João P. L. de Carvalho, Marcio Pereira, and Guido Araujo. Proceedings of the 30th International Symposium on Computer Architecture and High Performance Computing, 2018.

3. NV-PhTM: An Efficient Phase-Based Transactional System for Non-Volatile Mem-ory. Alexandro Baldassin, Rafael Pizzirani Murari, João P. L. de Carvalho, Guido Araujo, Daniel Felipe Salvador de Castro, João Barreto, and Paolo Romano. Proceedings of the 26th Annual International European Conference on Parallel and Distributed Computing, 2020.

In addition, the author stayed one year at the Software Systems Research Laboratory, part of the Computing Science Department in the University of Alberta, Canada. During his stay, the author participated in a research project as part of his BEPE scholarship1 _and

(14)

was co-advised by Prof. Jose Nelson Amaral. As a result, the following research paper will be submitted to the ACM Transactions On Architecture and Code Optimization: GEMM-FaReR : Replacing Native-Code Idioms with High-Performance Library Calls, João P. L. de Carvalho, Braedy Kuzma, José Nelson Amaral, Christopher Barton, José Moreira, Guido Araujo.

This document is organized around the articles published by the author during his PhD Program at the Institute of Computing (UNICAMP). The remaining of the docu-ment is organized as follows. Chapter 2 briefly discusses concurrent programming and its difficulties. Following that, Chapter 3 presents a short introduction to TM and more specifically a brief history of its compiler support. Chapters 4 through 7 present each contribution as published in the associated research paper or journal version. Finally, the conclusions of the work are presented in Chapter 8.

(15)

Chapter 2 Concurrent Programming

Concurrent programming is the design and implementation of algorithms that divide the process of computing the solution of a computational problem into smaller concurrent parts. Each concurrent part can be executed in parallel while preserving the input → out-put relationship. The parallel execution of a program is highly desirable because either its output is computed faster (reducing execution time), or the program can compute multiple outputs for multiple inputs in the same amount of time taken by the sequential program to produce a single output (improving execution throughput). Concurrent programs are mainly written for shared-memory or distributed-memory architectural models. Neverthe-less, there are many problems that benefit or, even require, a combination of both [117]. In the shared-memory model, data is seamlessly shared between processes/threads and do not usually require copy operations. Although shared-memory algorithms execute within the same physical machine, memory access time might not be uniform among threads of different cores. This is the case of Non-Uniform Memory Access (NUMA) machines [59]. Processes in the distributed-memory model share data through message passing proto-cols, which are implemented on top of high-speed networks. This work proposes a new algorithm (Chapters 4 and 5) and code generation technique (Chapter 7) for Transac-tional Memories (TM), a programming model designed for shared-memory systems. To help in the undersanding of those contributions this chapter presents and discusses how concurrent programming works in the context of shared-memory.

Master Thread Master Thread Master Thread F O R K F O R K J O I N J O I N

Figure 2.1: Life-cycle of threads in the Fork-Join model.

Threads are the most common concurrent programming model in shared-memory sys-tems. Although alternatives such as the Actor Model [2] exist, they are less commonly used and is outside the scope of this work. In the thread model, each process is composed of one or more threads. Each thread executes within a private context, which contains their own set of registers (e.g. program counter). Nonetheless, all threads within a

(16)

pro-cess share the same address space. The simplest, but perhaps most common threading pattern in concurrent applications is the fork-join model, depicted in Figure 2.1. The figure shows the main thread spawning multiple threads, usually called worker threads, which perform some computation and then synchronize at the end (join). After the join, the main thread continues the execution sequentially until the next fork. Other patterns exist, for instance, those in which threads do not synchronize at the end (disjoint threads). It is up to the programmer to decide which threading pattern better suits the applica-tion. The programmer is also in charge of choosing between various levels of control over the threads in the program. If the application does not require a finer control level, it is possible to use compiler-based parallelism, such as OpenMP [18]. If finer control is necessary or, a more sophisticated threading pattern is desired, the programmer can use thread libraries such as Pthreads (POSIX Threads) [84]).

Dilation Filter

Erosion Filter

Linear Combination

Laplacian Estimate

Figure 2.2: DAG of tasks to compute the Laplacian Estimate given image I. Some fundamental problems in computer science, such as sorting and searching, have intrinsic characteristics that facilitate the design of concurrent algorithms. Problems with these features are known as embarrassingly parallel (or pleasingly parallel). Such ease in design comes from the fact that either the input or the computations can be divided into independent (or mostly independent) parts such that they are processed or performed in parallel. The former falls under the category of Data-Parallel problems [57], and the latter is categorized as Task-Parallel problems [99]. Data-Parallel problems are very common in the realm of Linear Algebra computations (e.g. multi-dimensional array multiplication) which are, for example, the building blocks of Convolutional Neural Networks (CNNs) [65]. Even though data-parallel algorithms do not usually require synchronization, they still impose the challenge of evenly partitioning the data among each processing element, so each processor performs the same amount of work. This problem is known as the load-balancing problem [25]. Task-Parallel problems are less pleasingly parallel because, although independent tasks can indeed execute in parallel, those tasks which are not independent must be executed in a specific order. For example, the Laplacian Estimate can be divided into three tasks: Dilation Filter (DF), Erosion Filter (EF) and Linear Combination (LC)[105]. Both DF and EF tasks can execute in parallel, as both produce two new matrices from the input image. However, LC task, which linearly combines both matrices, can not execute in parallel with DF and EF. This dependency, depicted in Figure 2.2, requires some form of synchronization. The problems with synchronization will be discussed later in Section 2.2.

The quality of a parallel algorithm is commonly measured in terms of how much faster it is compared with its best known sequential counterpart. Nevertheless, the best

(17)

sequential algorithm might be unknown. Alternatively, the quality can be measured based on how much faster its parallel execution is compared with its sequential execution. Every implementation of a parallel algorithm splits the algorithm in two parts: parallel and sequential sections. As the name suggests, the parallel sectiosn can execute in parallel, while the sequential sections must run serially, either because they were not, or could not, be parallelized. Amdahl’s Law [7] gives the theoretical limit of the performance speedup that can be achieved by the parallelization of an algorithm, as a function of the number of processors (p), the percentage of sequential sections (fs) and parallel sections (1 − fs)

of the program (Equation 2.1).

S(p) = 1 fs+1−f_ps (2.1) lim p→∞S(p) = 1 fs (2.2) As shown in Equation 2.1, when p approaches infinity, the maximum speedup is limited only by the percentage of sequential code. Hence, no matter how efficiently a parallel algorithm exploits the underlying hardware to execute the parallel sections of the program, the speedup will always be limited by the amount of time to execute the serial sections. Therefore, reducing the amount of serial computations is as important as designing an effective parallelization strategy.

2.1 Intellectually Manageable Parallelism

Even before a solution to a computational problem is devised, the key features must first be understood and abstracted into entities that are, at least, intellectually manageable [35]. Once the problem’s components are identified and their relationships are known, the translation from the high-level abstractions to a computer algorithm takes place. In order to execute, such algorithm must be further translated into a program writen in a programming language. A program describes computational steps in a given execution model. For didactical purposes only, the discussions in the remainder of this chapter will employ an object-oriented model [97]. In this conceptual model the program is composed of objects and methods. Objects are the entities that store information relevant to the computation of a problem’s solution. The current value stored by an object is called its state. Methods are the means by which objects change state, represent their interaction and describe their behavior (side effects). The state of the program is the state of all objects in a given point in time. Thus, the program execution from the initial state, which encodes the input, to a terminal state, encoding the output, can be fully described by a sequence of method invocations and considering only the state of each object before and after each invocation. In other words, if the program state satisfies a set of conditions (preconditions), therefore, after the method invocation, the new program state will satisfy another set of conditions (post-conditions), which depend only on the preconditions and the method specification. Such description is known as the Sequential Specification of a program [64].

(18)

The sequential specification is powerful enough to correctly describe the state transi-tions of sequential objects, i.e. , objects in a sequential program. However, if a sequential object is shared among threads, to precisely describe its state between method invocations might be impossible, as superposition may naturally occur. If method invocations over-lap, it is possible that objects are observed in an intermediate state, produced by another incomplete invocation. In addition, the notion of order, natural in sequential execution, loses its meaning as different permutations of invocations are possible. In essence, a se-mantic for concurrent objects is required in order to precisely describe their behavior in the presence of multiple threads. Before moving to the discussion of consistency proper-ties of concurrent objects, some auxiliary concepts and definitions must be established. The execution of a method consists of the period between its invocation event and its response event. A method is pending if only its invocation event happened, but not its response. Final methods are methods that have a defined response event for every ob-ject state. Otherwise, they are said to be partial methods. A correctness property P is composable if, whenever every object in the system satisfies P, then the system as a whole also satisfies P [49]. Lastly, program order is the one in which method invocations happen for each thread. The invocation order for different threads are independent, i.e. , program order is a partial order of all invocations.

In the discussion of the following consistency properties a queue object will be used. A queue Q has two basic methods: enq() and deq(), which enqueues and dequeues form Q, respectively. Q is assumed initially empty, unless said otherwise. The method enq() is final since its always possible to enqueue an element into a queue. However, deq() is not final because it is not possible to dequeue from an empty queue. Nevertheless, it is possible to transform deq() into a final method by considering an exception as an event response for empty queues. The following principles will be used to define the semantics of each kind of object.

Principle 1: An execution must appear one after another in sequential order.

Principle 2: Executions separated by a quiescent period must appear in real time order. Principle 3: An execution must appear in program order.

Principle 4: Every execution must appear to take place instantaneously between its invocation and response event.

2.1.1 Quiescent Objects

The most intuitive principle for concurrent methods is to make their execution appear as if taking effect one after another in sequential order. For instance, if thread A and B execute concurrently Q.enq(−3) and Q.enq(7) then, a later execution of thread C must not execute Q.deq(−7) 1_{. Although easy to grasp, this principle, here on referred to}

as Principle 1, is a very weak correctness criterion. In order for a method to satisfy quiescence, suffices to always return the initial state of the object. This vicious scenario

(19)

time

Figure 2.3: P and Q are sequentially consistent but the execution as a whole is not.

can be avoided by imposing that invocations separated by a quiescence period to appear in real time order (Principle 2). Quiescence period is a time interval in which there is no pending methods. In a queue object that is quiescently consistent, if thread A executes Q.enq(x)and B Q.enq(y), then if C executes Q.enq(z) in a point that the Q is quiescent, therefore z must be preceded by both z and y. An object is quiescently consistent if it satisfies both Principle 1 and Principle 2. Quiescent objects are composable and, more importantly, non-blocking (see Section 2.2.3). In any concurrent execution, pending final methods have quiescently consistent responses. It is important to notice that quiescent consistency does not specify an order for concurrent invocations, only for those that follow Principle 2.

2.1.2 Sequential Objects

Sequential objects behave as if their methods occurred in sequential order (Principle 1) consistent with program order (Principle 3). In essence, it is possible to schedule the methods such that both program order and the sequential specification of the object are satisfied. Sequential objects are desirable because read and write operations in modern multiprocessor architectures are not sequentially consistent 2_{, thus the processor is free}

to reorder them. Despite its usefulness, sequential objects are not composable. This follows from the fact that sequential consistency imposes only that executions must be consistent with respect to the program order, which for two different threads are unrelated. For example, let us assume that threads A and B perform operations in two queues, P and Q, as Figure 2.3 shows. It is easy to see that both, in isolation, P and Q are sequentially consistent, i.e. , there is a way to order invocations in a sequential order that obeys program order. However, it is impossible to schedule them in sequential order such that the sequential specifications of both queues are satisfied, without violating program order. This result is core to the understanding of why operations that provide sequential semantics are not sufficient for general synchronization (see Section 2.2).

2.1.3 Linearizability

In order to provide sequential consistency and composability, a property stronger than program order is required. If the real-time behavior of methods are preserved (Principle 3) and appear one after another in sequential order the object has linearizable consistency. Such property not only guarantees program order of individual threads but also an order

(20)

time

Figure 2.4: Example of a sequence of steps that result in an inconsistent state.

among method invocation between threads. As quiescent objects, linearizable objects are composable and therefore better suited to describe the behavior of the components in a large system [49]. In addition, linearizability is non-blocking (see Section 2.2.3). The power of linearizable objects comes from the fact that every concurrent execution, no matter the order of method events, is equivalent to a sequential order of events.

2.2 Synchronization (or agreeing on Time)

The correctness of concurrent objects is only possible if all computing agents (processes or threads) agree on an order in which the events of concurrent objects took place. The mechanism to reach such agreement is synchronization. The issues and strategies of synchronization will be discussed using the consensus problem [40] as a tool to measure the strength of synchronization primitives. In this section both blocking and non-blocking primitives will be discussed, as well as their progress guarantees.

Threads in a concurrent program must synchronize in order to access a shared mem-ory region. Otherwise, the inherent non-determinism of parallel execution might allow individual computing steps to occur outside the expected order, i.e. a race condition [83]. This problem usually happens because computing steps are usually taken as indivisible, however, even processors’ instructions are performed in multiple steps (micro-operations). The unexpected interleaving of such steps might produce results that violate the correct-ness properties of objects. For instance, in the case that threads t1 and t2 want to

incre-ment a shared variable l, the following sequence of steps is possible (l is initially zero). t1

reads l into a register (v1) and then increments the register, as Figure 2.4 shows. After

that, t2 also reads l in another register (v2) and increments it. In between the steps of t2,

t1 stores the value of v1 to shared memory (l = 1). Finally, t2 stores v2 to shared memory

(l = 1). Because there were no synchronization among t1 and t2, after two increments the

value in l is 1 and not 2, as it was expected. Synchronization aims to impose an order in which steps from different threads are executed, thus eliminating steps reordering that would produce inconsistent results. Notice that in the previous example, the increment operation might appear as atomic in a high-level language. Most of the synchronization problems are the result of the mismatch between the high-level language and the actual steps that are executed. This and other many subtleties are the main cause of confusion among novice programmers when writing concurrent programs.

(21)

Figure 2.5: Bivalent states that arise from the use of atomic registers.

2.2.1 Consensus

In the context of shared-memory systems, the consensus problem consists in finding an implementation for the decide(s) method, such that, the following conditions hold for a number n of threads: (i) every thread decide the same value v when decide() is invoked concurrently and (ii) the decided value v = vi was proposed by some thread i by invoking

decide(vi). Using this specification it is possible to formulate the following definitions:

Definition 2.1: A class C solves consensus for n threads if exists an implementation of the decide() method that uses an arbitrary number of objects from C and atomic registers 3

Definition 2.2: The consensus number of a class C is the largest number of threads for which there is an implementation of decide() following Definition 2.1. If there is no such number, the consensus number is said to be infinite 4_.

Consensus is intimately related with synchronization because to have consensus is equivalent to decide: which thread will enter a critical section; whether there are el-ements/space to produce/consume from a shared queue and so on. In short, if it is

3_{Registers with atomic load and store operations.}

(22)

possible to solve consensus for n threads with a class C, therefore objects from C can be used to synchronize the access in shared-memory of n threads. It is tempting to as-sume that atomic registers are, alone, sufficient to solve consensus and thus be used as a synchronization primitive. A particular case of consensus, called binary consensus, will be used to show that it is not the case 5_{. For simplicity, the analysis will consider only}

two threads. Thus, the question is if atomic registers solve consensus between threads A and B. In other words, if it is possible to decide which was the execution order of each thread using the values of an arbitrary number of atomic registers. By contradiction, let us assume it is possible to solve consensus in this setting using atomic registers. There are three possible scenarios: one of the threads reads from a register, both of them write to separate registers or both write to the same register. In the case that one performs a read, let us assume that A is the reader. A can read either before or after B writes to a register (Figure 2.5a). Starting from an undecided state s, if B executes a write first, by convention, the system goes to a 1-valent state s0_{, and then B runs to completion and}

eventually decides 1. In the alternative case, A reads first then B performs a write and, by convention, the system goes to a 0-valent state s00_{. After that, B runs to completion}

and eventually decides 0. The problem here is that, from B’s point-of-view, the states s0 and s00 are indistinguishable, therefore B must decide the same value in both cases, a contradiction.

Figure 2.5b shows the scenario where both threads A and B write to different registers. As before, there are two possible executions. If A writes first into register r0 and, by

convection, the system is in 0-valent state s0_{. Then B writes and eventually decides 0.}

Analogously, if B writes first into r1, by convention, the system is in 1-valent state s00.

Then A writes to r0 and eventually decides 1. In both cases neither A nor B can distinguish

s0 and s00, therefore both should decide the same value, which is a contradiction. The third and last scenario are analogous to the first scenario. If both threads A and B write to the same registers (see Figure 2.5c), the two possible executions will decide different values via two indistinguishable states s0 _{and s}00_{. Therefore, both threads must decide the same}

value, which is a contradiction.

As a result, the consensus number of atomic registers is 1. In essence, it is impossible to synchronize 2 or more threads using only atomic registers. This impossibility result demonstrates why modern multicore architectures must provide primitives stronger than atomic registers in order to allow the implementation of lock-free concurrent data struc-tures [49]. In fact, modern processors provide operations from the class Read-Modify-Write, such as Compare-And-Swap (CAS), which have an infinite consensus number.

2.2.2 Blocking Synchronization

Blocking mechanisms, as the name suggests, are those that block, or limit, the number of threads that are allowed to enter a critical section. The most basic blocking objects are locks and semaphores. Lock objects essentially have two operations: lock(), to acquire the lock when entering the critical section, and unlock(), which releases the lock when leaving the critical section. An implementation of a lock object must provide a safety

(23)

property that guarantees that only a single thread is inside a critical section in a point in time. This property is called mutual exclusion. In addition, lock objects must also provide a liveness property that guarantees deadlock freedom. Deadlock freedom ensures that, if a thread tries to acquire a lock, some thread will succeed in acquiring the lock. If a thread never acquires the lock, therefore other threads must be completing an infinite number of critical sections [49]. A stronger and more desirable liveness property is starvation freedom, which guarantees that every thread that attempts to acquire a lock eventually succeeds. In other words, every call to lock() returns. This property is also called lockout freedom.

Semaphores are objects that control how many threads are allowed inside the critical section. Different from locks, semaphores are not limited to a binary state and can assume any value between 0 and c, where c is the semaphore’s capacity. The capacity translates to the number of threads allowed in the critical section. Semaphores have two meth-ods: wait() and post(), which are analogous to lock() and unlock(), respectively. A wait()is invoked before entering the critical section and only returns when it successfully decrements the capacity counter, which must be greater than zero before the decrement. A post() invocation returns immediately after incrementing the capacity counter, if it is smaller than c, and leaves it unchanged otherwise.

Synchronization objects and critical sections and, more specifically, shared variables, are protected by locks and semaphores following a convention defined by the program-mer. The absence of structure complicates the composition and maintenance of code, given that different programmers might follow different conventions. This aspect might severely limit code re-use, since re-use might produce a sequence of invocations not pre-dicted in the original implementation and lead to deadlocks. A deadlock is a state where threads are waiting for a lock release that will not occur. For example, two or more threads wait the release of a lock after acquiring another lock that prevents the first lock from being released. Another problem with blocking primitives is the trade-offs between performance and granularity and the trade-off between granularity and the complexity of code. For instance, using a single lock to protect all critical sections will clearly produce a correct execution, however the lock will serialize the execution. This strategy is simple to implement but will be no better and, probably worse, than the sequential execution due to contention and by Amdahl’s law. On the other hand, a fine-grained usage of locks will potentially increase parallelism by the cost of code complexity. Correctly scheduling sequences of lock()/unlock() invocations and ensuring they happen in desired order is far from trivial.

Blocking mechanisms have dependent progress guarantees, meaning that the progress not only depends on the object implementation, but also of external conditions. For instance, a concurrent program will enter a deadlock state if the thread that holds the lock is preempted and never scheduled to run again, due to priority inversion problems. Therefore, blocking synchronization progress is dependent of the operating system.

(24)

2.2.3 Non-Blocking Synchronization

Non-blocking synchronization is an alternative to lock-based algorithms with independent progress guarantees. Non-blocking mechanisms ensures progress independently of order in which threads are scheduled. Nevertheless, this does not imply that blocking algorithms, which provide only dependent guarantees (e.g. starvation and deadlock freedom), must be avoided. On the contrary, if pathologies like priority inversion and frequent preemption are rare, then both blocking and non-blocking implementations can exhibit similar behavior, apart from performance. Indeed, non-blocking solutions usually outperform their blocking counterparts mainly due to reduction on serial code execution. However, both the design and implementation of non-blocking algorithms can be very challenging. For example, wait-free algorithms for basic data structures are highly desirable and have been published at quality venues [113]. The literature classifies non-blocking algorithms according to their progress guarantees, or freedoms [49]. The three main freedoms are:

1. Wait-free – all method invocations eventually complete after a finite number of steps. Whole-system progress is guaranteed.

2. Lock-free – some method invocation is guaranteed to eventually complete after a finite number of steps. Some threads’ progress is guaranteed.

3. Obstruction-free – eventually an invocation, when in isolation, is guaranteed to complete. A thread in isolation makes progress.

Wait-freedom is the strongest of the three as it can guarantee starvation freedom and, thus, the progress of all threads. On the other hand, lock-free algorithms are vulnerable to starvation, although non-blocking. Nevertheless, a wait-free implementation might be orders of magnitude slower than its lock-free counterpart [37, 81]. Besides that, a simple back-off mechanism might be sufficient to reduce or, even eliminate, starvation in some highly-contented scenarios. It is easy to see that lock-freedom is a dependent progress guarantee, given that preempting a thread holding the lock prevents other threads from progressing.

The weakest freedom is delivered by obstruction-free algorithms, which only ensure the completion of conflicting invocations when they are executed in isolation. It is important to notice that non-conflicting methods, those that manipulate different objects or different properties of an object, are usually independent and can make progress in parallel. Non-blocking algorithms are implemented and rely on hardware atomic instructions, such as Compare-And-Swap [49]. The in-depth discussion of hazards and problems of non-blocking algorithms is outside the scope of this work.

In the next chapter the Transactional Memory (TM) programming model is discussed. TM algorithms are usually blocking and only provide obstruction or lock freedom, however non-blocking and wait-free TM algorithms have been proposed [101]. The discussion in the next chapter is mostly focused on blocking TM algorithms, unless when it is explicitly said otherwise. In TM, accesses to shared-memory are abstracted through transactions, just like in data-base systems, which simplifies greatly both the design and re-use of concurrent objects.

(25)

Chapter 3 Transactional Memory

General concurrent programming is still an open problem in computer science. However, database systems for decades efficiently exploit parallel hardware. The ability of pro-cessing multiple queries simultaneously through the use of multiple threads is the main feature responsible for the great performance achieved by modern database systems. The transactional programming model abstracts concurrency and frees programmers from the error-prone and difficult task of synchronizing individual accesses. In face of this wide success and relative simplicity of this model, it was conjectured if the transactional model could be extended to general purpose concurrent applications. In this context, Trans-actional Memory (TM) was conceived as a conceptual parallel programming abstraction modeled after transactions in database systems [101]. A transaction is a sequence of oper-ations that appear to be indivisible and that cause modificoper-ations which take effect instan-taneously from the point-of-view of an external observer. Eswaran et al. [56] introduced the concept of transactions and, since then, many concurrent programming approaches were inspired by transactions and atomic operations. For instance, Lomet [26] pioneered the use of atomic actions as a means to simplify the coding of process synchronization and system recovery. In this work, transactions operate on shared-memory and have the following properties: atomicity, consistency and isolation. Only recently, the fourth property, durability, from the acronym ACID used in database systems was incorporated into TM transactions with the resurrection of Non-Volatile Memories (NVMs) [75]. The definition of each property follows below:

• Atomicity: a transaction is atomic if its changes either take effect instantaneously, or not at all, in a point between its start and commit operation. This property is also knows as All-Or-Nothing, i.e. , the program state either appears to change instantaneously, in that case the transaction is said committed, or all modifications are rolled-back and the transaction restarts, in this case the transaction is said aborted.

• Consistency: its precise definition is application dependent. Nevertheless, consis-tency can be preserved for a given program by adhering to all invariats in every program-point. In other words, a transaction is consistent if, from a consistent state, all committed changes also define a new consistent state.

(26)

• Isolation: all non-commited operations of a transaction must be transparent to all external observers. In other words, the result of a transaction must be the same, given the same initial conditions, regardless of it running in parallel or isolated from other transactions.

Even with its similarities, database (DB) and transactional memory (TM) transactions differ in three main aspects: (i) DB data are persistently stored on hard-disks and take thousands of cycles to reach the processor, allowing millions of instructions to complete. TM data is stored in main memory, taking hundreds of cycles, or in cache, taking tens of cycles to be available to the processor. Thus, in DB systems there is more opportunities to overlap computations with data access in contrast with TM transactions; (ii) DB separates data representation from storage and thus allows the system implementation to change without modifying its interface with the user. In contrast, TM systems access data directly from memory in the same structure as designed by the programmer. This complicates the unification of a standard TM interface; (iii) TM is intended for general purpose concurrent applications and thus must be compatible with a variety of libraries, languages and existing infrastructures. TM’s learning curve will be less steep if adopting TM would not require large changes in the sofware stack and would not impose a strict modus operandi like DB.

The remainder of this chapter discusses the building blocks of transactional memory systems (Section 3.1), their semantics and progress guarantees (Section 3.2). At last, a brief discussion on the main implementation and design decisions of software (STM), hardware (HTM) and hybrid (HyTM) incarnations of TM are presented.

3.1 Building Blocks of TM Systems

This section discusses the building blocks of TM systems, as well as the main design decisions involved in their implementation. The most appealing feature of transactions is that concurrency control (Section 3.1.1) is made transparent to the programmer through speculative execution. In order to provide this transparency, TM systems employ a mech-anism to manage both speculative and committed versions of objects (Section 3.1.2). Transactions are made consistent by detecting conflicting updates and resolving them (Section 3.1.3).

3.1.1 Concurrency Control

Concurrency control mechanisms can be classified based on the assumptions each mech-anisms makes on the likelihood of conflicts. Pessimistic concurrency control considers that the conflict rate will be high, therefore both conflict detection and resolution are done at the moment that shared memory accesses are made. Pessimistic implementations try to obtain the ownership of accessed regions before the first access. If conflicts are indeed frequent, the employment of a pessimistic strategy yields to less wasted-work, be-cause the postponement of conflict detection would allow transactions to perform more computations that would eventually be rolled-back. In contrast, optimistic concurrency

(27)

control assumes low conflict rate scenarios and defer conflict detection and resolution. Such approaches do not eagerly obtain the exclusive access of shared-memory regions at access-time. If conflicts are rare, an optimistic strategy allows more concurrency given that multiple transactions can access the same data, as ownership is acquired only at a later time. Nevertheless, lazy conflict detection allows two or more transactions to commit conflicting updates. Therefore, the system must be able to decide which transactions are allowed to commit, thus resolving the conflict by aborting the other offending transac-tions. In practice, conflict detection and resolution can happen at any point between the shared-memory access-time and the transaction’s commit-time. As a result, optimistic approaches are more flexible in this respect and enable the usage of a more diverse set of techniques.

3.1.2 Version Management

Transactions increase the exploitation of parallelism through speculative execution. Nev-ertheless, the system must guarantee consistent execution of shared-memory updates. Therefore, TM implementations must manage different versions of data, one called the speculative version that is the result of non-committed updates of in-flight transactions and the other is the current, or globally committed, version visible to all transactions. As with concurrency control, there are two strategies for version management. Lazy version control stores all transactional updates in a private log structures (redo-log) and preserves shared-memory always consistent with respect to the last committed transaction. During commit, the redo-log is atomically drained to shared-memory. Lazy approaches perform well when conflicts are frequent because an abort does not perform any operations in shared-memory, only requiring to discard the redo-log entries. Notwithstanding, the main disadvantage of redo-logs is the extra write operations required to update shared-memory on commit. The alternative to lazy versioning is to employ eager version control. Eager versioning consists in storing the speculative version of objects directly on shared-memory and reverts back to the last consistent state of the object in a private log (undo-log). The undo-log is used to rollback shared-memory to a previous state consistent with the last committed transaction. This rollback operation might significantly hurt performance if conflicts are frequent. In contrast, transactional commit with eager versioning has low cost, because no further shared-memory update is required. Nevertheless, as speculative updates are written directly on shared-memory, only a single speculative version exists at any point of time. As a result, eager versioning can not be coupled with lazy concurrency control. Lazy versioning, on the other hand, is suited for both eager and lazy concurrency control.

3.1.3 Conflict Detection and Resolution

Conflict detection is central to a TM systems’ performance. A detection mechanism is defined by two attributes: detection granularity and instant of detection. Granularity defines the smallest unit of conflict detection, which is usually a cache line in HTMs and memory words in STMs. Conflicts can be detected either on the moment of the first

(28)

access in shared-memory (eager detection), periodically during the execution of a trans-action (incremental validation) or at commit-time (lazy detection). The conflict detection mechanism must try to minimize wasted-work due to conflict aborts. For instance, em-ploying a coarse-grained detection may be more efficient than a fine-grained approach, considering the cost of each check. However, coarse-grained detection can result in false positives and severely hurt performance. In practice, it is hard to know beforehand which is the alternative for an application. The conflict rate is usually application dependent and, even for a given application, can be input dependent. For example, let us assume an application that inserts records in a sorted list. If insertions and deletions are performed in different parts of the list, then conflicts will be rare. In contrast, modify operations upon a common record will exhibit a high conflict rate. Such issues can be attenuated by either a hybrid mechanism (e.g. lazy detection for reads and eager detection for writes) or an adaptive approach, guided by online profiling of transactions [101].

Future conflict aborts must be avoided through contention management policies. A plethora of policies exists not only to resolve, but also reduce the likelihood of future aborts [77, 74, 108, 86]. The choice of contention management policy is also guided by the profile of each application, as the choice of conflict detection mechanism. It is still an open problem to choose a policy for maximum effectiveness, given that applications usually are comprised of phases with different behavior and memory access profiles [96]. Perhaps the simplest policy to resolve conflicts is to choose a transaction among the conflicting ones to proceed, while aborting the others. For example, the transaction which detected the conflict in a causal order could abort itself. Such passive policy is common known as suicide. Transactional suicide resolves the conflict but do not prevent future conflicts from happening. A backoff policy, which causes the aborted transaction to wait before retrying, also resolves conflicts and might avoid future conflicts. The idea is to separate, in time, conflicting transactions, such that they do not execute totally in parallel. The wait-time is usually picked at random in an interval that increases either linearly or exponentially with every successive abort.

3.2 Semantics for Transactions

A simple and well-defined programming model is key to ease the understanding, writing and reuse of correct and efficient programs. With clear semantics it is easier to design and use tools to code and debug programs. Many proposals of transactional semantics for TM exist, although none of them are consensus among researchers. Transactional semantics must precisely describe the expected behavior of its operations when executed concurrently, as well as their behavior per se. The semantics of database transactions were initially considered to build a theory of TM. However, due to the inherent dissimilarities of both kind of transactions, a new/extended semantics is required. For instance, in the context of transactional objects, linearizability (see Section 2.1.3) defines that transac-tional updates on an object must appear to take effect atomically, at a point between the start and commit of a transaction. Nevertheless, linearizability does not allow the definition of a meaning for each sequential order produced by the possible interleavings of

(29)

each transaction. In essence, linearizable transactional objects do not define a sequential semantics for transactions.

Serializability can be understood as a coarse-grained incarnation of linearizability. While the former defines the sequential semantics of individual objects, the former defines a sequential semantics for transactions as a whole. It states that if it exists a transition from state s1 to s2, when transactions Taand Tb execute concurrently, the same transition

must exist if Ta and Tb execute sequentially, one after the other. Serializability can

be further strengthened by requiring that if Ta completes before Tb, therefore, in the

equivalent serial order, Tbmust be preceded by Ta. This stronger version is known as strict

serializability. It forbids invalid executions, such as when Ta starts before Tb, however Tb

commits before Ta. In this execution, Tb will not receive the updates of Ta, regardless

of the fact that B started afterward. Even this stronger version, serializability does not define the semantics of transactional and non-transactional operations. These interactions can be defined with linearizability by considering transactions as a whole as an indivisible step. In other words, tansactional and non-transactional operations must appear atomic to each other.

All the above correctness criteria considered only in-flight and committed transactions, leaving undefined the behavior of aborted transactions. Guerraoui and Kapałka [87] defined the correctness property of opacity to also take into account aborted transactions. Opacity is a strict serializability where aborted transactions also appear in the serial order, however not their effects.

Recently, Bushkov et al. [13] presented an impossibility result stating that transac-tions in TM cannot be disjoint-access parallelism (Parallel), be weak adaptive consistent (Consistent) and obstruction-free (Live), known as the Parallel-Consistent-Live (PCL) Theorem. Disjoint-access parallelism guarantees that transactions that do not conflict in their high-level objects, do not conflict in their underlying components. Weak adaptive consistency is weaker than the snapshot-isolation criterion. As Bushkov et al. argue, the impossibility result can be circumvented by weakening one of the three properties, e.g. re-placing obstruction-free with blocking liveness progress. Although not part of Bushkov et al. results, the PCL impossibility seems to be at the heart of all concurrent programming problems.

3.3 Brief History on Transactional Code Generation

Harris et al. [102] pioneered in the language-level support for transactional memory by building upon Hoare’s CCRs (Conditional Critical Regions) [50]. Harris et al. proposed a simple block-based construct to specify sequences of statements that should be executed atomically. Atomic blocks are defined with the atomic keyword, as Code 3.1a shows. Optionally, an expression can be supplied to atomic and enable conditional execution of the corresponding block (see Code 3.1b). Shortly after, Harris et al.[103] identified some inadequacies on the direct usage of library-based STMs in the runtime for language-label support. The proposed solution was to allow direct-access to transactional objects, thus avoiding overhead due to shadow copies and additional levels of indirection. In addition,

(30)

Harris et al. enabled classical optimizations (e.g. code motion) by decoupling metadata and object operations. For instance, metadata operations can usually be hoisted from loops.

public void i n s e r t ( int x ) {

atomic { b u f f e r [ items ] = x ; items++; } } public i n t get ( ) { atomic ( items != 0) { items −−; return b u f f e r [ items ] ; } }

(a) Atomic Transaction (b) Conditional Transaction Code 3.1: Examples with Harris et al.’s atomic construct

In order to reduce the number of transactional barriers a probabilistic log filtering tech-nique was used. Unnecessary log operations for newly allocated objects are eliminated through static analysis. In this direction, Adl-Tabatabai et al. [5] proposed JIT-based techniques to eliminate unnecessary transactional barriers. Their work proposed the elim-ination of barrier from immutable objects (e.g. String class, in Java) and also from heap objects that are local to the transaction. Both Harris’s [103] and Adl-Tabatabai’s [5] work, while have shown good results, employ techniques that are more effective and viable on managed programming languages.

Wang et al. [19] were among the first to consider transactional code generation and optimizations for unmanaged languages. Wang et al. implemented two transactional constructs as pragma directives in a variant of Intel’s C Compiler (ICC) known as Intel C++ STM Compiler. Atomic blocks were defined using #pragma tm_atomic, similar to Harris et al.’s atomic blocks shown in Code 3.1. Also, transactional-safe1 _{functions were}

explicitly annotated with #pragma tm_function. The ICC modified by Wang et al. also extends classical code analysis and optimizations (e.g. partial redundancy elimination) to reduce the number of transactional barriers, perform inline of STM routines and optimize checkpointing operations. The work of Yoo et al. [88] was the first to precisely characterize the over-instrumentation problem on Intel C++ STM Compiler. Yoo et al. extended the work of Wang et al. [19] in two aspects. First, with a comprehensive analysis of performance problems with näive compiler support for TM. Second, by extending the Intel C++ STM Compiler with a new pragma directive called tm_waiver. An atomic block or function annotated with #pragma tm_waiver will have all transactional barriers elided. Even though this annotation enables great performance improvements in some cases, its usage is limited due to the fact that all, and not specific, barriers are elided. The proposed annotation (TMElide), proposed later in this work (Chapter 7), provides a natural way to express a finer-grained control over transactional barriers. In Wu et al. [110], a more detailed discussion on the extension of conventional compiler analysis and optimizations for TM is discussed, in the scope of the IBM XL Compile, for both managed environments (Java VM) and unmanaged languages.

A study on capture analysis for managed languages was presented by Dragojević et