The epidemics of programming language adoption

(1)

EMANOEL FRANCISCO SPÓSITO BARREIROS

THE EPIDEMICS OF PROGRAMMING LANGUAGE ADOPTION

Ph.D. Thesis

FederalUniversityofPernambu o posgradua ao in.ufpe.br www.cin.ufpe.br/~posgraduacao

RECIFE 2016

(2)

Emanoel Francisco Spósito Barreiros

THE EPIDEMICS OF PROGRAMMING LANGUAGE ADOPTION

A Ph.D. Thesis presented to the Center for Informatics of Federal University of Pernambuco in partial fulfillment of the requirements for the degree of Philosophy Doctor in Computer Science.

Advisor: Sérgio Castelo Branco Soares Co-Advisor: Jones Oliveira de Albuquerque

RECIFE 2016

(3)

Catalogação na fonte

Bibliotecária Monick Raquel Silvestre da S. Portes, CRB4-1217

B271e Barreiros, Emanoel Francisco Spósito

The epidemics of programming language adoption / Emanoel Francisco Spósito Barreiros. – 2016.

248 f.: il., fig., tab.

Orientador: Sérgio Castelo Branco Soares.

Tese (Doutorado) – Universidade Federal de Pernambuco. CIn, Ciência da Computação, Recife, 2016.

Inclui referências e apêndices.

1. Engenharia de software. 2. Difusão de inovações. 3. Modelos epidemiológicos. I. Soares, Sérgio Castelo Branco (orientador). II. Título.

005.1 CDD (23. ed.) UFPE- MEI 2016-120

(4)

Emanoel Francisco Spósito Barreiros

The Epidemics of Programming Language Adoption

Tese apresentada ao Programa de Pós-Graduação em Ciência da Computação da Universidade Federal de Pernambuco, como requisito parcial para a obtenção do título de Doutor em Ciência da Computação.

Aprovado em: 29/08/2016

——————————————————–

Prof. Sérgio Castelo Branco Soares

Orientador do Trabalho de Tese

BANCA EXAMINADORA

———————————————————————– Prof. Dr. André Luis de Medeiros Santos

Centro de Informática / UFPE

———————————————————————– Prof. Dr. Fernando José Castor de Lima Filho

Centro de Informática / UFPE

———————————————————————– Prof. Dr. Cláudio Tadeu Cristino

Departamento de Estatística e Informática/ UFRPE

———————————————————————– Prof. Dr. Oswaldo Gonçalves Cruz

Fundação Oswaldo Cruz

———————————————————————– Prof. Dr Jairo Rocha Faria

(5)

I dedicate this thesis to my baby son Davi. Now I can buy more milk.

(6)

Acknowledgements

I thank my parents, Rejane and Manoel, for providing all I ever needed to get here, a solid family. I am sure they did their best to support me, my brother, and my sister. This is more important than any oher thing in life. Now being a father myself, I know how hard it is to raise a child. Thank you for all you patience and endurance, because it was not easy. Thank you Maurício and Manuela, for being my partners and helping me drive mom and dad nuts once in a while. Thank you Thuanne, Igor, Jessica, Heline, Vinícius, Dulce and Marcos, for enriching my life with you talks, advices and all the moments we shared together. I hope I left a mark in your heart the same way you left in mine. Thank you families (E)Spósito and Barreiros for all the fun we had together. Some important people did not see me finish this PhD thesis, but this is for you too. I miss you grandpas Midinho and Zito, aunt Eloíne, and uncle Rui. You left earlier than we wanted, but your mission was very well accomplished.

I also thank all the friends I made, some during and because of this thesis. I know I did not spend much time with you lately, but now you know why. I think now I will have some more spare time :)

I thank my wife Helaine, for the day to day support, for crying along in moments of sadness, for laughing out loud (the way only you know how to do), for the nights up, for pulling the bed sheet, for making me go to the groceries store twice because you forgot we also needed butter... I could keep writing forever, but most importantly, for allowing me to be a father. Because of that I wake up every day at 6 a.m. (some days earlier) with a smile on my face when I hear "daddy" from the room next door. Thank you Davi, for making my days more colorful. I did not now I could feel so happy even being so tired. I hope some day you can read this and be proud of your dad. Most of this work was done after you were born, so you also have some credit here.

Last but not least, I thank Sérgio and Jones for the valuable advice, for being the great people they are, not only in the adacemic side, but in life. I know I had the best of you, keep up with the good work, keep changing other people’s lives like you did with mine. Sérgio, I have always found you more a friend than an advisor, strictly speaking. I know it does not work for everybody, but worked for us, thank you for everything. Jones, I’ll try to start running, be prepared.

(7)

If you’re not having doubt, then you’re not pushing it hard enough, or you’re not looking at the details close enough. You need to be feeling that doubt every single day. —TONY FADELL, FOUNDER, NEST

(8)

Abstract

Context: In Software Engineering, technology transfer has been treated as a problem that con-cerns only two agents (innovation and adoption agents) working together to fill the knowledge gap between them. In this scenario, the transfer is carried out in a “peer-to-peer” fashion, not changing the reality of individuals and organizations around them. This approach works well when one is just seeking the adoption of a technology by a“specific client”. However, it can not solve a common problem that is the adoption of new technologies by a large mass of potential new users. In a wider context like this, it no longer makes sense to focus on “peer-to-peer” transfer. A new way of looking at the problem is necessary. It makes more sense to approach it as diffusion of innovations, where there is an information spreading in a community, similar to that observed in epidemics.

Objective: This thesis proposes a paradigm shift to show the adoption of programming lan-guages can be formally addressed as an epidemic. This focus shift allows the dynamics of programming language adoption to be mathematically modelled as such, and besides finding models that explain the community’s behaviour when adopting programming languages, it al-lows some predictions to be made, helping both individuals who wish to adopt a new language that might seem to be a new industry standard, and language designers to understand in real time the adoption of a particular language by a community.

Method: After a proof of concept with data from Sourceforge (2000 to 2009), data from GitHub (2009 to January 2016), a well-known open source software repository, and Stack Overflow (2008 to March 2016), a popular Q&A system for software developers, were obtained and preprocessed. Using cumulative biological growth functions, often used in epidemiological contexts, we obtained adjusted models to the data. Once with the adjusted models, we evaluated their predictive capabilities through repeated applications of hypothesis testing and statistical calculations in different versions of the models obtained after adjusting the functions to samples of different time frames from the repositories.

Results: We show that programming language adoption can be formally considered an epidemi-ological phenomenon by adjusting a well-known mathematical function used to describe such phenomena. We also show that, using the models found, it is possible to forecast programming languages adoption. We also show that it is possible to have similar insights by observing user data, as well as data from the community itself, not using software developers as susceptible individuals.

Limitations: The forecast of the adoption outcome (asymptote) needs to be taken with care because it varies depending on the sample size, which also influences the quality of forecasts in general. Unfortunately, we not always have control over the sample size, because it depends on the population under analysis. The forecast of programming language adoption is only valid for the analysed population; generalizations should be made with caution.

(9)

allows us to perform analyses not possible otherwise. We can have an overview of a population in real time regarding the use of a programming language, which allows us, as innovation agents, to adjust our technology if it is not achieving the desired “penetration”; as adoption agents, we may decide, ahead of our competitors, to adopt a seemingly promising technology that may ultimately become a standard.

Keywords: Software Engineering. Technology Transfer. Diffusion of Innovations. Program-ming Languages Adoption. Epidemic Models. Computational Epidemiology.

(10)

Resumo

Contexto: Em Engenharia de Software, transferência de tecnologia tem sido tratada como um problema pontual, um processo que diz respeito a dois agentes (os agentes de inovação e adoção) trabalhando juntos para preencher uma lacuna no conhecimento entre estes dois. Neste cenário, a transferência é realizada “ponto a ponto”, envolvendo e tendo efeito apenas nos in-divíduos que participam do processo. Esta abordagem funciona bem quando se está buscando apenas a adoção da tecnologia por um “cliente” específico. No entanto, ela não consegue re-solver um problema bastante comum que é a adoção de novas tecnologias por uma grande massa de potenciais novos usuários. Neste contexto mais amplo, não faz mais sentido focar em trans-ferência ponto a ponto, faz-se necessária uma nova maneira de olhar para o problema. É mais interessante abordá-lo como difusão de inovações, onde existe um espalhamento da informação em uma comunidade, de maneira semelhante ao que se observa em epidemias.

Objetivo: Esta tese de doutorado mostra que a adoção de linguagens de programação pode ser tratada formalmente como uma epidemia. Esta mudança conceitual na maneira de olhar para o fenômeno permite que a dinâmica da adoção de linguagens de programação seja modelada matematicamente como tal, e além de encontrar modelos que expliquem o comportamento da comunidade quando da adoção de uma linguagem de programação, permite que algumas pre-visões sejam realizadas, ajudando tanto indivíduos que desejem adotar uma nova linguagem que parece se apresentar como um novo padrão industrial, quanto ajudando projetistas de linguagens a entender em tempo real a adoção de uma determinada linguagem pela comunidade.

Método: Após uma prova de conceito com dados do Sourceforge (2000 a 2009), dados do GitHub (2009 a janeiro de 2016) um repositório de projetos software de código aberto, e Stack Overflow (2008 a março de 2016) um popular sistema de perguntas e respostas para desen-volvedores de software, from obtidos e pré-processados. Utilizando uma função de crescimento biológico cumulativo, frequentemente usada em contextos epidemiológicos, obtivemos mode-los ajustados aos dados. Uma vez com os modemode-los ajustados, realizamos avaliações de sua precisão. Avaliamos suas capacidades de previsão através de repetidas aplicações de testes de hipóteses e cálculos de estatísticas em diferentes versões dos modelos, obtidas após ajustes das funções a amostras de diferentes tamanhos dos dados obtidos.

Resultados: Mostramos que a adoção de linguagens de programação pode ser considerada formalmente um fenômeno epidemiológico através do ajuste de uma função matemática recon-hecidamente útil para descrever tais fenômenos. Mostramos também que é possível, utilizando os modelos encontrados, realizar previsões da adoção de linguagens de programação em uma determinada comunidade. Ainda, mostramos que é possível obter conclusões semelhantes ob-servando dados de usuários e dados da comunidade apenas, não usando desenvolvedores de software como indivíduos suscetíveis.

Limitações: A previsão do limite superior da adoção (assíntota) não é confiável, variando muito dependendo do tamanho da amostra, que também influencia na qualidade das previsões em

(11)

geral. Infelizmente, nem sempre teremos controle sob o tamanho da amostra, pois ela depende da população em análise. A adoção da linguagem de programação só é válida para a população em análise; generalizações devem ser realizadas com cautela.

Conclusão: Abordar o fenômeno de adoção de linguagens de programação como um fenômeno epidemiológico nos permite realizar análises que não são possíveis de outro modo. Podemos ter uma visão geral de uma população em tempo real no que diz respeito ao uso de uma linguagem de programação, o que nos permite, com agentes de inovação, ajustar a tecnologia caso ela não esteja alcançando o alcance desejado; como agentes de adoção, podemos decidir por adotar uma tecnologia aparentemente promissora que pode vir a se tornar um padrão.

Palavras-chave: Engenharia de Software. Transferência de Tecnologia. Difusão de Ino-vações. Adoção de Linguagens de Programação. Modelos Epidemiológicos. Epidemiologia Computacional.

(12)

Contents

1 Introduction 16

1.1 Problem Overview . . . 19

1.2 Seminal Ideas and Preliminary Findings . . . 20

1.2.1 Proof of Concept . . . 21

1.3 Contributions . . . 24

1.4 How to Read this Work . . . 26

2 Background 28 2.1 Transfer and Diffusion of Innovations . . . 29

2.1.1 Rogers’ Model of Diffusion . . . 29

2.1.2 Redwine and Riddle’s Maturation Model of Adoption . . . 31

2.1.3 Pfleeger’s Model of Technology Transfer . . . 32

2.1.4 Gorschek et al.’s Model of Technology Transfer . . . 35

2.2 The Social Aspects of Programming Language Adoption . . . 36

2.3 Mathematical Epidemiology . . . 38

2.3.1 Compartmental Models . . . 38

2.3.2 Models for Biological Growth and Their Usage on Modelling of Epi-demics . . . 40

2.4 Model Fitting . . . 41

2.4.1 Goodness of Fit and Hypothesis Testing . . . 43

2.5 Chapter Summary . . . 45

3 Materials and Methods 46 3.1 Data Extraction and Interpretation . . . 46

3.1.1 Nature of the Data . . . 46

3.1.2 Data Sources . . . 47

3.1.3 Data Processing . . . 48

3.1.4 Final Data Format . . . 48

3.2 Fitting Models to the Data . . . 49

3.2.1 Evaluating the Forecasting Capabilities of the Proposed Models . . . . 51

3.3 Residual-Based Correction Process . . . 52

4 Results 55 4.1 Fit Results . . . 56

4.2 Analysis of Residuals and Model Correction . . . 64

(13)

4.4 Projects, Posts and Individuals as Source of Information . . . 70

4.5 Inflection Point and Diffusion Milestones Analyses . . . 73

4.6 Differences to the Month Dataset . . . 76

4.7 The Data from Stack Overflow . . . 82

5 Related Work 91 5.1 Inefficiencies in Technology Transfer: Theory and Empirics . . . 91

5.2 Achieving Successful Technology Transfer . . . 93

5.3 Technology Transfer: Why some Succeed and some don’t . . . 93

5.4 Programming Language Adoption . . . 94

5.5 Diffusion of Innovations and Epidemic Models on Non-Biological Phenomena 95 5.6 Chapter Summary . . . 96

6 Discussion 97 6.1 Limitations . . . 101

6.2 Future Work . . . 102

6.2.1 New Functions . . . 102

6.2.2 New Forecast Scenarios . . . 102

6.2.3 Social Connections . . . 102

6.2.4 Different Sources of Data . . . 103

6.2.5 Diffusion of Innovations Theory and Technology Adoption in Com-puter Science . . . 103

6.2.6 More Concepts from Epidemiology . . . 104

References 105 Appendix 111 A The Systematic Mapping Study Protocol 112 A.1 Team . . . 112

A.2 Introduction . . . 113

A.3 Scope of the Study . . . 113

A.4 Research Questions . . . 114

A.5 Search Process . . . 115

A.5.1 Inclusion and Exclusion Criteria . . . 116

A.6 Selection Process . . . 116

A.7 Data Extraction . . . 117

A.7.1 Form A . . . 118

A.7.2 Form B . . . 118

(14)

A.7.4 Data Synthesis . . . 119

B Fit Parameters and Plots for the Preliminary Findings and Proof of Concept 120 B.1 Plots from Sourceforge (Preliminary Findings) . . . 120

B.2 Fit Parameters from Souceforge (Preliminary Findings) . . . 133

C Data from GitHub 137 C.1 Week . . . 138 C.1.1 Assembly . . . 138 C.1.2 C . . . 139 C.1.3 C++ . . . 141 C.1.4 C# . . . 142 C.1.5 Dart . . . 144 C.1.6 Go . . . 146 C.1.7 Java . . . 147 C.1.8 Javascript . . . 149 C.1.9 Julia . . . 151 C.1.10 Objective-C . . . 152 C.1.11 PHP . . . 154 C.1.12 Python . . . 156 C.1.13 R . . . 157 C.1.14 Ruby . . . 159 C.1.15 Rust . . . 161 C.1.16 Shell . . . 162 C.1.17 Swift . . . 164 C.2 Month . . . 165 C.2.1 Assembly . . . 165 C.2.2 C . . . 166 C.2.3 C++ . . . 168 C.2.4 C# . . . 170 C.2.5 Dart . . . 171 C.2.6 Go . . . 173 C.2.7 Java . . . 175 C.2.8 Javascript . . . 176 C.2.9 Julia . . . 178 C.2.10 Objective-C . . . 180 C.2.11 PHP . . . 181 C.2.12 Python . . . 183 C.2.13 R . . . 185 C.2.14 Ruby . . . 186

(15)

C.2.15 Rust . . . 188

C.2.16 Shell . . . 190

C.2.17 Swift . . . 191

D Data from Stack Overflow 193 D.1 Week . . . 194 D.1.1 Assembly . . . 194 D.1.2 C . . . 195 D.1.3 C++ . . . 197 D.1.4 C# . . . 198 D.1.5 Dart . . . 200 D.1.6 Go . . . 202 D.1.7 Java . . . 203 D.1.8 Javascript . . . 205 D.1.9 Julia . . . 207 D.1.10 Objective-C . . . 208 D.1.11 PHP . . . 210 D.1.12 Python . . . 212 D.1.13 R . . . 213 D.1.14 Ruby . . . 215 D.1.15 Rust . . . 217 D.1.16 Shell . . . 218 D.1.17 Swift . . . 220 D.2 Month . . . 221 D.2.1 Assembly . . . 221 D.2.2 C . . . 222 D.2.3 C++ . . . 224 D.2.4 C# . . . 226 D.2.5 Dart . . . 227 D.2.6 Go . . . 229 D.2.7 Java . . . 231 D.2.8 Javascript . . . 232 D.2.9 Julia . . . 234 D.2.10 Objective-C . . . 236 D.2.11 PHP . . . 237 D.2.12 Python . . . 239 D.2.13 R . . . 241 D.2.14 Ruby . . . 242 D.2.15 Rust . . . 244

(16)

D.2.16 Shell . . . 246 D.2.17 Swift . . . 247

(17)

16 16 16

1

Introduction

An innovation, as defined by Rogers, is “an idea, practice, or object that is perceived as new by an individual or other unit of adoption” (1). In his career, ideas of innovation diffusion have been studied on the domain of agricultural innovations. They stem from technological advances, such as novel machinery, to new plantation techniques. From the Software Engineer-ing (SE) point of view, a technological novelty (or simply technology, as this concept will be referred to from now on) can encompass a number of things: techniques, methods, languages, tools, paradigms, and procedures (2). Methods and techniques can be perceived as formal pro-cedures for producing some result. A tool is an artifact (usually an instrument, a language, or an automated system) that is used to effectively accomplish an objective, and a procedure can be defined as an orchestration of tools and actions (defined by techniques or methods) that produces a product (3).

Constant innovation is strategic to most companies. Successful innovation is often trans-lated to commercial advantage, greater market share, more customers, higher profits, and so forth. However, the adoption of new technologies is risky for most of them. Adoption agents (those who are adopting new technology) might delay the adoption as much as they can, to avoid risks and/or maximize the return of investment in the short run, which is not a surprise. As a rule, if the transfer of a technology brings great benefit, it will be attempted.

Technological innovations have always played a very important role in our world. In the beginning of times, new tools and techniques would allow people to produce more food, hunt larger animals, build better shelters, and so on. At the same time, other kinds of technologies in the form of weapons (or even strategic thinking) would allow some groups of people to dominate others gaining considerable influence over massive crowds.

The process of technology transfer is very complex and a technology may take a long time to be fully adopted in practice (4). The innovation diffusion process is one in which a specific innovation is communicated through certain channels over time among the members of a social system (1). From that definition, much can still be developed. For example, the innovation must be communicated through channels considered relevant to the community it is targeted to, otherwise, that target community is very unlikely to “find” the new technology. A study by Jedlitschka et al. (5) discusses some relevant sources of information for technology

(18)

17 transfer. The study shows that articles in scientific journals and other kinds of literature are only the 10th _{and 11}th _{most important information sources in a list of 11 possible sources of}

information. This defines a problem we (technology developers) need to address to make our creations adoptable to the wider public.

Currently, the practice of SE is concerned with the models for efficient and effective technology transfer from Academia to the industry. Existing models are usually focused on a very restricted environment, where research is carried out by specialized research labs (the innovation agent), keeping close contact with the organization (the adoption agent). To this end, some models of technology transfer have been proposed (1; 4; 3; 6). The model proposed by Rogers (1) is a general one, even though most of his research has been done under the agricultural domain. The ones proposed by Redwine and Riddle (4), Pfleeger (3) and Gorscheck et al. (6) are more related to the domain of software engineering and describe their approach to the technology transfer issue.

The aforementioned models address a very specific scenario, frequently argued as the ideal (6; 7; 8), since the technological gap of a specific adoption agent is being directly ad-dressed (technology being developed or customized for a client) by a dedicated group of re-searchers, which has a higher chance of success. When a technology is developed and “thrown into the wild” the role of the technology developer (innovation agent) is different. The tech-nology is still being “produced”, but not aiming at a specific adoption agent. The problem (the transfer), in this case, is harder to solve, since there are many factors that may hinder the adop-tion, e.g., the adoption timing, return of investment (ROI) ratio, and specific characteristics that should be tweaked or fine-tuned for each customer. In this context, how can one understand the dynamics of adopting a technological innovation in a population?

Until now, the problem has only been addressed with technical issues in mind. Tech-nology transfer models have been proposed, as well as processes, support tools, and so forth. We propose to look at the technology transfer problem differently. Philosophically, looking at a problem with a different perspective might influence the way solutions are designed. SE re-searchers are starting to argue the field carries many similarities with social sciences (9). In this light, technology transfer may also be viewed as a social phenomenon, in which several rules may be derived from empirical observation. The problem now shifts from technology transfer to technology (information) diffusion. This shift is possible because in social systems of in-formation diffusion, the word of mouth is the key channel for communication, which closely resembles the dynamics of a disease spread in a community, often requiring personal contact to spread the disease (9; 5).

Hence, this thesis aims at describing technology diffusion, more specifically the adop-tion of programming languages, as a social epidemic, similar to the ones Malcolm Gladwell discusses in the book entitled The Tipping Point: How Little Things Can Make a Big Differ-ence(10). He discusses several examples of social epidemics and the rules these phenomena are subject to. He explains, for example, how the criminality in the subways of New York

(19)

18 (USA) went down after a small set of actions taken by the police department or how a company that produced a specific line of shoes went from near bankruptcy to stellar success.

The concept of epidemiology emerged in medical sciences, where the dynamics of in-fectious diseases is studied within a population of susceptible individuals. Apart from the med-ical context, the dynamics of infectious diseases may be mathematmed-ically modeled (11), which gives us the ability to analyze and simulate a wide range of scenarios. With the current com-puting tools, heavy simulations can be ran, and even some kinds of forecasting might be per-formed (12). To show that the dynamics of epidemic disease spread and programming language adoption share similarities, Figure 1.1 (a) depicts the Ebola outbreak in 2014 and Figure 1.1 (b) shows the creation of new projects using Matlab on Sourceforge from 2000 to 2009. The main features include the slow start of the spread, but that quick builds up because of its exponential nature, and the progression to an asymptote after the inflection point (peak of infection).

Figure 1.1: Ebola vs MATLAB.

(a) Ebola outbreak in 2014. (b) New projects using Matlab, 2000 to

2009 (months). Source: The author.

Considering the social aspects of programming language adoption and the mathemati-cal modeling of the epidemic disease spread dynamics, we define the scope, assumption, and hypothesis of this thesis as follows:

Scope. The scope of this work comprises the analysis of programming language adoption by software developers, more specifically, users of open source software

repositories like GitHub (https://github.com), and users of Stack Overflow (http://stackoverflow.com/). We chose those two platforms due to the easy access to their data. GitHub’s free

repositories are open to the community. Stack Overflow, as part of the modern de-veloper culture, also provides valuable information to add to our analysis.

Hypothesis. Biological growth models can describe the dynamics of programming language adoption by users of open source software hosting sites. With a satisfactory fit model, it is possible not only to describe the dynamics of programming language adoption, but, to some extent, forecasting the dynamics of programming language adoption.

(20)

1.1. PROBLEM OVERVIEW 19 Figure 1.3 summarizes this research’s framework. Blue boxes with sharp edges repre-sent knowledge that is somehow being used or inspiring this research. Dashed lines connect these boxes representing a relation among them. Each relation is labelled with the relation’s nature. Solid lines indicate direct actions of our research, the green round box is our main goal, and the round yellow box our area of concentration.

Figure 1.3: Research framework.

PhD

Thesis

Technology Transfer/Diffusion in SE better understands Software Engineering Programming Language Adoption Epidemic Growth Models Mathematical Epidemiology Epidemiology Diffusion of Innovations inspired by powered by knowledge area describes, explains, and forecasts Social Aspects incorporates studies,

explains, and describes builds

upon

extends

Source: The author.

1.1 Problem Overview

The development of programming languages is a challenging task. It is challenging because it is hard to weigh the features that should be available based on usefulness and user requests, project decisions regarding syntax, type systems, compilation time, and so forth. Also, once the language is released, it is always hard to get it adopted by the target population. The target population concept is important because not all languages are general-purpose; many are niche languages or designed to solve a specific problem relevant to specific slice of the whole community of software developers.

From the adopter’s point of view, it is also hard to decide whether and when a language will be adopted. Several aspects of the language have to be evaluated, such as simplicity, relative advantage against the current technologies used, cost, risk, return of investment (ROI), and so on. This partially explains why the adoption curve starts slow (see Figure 1.1(b)); few people are willing to try still unused technology. They are called the innovators (1) and account for approximately 2.5% of the population.

Also, it is hard to evaluate the dynamics of adoption after a language has been released to the public. How do we evaluate if the language is being used as expected or how do we

(21)

deter-1.2. SEMINAL IDEAS AND PRELIMINARY FINDINGS 20 mine if its adoption process has gained momentum? Is it accelerating or decelerating? What is the predicted amount of users of the language when it is widely adopted? These questions can be answered by using mathematical models that incorporate epidemiological concepts, briefly described next.

1.2 Seminal Ideas and Preliminary Findings

The main question that motivated this work was: is there a way to understand the dy-namics of programming language adoption by a group of software developers? Understanding a system is the primary step when one wants to start manipulating it. After reading Meyerovich et al.’s work (9; 13; 14) it became clear that the adoption of programming languages was a so-cial phenomenon. The auhtors approached the problem from a soso-cial perspective only, though. Malcolm Gladwell’s “The Tipping Point” (10), introduced the notion that simple ideas could go “viral”, with the dynamics of the phenomenon influenced mostly by social factors. The ideas discussed by Gladwell include the roles of specific kinds of individuals, the relation of these individuals to the community, the kind of information being communicated and the channels used to communicate the information. These concepts greatly resemble a biological system when looked through the epidemics theories optics.

In this context, diseases are very similar to the information that is being transmitted throughout the community. More importantly, the information transfer requires the social in-teraction among individuals. We already knew that this feature of innovation diffusion could also exist in Software Engineering due to the work of Jedlitschka et al. (5). They discovered that, for a software developer, the most important source of information for adopting a tech-nology was a good experience and recommendation by a colleague. Nowadays, more than ever, software developers work collectively and collaboratively. The suggestion that the adop-tion of programming languages could be modelled as an epidemic seemed to make sense. The main idea behind this is that open source projects and software developers could be mapped as susceptible individuals, while programming languages could be viewed as infective agents. Hence, the programming languages adoption can be characterized as a spreading disease in a community. Table 1.1 summarizes this concept mapping.

It is not possible for us to track the incidence or the incidence rate because, with the current approach, we are not able to estimate the number of potential adopters beforehand. Since we cannot estimate the susceptible population we had to use cumulative biological growth models instead of the classical compartmental models, which are only concerned with the preva-lence.

We then had to assess if the idea indeed made sense and we could continue the research. Meyerovich et al. made available a large dataset regarding the adoption of programming lan-guages (14). The data includes, among other things, information about projects on Sourceforge (https://sourceforge.net/). We decided to use this dataset to run a preliminary study as a proof

(22)

1.2. SEMINAL IDEAS AND PRELIMINARY FINDINGS 21

Table 1.1: Epidemiological concepts and their mapping to this work.

Concept Analogy

Incidence Number of individuals that use a given programming_{language per potential adopter} Incidence rate Incidence per unit of time (month or week)

Prevalence The number or individuals (cases) that use a given programming language at one time

Prevalence rate The number of cases per potential adopter Infectious agent Programming language

Susceptible A potential adopter of a given programming language

Infected Individual that has been identified as user of a given programming language

Transmission route Word of mouth

Symptom

To participate in open source projects on GitHub that use a given programming language or post questions on Stack Overflow tagged with a given programming language

of concept.

1.2.1 Proof of Concept

We started by processing the data and transforming it to the format we needed in order to be used by the fitting algorithm (Chapter 3 details how we processed the files). The data ranges from 2000 to 2009, and the following languages have been analyzed: C, C++, C#, Java, Javascript, Matlab, Objective-C, Perl, PHP, Python and Ruby. Most of them are on the top 10 language ranking from the TIOBE Index (http://www.tiobe.com/tiobe_index). We also ranked the languages on Sourceforge from our dataset and got the result on Figure 1.4.

The next step was to fit models to the processed data. If we could successfully find fitted models to the data, it would be enough evidence to continue the research. We experimented with a few models, such as Gompertz (15), Koya-Goshu (16), Weibull (17) and von Bertalanffy (18), and Richards (19). They all had very similar performance (AIC (20) and R2_{were very close), so}

it did not make sense to use several models simultaneously. We then chose the one that seemed more straightforward and had been used extensively elsewhere (21), Richards (details on the Richards function can be found in Section 2.3.2.1).

We then proceeded with the fitting of the model to the data and the evaluation. The plots for the analysed languages can be found on Figures 1.5 and 1.7.

The results seemed promising after the visual analysis of the plots. The goodness of fit-ness measures we calculated provided more evidence for us to believe the research was promis-ing. They are on Table 1.2. Very good values (close to 1.0) for the coefficient of determination (R2_{) and th p-value for the Székely Energy test indicate good fits.}

(23)

Figure 1.4: Sorceforge languages ranking from 2000 to 2009.

Source: The author.

Figure 1.5: Plot of observed values (abscissae) vs predicted values (ordinate), part 1.

(a) Fit plot for Java. (b) Fit plot for JavaScript.

(c) Fit plot for Matlab. (d) Fit plot for Python.

Source: The author.

The analysis includes the evaluation residuals. In general, you expect the residuals to have a low mean (close to zero), to be randomly distributed around the mean, clustered towards the middle of the plot, as well as not displaying clear patterns. We calculated the studentized residuals for the fitted models and obtained the results that can be observed on Figures 1.9 and 1.11.

(24)

Figure 1.7: Plot of observed values (abscissae) vs predicted values (ordinate), part 2.

(a) Fit plot for Ruby. (b) Fit plot for Objective-C.

(c) Fit plot for C#. (d) Fit plot for C++.

(e) Fit plot for Perl. (f) Fit plot for PHP.

(g) Fit plot for C. Source: The author.

The models are in line with the first property, having only few outliers, points that are more than 2σ (standard deviation), in absolute value, away from the mean, and they are also clustered in the middle of the plot (second property). However, patterns emerge from the plots, with most models having a similar shape for the residuals, which indicates that they do not give the best fit most of the time throughout the series. Most models, though, present a better residual plot towards the end of the dataset. In practice, this analysis indicates that it is possible to improve the models, even though we do believe that they provide good fit according to the other measures discussed earlier. This preliminary analysis shapes most of our final approach

(25)

1.3. CONTRIBUTIONS 24

Table 1.2: Results for the goodness of fit test for the Richards model.

Language R2 Székely Energy

statistic p-value Rejects H0?

C 0.9988 1297.63 0.987 No C++ 0.9989 1433.59 0.996 No C# 0.9996 324.72 0.999 No Java 0.9988 1793.37 0.993 No Javascript 0.9993 405.01 0.995 No Matlab 0.9994 15.13 ≈1.0 No Objective-C 0.9991 56.21 0.998 No Perl 0.9987 482.55 0.985 No PHP 0.9990 942.95 0.996 No Python 0.9991 420.53 0.997 No Ruby 0.9975 187.75 0.813 No

Figure 1.9: Plots of studentized residuals, part 1.

(a) Language C (b) Language C++

(c) Language C# (d) Language Java

Source: The author.

that is discussed in detail in Chapter 3. All plots and fit parameters for the preliminary findings are available on Appendix B.

1.3 Contributions

We introduce the idea of treating the dynamics of programming language adoption as an epidemic phenomenon, which allows for a series of analysis and forecasting features. A biological growth function was used to fit the data to and a means for evaluating the forecasting capabilities was devised.

(26)

1.3. CONTRIBUTIONS 25

Figure 1.11: Plots of studentized residuals, part 2.

(a) Language Javascript (b) Language Matlab

(c) Language Objective-C (d) Language Perl

(e) Language PHP (f) Language Python

(g) Language Ruby Source: The author.

By using the aforementioned models, it is possible to do real-time analysis of the adop-tion dynamics. This kind of analysis allows us to understand if the adopadop-tion is still in the exponential phase or is already decelerating towards an asymptote. The models also allow fore-casting for a limited time range in the future, which can drive adoption, helping in the decision making of programming language adoption.

The results of this work are important to several different kinds of individuals. First, it is a great source of information for developers who wish to know not only which languages are

(27)

1.4. HOW TO READ THIS WORK 26 popular, but which ones are promising, which ones will last and which ones are already facing the community’s approval or disapproval. One example of this is Dart (https://www.dartlang.org/), a language developed by Google, released on November of 2013, but already seeing its usage decline on GitHub and Stack Overflow. Educators may also find our results and approach in-teresting because they might drive their will to try something different in class. For example, probably, Dart will not be used in classes for the purpose of teaching new languages, since it is already facing the community’s lack of interest. It is probably much more interesting to teach Go (https://golang.org/) if the objective is to teach something new that has a promising future.

Second, it is also important for language developers because it may help them under-stand how the adoption of the language is in near real time. Based on data from Stack Overflow, they could track the usage of specific features of the language, planning new releases, and so forth. The proposed approach can also be used to evaluate how different programming language features fare in general. Recently the code for all open source projects on GitHub has been re-leased on Google’s BigQuery system. With a few adjustments, our approach could be used to track, for example, how is the usage of a specific Go module.

Third, our results can be used by managers in the decision making process of choosing a programming language to invest on. Managers might prefer long lasting technologies over the ones with lesser community appeal. It is important to say that all our claims have been made using open data, that is, data from the open source communities. The professional community might have different standards, but nowadays even the “big players” in industry have open sourced several of their products. Also, it is a very common practice for professionals to work on side projects that are frequently hosted by open source repositories, like GitHub. Also, we use data from Stack Overflow, a relevant source of information for all kinds of developers worldwide, and definitely a part of the software developer culture for developers that work only on private and closed projects, as well as students and open source developers.

1.4 How to Read this Work

Chapter 2 brings the background required to fully understand this thesis. You can, however, skip a few sections if you feel comfortable with the concepts they are discussing.

Section 2.1 deals with the concepts of technology transfer and innovation diffusion. It discusses the concepts from a Software Engineering point of view, and is a good starting point for all readers to get a better understanding of the thesis’s context. Together with Section 2.2 (The Social Aspects of Programming Language Adoption) we believe the aforementioned sections should not be kept aside. In particular Section 2.2 brings interesting ideas that we, software engineers, are not usually aware of. Section 2.3 brings the borrowed concepts from biology and Section 2.4 discuss some mathematical and statistical concepts we used in this thesis. If you have a standard computer science background, we advise you to not skip those sections They will provide necessary knowledge to make use of this thesis.

(28)

1.4. HOW TO READ THIS WORK 27 Chapter 3 discusses the materials and methods used to carry out this research. It de-scribes the data, their sources, and the required pre-processing. The fitting of the models to the data is also discussed. Chapter 4 presents the results of the model fitting process, the statistics ran, and the evaluation of the models. The quality of the models is verified as well as their forecasting capabilities. Finally, Chapter 5 discusses related work and Chapter 6 presents the discussion, final considerations of this work and proposes further research based on the results.

(29)

28 28 28

2

Background

Technology transfer and diffusion have been studied for a while. The roots of diffusion theory traces back to Europe, where Gabriel Tarde started to discuss what he then called the “laws of imitation” on his 1903 book with the same title (22). He discusses, from a sociological and anthropological point of view, how and why some innovations are widely adopted, while others fail to hit the vast majority of users. He proposed the S-shaped curve that describes the time an innovation takes to be absorbed by the adopters, as seen in Figure 2.1.

Figure 2.1: The adoption rate of innovations as observed by Tarde.

Source: Rogers (1).

Georg Simmel, who was contemporaneous with Tarde, was a philosopher and lecturer on Sociology at the University of Berlin. He devised the idea of a stranger on a social system: an individual who is member of a system but is not strongly attached to it(23). Later, scholars

(30)

2.1. TRANSFER AND DIFFUSION OF INNOVATIONS 29 on diffusion of innovations used his ideas and developed new concepts, e.g., communication networks, homophily and heterophily.

Research on this topic had been carried out in isolation for about 60 years, meaning that researchers from different areas would not share knowledge among them, even though the phenomenon of innovation diffusion is universal to science. This fact was unknown to them by that time. Nowadays, Everett Rogers is known for his work of bringing together different views on the same subject and knocking down the barriers among different areas of knowledge. He started his work by studying the diffusion of innovations in the agricultural domain. He soon realized that the difficulties, solutions to problems and the model he was addressing were not specific to his domain. After the publication of a series of books (1), he is one of the most referenced researchers on diffusion of innovation studies.

Some of the concepts discussed in the following sections may not be directly used by this work, but are important for the reader to understand the context in which it is inserted in. Also, it is important for the reader to understand how Software Engineering researchers approach the problem.

2.1 Transfer and Diffusion of Innovations

The concepts of diffusion and adoption of innovations are quite similar, but bear rele-vant differences. The adoption process is individual, that is, it concerns the steps taken by an adoption agent in order to adopt a given technology. For example, as presented in the Gorschek et al.’s model (6), the transfer process is centred in an individual (an organization), not a group of individuals. The study of diffusion is not concerned with how a technology is adopted, but how it spreads throughout the community.

2.1.1 Rogers’ Model of Diffusion

According to Ryan and Gross (24), the decision about an innovation is not an immediate act. In fact, it is a process that comprises a series of steps. Based on that, Rogers devised his model (1), consisting of five stages:

1. Knowledge is obtained when an individual is first exposed to an innovation and its workings;

2. Persuasion occurs when an individual forms an favorable or unfavorable opinion towards the innovation;

3. Decision takes place when the individual acts towards a choice to adopt or reject the innovation;

(31)

2.1. TRANSFER AND DIFFUSION OF INNOVATIONS 30 5. Confirmation is the stage in which an individual seeks reinforcement about an

in-novation that has already taken place.

In the steps described above (and from now on), an individual may be a person or any other adoption agent, an organization, for example. The innovation-decision process starts with the first contact of an individual with a new technology. This alone is a serious problem for technology creators. If a technology is not known, it cannot be applied. On top of that, there is still the concept of selective perception, which defines that individuals are more likely to be exposed to innovations that are consistent with their attitudes and beliefs and when they feel the need for an innovation.

For Rogers, there are three types of knowledge: the awareness-knowledge, the how-to knowledge and the principles-knowledge. The awareness-knowledge is related how-to what an innovation is. The how-to knowledge is related to how an innovation can be used effectively. For more complex technologies, the “amount” of how-to knowledge is increasingly greater. The principles-knowledge consists of information dealing with the functioning principles underlying how an innovation works. It is not always mandatory to have principles-knowledge to adopt an innovation, but it reduces the risks of misusing an idea.

At the persuasion stage, individuals form a favorable or unfavorable vision about an innovation, they become more psychologically involved with it. They look for information about the technology, decide which of them are credible, as well as how to interpret each piece of information. Third party evaluation is also important at this stage, since ideas that have already been tested by others in similar domains are more likely to perform similarly well in the user’s particular domain.

In the decision phase, individuals engage in activities that lead to the adoption or rejec-tion of an innovarejec-tion. Adoprejec-tion means that the innovarejec-tion will be fully used, while rejecrejec-tion means the opposite. Most innovators like the idea of trying the technology before its full adop-tion. The time for an innovation can be reduced when a peer performs the trial phase. Some-times a demonstration is enough to speed up the diffusion process. Regarding the rejection of innovations, there are two types:

1. Active rejection consists of considering the adoption of the innovations (including trial), but deciding not to adopt it;

2. Passive rejection consistis of never really considering the use of the innovation. In the implementation phase the idea is put into practice. It requires a fair amount of work to change the status quo and induce a change in behavior. In this phase, individuals are more concerned about operational details, such as “Where can I obtain the innovation?”, “How do I use it?” or “What problems may I encounter and how can I solve them?”. The implemen-tation phase can take a long time and, eventually, the innovation is institutionalized. At this point, the innovation is no longer an innovation; it is part of the individual’s (or organization’s) culture.

(32)

2.1. TRANSFER AND DIFFUSION OF INNOVATIONS 31 In the confirmation phase individuals seek information that may change their mind about the innovation. Internal disequilibrium or an uncomfortable state of mind may induce a change in behavior, thus making the innovation to be reverted. The decision to reject an innovation after a previous adoption is called discontinuance. Discontinuances can occur in two types: (i) replacement and (ii) disenchantment. The first type is characterized by the discontinuance of an innovation for the adoption of other (often better) one, while the latter decision is made as a result of dissatisfaction with the innovation.

2.1.2 Redwine and Riddle’s Maturation Model of Adoption

Redwine and Riddle (4) analysed the transition of many different software technologies and tried to extract a model of maturation that can explain how the they were transferred from academia to industry. They start by classifying technologies in four groups:

Major technologies areas, such as metrics, compiler construction. Advancements in these areas require coordinated improvements in several other areas, many of them theoretical;

Technology conceptssuch as abstract data types and structured programming. Tech-nologies in this group are usually used to build other pieces of technology;

Methodology technology addresses how to most effectively create and evolve soft-ware and is a mixture of technical and managerial principles, practices and proce-dures. The authors call it a second-level technology that can be used to guide the use of other technologies;

Consolidated technologyis also a second-level technology.

According to Redwine and Riddle, it is very hard to predict the evolution timeline of a specifc technology, even if a similar technology is analysed, mainly because particular in-stances of technology transfer (technology + environment) are very unique. They do, however point some factors that might help or hinder technology transfer. Their study shows that when one or more critical factors are not present, it is very likely that the technology transfer will fail. The technology must be well developed and present a good maturity level. This factor is named “conceptual integrity”. There can be no relevant question about the conceptual basis of a technology, otherwise it will slow down the process. The technology must also fill a well-defined and “recognized need”. In some cases this need might be articulated by a salesperson. It must be “tunable”, in a sense that it must be tailorable to the adopter’s specific context and needs. Reports on “prior positive experience” are also important, since they convey relevant information (mainly cost/benefit reports) about other adoption attempts of the technology. It is important to have the “higher management commitment”, meaning that they actively work to

(33)

2.1. TRANSFER AND DIFFUSION OF INNOVATIONS 32 introduce the technology rather than oppose to it. The last critical factor relates to “training”, especially when the technology introduces new concepts.

Even when the factors discussed above are present, there are some other factors that might inhibit the successful transfer. When a technology is adopted by an organization, it is done, at first, by a restricted group of people, and it will take some more time to be widely adopted by the organization. The technology should have a reasonable cost (time and money) to be adopted. If the adopting company perceives that the technology being adopted will not give it a competitive advantage over competitors, it might slow things down. Also, technologies that promise to change or automate process that have been done for years are seen as psychological hurdles and often as a threat.

Other factors facilitate the technology transfer process, reducing the time taken for the transition. Prior success works as a good track record, not only making selling easier, but leading practitioners to seek out a technology when they read or hear about a recognized ex-pert’s new developments. In some other cases, contractors might incentivize the use of a given technology. It is important to provide knowledgeable support to the adopters because a new technology might be hard to be fully understood. If there is a latent demand for the technology, the adoption can be very accelerated. While technology and its underlying basis can be quite complex, its adoption will be more certain and smooth if its instances that are available for use are easy to comprehend and are only minimally disruptive to the state of practice. Still related to simplicity, technologies will be more easily adopted if they are small extensions to current technologies in use.

2.1.3 Pfleeger’s Model of Technology Transfer

Pfleeger (3) argues that until then (1999), researchers have been investigating the diffu-sion of specific technologies rather than the technology transfer process in general. One inter-esting thing about Pfleeger’s model is that it is discussed from the point of view of the industry, even though she argues that it is very important for the academia to develop new technology aiming at the industry’s needs. This can be observed in the process defined and steps presented next. Pfleeger suggests that for the successful transfer of technology, five steps must be taken:

1. Technology creation;

2. Technology evaluation: preliminary; 3. Technology evaluation: advanced; 4. Technology packaging and support; 5. Technology diffusion.

(34)

2.1. TRANSFER AND DIFFUSION OF INNOVATIONS 33

2.1.3.1 Technology Creation

In this step, we must identify if there already exists a piece of technology that solves the problem we wish were solved. In some cases the technology exists and has already been applied in a similar domain/situation, in other cases the technology exists but hasn’t been used elsewhere. Also, the technology might not exist and should be developed. In any of these cases, the need for a technology arises from a business need. In general, some questions must be answered in order to perform an early evaluation of how suitable the technology is for a given application, for example:

What problem does it solve? Does it work properly?

Does it replace/expand/enhance an existing technology? Does it fit with existing processes?

Is it easy to understand? Is it easy to learn? Is it ease to use? Is it cost effective?

2.1.3.2 Technology Evaluation: Preliminary

After the selection or creation of the candidate technology, the organization needs to know (i) if it fits well with the technologies the organization already uses, (ii) if there is any benefit to using the candidate technology and (iii) if it is actually usable in the organization’s context.

Often, this step is seen as a research task, hence, practitioners are interested in finding relevant insights in the literature that would help them decide if the new technology has a good potential. According to Zelkowitz et al. (25), researchers tend to use methods that are usually not valued by practitioners. For example, researchers are more likely to evaluate technologies using theoretical proofs, static analyses and simulations, while practitioners are often interested in case studies, field studies and replicated controlled experiments.

Practitioners are also interested in specific questions to be answered, namely: To what degree is the new technology better than what is already available?

To what degree is it consistent with existing values, past experiences and needs of potential adopters?

(35)

To what degree is it easy to understand and use? Can it be experimented with on a limited basis? Are the results of using it visible to others? 2.1.3.3 Technology Evaluation: Advanced

Once the new technology has passed the initial scrutiny, it is analyzed more carefully. In particular, practitioners investigate the quality of the evidence that claims the technology’s benefits. First, the form of evidence is identified: tangible (e.g., documents, images, measure-ments), unequivocal testimonial (e.g., direct observation, opinion), equivocal testimonial (e.g., probabilistic argument), missing tangible (e.g., contradictory data) or authoritative records (e.g., facts, census data).

On a second moment, practitioners must understand the confidence level of the evidence that is being presented. This is related to the degree of control the studies had when carried out. If a controlled experiment was used and all other variables were controlled, it is safer to say that an impact in the results were due to the technology under study.

The process by which the evidence is generated is examined in the third moment. Indi-vidual studies may bring conflicting data to the table, but as more and more studies are run, the confidence level is raised and a better conclusion is drawn.

Some more questions should be answered to make sure that each piece of evidence is not analyzed in isolation, but rather as part of a whole where each piece of information helps us in forming a body of knowledge, creating what Pfleeger calls a “fabric of an argument out of threads of evidence”. Questions may look like:

Is each piece of evidence relevant to the argument? How accurate is the evidence?

How objective was the evidence collection and results? 2.1.3.4 Technology Packaging and Support

For the effective diffusion of new technology descriptions of its use are not enough. Packaging and support are needed to break down the “knowledge barriers” preventing the use of the new technology (26). The objective of a technology package is to more easily promote it. In some cases, a wholesaler might be used as a channel to incentivize the use of that new technology.

(36)

2.1.3.5 Technology Diffusion

Once the technology is packed with sound evidence of its efficacy and has good sup-port, the diffusion can take place. One important thing to consider is to understand the potential audience for the technology. The difference among groups that have already adopted the tech-nology and groups that haven’t is important to be analyzed. This difference, referred to as “heterophily”, is relevant because the more different two groups are, the more difficult it is for the technology to be transferred from the former to the latter. Pfleeger makes strong references to the Rogers’ model in this step (1).

2.1.4 Gorschek et al.’s Model of Technology Transfer

Gorschek et al.’s model (6) differs from the other models because the authors describe a model that is closely attached to industrial partners. In fact, they describe this as an important characteristic of their model and one of the factors that facilitates the adoption of new technol-ogy. The model is split in seven steps, namely:

Identify potential improvement areas based on industry needs; Formulate a research agenda;

Formulate a candidate solution; Conduct lab validation;

Perform static validation;

Perform dynamic validation (pilot project); Release the solution.

They start by identifying potential areas of improvement onsite. They learned that it is very important to have a friendly presence in the practitioners groups, which gives them easy access to all practitioner groups. The potential areas of improvement are mapped and prioritized to the perceived importance and dependency (if technology B needs technology A to be adopted so that it can be used, the investigation of A must come first).

The next step is to formulate a research agenda, having the prioritized needs in hand. At this stage, a strict contact channel is kept open with practitioners. Researchers should have onsite presence and deeply understand the organization’s domain. The close proximity between researchers is a critical success factor (27).

After formulating the research agenda, we need to formulate a candidate solution. A close colaboration with industry is still important. Gorschek et al. also believe that practitioners are very important to make sure the candidate solution is realistic and suits the current organiza-tion’s needs and to help changing the organizaorganiza-tion’s mindset prepare it for the new technology

(37)

2.2. THE SOCIAL ASPECTS OF PROGRAMMING LANGUAGE ADOPTION 36 to be adopted. The main role for researchers is to act as a link to the state of the art in research, ensuring that already existing technologies, techniques and processes are not overlooked.

After the technology is formulated, lab validation is needed. Experiments are con-ducted, usually in an academic context, simulating the use of the technology by practitioners. This setup is useful to enable the early identification of issues, analyses of usability and scala-bility. Besides giving researchers relevant insights about the technology’s effectiveness, the lab validation is important to communicate with practitioners and management, convincing them of both manageable risks and potential benefits.

The static validation follows the successful lab validation. In this step, the technology is showcased to practitioners. The candidate solution can be presented with seminars, demon-strations or any way researchers and practitioners see fit. Feedback is collected and eventual improvements are made to the technology. Gorschek et al. argue that sometimes the solution might be scaled down to meet practitioners needs. This should not be feared, just seen as part of the refinement and validation process.

Practitioners are now “ready” for the dynamic validation. Usually they achieve this by performing pilot studies. Pilot studies might come in different sizes and organizations might choose to slowly scale them up and “feel” the new technology in practice. Piloting is very important because the technology can be evaluated in a realistic environment, and it minimizes risks because it is a limited test and bad results will not “contaminate” whole projects.

The final step is to release the solution. After carefully evaluating the candidate tech-nology in several steps, it may be considered ready for full adoption.

It is import to note that, even though the concepts presented so far regarding technology transfer models are not directly used in this thesis, they are important because they set the context in which this work is inserted. The aforementioned models are the state of the art in technology transfer and diffusion in software engineering, and our work proposes a disruption in this framework by introducing a new way of thinking about diffusion theory in software engineering.

2.2 The Social Aspects of Programming Language Adoption

Diffusion of innovation is different from diffusion of information (28). In order for the former to happen, the latter must happen first. For programmers to adopt a particular language, they first must be aware it exists. Sometimes the adoption never happens if the benefits of adoption are overwhelmed by its costs or complexity.

To make programmers aware of a new language is very important, it is the first step, but after the first moment, when developers have the first contact with the language, if he/she cannot continue the adoption process, the language will most likely not be adopted. Some factors, as observed by Rogers, may influence the adoption process, and according to Meijer (29), language developers focus on the first factor, praising the benefits of their new creation against the state

(38)

2.2. THE SOCIAL ASPECTS OF PROGRAMMING LANGUAGE ADOPTION 37 of the art:

Relative advantage: the improvement over a previous innovation;

Compatibility: how well an innovation integrates into an individual’s needs and beliefs;

Simplicity: how easy the idea is to use and understand; Trialability: how easy it is to experiment with;

Observability: the ability to see results.

In a recent mapping study performed by our research group, we have discovered several factors that influence the adoption of technologies (30) (protocol desccribed in Appendix A). Some factors have a positive effect, while others have negative effect in the adoption process. The top five positive influencing factors are, in rank order: existance of successful experimenta-tion, higher management support, training, perception of benefit, cooperation and collaboration (cooperation and collaboration may have the same meaning most of the time, but in this context, cooperate means to enable the execution of some activity, while collaboration means to actually work alongside someone). The top eight negatively influencing factors (there was a tie in ranks 5 through 8) are, in rank order: cost, lack of tool, lack of understanding, traning, resistance to change, lack of correspondence, communication, low maturity. Although some of the factors cited above are technical in nature, most of the them carry a social side, which corroborates with the research in the field.

For a programmer to adopt a language he/she has to hear about it (i), decide to study the language and decide to try it (ii), try it and evaluate if it is worth the "shot" (iii), use it in a real project (iv) and confirm that the new language performed as expected and if the benefits surpassed its costs (recall the Rogers model (1)). The adoption process may fail at any of the aforementioned steps, so it is a hard time for the language to go through all these steps and be adopted.

In the diffusion of information process, responsible for the first phase of the adoption process, the word of mouth plays a key role. According to Jedlitschka et al. (5) and Diebold et al. (31) the most important information source for successful transfer of technology is the word of mouth, more specificaly, experiences from colleagues with the technology, whereas papers in scientific journals rank last in the list of relevant information sources. This shows the importance of the social interactions regarding the diffusion of information.

A community of related individuals can usually be modeled as a graph, where indi-viduals are nodes and relations among them are vertices in this graph (32; 33). This is the same kind of connection observed in epidemics theories (34). Individuals with a social inter-action are potential transmission routes for both biological infectious agents and information. This is the link we need to enable the mapping of concepts from epidemiology to the ones in

(39)

2.3. MATHEMATICAL EPIDEMIOLOGY 38 the programming language adoption context. In the biological epidemic context, the physical contact among individuals is mapped to the social connections individuals keep to other indi-viduals. While in the epidemic context an infectious agent is, most frequently, transmitted to a second individual through physical contact, in social epidemics, information is transmitted through a series of different communication channels, which not always require a physical con-nection among individuals. Digital communication channels play a key role in this scenario, allowing fast and broad communication from one individual to a large group of susceptible ones. In the current connected world, social epidemics can easily outpace biological ones, for example, the recent case of Pokémon Go, which became the most popular mobile game in history, after only five days of existence (http://www.lifehacker.com.au/2016/08/pokemon-go-in-numbers-the-incredible-highs-and-frustrating-lows-infographic/). Hence, the diffusion of innovations and programming language adoption can be seen as an epidemic phenomenon and will be addressed as so for the remaining of this thesis.

The study of epidemics from the mathematical point of view is an active field of re-search (35). It allows for several interesting analyses, the construction of models and their evaluation. The theory of mathematical epidemiology is extensively used in this work and will be more thoroughly addressed in the following sections.

2.3 Mathematical Epidemiology

Epidemiology is the science that studies patterns, causes and effects of health and dis-ease condition in defined populations. It is both studied by the medical sciences, which focuses on the public health side of the problem, and by mathematics, which tries do develop models to describe epidemiological phenomena allowing the scientific community to draw a number of relevant analyses on them. It is worth noting that epidemiology studies the spread of diseases in a population. In the present work, programming languages play the role of diseases, which act as infecting agents in a given population.

The initial mathematical epidemiology history dates back to 1760, when Daniel Bernoulli developed a model to describe the smallpox epidemics. The basic theory, however, has been established between 1900 and 1935 (35). Some of the most common models are described in the sections that follow.

2.3.1 Compartmental Models

Most epidemic models divide a population in compartments, each containing individ-uals that are identical in terms of their status with respect to the disease in question. The SIR model, the most common, uses three compartments:

Susceptible: individuals who have no immunity to the infecting agent and might become infected if exposed;

(40)

2.3. MATHEMATICAL EPIDEMIOLOGY 39

Infectious: individuals who are infected and can transmit the infection to susceptible individuals if proper contact occurs;

Removed: individuals that were once infectious, but have recovered from the disease and have developed immunity; individuals in this compartment do not affect the transmission dynamics, since they cannot transmit the disease and are immune to it. Usually, the number of individuals pertaining to these compartments is denoted by the letters S, I and R, respectively. The total host population is denoted by N = S + I + R. Now that the compartments have been defined, a set of equations can be written to describe the rate at which individuals migrate from one compartment to another. It is easy to observe that a negative term appears as positive in the following equation, showing that the individuals that leave a compartment must follow to the next. The following differential equations describe theses dynamics: dS dt = −βSI, ☛ ✡ ✟ ✠ 2.1 dI dt =βSI−γI, ☛ ✡ ✟ ✠ 2.2 dR dt =γI. ☛ ✡ ✟ ✠ 2.3 In Equation 2.1, the term SI in models the contact rate between susceptible and infected individuals, i.e., the number of new infected individuals is proportional to the numner of already infected individuals, which are the ones that transmit the disease. The transmission rate (per capita) is denoted byβ and the recovery rate is γ, hence the mean infectious period is 1/γ). Figure 2.2 depicts the classical SIR model and its compartments. This simple model is attributed to Kermack and McKendrick (36).

Figure 2.2: The compartmental SIR model.

Source: The author.

When an infectious individual is introduced in a susceptible population of size N, this infected individual is expected to infect others at the rateβN during the infectious period 1/γ. Then, the first infective individual is expected to infect R0=βN/γ individuals. The number

R0 is called the basic reproduction number and is the most important number to analyse any

epidemic model for any disease, because it determines if the epidemic will occur at all, that is, an epidemic spread will occur only if R0> 1. In a practical sense, R0 is the number each

infective individual will infect during its infective phase, and an epidemic can only occur if R0> 1. In other words, an epidemic occurs only if a single individual can spread its illness to